MoE Inference Economics from First Principles…

Sep 2

DeepSeek 🐳, Kimi, synthetic data markets and the token overcapacity issue

Read →

6 Comments

Andreas

Sep 2

Oh my god. Great article guys. Long awaited 😉

Expand full comment

Luigi Pagani

Sep 2Edited

Fireworks on their website at some point declared, 5T tokens every day.

https://fireworks.ai/blog/virtual-cloud

Expand full comment

Yusong Cheng

Sep 23

where is the 2 in FLOPs_(xV) when calculating attn_score x V

Expand full comment

Reply (1)

Eric Schreiber

Sep 23

Thank you for reading the blog post carefully and for raising this question.

When multiplying the attention scores by v, a scalar-vector product is performed. Since no summation occurs in this step, there’s no need for the 2× factor. The FLOPs for the xV operation therefore depend on the number of tokens involved (e.g., BxS^2 or BxS), the dimensionality of v (d_c), and the number of heads (n_h).

Expand full comment

Reply (1)

Yusong Cheng

Sep 25

Thanks for the response, but I'm still confused given the inconsistency with the previous post.

In the previous blog(LLM Inference Economics from First Principle), it stated that xV FLOPs are 2 S² × hidden_dim - which includes the ×2 factor. But here it's listed as B × S² × n_h × d_h without the ×2.

Both QK^T and xV are matrix multiplications requiring multiplications AND summations. If we use the same FLOPs counting convention, both should have the ×2 factor, or both should omit it.

Could you clarify why the counting method differs between the two operations in this post?

Expand full comment

Reply (1)

Eric Schreiber

Oct 1

Thank you for raising this point and for carefully comparing with our earlier post.

As the MLA involves more complexity than simple attention, we decided to follow the mathematical formulation from DeepSeek when counting FLOPs (first equation block in the section Theoretical Performance Model:Computation:MLA). This choice leads to a different structural presentation compared to the previous post.

The summation over the input tokens (∑₌₁ᵗ […]) was indeed not accounted for. Accounting for this adds B × S × n_h × d_c during decoding, and B × S(S+1)/2 × n_h × d_c = B × O(S²) × n_h × d_c during prefill. If this step is combined with the xV multiplication, it does result in the ×2 factor. However, in the updated version we chose to keep it separate in order to make clear where the additional term originates, and to remain consistent with the mathematical notation we are referencing.

We appreciate your careful reading and thank you for pointing out this inconsistency.

Expand full comment

Tensor Economics

MoE Inference Economics from First Principles…