6 Comments
User's avatar
Andreas's avatar

Oh my god. Great article guys. Long awaited 😉

Expand full comment
Luigi Pagani's avatar

Fireworks on their website at some point declared, 5T tokens every day.

https://fireworks.ai/blog/virtual-cloud

Expand full comment
Yusong Cheng's avatar

where is the 2 in FLOPs_(xV) when calculating attn_score x V

Expand full comment
Eric Schreiber's avatar

Thank you for reading the blog post carefully and for raising this question.

When multiplying the attention scores by v, a scalar-vector product is performed. Since no summation occurs in this step, there’s no need for the 2× factor. The FLOPs for the xV operation therefore depend on the number of tokens involved (e.g., BxS^2 or BxS), the dimensionality of v (d_c), and the number of heads (n_h).

Expand full comment
Yusong Cheng's avatar

Thanks for the response, but I'm still confused given the inconsistency with the previous post.

In the previous blog(LLM Inference Economics from First Principle), it stated that xV FLOPs are 2 S² × hidden_dim - which includes the ×2 factor. But here it's listed as B × S² × n_h × d_h without the ×2.

Both QK^T and xV are matrix multiplications requiring multiplications AND summations. If we use the same FLOPs counting convention, both should have the ×2 factor, or both should omit it.

Could you clarify why the counting method differs between the two operations in this post?

Expand full comment
Eric Schreiber's avatar

Thank you for raising this point and for carefully comparing with our earlier post.

As the MLA involves more complexity than simple attention, we decided to follow the mathematical formulation from DeepSeek when counting FLOPs (first equation block in the section Theoretical Performance Model:Computation:MLA). This choice leads to a different structural presentation compared to the previous post.

The summation over the input tokens (∑₌₁ᵗ […]) was indeed not accounted for. Accounting for this adds B × S × n_h × d_c during decoding, and B × S(S+1)/2 × n_h × d_c = B × O(S²) × n_h × d_c during prefill. If this step is combined with the xV multiplication, it does result in the ×2 factor. However, in the updated version we chose to keep it separate in order to make clear where the additional term originates, and to remain consistent with the mathematical notation we are referencing.

We appreciate your careful reading and thank you for pointing out this inconsistency.

Expand full comment