Thank you for reading the blog post carefully and for raising this question.
When multiplying the attention scores by v, a scalar-vector product is performed. Since no summation occurs in this step, there’s no need for the 2× factor. The FLOPs for the xV operation therefore depend on the number of tokens involved (e.g., BxS^2 or BxS), the dimensionality of v (d_c), and the number of heads (n_h).
Thanks for the response, but I'm still confused given the inconsistency with the previous post.
In the previous blog(LLM Inference Economics from First Principle), it stated that xV FLOPs are 2 S² × hidden_dim - which includes the ×2 factor. But here it's listed as B × S² × n_h × d_h without the ×2.
Both QK^T and xV are matrix multiplications requiring multiplications AND summations. If we use the same FLOPs counting convention, both should have the ×2 factor, or both should omit it.
Could you clarify why the counting method differs between the two operations in this post?
Thank you for raising this point and for carefully comparing with our earlier post.
As the MLA involves more complexity than simple attention, we decided to follow the mathematical formulation from DeepSeek when counting FLOPs (first equation block in the section Theoretical Performance Model:Computation:MLA). This choice leads to a different structural presentation compared to the previous post.
The summation over the input tokens (∑₌₁ᵗ […]) was indeed not accounted for. Accounting for this adds B × S × n_h × d_c during decoding, and B × S(S+1)/2 × n_h × d_c = B × O(S²) × n_h × d_c during prefill. If this step is combined with the xV multiplication, it does result in the ×2 factor. However, in the updated version we chose to keep it separate in order to make clear where the additional term originates, and to remain consistent with the mathematical notation we are referencing.
We appreciate your careful reading and thank you for pointing out this inconsistency.
Oh my god. Great article guys. Long awaited 😉
Fireworks on their website at some point declared, 5T tokens every day.
https://fireworks.ai/blog/virtual-cloud
where is the 2 in FLOPs_(xV) when calculating attn_score x V
Thank you for reading the blog post carefully and for raising this question.
When multiplying the attention scores by v, a scalar-vector product is performed. Since no summation occurs in this step, there’s no need for the 2× factor. The FLOPs for the xV operation therefore depend on the number of tokens involved (e.g., BxS^2 or BxS), the dimensionality of v (d_c), and the number of heads (n_h).
Thanks for the response, but I'm still confused given the inconsistency with the previous post.
In the previous blog(LLM Inference Economics from First Principle), it stated that xV FLOPs are 2 S² × hidden_dim - which includes the ×2 factor. But here it's listed as B × S² × n_h × d_h without the ×2.
Both QK^T and xV are matrix multiplications requiring multiplications AND summations. If we use the same FLOPs counting convention, both should have the ×2 factor, or both should omit it.
Could you clarify why the counting method differs between the two operations in this post?
Thank you for raising this point and for carefully comparing with our earlier post.
As the MLA involves more complexity than simple attention, we decided to follow the mathematical formulation from DeepSeek when counting FLOPs (first equation block in the section Theoretical Performance Model:Computation:MLA). This choice leads to a different structural presentation compared to the previous post.
The summation over the input tokens (∑₌₁ᵗ […]) was indeed not accounted for. Accounting for this adds B × S × n_h × d_c during decoding, and B × S(S+1)/2 × n_h × d_c = B × O(S²) × n_h × d_c during prefill. If this step is combined with the xV multiplication, it does result in the ×2 factor. However, in the updated version we chose to keep it separate in order to make clear where the additional term originates, and to remain consistent with the mathematical notation we are referencing.
We appreciate your careful reading and thank you for pointing out this inconsistency.