Tensor Economics

AI infrastructure in the "Era of experience"

Piotr Mazurek — Wed, 26 Nov 2025 19:08:16 GMT

In the famous essay from May 2025, “Welcome to the Era of Experience,” Rich Sutton and David Silver proposed a new paradigm of training AI models - models that learn not through predicting the next word against text scraped from Common Crawl, but through gaining experience via interaction with environments. As we approach the exhaustion of easily scrapable text data, we predict we’ll observe a shift toward AI models increasingly trained in this fashion via reinforcement learning (RL). In this text, we discuss the technical details underpinning this process.

The text proceeds as follows: First, we introduce the concept of intelligence involution and discuss its consequences and why there is currently a strong incentive for custom RL models; then, we explain the basic principles behind GRPO; we briefly explain why RL training is so information sparse; and we do a deep dive into LoRA and the compute advantages of training and inference it unlocks. Then in the last section we discuss the broader implications; we show how the tech foundation can enable the emergence of the reinforcement fine-tuning (RFT) industry; we highlight where economies of scale can be realized and what some potential first applications of custom models are. We mention the remaining open challenges and the fundamental limitations that can potentially make RFT a similar flop as the first wave of SFT has been.

We intend this text to provide the reader with the theoretical basis needed to reason about AI infrastructure in the context of reinforcement learning. We argue that in the next 6-12 months there are significant opportunities for new businesses to be built around recent developments in RL, particularly for product companies to build sustainable moats through custom models trained on their proprietary environments, as well as for infrastructure players to build “picks and shovels” enabling the RL economy.

Intelligence involution and its consequences

As of late 2025, it appears that the gap between available LLMs is extremely small. There are marginal differences between proprietary LLMs, with some models slightly stronger in some niches like creative writing or coding, but overall differences seem to be diminishing over time. Moreover, the gap between open-source and proprietary models is rapidly closing. EpochAI estimates (see Fig. 1) the lag at around three months; with Kimi K2 Thinking’s release, some argue the gap has largely closed1.

Figure 1: The grate between open-weight models and closed weights models keep closing. Sourced from Epoch.ai

Interestingly, China now leads open-source development. Chinese models have overtaken Western ones in cumulative downloads worldwide (see Fig. 2). While downloads aren’t a perfect proxy for popularity2, the Twitter “vibe check” seems to confirm this. These days, every RL experiment appears built on Qwen. Recently, Cursor and Windsurf appear to have built their models on top of Chinese foundations as well.

Figure 2: China has overtaken the US in cumulative open-source AI model downloads: A16Z twitter

There are dozens of Chinese labs in this space, and it seems like every week a Chinese food delivery company or consumer electronics firm releases a competitive AI model. Next-token prediction appears to share characteristics with EVs, solar panels, or batteries - where Chinese “capital markets” are capable of supporting dozens of entities that relentlessly compete with each other, continuously driving down the price per unit of intelligence, glutting international markets, and making it close to impossible for competitors abroad to make any revenue with models not at the absolute bleeding edge3. While OpenAI, Anthropic and Google remain ahead and this doesn’t yet apply to them, if the trend continues, it seems inevitable their margins will eventually be affected.

We refer to this phenomenon as intelligence involution, where - similar to EVs or solar - competition is so fierce that everyone makes close to zero profits, moats only come from scale, and the pace of competition slowly bleeds out anyone not at the absolute frontier.

Since competition is so fierce, it continuously drives down the price of tokens. For example, DeepSeek during the transition from v3 to v3.2 dropped prices from $2.19 to $0.42 per million output tokens - 5x cost reduction in a span of a few months while simultaneously boosting general model capabilities.

With competition in general-purpose models this brutal, requiring absolute frontier performance to generate any substantial revenue, the obvious choice for less sophisticated players seeking quick profits is model specialization: targeting a niche that should be more defensible than competing in the foundation model space.

Traditionally, companies specialized foundation models through Supervised Fine-Tuning (SFT). The approach was straightforward: collect input-output pairs for your domain, then retrain the model to mimic those outputs. However, as we enter the era of reasoning, SFT is being squeezed out of relevance by a “pincer movement” - it is becoming economically irrational for simple tasks and technically insufficient for complex ones.

For static knowledge or stylistic specialization, SFT has become largely unnecessary. Modern base models are now powerful enough that in-context examples (few-shot prompting) match fine-tuned performance without the complexity of managing model weights.

With the commoditization of prompt caching, this approach is also much cheaper. As of late 2025, caching allows us to “pin” massive instructions into memory at near-zero cost. For example, as of November 2025, DeepSeek charges $0.028 per million cached tokens. Storing 10,000 tokens of examples in the prompt and serving 1 million requests costs just $280 in caching fees:

This is potentially cheap enough to eliminate the need for SFT entirely for straightforward tasks. Fine-tuning a custom model for the same task would cost thousands in compute and engineering time, only to yield a model that becomes obsolete the moment a better base model is released.

While caching kills SFT at the low end, the “Reasoning Data Barrier” kills it at the high end. Even if human data were available, SFT creates a fundamental ceiling: it limits models to mimicking human baselines rather than discovering novel strategies that surpass them. As per the era of experience, there are certain problems for which prompt-answer pairs simply do not exist. The only way to discover the right reasoning chains or tool sequences is for the model to interact with an external environment and observe the effects of its actions. Based on these observations, the model’s behavior is adjusted iteratively. Critically, we explicitly assume the correct steps are unknown upfront and can only be learned through interaction - by observing how the environment responds and adjusting accordingly.

The challenge intensifies with tool-calling and agentic workflows. When models need to orchestrate multiple tools - whether executing Python code, performing web searches, calling APIs like the SharePoint MCP server, or any other programmatic action (see Fig. 3) - the correct sequence of tool invocations and their specific parameters must be discovered through trial and error. Manually creating training examples for every possible toolchain and edge case quickly becomes infeasible.

Figure 3: Example of two different RL rollouts involving different tool calls. LLM can take multiple paths, involving different tools used at different timestamps to achieve the final goal. Source: Building Cursor Composer with Sasha Rush, YouTube.

This creates a natural opening for custom models. Many companies would prefer to own their intelligence stack entirely - for data privacy, cost control, and because they don’t want to depend on vendors whose priorities and pricing might shift unpredictably. Previously, this wasn’t feasible: building competitive models required frontier-level base intelligence that only a handful of labs had.

Intelligence involution changes this calculus fundamentally. With open-source models now matching proprietary performance at a fraction of the cost, companies can build defensible RL-specialized models on top of these commodity foundations. The moat comes not from superior base intelligence - which is rapidly commoditizing - but from proprietary access to specialized environments and the continuous learning loops within them. These environments are unique and proprietary: a company’s internal research infrastructure spanning SharePoint, Confluence, and legacy enterprise systems; an e-commerce platform’s feedback loop where model behavior is continuously refined based on observed customer actions; a SaaS product’s onboarding flow that adapts based on real user engagement patterns, etc. None of these can be replicated by GPT-5 simply because OpenAI never had access to these interactions during training.

However, this defensibility comes at a cost of scale. Training a model for a specific environment to solve a specific problem is inherently less scalable than a single base model like the one powering ChatGPT, where hundreds of millions of users interact with the same foundation model, controlled purely through prompting.

To understand the cost structure and opportunities in RL, we need to examine how these models are actually trained. This requires introducing a few key concepts: GRPO, LoRA, and the fundamental information-theoretic principles that make RL feasible at scale.

GRPO 101

The “renaissance” of RL for large language models (LLMs) can be traced back to the introduction of Group Relative Policy Optimization (GRPO) in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models in April 2024. GRPO was the fundamental technique that enormously simplified running reinforcement learning with verifiable rewards (RLVR)-type training. Later it was applied in the famous DeepSeek R1 model that shook the global financial markets in January 2025.

The GRPO algorithm is conceptually quite simple; it can be compressed to 12 lines of pseudo-code (see Fig. 4). The main innovation comes from the fact that, in contrast to previous RL methods like PPO, it is far simpler to implement. There is no need to train a separate critic model, and as measured empirically, it proves to be much more stable and sample efficient.

Figure 4: GRPO algorithm; arxiv

While it might seem complicated at first glance if the reader is unfamiliar with RL notation, upon closer inspection it is pretty simple. In Tab. 1 we present a detailed explanation and example values for all symbols in the algorithm from Fig. 4.

Table 1: Notation and Variables Used in Group Relative Policy Optimization

GRPO is an example of so called gradient policy optimization algorithm. We use this name to refer to reinforcement learning methods that directly adjust a policy’s (model`s) parameters using gradients to maximize expected cumulative rewards. In gradient policy optimization algorithm we traditionally have two phases:

Rollout generation: Using the current policy, we generate outputs. For LLMs, a rollout is the sequence of tokens produced for a given prompt (see variable `outputs` in Fig. 5).
Optimization step: We evaluate the generated rollout with a reward signal, then use the gradient of the policy (which shows how to change parameters to increase reward) to adjust the model parameters.

In GRPO, for each prompt we sample G answers, calculate the average reward for this group, and subtract it from each individual reward to get the advantage - hence the name “group relative” policy optimization (see Fig. 5).

Figure 5: Advantage calculation in GRPO for the prompt “Solve: 2x + 3 = 7” with group size G=4. For each output, the advantage is computed as its reward minus the group’s mean reward (0.55 in this example).

During the training step, we first calculate the ratio of log-probabilities: new policy divided by old policy (see Fig. 6). We apply PPO-style clipping to this ratio, preventing the model from diverging too far from its previous version (previous optimization step). This clipped ratio is then multiplied by the advantage. As shown in Fig. 5, the advantage measures how much better (or worse) each completion performs relative to the group mean, to other completions for this exact prompt.

Figure 6: GRPO loss calculation for each token. The advantage A[i] is the same for all tokens in a given output o_i. The loss combines a PPO-style clipped objective (preventing large policy updates) with a KL penalty term that keeps the policy close to a reference model.

The core idea behind every RL algorithm is to shift the model’s weights, so that high-scoring outputs become more likely and low-scoring ones become less so.

A key limitation specific to GRPO is that, unlike methods with a learned Critic (like PPO), it assigns the same advantage to all tokens in a sequence. While this means equal credit is assigned even in mixed-quality outputs- for example, in a multi-turn WORDLE game where the model guesses incorrectly for three turns but succeeds on the fourth - we accept this approximation because it removes the massive overhead of a value model. In practice, DeepSeek observed that this coarser heuristic is not only significantly cheaper to calculate, but often leads to more stable training dynamics than trying to learn a complex token-level value function.

Just 1 bit per rollout

The “defining” blog-post of the recent boom in RL is “LoRA Without Regret” by John Schulman and the Thinking Machines team. The argument the authors make is that because policy gradient methods (like GRPO) are so information sparse (only 1 bit per rollout) LoRA adapters have enough capacity (enough parameters) to efficiently learn all of this information without the need to modify the original model parameters. This has profound implications for both training and inference which we discuss in detail in the next section.

As the authors put it:

… when we get a few key details right, LoRA learns with the same sample efficiency as FullFT and achieves the same ultimate performance.

The intuition behind this goes as follows. In GRPO, the gradient update for each rollout is proportional to

where q is the prompt, o_i is the generated output (rollout), and A_i is the advantage (reward minus group mean).

The gradient has two components:

Direction

shows how to change the policy to make output o_i more likely. This depends only on the current policy π and the sampled output—it contains zero information about the reward.

Advantage

A_i s the same for all tokens in output o_i. This scalar is where ALL information about the reward function resides.

By the data processing inequality:

and:

where H is entropy (the maximum information content).

While the calculated Advantage is technically a continuous real number (due to normalization against the group mean), its information content is bounded by the granularity of the reward function:

If reward is binary (correct or incorrect)

If we have granular reward - the reward allows for partial credit (e.g., 5 levels: 0,0.25,0.5,0.75,1.0)

Depending on how “granular” your reward is it log_2(N) is the upper bound of what a single rollout provides.

This is much more sparse signal than the supervised training where each token provides ~1 bit of information, so training is way less efficient. It is possible that our rollout generates tens of thousands of tokens, and we put in hundreds of thousands of input tokens the tool calling, and all of this will be summarized into a few bits worth of information, requiring us to run tens of thousands of rollout to learn anything useful.

While this approach is inherently inefficient, as we previously discussed, in cases where we don’t know the correct labels, this “exploratory” approach seems to be the only one we know that works reliably.

The upside of this extreme information sparsity is that it can be efficiently encoded with very few parameters. To quote the Thinking Machines team:

Past work has shown that neural networks can store 2 bits per parameter. These results pertain to the maximum amount of information absorbed in the long-training limit, not to the compute efficiency or rate of learning.

The Thinky team showed (see Fig. 7) that LoRA adapters of fairly low ranks can learn from RL basically as well as full model fine-tuning. These results have since been independently replicated by many people in the community (see Fig. 8).

Figure 7: “Experiments on the DeepMath dataset with Qwen3-8b-base. In the left plot, we show the learning curve for different ranks and full fine-tuning. For each of these settings, we show the best learning rate, which results in the highest final performance. On the right, we plot learning rate vs final performance. As in our previous math experiments, LoRA seems to have a wider peak of near-optimal learning rates.“ Lora without regret.

Figure 8: One of the examples of community replicating the results observed by Thinky; twitter

This has profound implications for the economics of training and inference. LoRA makes it possible to train massive models on relatively modest hardware (within a single node), as long as the model parameters fit into memory. Moreover, if the model customization relies on LoRA adapters running on top of a single base model, this has significant effects on the economics of serving such models. It becomes possible to batch together multiple requests from multiple users, use a single base model, and assign different adapters to every request in the batch. This makes inference significantly more affordable. An inference provider might reuse a single base model for thousands of clients, each with their own custom adapter, serving their own custom-trained RL model.

In the next sections, we will delve into the details of why LoRA is so much more efficient in model training and why it has a much smaller memory footprint. We will then explain how multi-tenancy works in a modern inference engine and show how well such models perform and scale. We want to highlight once more the key insight: LoRA fine-tunings perform comparably to full model fine-tuning when trained with policy gradient methods such as GRPO (“no regret”). Because the training signal is so information-sparse, there’s no penalty for training fewer parameters.

Backpropagation and LoRA

Before we proceed to training and inference, we should introduce the two concepts:

How backpropagation in model training works.
How low-rank adaptation (LoRA) works.

Let’s start with backpropagation. Consider a simple 2-layer neural network with sigmoid activation. The forward pass goes as follows:

Now if we intend to train the said network, we need to define some objective we want to minimize (J) and we will minimize it using stochastic gradient descent (SGD). We want to optimize the parameters (weights) of the network to minimize our loss (the objective we defined). To find these values, we use the chain rule.

As a reminder, the chain rule states that for composite functions:

For multiple nested functions, we apply the chain rule repeatedly:

A neural network is exactly this type of composition. Notice how our forward pass creates a chain of functions:

Hence to calculate the gradients we need to run the following steps:

Notice how to calculate the gradient with respect to W2 we need to store the values of intermediate activations - input to the layer 2, a1. This requirement creates a significant memory burden: we must cache these intermediate activations during the forward pass to use them later in the backward pass. Crucially, this cost scales linearly with the batch size. For every additional example we process in parallel, we must allocate memory for its specific activations, explaining why increasing the batch size rapidly consumes available VRAM.

What is also important to notice is how much memory is consumed by storing the gradients of parameters. Since we need a gradient for every single weight in the layer it takes as much memory to store the gradients as it takes to store the weights themselves, further increasing the memory footprint.

In practice, estimating the memory footprint gets even more complicated. Some optimizers require additional memory - Adam, for example, stores momentum and variance estimates for each parameter, often doubling or tripling the memory needed beyond just weights and gradients. Our intention here is to provide the reader with intuitions on where compute and memory costs come from in model training, as this context is crucial for understanding why LoRA makes training more efficient.

Under the hood, for the forward and backward pass, PyTorch implements the interface from Fig. 9. Notice how we need to cache the values of layer inputs (x) and how we need the values of the gradient with respect to the input from the next layer (grad_out) to come as an input parameter.

Figure 9: This is what’s actually happening when you use nn.Linear. Notice how the forward pass must cache x and weight, which are then used in the backward pass to compute gradients.

Looking at this, it is quite easy to see how we have to do roughly twice as many operations during the backward pass as we do during the forward. In the forward pass we need to do a single matrix multiplication W @ x; in the backward pass we need to do two large-scale matrix multiplications, one to calculate the gradients with respect to weights and another to calculate gradients with respect to the inputs.

LoRA (Low-Rank Adaptation) is a popular efficient fine-tuning method. As we argued earlier, because of “no regret” property of gradient policy optimization methods, this is very likely how custom models of the future will be trained.

The idea behind LoRA is quite straightforward - instead of optimizing one big parameter matrix W, we freeze it and learn a low-rank update through two much smaller matrices, A and B.

where matrix A is the “Down-projection” (rank r by input dim) and matrix B is the “Up-projection” (output dim by rank r).

During training we freeze the original parameter (meaning we don’t calculate gradients for it), and we only calculate the gradients for the small matrices A and B. Since they are so much smaller, the typical ranks range is 1 to 16 - they are 3 to 4 orders of magnitude smaller than the original matrix.

The result is, on one hand, a massively reduced memory footprint for training. We used to store grad_weights of size (out_features, in_features). For example in Llama3.3. 70B, if we apply this to the down projection in the MLP layer, it comes to:

However, if instead of storing the gradients for the full parameters, we store the gradients for matrices A and B, assuming we use rank 1:

The savings are not limited to just a massively reduced memory footprint. They also apply to compute, which is substantially reduced. In LoRA, if we are slightly smart about the order of operations, the grad_A and grad_B matrix multiplications are very small compared to grad_weight in full model training, in way fewer FLOPs per backward pass (see Fig. 10). The only big matrix multiplication remaining is calculating gradients with respect to the input (grad_x). This means that the cost of the backward pass in LoRA is roughly half that of full model fine-tuning; if we include the forward pass, LoRA training is about 2/3 of the compute cost of full model training (the forward pass has roughly the same compute cost in both cases).

Figure 10: Under the hood of LoRA training. The backward pass only computes gradients for the small A and B matrices, dramatically reducing both memory and compute compared to full fine-tuning.

All of this means that training of relatively big models can be successfully done on even a single node. As long as the model weights fit in memory, we should be able to train MoE-style models with hundreds of billions of parameters. In full-model fine-tuning, storing the gradients and optimizer states balloons the memory footprint far beyond what a single node can handle, requiring multiple nodes, which significantly increases the training complexity. With the adoption of LoRA, training can be executed on a single node. With the adoption of LoRA, the training itself can be executed on a single node, while inference runs on another independent setup.

Economies of scale in inference

As the seasoned readers of our publication probably already know, LLM inference is primarily memory-bound, meaning that token throughput is mainly limited by the time it takes to load the model parameters into the GPU’s streaming multiprocessors (SMs) from GPU memory rather than by the time it takes to perform calculations within SMs. If the reader is unfamiliar with the concept of a roofline model and computation being compute-bound or memory-bound, we highly recommend our past article that introduces these concepts and explores them in detail in the context of LLM inference.

Being memory-bound has pretty straightforward implications. In order to decrease the cost of producing a token, we need to increase the number of tokens produced within a unit of time. If producing a token means mainly waiting to load the model parameters, we can load them once and use the same loaded parameters to serve multiple queries in a single batch.

This increases the cumulative number of tokens produced sublinearly - throughput improves with batch size, but at a diminishing rate. As we increase the batch size, this comes at a cost: the speed individual users experience will decline over time, as demonstrated in Fig. 11. This is mostly due to the KV cache growing larger. As we increase the batch size, the KV cache starts to dominate the load time, and the improvements we observe diminish. However, as we increase the batch size, we saturate the available compute better, driving down the cost of producing an individual token, making it cheaper to serve.

Figure 11: Theoretical estimation of throughput at different batch sizes vs real world observations. 2000 tokens in, 300 tokens out. 4xH100 SXM5. Figure from “LLM Inference Economics from First Principles”. Note that the x axis is log scale and y is linear scale.

This is even more relevant in the case of Mixture of Experts (MoE) models - the most popular architecture of powerful models as of November 2025,. As we wrote in our previous article on MoE inference:

During the decode phase, each token in the batch is activating only a small subset of parameters at every layer. This means that each request requires us to load a different part of the model, as demonstrated in Fig. 12. As the number of requests in a batch increases, a more and more substantial portion of the model will have to be loaded from global memory. The experts are chosen semi-stochastically1, so some of the tokens in the batch will be routed to the same expert. As we progressively increase the batch size, more and more experts will be shared by different requests. This means that at the larger batch sizes we will partially recreate the situation from the dense model - sharing the cost of model loading between multiple users. Unfortunately this means that we will need significantly more requests.

Figure 12: Two request tokens activating different parts of the model, requiring us to load more weights, saturating the memory bandwidth. Figure from “MoE Inference Economics from First Principles”

The main thing the reader should take away from this is that the key to good inference economics is large batches where we share the cost of loading the model weights between as many users as possible, achieving economies of scale of sorts.

This need for large batches might be problematic when we serve custom model fine-tunings. If we were to do full model fine-tuning for a model used by a particular user, this would be very hard to achieve unless we can guarantee massive demand. Some providers have this luxury - for example, Cursor Composer clearly has enough demand to achieve the necessary batch sizes - but if we are serving a model trained via RL by a smaller company to achieve superhuman performance at some niche task, it won’t be possible to find enough demand. Luckily for us, LoRA addresses this problem.

Throughout this text, we refer to the model on top of which we add LoRA adapters as the “base model.” This should not be confused with “base model” meaning a pre-trained model before instruction tuning. The base model can be any model on top of which we would run; e.g., for this model, Qwen3-30B-A3B-Instruct-2507 would be considered a base model.

When we are running LoRA-based model fine-tunings, it is much easier to achieve the necessary batch sizes. We can gather requests from multiple users, each using their own LoRA adapter. During the forward pass, we share the cost of loading the base model weights across multiple users. We still need to load the LoRA weights, which adds a little overhead, but as we showed in the previous section, LoRA adapters have a minimal memory footprint, and loading them is very fast.

This idea is called multi-tenancy and is the core technique that will enable the era of experience-style custom models to be served cost-efficiently. The inference provider will be able to serve thousands of adapters built on top of the same base model. During inference, we allocate a dedicated buffer to store the LoRA adapters of predefined shapes. Such a setup enables dynamically loading the adapters for which there is currently demand. If the particular adapter is not used at some point in time, it is offloaded from the buffer and replaced by another adapter requested by another user.

The exact details of how to implement this are quite complex and beyond the scope of this text, but the high-level is demonstrated in Figure 13. Multi-tenancy means we are able to dynamically load different adapters and share the cost of using the base model across multiple users, driving down the cost for individual users.

Figure 13: Visualization of how various LoRA adapters can be efficiently served alongside the base model. The batched computation of the base model is implemented by GEMM, with the computation in adapters implemented via a custom CUDA kernel. Figure from S-LORA.

Modern inference engines such as SGLang and vLLM already have pretty sophisticated mechanisms to serve multiple LoRAs. In our experiments we were able to achieve ~85% of the baseline (no adapter) throughput when using LoRAs, as demonstrated in Fig. 14. Since the cost is a directly tied to the throughput achieved, this nicely shows how the cost of serving multiple adapters is only marginally higher than serving the base model alone.

Figure 14: Throughput comparison with tokenomics. We measure the performance of SGLang v0.5.5 running on a single B200 rented from DataCrunch (Verda). We run Qwen-32B with this LoRA adapter. We compare the performance of Qwen32B without any adapters to performance when all requests are making calls requesting LoRA adapters. Lora rank is 32. Adapters are applied to both the attention and to MLP weights. We pretend that the setup is running with 8 distinct adapters (we load a single adapter with 8 different names). In the LoRA setup (blue line) each request uses LoRA. We simulate the uniform usage of the adapters (round-robin strategy). Full experiment setup can be found here.

Everything async and multi-turn

As we discussed before, in RL training we have two phases: training and generating rollouts. The workflow goes as follows:

We gather a set of prompts that we want to train the model on. This can be anything, from “write a Python program sorting numbers” to “solve this PhD-level math problem.” The only requirement is that we have some way to verify (or at least estimate) how well our model did on a particular problem.
We use the inference worker to produce the replies for a given prompt. Once the rollout is finished, we assign a reward to it. The exact reward formula is highly dependent on the problem; it can be anything from simple string matching (1 if matching, 0 if not) to sophisticated evaluations consisting of multiple steps such as compilation, running tests, comparing execution time, etc.
Once rewards are calculated, we can use them to calculate the advantages (as we showed in Fig. 5) and proceed to run the optimization step via GRPO as explained by the pseudocode in Fig. 4.
Once we run the optimization step and we have the new version of the model, we update the inference worker with the updated weights, and we repeat the cycle.

This is a high-level overview of an RL training pipeline. However, in practice, making it work is extremely challenging. One canonical problem is token lag. Token lag refers to the number of optimizer steps between the current policy (π_θ being trained) and the old policy (π_θ_old that generated the samples). We sample the rollouts using some policy. We grade the outputs, and then we proceed to training. Due to memory limitations, we train on small batches, consisting of a few rollouts - not on all rollouts at the same time. We update the policy π in each optimization step (on each minibatch). This means that as we calculate the next steps, the divergence between our current policy and the original policy from which we sampled the rollout widens.

In RL, this is traditionally measured through the Effective Sample Size, or ESS. When using off-policy RL, ESS measures how many samples from the current policy π_θ would yield equivalent performance to weighted samples from the sampling policy π_θ_old. The (normalized) ESS is defined as:

Where:

N is the sample size.

ESS ≈ 1.0 (100%): Perfect! Data is basically on-policy, all samples equally useful
ESS ≈ 0.1 (10%): Most samples are useless, a few dominate → high variance, unstable training

ESS is like asking, “Out of my 1000 samples, how many are actually informative vs. just noise?” If only a few samples have all the importance weight, you effectively have very few useful samples.

As we move away from the original policy, the token lag increases, as demonstrated in Fig. 15. This causes ESS to decrease, meaning our samples become progressively less useful for training.

Figure 15: Token lag increases as we process batches in conventional RL. Darker green = more lag. By the time we train on the bottom rollouts, they’re much more off-policy than when they were generated. Figure from PipelineRL.

This is a major problem in RL - from the later samples we learn less and less. This naturally limits our sampling batch size. We can’t produce too many examples for the trainer, because they will not be very useful, as we’ve diverted too far from the sampling policy anyway. This limit in the batch size introduces another problem. Since inference is memory-bound, we want as large batches as possible to achieve high utilization and produce the maximum number of per second. Yet because of low ESS in samples coming later, we can’t effectively utilize these larger batches. This is very wasteful and limiting, substantially slowing down the training and driving up the costs.

To address this, nowadays most organizations are using some sort of Asynchronous RL. The concept is rather simple, as we demonstrate in Fig. 16. We operate three types of workers running concurrently: the training worker, the inference worker, the grading worker. They all run at the same time and communicate through queues.

Figure 16: High-level overview of async RL. Three workers run concurrently and communicate through queues. Inference workers pick up prompts, run rollouts, which may involve multiple turns of interaction with tool calls and environment feedbac, and place finished rollouts in a queue. Grading workers pick them up and assign rewards - this can be simple string matching, code compilation and testing, or LLM-as-judge evaluation. Once rewards are calculated and advantages computed, training workers pick up examples and run GRPO optimization steps.

First, we have inference workers. They pick up prompts from the prompt queue and sample the rollouts, then they push the rollouts to the “completed rollout queue,” where they can be picked up by the grading worker. Depending on the problem, grading can either be very fast, e.g., for a simple string comparison, or take longer than generating the rollout itself, e.g., when it requires time-consuming compilation of a CUDA kernel. Ideally, we would like to scale the grading workers so that the overall throughput of the system is limited by the speed of the inference workers rather than by the grading workers, since the grading workers are usually CPU-bound - hence scaling them should be much cheaper than scaling the GPU-based inference.

After the rollout has been assigned a reward, we can proceed to calculate the advantage. In GRPO, advantage for rollout i is calculated as:

This means we can only calculate it after we have graded all rollouts for the same prompt. In Fig. 5, we demonstrated a simple example of this calculation.

Multi-turn interactions and tool usage add additional complexity to this pipeline. In multi-turn RL, instead of generating a single response, the model engages in back-and-forth exchanges, each turn building on previous context. The rollout now consists of multiple conversation turns, and the reward might only be assigned after the entire conversation concludes. For example: “Did the model successfully build a computer program I asked for across all turns?” or “The model made 6 guesses in WORDLE - did it guess the correct word in the end?”

Additionally, modern RL setups often involve tool-calling, where the model can invoke external functions during generation, for example, as demonstrated in Fig. 3 in the context of Cursor Composer. The “conversation” might consist of model replies, feedback from the environment (e.g., “you did not guess correctly, try again”), and the results of function calls (e.g., what the Python interpreter returns). While there exist all sorts of sophisticated “agentic frameworks,” at the end of the day, an “agent” is just a loop: iterate over turns, pass context to the model, execute any tool calls, append results back to the context, and repeat. We provide a high-level example of such a system in Fig. 17.

Figure 17: A minimal implementation of an agentic loop. We define the tools our model can use; using the single common interface provided by MCP, we ask the model to provide function calls, then we run these functions and append their outputs to the conversation. We continue with this setup as long as the model returns the function calls; when it just returns a reply, we assume the rollout concluded and the model provided the final answer - answer that we can grade using the grader worker.

Once we grade the examples, we can proceed to training and run an RL optimization step. The trainer worker requests batch size B examples from the graded examples queue. In practice we would most likely choose the biggest batch size that will fit into GPU memory. We calculate the RL loss (e.g. GRPO), run a single optimizer step, and send the updated weights to the inference workers.

Inference workers, when they get a weight update, briefly pause generation to update the model weights, and then continue from where they stopped. Crucially, the KV-cache is not recomputed, meaning it goes out of sync with the new weights - a misalignment that adds to the unintuitive nature of why this works at all. Consequently, a single rollout will be comprised of tokens sampled using various, continuously updated versions of the policy. It ensures that the token lag will be spread relatively evenly across all examples, as demonstrated in Fig. 18, not concentrated in the examples processed later in time by the training worker (as we saw in Fig. 15).

Figure 18: In async RL, token lag is distributed within each rollout rather than across batches. Earlier tokens have higher lag (darker green), but every batch has the same lag structure. This maintains more consistent training effectiveness than conventional RL. Figure from PipelineRL.

While it seems unintuitive that it works at all, it has been shown empirically that it performs remarkably well, learning different problems much faster than conventional RL. The key additional benefit of async RL is that it enables sustainably running larger batches and producing more tokens in the same amount of time, resulting in faster training through better hardware utilization.

In RL the learning speed can be expressed as a simple product between how good our samples are at teaching the model the new task and the number of samples we process per unit of time.

Where:

Speed (ΔR/Δt): How fast does reward improve over time?
Effectiveness (ΔR/ΔS): How much does reward improve per sample? (data quality)
Throughput (ΔS/Δt): How many samples can we process per unit time? (computational efficiency)

In classic RL the batch size (throughput) was naturally limited - as over time the effectiveness of the samples was going down, to zero at some point, meaning that increasing the batch size provided no additional benefits. With Async RL the effectiveness is maintained for longer, and we can operate on substantially larger batches, driving up the throughput and increasing the learning speed as a result (see Fig. 19).

Figure 19: Demonstration how async RL (Pipeline RL in this case) achieves higher learning speed (ΔR/ΔS x ΔS/Δt) across different setups. Figure from PipelineRL.

The best introduction to async RL the reader can find is “PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation.” We want to highlight this to the readers as a truly remarkable work! Very well written, with a great open-source implementation. We highly recommend the reader give it a read when they have time if they want to understand async RL concepts at a deeper level.

LoRA in RL

Both the training itself and rollout generation can greatly benefit from combining async RL with LoRA.

In model training, when we use LoRA, we don’t need to store the gradients and optimizer states for all the parameters. We just need to store the base model weights, gradients for small adapters (3-4 orders of magnitude smaller, as we showed earlier), and activations4. Since gradients and optimizer states are a substantial portion of the memory footprint during full model training, this means that when training using just the LoRA adapters we can massively increase the batch size. The memory footprint of activations will still scale linearly with the batch size, as will the FLOPs, but LoRA requires only about 2/3 of the FLOPs compared to full fine-tuning.

While the exact details depend on factors such as which base model we are using, where we apply LoRAs, and what the LoRA rank is, the key takeaway is that LoRAs enable us to run training at substantially larger batches due to massively reduced memory requirements for gradients and optimizer states. This means that the training worker can process rollouts from the inference workers faster. Since it processes rollouts faster, at some point the training worker will start to idle - there won’t be enough rollouts to consume. This means we can reassign the workers, moving some compute resources (GPUs) from training to inference, further accelerating the throughput. Now inference has more resources because training requires fewer, so we can produce more rollouts.

As we discussed in the previous section, running inference with LoRA adapters results in only minuscule performance degradation compared to the baseline (”without regret”). This has profound implications for the economics of rollout generation. It becomes technically feasible to combine rollout generation from multiple policies, each optimizing different reward functions, all built on top of the same base model within a single inference setup.

Inference is memory-bound, so throughput and utilization per GPU increase as we increase the batch size. If we combine multiple LoRA adapters, we can linearly scale the batch size, driving down the marginal cost of producing a token, decreasing the overall cost or enabling much larger scale on some fixed budget as a result.

This property can be extremely useful both for RFT and for full model fine-tuning. For we can use multiple policies and combine them into a single massive batch, fully utilizing the provided compute. Already today a number of companies, such as Thinking Machines or Weights and Biases, offer “RL as a service”-type of products, where clients can train their LoRA-based model fine-tunings running on top of open-source models. Using a single inference setup for multiple clients can be an important axis of cost optimization for these providers, driving down the costs and increasing the margin (assuming there is enough demand, which remains unclear as of today).

What is less obvious is that this ability to combine multiple independent policies into a single inference setup can also be valuable for organizations training universal foundation models. Often when training the model, initially the team would train a few different, independent models, each optimizing a different reward function. After we have a few trained models with good performance on these independent tasks, we would apply some sort of model distillation technique to bring all of these capabilities into the single final model. For example in the report for DeepSeek v3.2 we read:

For each task, we initially develop a specialized model dedicated exclusively to that particular domain, with all specialist models being fine-tuned from the same pre-trained DeepSeek-V3.2 base checkpoint. In addition to writing tasks and general question-answering, our framework encompasses five specialized domains: mathematics, competitive programming, general logical reasoning, agentic coding, and agentic search. Each specialist is trained with large-scale Reinforcement Learning (RL) computing. Furthermore, we employ different models to generate training data for long chain-of-thought reasoning (thinking mode) and direct response generation (non-thinking mode). Once the specialist models are prepared, they are used to produce the domain-specific data for the final checkpoint

If the DeepSeek team were to train these models through LoRA adapters (remember “no regret”), they could combine prompts for different models and run them on top of a single inference setup. This would be much, much more efficient than running five separate setups.

The mental picture we hope the reader takes away from reading this article is the visualization shown in Fig. 20. To increase the throughput of token generation, we need to run very large batches. This comes at a tradeoff - any single generation will be relatively slow, but this doesn’t matter much for RL rollout generation, since we don’t have human users impatiently waiting for a response to appear. All we care about is producing as many tokens as possible given a hardware budget, and here we achieve just that.

Figure 20: To achieve full utilization of GPU resources in an inference setup, large batches are required. This decreases the throughput per request, but the cumulative throughput, what we care about in the end, is maximized. Combining multiple policies running on top of a single base model makes it easier to achieve larger batches, driving down the cost for each client.

The ability to run larger batches is especially relevant for large-scale MoE models, such as DeepSeek. As we elaborated in detail in our previous text on “MoE Inference Economics from First Principles,” this class of models, due to their sparse computation patterns, greatly benefits from a high concentration of GPU resources in a single setup. If the inference provider is capable of running a setup spanning multiple nodes, the performance per GPU will be significantly improved compared to a setup spanning one or two nodes, as demonstrated in Fig. 21. This means that if our training setup allows for larger batches, we can combine multiple nodes, increasing the performance per GPU. All these factors compound, resulting in substantially faster and cheaper training of RL models.

Figure 21: Token throughput for DeepSeek v3 measured on NVL72 SGLang Blog.

Additional benefit of the minuscule memory footprint of LoRAs comes up during the policy update when the training worker sends the updated policy to the inference workers. Because the LoRA weights are so small, sending them from one worker to another is very quick. There is little to no overhead - updates are fast and inference GPUs aren’t left idling during synchronization.

Intelligence markets through RFT

In this text we tried to introduce the technical details that will underpin the “experience economy.” We argue that policy gradient optimization methods, such as GRPO, are very information sparse, providing only log(N) bits per rollout, where N is the reward granularity. Because of this, the data learned by the model can be efficiently encoded in very few parameters. We can do this parameter-efficient fine-tuning technique, such as LoRA, without any penalty (“no regret”) to the model performance compared to the full model fine-tuning.

LoRA makes training substantially cheaper because we don’t need to store the optimizer states nor gradients for the parameters. Hence, on the fixed hardware budget, we can run substantially larger batches, increasing the speed of training. In the context of RL, this means that we can allocate fewer resources for training and more towards inference workers, driving up the data generation speed.

LoRA unlocks the ability to scale the batch size, which plays nicely with ideas of AsyncRL - the leading paradigm of doing RL in 2025. In AsyncRL different tokens in a single rollout are sampled using multiple continuously updated policies. Empirically it has been shown to improve the end-to-end performance as the token lag is spread more evenly throughout the samples.

Because competing with the foundation’s general-purpose model is so challenging and capital intensive, focusing on model specialization seems like a promising path for less resourceful organizations. The biggest of the recent releases in this domain was the introduction of Tinker by Thinking Machines, where they enable organizations to run RL workloads. The distributed training is abstracted away and handled by Thiky. The user of the API has to only focus on reward modeling. Pricing is divided into inference parts (prefill + sampling) and training parts (see Fig. 22).

Figure 22: Token level pricing of Tinker 18.11.2025. Sampling is substanitally pricer than prices offered by the inference prividers such as Together or Fireworks. This demonstrates the potential to improve the margins if an inference provider offers such a service.

The RFT-type services, in our opinion, offer a great opportunity for NeoClouds to increase the margins on compute in smaller clusters. RFT is specifically well positioned to utilize the smaller clusters (a few hundred nodes). Since the memory requirements are so significantly reduced, and we have only 2/3rds of the FLOPs of full-model fine-tuning, training can be conducted on substantially reduced resources, potentially even within a single H100 node for models of +200B parameters. Moreover, using multi-tenancy, it is possible to gather requests from multiple users into a single large batch, further increasing the cost competitiveness.

This has great potential to improve the profit margins on the underutilized compute resources gathered in the smaller datacenter - the compute that up until now was of limited potential, as it was not possible to run larger training runs of it - thing that offers the highest margins.

You can see that NeoClouds are already interested in this direction. E.g., in September 2025, CoreWeave acquired one of the startups pioneering RFT - OpenPipe. CoreWeave also acquired Weights and Biases, which also offers an RL-as-a-service-type product, since October 2025.

The big question is whether there is actually a market for RFT - if there is any point in developing these capabilities, or will the smart models be able to just “figure it out” in context, as they get more and more intelligent? This we can’t answer. What can be argued is that as of today it is not the case; there are clearly moats to be built in this fashion. There are some capabilities that are inherently “not there” in cloud models.

Two good examples are the model’s ability for persuasion in a particular domain. The best example here is the AdLlama paper from Meta. Researchers first train a reward model based on the user preference data. The reward model is able to estimate which headline will “click” with users and which will be ignored. Then the said model was used to optimize Llama 2 (it took them a while to publish this paper) using PPO, so it is more likely to produce more persuasive outcomes. The high-level overview we present in Figure 23.

Figure 23: “Our first contribution, illustrated in the left panel, is RLPF, a reinforcement learning (RL) approach to post-training an LLM based on an aggregate performance metric. We apply RLPF to a generative AI feature in Meta’s Ads Manager that helps advertisers generate new ad text variations. To do this, we use Facebook ad performance data (i.e., click-through rates) to train a reward model, which can score the effectiveness of a piece of ad text. Subsequently, we align the LLM toward this reward model using RL, which involves iteratively generating and scoring ad text, illustrated in lower part of the figure. The resulting LLM is called “AdLlama.” Our second contribution, illustrated in the right panel, are the results of a large-scale advertiser A/B test, encompassing approximately 35,000 advertisers and 640,000 ad variations, which showed that advertisers who received AdLlama achieved 6.7% higher advertiser-level click-through rates (CTR). To our knowledge, this study is the largest reported so far that investigates the use of generative AI in an ecologically valid setting” . Figure from AdLlama paper.

Data on how to best persuade the users of Facebook are clearly not available directly on the Common Crawl. It requires a pretty sophisticated data-gathering mechanism that also can be further refined by providing specific demographic data (e.g., user age, sex, embedding of preferences, etc.) to make the reward model more accurate. A language model optimized using such a reward model is most likely to be much, much better at persuasion than off-the-shelf GPT, though it is fair to say that in the future, as models get more intelligent, the ability to persuade might be just “baked” in them, since they already have a pretty high emotional intelligence.

Another example of successful RFT is building custom deep research systems. The most successful deep research models today are already trained today with RL reward modeling. Training is organized as follows: we ask a question, the model does a number of function calls, web searches, and coding, and maybe searches through third-party tools, and after a number of turns produces a final answer (like in the simple agent loop we introduced in Fig. 17), and the system provides a reward based on how well the model did on this task.

Every company has slightly different data; some use Confluence, others some obscure SAP system written in the 1980s, and others just a continuously updated Excel sheet. It can be argued that this might be too diverse for an out-of-the-box model to be able to search through it successfully, and to build a competitive advantage, companies should be training models that get better at utilizing THEIR data over time. Whether the overhead of this will be too substantial, building good environments is challenging and requires some expertise, or it will be worth it remains to be seen as of now.

The major challenge of RFT is that it requires building custom environments for every problem to be solved, which is inherently not scalable. LLMs can potentially help here, in aiding humans in building these environments, but in our experience, as of late 2025, in zero-shot fashion setups, LLMs exhibit a pretty poor performance when it comes to reward modeling, requiring substantial supervision from humans. Not surprisingly, most of the companies in this space right now focus on building “picks and shovels” of RFT, be it the fine-tuning platforms like we discussed for Thinking Machines or building the environments hub like Prime Intellect.

While these make it easier to iterate on custom models, the main problem of making custom models useful - how to scale environment building and reward modeling remains unsolved. The training foundations are there - the unknown remains the path to scaling.

Acknowledgments

Thanks to each of you for giving me feedback. Lukas (absolute 🐐) for having an insane eye for detail and correcting me on errors small and not so small. Pieter for giving it a first read and pointing out networking advantages, Jordan for discussing the NeoCloud perspective with me. Vedant (big win for Mistral) for his comments on ICL and era of experience. Szymon and Felix for providing me with small suggestions on how to direct this text, and to Felix for discussing with me the token lag problem. Andreas for challenging the 1 bit idea.

@online{tensoreconomics2025aiinfraineraofexperince,
  author = {Piotr Mazurek},
  title = {AI infrastructure in the "Era of experience"},
  url = {https://www.tensoreconomics.com/p/ai-infrastructure-in-the-era-of-experience},
  urldate = {2025-11-16},
  year = {2025},
  month = {November},
  publisher = {Substack}
}

At tensoreconomics we doubt this is actually the case as of today. We believe that there still remains a gap between open-source models and frontier models from Anthropic, OpenAI and most recently Google. The gap is poorly captured by the existing benchmarks, but it is clearly showed by the consumer interest. People just don’t use these free models, but pay for Claude, GPT or Gemini subscriptions.

E.g. facebook/opt-125m models have 4M monthly downloads as of November 2025, even though no one is using them. It is most likely caused by them being the default model used in vllm documentation.

There are limits to this analogy. For example, until recently the Chinese model providers were unable to secure enough compute capacity to serve their models competitively. Users wanting to use Chinese models typically access them through Western inference providers, such as Together or Fireworks. This is in clear contrast to the traditional model where Western companies dominated “IP” and Chinese companies dominated manufacturing; here the dynamic is reversed.

The exact footprint of activations stored is highly dependent on the activations checkpointing strategy.

Why are embeddings so cheap?

Piotr Mazurek — Wed, 24 Sep 2025 18:09:45 GMT

Embeddings are a fundamental component of every modern retrieval augmented generation (RAG) system. State-of-the-art (SOTA) embeddings are provided by companies such as OpenAI or Google at prices up to two orders of magnitude lower on a per-token basis than prices for generative models such as GPT or Gemini. In this analysis, we show what computations are required to produce an embedding and how, based on these computations, we can derive the true dollar cost of processing a token.

Our contention is that since the cost of processing a token is so minuscule and all embedding models converge to similar semantic representations, this means that no provider enjoys a sustainable moat. This results in underlying low costs being passed directly onto the consumers, driving prices to rock bottom.

As we demonstrate, with sufficient demand and scale, the price to process a million input tokens can be driven below 1¢ for SOTA embedding models topping the evaluation leaderboards.

The insight we hope you take from reading this analysis is that, unlike GPT-style autoregressive models, modern embedding models are primarily compute-bound, not memory-bound, meaning they largely don't benefit from batching. We show that FLOPS/dollar is the key hardware parameter for optimizing cost structure when building an embedding API.

The remaining sections proceed as follows. First, we estimate the FLOPs of a forward pass based on model architecture, comparing these FLOPs with the cost of loading the model from global memory. Then we benchmark a real-world embedding inference system, showing how quickly computations saturate our system and how little additional utilization we gain from batching. Finally, we dive deep into CUDA profiling, examining which kernels are invoked and profiling them in detail to identify system limitations.

We assume that the reader is familiar with the concepts presented in our previous text: LLM Inference Economics from First Principles. Before proceeding to read this, the reader should be familiar at a high level with how transformer models work, what it means to be compute- or memory-bound, what the FLOP is, and how many FLOPs there are in a matrix multiplication. We highly recommend you get familiar with these concepts first, as we will be building on top of this understanding. If you are unfamiliar with these concepts, please read the previous text first.

Introduction

Embedding models transform sequences of data, typically text or images, into semantic representations. These representations take the form of high-dimensional vectors that capture meaning across different dimensions.

This semantic representation can be later used in downstream tasks, such as “find me the document semantically similar to the question my user is asking”. Similarity is computed using linear algebra operations such as dot products or cosine similarity between vectors. While there exist some limits to what can be captured in this kind of single vector representation embeddings remain the fundamental block of any modern RAG system.

One of the first companies to offer the semantic embedding as an endpoint was OpenAI, back in December 2021. Since then, their embedding models have been updated and competitors like Google and Cohere have launched their own offerings, though OpenAI remains the market leader.

What is pretty special about the embedding endpoints is how cost-efficient they are to use. While OpenAI is charging up to $15/1M input and $60/1M output tokens for their O1 reasoning model1, embedding models are offered at an astonishingly small 10 cents/1M input tokens for the most popular `text-embedding-ada-002` model (see Fig. 1).

Figure 1: OpenAI embedding pricing as of 13.09.2025.

To illustrate how cheap it is, embedding the entire English Wikipedia corpus would cost approximately:

5B words or ~10M pages of text for $650.

If we look at the offerings from other competitors, the prices are mostly within a ballpark of this, with Google's gemini-embedding-001 at $0.15/1M input tokens and Cohere’s Embed 4 at $0.12/1M input tokens.

While these prices might appear as artificially low if we "count the FLOPS," we can clearly see that they are reflected in the underlying cost structure of processing a token. Generating an embedding is just extremely cheap, and these savings are passed directly onto the consumers. There's another factor driving prices down: embedding models are converging to similar semantic representations regardless of their architecture. If all models capture semantics similarly, differentiation becomes nearly impossible. This makes serving raw embedding a terrible business to be in and forces companies that used to specialize in this to pivot more towards the end-to-end search services that can command higher margins.

Qwen3 Embedding architecture

For our analysis, we use Qwen3-Embedding-8B as our example model. As of September 2025 it is a model topping the embedding tasks leaderboards (see Fig. 2). While many people validly point out the limitations of the real-world usefulness of using MTEB as a proxy for model quality, due to models overfitting on test data, it remains widely adopted. Since we don’t care that much about the exact model performance but more about the compute cost of generating an embedding representation, it should not be a problem for our analysis.

We make several key assumptions about commercial embedding models. We assume that closed-source models like text-embedding-ada-002 and gemini-embedding-001 share similar architectural designs with Qwen3. Specifically, we assume they are dense transformer models that produce embeddings via a single forward pass and are relatively compact at under 10B parameters.

Figure 2: MTEB Leaderboard 8.09.2025

The architecture of Qwen3 embedding is very simple, to quote the Qwen team:

The Qwen3 embedding … models are built on the dense version of Qwen3 foundation models and are available in three sizes: 0.6B, 4B, and 8B parameters. We initialize these models using the Qwen3 foundation models to leverage their capabilities in text modeling and instruction following.
For text embeddings, we utilize LLMs with causal attention, appending an [EOS] token at the end of the input sequence. The final embedding is derived from the hidden state of the last layer corresponding to this [EOS] token.

This means that generating an embedding requires only a single forward pass through the model, extracting the hidden state corresponding to the final [EOS] token (see Fig. 3). In other words, embedding has (nearly) exactly the same compute footprint as completing the prefill phase, or producing the first output token, when using the auto-regressive version of Qwen3.

Figure 3: Model architecture of Qwen3-Embedding.

Counting FLOPs

*Please be aware that FLOPS and FLOPs mean different things. FLOPs (small s) is the plural of floating-point operations not considering time at all, but FLOPS (capital S) means floating-point operations that happen within a second.

In the previous text we estimated the total FLOPs of a forward pass of a Llama model at:

Where S is the sequence length of the processed sequence.

Qwen3 and Llama3 share a very similar architecture. Both are dense causal transformer models; both use group query attention and the silu activation function for MLP. One minor difference is that the ratio of hidden_size to intermediate_size is 3 (12288/4096, see Fig. 4) compared to Llama’s 3.5, though it is such an insignificant difference that we proceed to use the equation above to estimate the FLOPs of a forward pass of Qwen. For further details, please refer to the previous article.

Figure 4: Hugging Face config for Qwen3-8B Embedding.

Since for embedding representation we just use the hidden state of the last layer corresponding to the [EOS] token, this means we don’t need to calculate the probability distribution over the next predicted token, aka the LM head, meaning we can remove the FLOPs for calculating it. There are some extra FLOPs involved from the pooling layer, but for the sake of simplicity we skip them, as they involve very few operations. Bringing the total in a forward pass to:

Assuming we are processing S=1024 tokens and using the Qwen config as in Fig. 4 (num_hidden_layers=36, hidden_size=4096, num_attention_heads=32), we can quite easily estimate the total FLOPs at:

Using this, we can easily estimate the upper bound of how long it takes to process a sentence of 1024 tokens. To do so, we need to look into the FLOPS and memory bandwidth of a GPU. Performance can be capped either by us having to do too many operations (compute-bound) or by a need to load too much data (memory-bound).

On paper NVIDIA H100 SXM5 delivers 989 TFLOPS of compute for BF16 with accumulation in FP32. It also features a fast high bandwidth memory (HBM) offering up to 3.3 TB/s throughput. However, in practice, due to the power constraints, H100 is not capable of actually achieving these declared FLOPS. As Horace He explains:

This observation that GPUs are unable to sustain their peak clock speed due to power throttling is one of the primary factors that separates “real” matmul performance from Nvidia’s marketed specs.
The figure that Nvidia provides for marketing is:
For example, on an H100, there are 528 tensor cores per GPU (4 per SM), the max clock speed for these is 1.830 Ghz, and the FLOP per tensor-core instruction is 1024. Thus, we have 1.830e9 * 528 * 1024 = 989 TFLOPS, exactly Nvidia’s listed number.
However, you can only achieve this number by sustaining 1.83 Ghz clocks, and as we’ve seen above, the GPU just doesn’t have enough power to do that!

In our tests we achieved ~750 TFLOPS in matrix multiplication (see Fig. 5) on H100 SXM, varying with matrix dimensions. As we will show later, most of the time in calculating embeddings is spent in large-scale matrix multiplications - hence this will be a pretty good indicator of the realistic end-to-end performance we can expect.

Figure 5: FLOPS achieved on Nebius on NVIDIA H100 SXM running on CUDA 12.4. Code can be found here.

To calculate an embedding, we need to:

Load the model weights once from global memory
Complete a forward pass involving ~16.4 TFLOPs of matrix multiplications and other calculations.

The model size can be easily estimated based on the number of parameters at:

meaning that a single model load takes around:

Meanwhile, when we do all 16.4 TFLOPs worth of calculations we estimated above, we will wait approximately

In practice the memory loading and computations on a GPU are to some degree overlapped, meaning that the factor limiting our performance will be the larger of these two times - compute in this case, or in other words, our calculation will be more compute than memory bound.

Assuming 0.021s for a single forward pass, we can estimate the upper bound of embedding throughput as:

In practice, we won't achieve these exact numbers. Both memory and compute figures represent theoretical upper bounds. Real-world performance falls short due to factors like imperfect memory-compute overlap, kernel launch overhead, and various computational inefficiencies.

Real world benchmark

To measure the real-world performance of an embedding model, we can use the popular LLM inference engine - SGLang. Running it is quite simple (see Fig. 6) and should work out of the box on any H100 setup. SGLang provides an efficient inference platform, realized in the form of an OpenAI-compatible endpoint, and should be a good basis for our experiments.

Figure 6: Command we use to run the SGLang instance powering our experiments.

We implemented a simple benchmark measuring inference engine response times. The benchmark takes two inputs: sequence length (in tokens) and batch size (simulating concurrent users). We sent batches of requests with predefined lengths to the API and measured processing time.2

Figure 7: Performance measured at different input sequence lengths. We run Qwen3-8B Embedding on Nebius on H100 SXM running on CUDA 12.4 with SGLang 0.5.1. Note that the x-axis is logscale.

What the reader should immediately spot is how, despite batch size increasing, the number of embeddings processed within a second doesn’t scale significantly. For shorter sequences (260 and 520 tokens), increasing batch size from 1 to 64 yields only ~2.3x improvement in total token throughput. As sequence length increases and compute becomes more saturated, this scaling benefit diminishes ever further. For the longest sequences (5200 tokens), larger batch sizes provide virtually no benefit.

The reason for this situation can be largely explained by the FLOPS calculations we did above. We estimated that for batch size=1 and sequence length S=1024 tokens, the upper bound is ~47 embeddings per second, roughly double (47/25.3=1.86) what we observe in practice3. This gap is to be expected, as model calculations have extra overhead from launching multiple kernel grids, the lower arithmetic intensity of some kernels, and other inefficiencies.

Even at batch size=1, the computational workload substantially saturates the compute units. At larger sequence lengths, we're operating on increasingly large matrices with no spare compute resources available—the streaming multiprocessors (SMs) are fully occupied. As batches grow larger or sequences get longer, kernel launch overhead becomes a smaller fraction of total time, with most cycles spent on the calculations themselves.

What you need to remember, though, is that since we are compute bound and we only get marginal improvements in the number of embeddings processed with increased batch size, it has a severe impact on the latency experienced by each user. Since we do these larger and larger scale matrix multiplications, they take longer and longer to finish. This represents a traditional tradeoff in LLM inference - combined throughput vs. the latency experienced by each user.

Figure 8: Latencies corresponding to observations from Fig 7. We use the same hardware setup as for Fig. 7. By latency we mean how long do we wait for an entire batch (requests are batched together).

This contrasts sharply with autoregressive token generation. Here, because our computation is primarily memory rather than compute bound, increasing the batch yields substantial throughput gains. As Figure 9 shows, batch size increases correspond to significant improvements in overall throughput across all requests. The latency penalty for individual users is far less severe than with embeddings.

Figure 9: Qwen3 8B benchmark on a single H100 SXM, 300 tokens in, 2000 tokens out.

Text generation's memory-bound nature creates this favorable scaling. We primarily wait for model weights to load for each token generated while compute resources remain underutilized. Loading the model once and "sharing" this cost across multiple users has minimal impact on per-user experience.4

The fundamental difference lies in computational bottlenecks. While autoregressive generation is limited by memory bandwidth and benefits from amortizing memory costs across batches, embeddings face the opposite constraint. Embedding generation requires only a single forward pass, but this pass involves large-scale matrix operations that fully saturate available compute. This is the main intuition we hope the reader takes out of this text. Embeddings are primarily compute bound. Due to this there is only limited benefit to batching, and it comes at the cost of severely increased latency experienced by each individual user. This means that, unlike in serving GPT-style models, there are very limited benefits to scale - running big batches doesn’t slash our cost “per user”. This further decreases the appeal of serving embeddings as a business.

What profiling reveals

What we calculated while doing the theoretical analysis of the Qwen embedding forward pass, and what we measured via our benchmarking script, can also be observed if we profile the forward pass and look at the individual kernels being invoked. The profile trace can be easily obtained from SGLang by hitting the `start_profile` and `stop_profile` endpoints. All calculations in between these calls will be captured by the torch profiler as demonstrated in Fig. 10.

Figure 10: How to send the profiling request to SGLang.

Before diving into our profiling results, it’s important to understand how to read these traces. Torch Profiler shows both CPU and GPU activity in a unified timeline, but they operate asynchronously.

CPU timeline (top portion): Shows when the host code launches kernels, allocates memory, and performs other CPU operations
GPU timeline (bottom portion): Shows when kernels actually execute on the GPU hardware
Asynchronous execution: When the CPU “launches” a kernel, it doesn’t wait for completion—it immediately continues to the next instruction while the GPU executes the kernel independently

This asynchronous nature means there’s often a visible gap between when a kernel is launched (CPU) and when it actually runs (GPU). When we look at the high-level (Fig. 11) trace of a forward pass, we can clearly see multiple (36) hidden layers being called. The kernels are pretty well packed together.

Figure 11: High-level profiling trace. Profile for 2600 tokens, batch size 1.

When we look at the single-layer level (see Fig. 12), we can identify the individual components of a transformer layer. Even for a relatively long sequence of 2600 tokens, MLP computation visibly dominates execution time. Additionally, the reader should note that except for the time the GPU is working (kernel) is visible, there are some small, but still existing, gaps between the kernels and kernels invoked to prepare the data (the small pink rectangles before the flash attention kernel). All of these gaps and small kernels add extra overhead to our computation.

Figure 12: Single transformer layer profile. Profile for 2600 tokens, batch size 1, on NVIDIA H100 SXM5. For the convenience of the reader, we show which kernel is responsible for which element of a transformer block forward pass.

When we aggregate all of the kernels invoked during the forward pass, an interesting pattern emerges (see Fig. 13). The majority of time during a forward pass is spent computing nvjet_tst_256x144_64x4… kernels - CUDA highly optimized matrix multiplication kernels. This is another one of the key observations we hope the reader takes out of reading this text. The vast majority of forward passes while calculating the embeddings are spent in matrix multiplication kernels. What it implies is that the 750 TFLOPs we calculated for matrix multiplication in the previous section will be a pretty good proxy for estimating our time of a forward pass, and as a result, the throughput.

Other kernels like fused SILU and RMS norm (primarily from FlashInfer) contribute smaller percentages of wall time and aren't explored further here.

Figure 13: Kernel statistics aggregation. Profile for 2600 tokens, batch size 1, on NVIDIA H100 SXM5.

What should be pretty interesting to the reader is how small a percentage of the forward pass is spent calculating the attention. In our 2600-token example, only 6% of execution time occurs in the attention kernel. Note that the flash attention kernel represents just part of the attention block's total cost. As Figure 12 shows, substantial time is spent on QKV projection, O projection, and input preparation (such as applying RoPE).

Since attention has O(N^2) complexity with respect to length, and MLP grows linearly (we calculate the up and down projections for each token in the sequence), as we increase the input length, attention will represent a more and more significant portion of the forward pass time. E.g., in Fig. 14 we show that when we increase the sequence length to 8k tokens, the percentage of time spent in calculating the flash attention kernel grows from 6% to 15% of the total time.

Figure 14: Kernel statistics aggregation. Profile for 8000 tokens, batch size 1, on NVIDIA H100 SXM5.

When we increase the batch size, the number of operations grows linearly both for the matrix multiplication kernels (we need to handle linearly more tokens) and for the flash attention kernel. You can see (Fig. 15) that as we increase the batch size, the percentage of the wall time remains (roughly) the same.

Figure 15: Kernel statistics aggregation. Profile for 2600 tokens, batch size 16, on NVIDIA H100 SXM5.

The traces demonstrate that most time is spent in matrix multiplication kernels. However, the torch profiler only shows that the GPU is working on these kernels—not how efficiently it's utilizing available compute resources. While we might assume NVIDIA's highly optimized kernels achieve good utilization, it would be nice to confirm this.

What we can do is we can run the NCU profile while producing an embedding. Such an option is also available in SGLang, though it is slightly more challenging to make it run than in the case of the torch profiler.

When profiling the kernel with NCU, it becomes pretty evident that we achieve very, very high compute utilization (see Fig. 16). Our compute throughput is pretty close to 100% of utilization. Note how since the majority of time is spent computing, the memory remains underutilized.

Since most of our time is spent in these matrix multiplication kernels, and they by default present a very good compute utilization, this bears the question, “What could be further improved?”

Figure 16: Profile of the nvjet_tst_256x144_64x4_2x1_v_bz_coopA_TNT kernel using the NCU shown in NVIDIA Nsight Systems. The profiling was done using H100 PCIe (different from other calculations shown in this text).

From FLOPs to dollars

As we calculated in the beginning of this text, generating embeddings is incredibly cheap. Embedding an entire English Wikipedia will set a user back around $650. Our contention is that the cheap prices offered to consumers are a result of the underlying cost structure - that producing the tokens is just incredibly efficient and that, due to competitive pressure, the cost savings are passed onto the consumers, driving the price basically to zero.

Based on the experiments we captured in Fig. 7, we can quite easily calculate what the total number of tokens processed per second is at different batch sizes at:

For example, with 1040-token sequences at batch size 4:

In Tab. 1 we present the number of tokens processed each second based on the numbers presented in Fig. 7. What the reader should note is that the numbers saturate around ~45k tokens/s, which once again shows our calculation getting compute bound.

Table 1: Tokens processed within a second assuming input sequences of different lengths and coming in different batches. Numbers for Qwen3 8B Embedding, run on NVIDIA H100 SXM5 with CUDA 12.4 on Nebius cloud.

Assuming that we are renting H100 for $2/h, we can estimate the cost of producing a million tokens as:

e.g., for embeddings of an input sequence of 1040 tokens at batch size 4, we calculated above it would translate to around

or 1.66 cents for processing a million tokens.

In Tab. 2 we present the estimated pricing for different settings.

Table 2: Price for processing 1M input tokens, based on the numbers we presented in Tab. 1. We assume $2/h for NVIDIA H100 SXM.

The prices estimated above build on the assumption that we enjoy 100% utilization at the given batch size, meaning that 24/7 we get requests coming in that we can compute and charge the user for. In practice we will for sure get periods of higher traffic when your API is bombarded by the requests and quiet periods, e.g., outside of business hours, when our compute will be mostly idling.

Our exact calculations will depend on these computation patterns, how many users you have per GPU, how much latency your users are willing to accept (see Fig. 8), and how time-concentrated our requests are.

Assuming that our typical request will be similar to batch size 4 of a request of size 1040 tokens coming at the same time and this will happen around 10% of the time throughout the day, this would result in 32.2 x 3600 x 24 x 0.1 ≈ 276k requests. In such a case our cost for processing 1M input tokens would be around $0.0166 x 10 = $0.16, or 16 cents/1M tokens. Slightly above the OAI pricing for ada-embeddings.

Other than bringing in more users, could we do anything else to improve our cost structure?

Our metric - FLOPS per dollar

Consider embedding generation from first principles. Our computation pattern has four key characteristics:

We do just a single forward pass.
Our model is in bf16. It is relatively small, occupying only 15.2 GB of space (plus some space for activations); we can easily fit it on a single GPU, and we don’t need high communication between the GPU.
The forward pass is mostly compute bound - having fast memory brings us little benefit.
We know that our most important metric is how fast the GPU can do large matrix multiplications - FLOPS

Since we're primarily FLOPS-bound, the metric we should optimize is FLOPS/dollar. Why pay premium prices for H100's 80GB of memory, ultra-fast memory bandwidth, and high-speed interconnects when our small model doesn't require these features? Most of our time is spent computing, not moving data.

The logical approach is finding GPUs with superior FLOPS/dollar ratios. Table 3 compares three popular options using on-demand cloud pricing: H100 SXM, A100, and RTX 4090.

Table 3: TFLOPS declared by NVIDIA for H100 SXM, A100, and 4090. The pricing is an estimation; the exact pricing will depend on the current hardware availability.

On paper, the H100 dominates this comparison, achieving the best FLOPS/dollar ratio out of the three compared GPUs.5. However, as we showed in Fig. 5, in practice H100s gets power-bound. The 989 989TFLOPS promised by Nvidia are unattainable.

Using the more more realistic 750 TFLOPS (as estimated in Fig. 5), the TFLOPS/$ ratio would drop to 750/2=375, considerably below the RTX 4090's theoretical 445. To verify whether the RTX 4090 can achieve its declared 165 TFLOPS, we ran identical matrix multiplication benchmarks (Fig. 17). The RTX 4090 closely matches NVIDIA's specifications.

Figure 17: FLOPS achieved on Runpod on NVIDIA 4090. Code can be found here.

The price difference is significant: H100 costs 5.4x more than RTX 4090 ($2/$0.37), while the performance difference is only:

This suggests RTX 4090 should deliver better performance per dollar for embedding generation. Since embedding computation is dominated by large-scale matrix multiplication, this benchmark performance difference should accurately predict real-world embedding throughput differences.

To confirm this prediction, we ran identical benchmarks on the RTX 4090 (Figs. 18-19).6.

Figure 18: Performance measured at different input sequence lengths. We run Qwen3-8B Embedding on RunPod on NVIDIA 4090 with SGLang 0.5.1. Note that the x-axis is logscale.

The results mirror our H100 observations: as batch size increases, per-user latency grows (Fig. 19). The RTX 4090 demonstrates the same compute-bound behavior - limited throughput scaling with larger batches but severe latency penalties for individual users.

Figure 19: Latencies corresponding to observations from Fig. 18. We use the same hardware setup as for Fig. 18. By latency we mean how long do we wait for an entire batch (requests are batched together).

Especially for the shorter inputs, the 4090 performs surprisingly well. The H100-to-4090 performance ratio ranges from 25.3/7.2 = 3.5x at batch size 1 to 45.2/10.8 = 4.2x at larger batches - smaller gaps than the 4.5x difference we measured in pure matrix multiplication benchmarks.

Figure 20: Comparison of NVIDIA H100 SXM vs NVIDIA 4090 on shorter prompts. We use the same setups as in Figure 7 and Figure 18, respectively.

As sequence length increases and we process larger matrices that more fully saturate compute resources, the performance ratio approaches our measured 4.5x matrix multiplication difference.

Figure 21: Figure 20: Comparison of NVIDIA H100 SXM vs NVIDIA 4090 on longer prompts. We use the same setups as in Figure 7 and Figure 18, respectively.

What should be pretty apparent is that the performance difference is smaller than the price difference between the H100 and the 4090 - meaning that the per dollar spent, we should be able to produce more tokens. To confirm this, similarly to what we did for H100 in Tab. 1, we calculate the number of tokens processed within a second by 4090 in Tab. 4:

Table 4: Tokens processed within a second assuming input sequences of different lengths and coming in different batches. Numbers for Qwen3 8B Embedding, run on NVIDIA 4090 on RunPod.

Assuming we pay $0.37 for an hour of 4090 and we use it to continuously embed an input sequence of 1040 tokens at batch size=4, it would translate to around

or around 1¢ for 1M tokens - 36% cheaper than the number we achieved for H100.

In Tab. 5, we present the estimated pricing for processing 1M input tokens. Once again, please note that this assumes 100% utilization; in practice, you won’t be able to process it so cheaply unless you have some sort of a batching system and massive, not time-sensitive demand that can be spread evenly throughout the day.

Table 5: Price for processing 1M input tokens, based on the numbers we presented in Tab. 4. We assume $0.37/h for NVIDIA 4090.

Note how Tab. 5 demonstrates our point. We were able to find a GPU that offers a superior FLOPS/dollar compared to an H100; this better performance is clearly reflected in the cost for processing 1M tokens. In 4090, in a lot of cases, we are able to drive the price of 1M tokens below 1¢. If we have more demand, we can simply scale the number of 4090s, which should be cheaper than switching to an H100.

Interestingly, for a 4090 that costs only 24x$0.37=$8.8 per day to rent, assuming we continuously get batches of 1040 tokens at batch size=32, it can process

We find it quite remarkable that a consumer-grade GPU that you can rent for less than $9 per day can process 1B tokens a day if provided enough demand.

Noteworthy embeddings are not the only application where you might benefit from prioritizing FLOPS/dollar spent. Prefill, especially in agentic tasks with long context consisting of long tool calls, is another example. Recently NVIDIA released Rubin CPX exactly for this purpose - a chip that’s heavy on compute but light on memory bandwidth, using cheaper GDDR7 instead of expensive HBM. It’s the same logic we’ve been discussing: when you’re compute-bound, optimize for FLOPS per dollar.

Summary

We started this text mentioning that embedding APIs are very cheap to use, and we argued that this underlying price is derived from low costs of producing the embedding representations rather than being subsidized below the cost of production.

We took the leading open-source embedding model, Qwen3 8B Embedding, and analyzed it in detail, including theoretical estimation of the FLOPS and real-world performance benchmarking. We did a deep dive into what is happening under the hood, showed which kernels are invoked, and showed how the majority of the execution time is spent in large-scale matrix multiplication.

Lastly, based on the observed performance, we calculated the price for processing 1M input tokens for different input-tokens/batch-size setups. Please once again note that the numbers shown assume a situation in which you enjoy 100% utilization, with the request being continuously batched.

GPUs are incredibly efficient at these sorts of workloads, driving the price down to zero. Since all leading models converge on a similar latent representation, and perform comparably no one has a pricing power enabling them to command higher prices. As the model gets more efficient, compute and energy get cheaper, and this situation will move towards the generative models as well. The embedding situation offers a nice pre-taste of the "intelligence involution” that is coming …

Acknowledgments

Thanks to Johannes from Prime Intellect for providing me with the compute to run these experiments, and to Jo (check out his new retrieval company), Felix, Pablo and Szymon for giving me comments for parts of this text.

@online{tensoreconomics2025embeddings,
  author = {Piotr Mazurek},
  title = {Why are embeddings so cheap?},
  url = {https://www.tensoreconomics.com/p/why-are-embeddings-so-cheap},
  urldate = {2025-09-24},
  year = {2025},
  month = {September},
  publisher = {Substack}
}

Prices as of 8.09.2025

Before each experiment we do a quick warmup of the GPU. We also repeat each experiment setup multiple times (e.g., 10) and measure the mean and the variance.

We have a small difference between the estimation and experiment. In our estimate we use 1024 tokens; in experiments we use 1040 tokens—a ~1.5% difference. It has a negligible impact on the calculations; we just want the reader to know we are aware of this minor difference.

There is some limit to this; as we increase the batch size, the KV cache will represent a more and more substantial portion of the total memory loaded. As the KV cache starts to dominate the load, the per-user experience will start to be more and more impacted by the batch size.

Note that if you were to buy these GPUs the ratio between the devices would be different, e.g. a new 4090 can be aquired for ~$3200 while a brand new H100 SXM5 would cost around 10x more.

We rented a 4090 on RunPod, running within a container.

MoE Inference Economics from First Principles

Piotr Mazurek — Tue, 02 Sep 2025 18:00:14 GMT

The release of first DeepSeek R1, then Kimi K2 and then DeepSeek V3.1 mixture-of-expert (MoE) models has firmly established them as the leading architecture of large language models (LLMs) at the intelligence frontier. Due to their massive size (1 trillion parameters and up) and sparse computation pattern, selectively activating parameter subsets rather than the entire model for each token, MoE-style LLMs present significant challenges for inference workloads, significantly altering the underlying inference economics. With the ever-growing consumer demand for AI models, as well as the internal need of AGI companies to generate trillions of tokens of synthetic data, the "cost per token" is becoming an ever more important factor, determining the profit margins and the cost of capex required for internal reinforcment learning (RL) training rollouts.

This analysis examines MoE architecture through the lens of hardware limitations and costs. Key bottlenecks: FLOPS, memory bandwidth, and inter-node connection speed directly impacts end-to-end performance and user scaling potential. Building on this foundation, we develop a theoretical cost model for large-scale model serving, comparing Deepseek V3.1 and Kimi K2 and showing how the hardware costs shape the business models of LLM inference providers.

We use this world model to go from tokens to dollars - demonstrating how these models can be cost-effectively served to consumers at scale and how cheaply they can be used for synthetic data generation. We argue that this is a market opportunity currently flying under the radar and a potential growth engine for NeoClouds. Lastly, we address the elephant in the room - the surprising lack of consumption of these models. Despite the great dollar/performance ratio, the global consumption of open-source models remains surprisingly minuscule, suggesting a potential oversupply of inference providers and a capability gap that is felt by users but not necessarily reflected well in the benchmarks.

The remainder of this article proceeds as follows. We first examine DeepSeek's architecture in detail, covering multi-head latent attention (MLA), expert routing, and expert parallelism (EP). Building on DeepSeek's published optimizations—many of which are implemented in SGLang—we develop a theoretical performance model that works across diverse hardware specifications. We then validate this model against real-world performance data and use it to derive per-token pricing for different deployment configurations.

For readers seeking a concise overview of inference economics, we recommend proceeding directly to section “Hardware considerations and profit margins“, returning to examine the DeepSeek V3.1 architectural details and the theoretical model formulation as reference material when needed.

Sections “DeepSeek MoE Architecture”, “Inference Optimization Techniques” and “Throughput: Theory vs Practice” are based on the authors' work at Aleph Alpha. We publish the numbers and methods we used internally for estimating the hardware requirements and the numbers we observed through conducting experiments of running DeepSeekV3.1 on multi-node setups. The theoretical model appeared first on the Aleph Alpha Blog.

Introduction

In January 2025, DeepSeek's release of their R1 reasoning model triggered the so-called "DeepSeek shock" in the global financial markets - with major Western tech companies, most notably NVIDIA (see Fig. 1), taking massive (though short-lived) losses to their market caps. While it is not possible to establish what exactly spooked the investors, it seems widely accepted that the ultimate reason was the realization of how cost-efficient it was to train the original DeepSeek V3, with the figure reported in the paper of only $5.6M, orders of magnitude less than the figures reported by various western labs.

Figure 1: The "DeepSeek moment" for NVIDIA stock price, as investors priced in the potential savings in model training.

What stood out to industry insiders even more than the minuscule training budget was the massive cost advantage DeepSeek API offered. At only $2.1/1M output tokens, it provided an over 27x cost advantage (see Fig. 2) while nearly matching O1-preview's benchmark performance (the leading reasoning model at the time).

Figure 2: o1-preview pricing in January 2025, from source

The DeepSeek team has been unprecedentedly open about the model and training details, and, most relevant to this analysis, about their inference infrastructure details. During the Open Source Week, among other things, they released efficient multi-head latent attention (MLA) kernels, an expert parallel (EP) communication library, and they published the details of their inference stack and setup, explaining the optimization techniques and releasing the theoretical revenue figures (see Fig. 3).

Figure 3: #OpenSourceWeek, when DeepSeek released their inference infrastructure details, lifting the veil on the economics behind LLM API pricing.

Understanding how DeepSeek achieved such dramatic cost advantages requires examining the underlying business model of AI inference. To quote Semianalysis here:

The modern factory is an AI token factory. Raw Silicon, electricity, and water comes into a Datacenter and what comes out is intelligence (in the form of tokens).

The business model of an 'AI token factory' is straightforward. Like any factory, there are fixed equipment costs that owners want to spread across as many users as possible. In AI inference, this fixed cost is the hourly expense of running GPUs, and providers maximize efficiency by producing as many tokens as possible per hour. The more tokens produced, the lower the cost per token - enabling cheaper pricing or higher margins. This creates a classical economics model incentivizing economies of scale: substantial fixed costs (GPU servers) that must be divided across as many users as possible.

In this article we develop a comprehensive cost model to answer one fundamental question: what does it actually cost to generate a DeepSeek V3.1 token, and what factors impact this number? We aim to build a theoretical cost model enabling us to estimate the final throughput, given the hardware specification (FLOPS, memory bandwidth, and the interconnect) and the workload profile (batch size and number of input/output tokens). We hope this will help you build a more accurate world model of this topic and make informed decisions about hardware investments and deployment strategies. We will use said theoretical model to propose the best hardware setup for deployments of different characteristics (the latency/speed/cost tradeoff).

We aim the article at experienced readers. We strongly encourage you to read and understand the core messages of “LLM Inference Economics from First Principles"first. Before proceeding to reading this article, you should be familiar with topics such as what a FLOP is, what it means to be compute/memory bound, what a KV cache is and how to calculate its memory footprint, and what prefill and decode phases are. We assume the reader is familiar with said topics, and we strongly believe that without this background the topics in this text might prove challenging to understand.

DeepSeek V3.1 and Kimi K2 are two prominent examples of mixture-of-expert style models. Understanding their cost advantages requires examining how MoE economics differ from traditional dense models. The key challenge in MoE inference is that, unlike in dense models like Llama3, each processed token activates only a subset of parameters, rather than the entire model. As we learned in the previous text, the decode (or token-by-token) phase is primarily memory bound, meaning that the majority of the execution time, and as a result most of the cost associated with running the model, comes down to the time of loading the model's parameters from the global memory. This property of dense models naturally incentivized amassing a batch as big as possible and sharing the cost of loading the model parameters across as many requests as possible, aka achieving a sort of economy of scale - fixed cost shared by multiple users.

For MoEs this becomes substantially more challenging. During the decode phase, each token in the batch is activating only a small subset of parameters at every layer. This means that each request requires us to load a different part of the model, as demonstrated in Fig. 4. As the number of requests in a batch increases, a more and more substantial portion of the model will have to be loaded from global memory. The experts are chosen semi-stochastically1, so some of the tokens in the batch will be routed to the same expert. As we progressively increase the batch size, more and more experts will be shared by different requests. This means that at the larger batch sizes we will partially recreate the situation from the dense model - sharing the cost of model loading between multiple users. Unfortunately this means that we will need significantly more requests, i.e. more users, to achieve the same "economies of scale" for MoE models.

Figure 4: Two request tokens activating different parts of the model, requiring us to load more weights, saturating the memory bandwidth

To illustrate this challenge: while DeepSeek V3.1 activates only 37B parameters for a single forward pass, this number grows nearly linearly with batch size as different queries activate different experts. At large batch sizes, the system may need to load close to the full 671B parameter model, creating severe memory bandwidth bottlenecks. Furthermore, this need for bigger batches requires substantially more resources to store the KV cache for all of the requests. These two factors necessitate running the model beyond a single node. To put it simply, there is not enough memory bandwidth, and there might not be enough space in memory on a single node to amass enough users to make model serving economically viable.

A GPU node is a specialized computer system designed with high-performance computation in mind. It is essentially a server with multiple GPUs, alongside hardware like CPUs, memory, or networking equipment. A popular example of a node would be a DGX system by NVIDIA, containing eight high-end GPUs (such as H100s, H200s, or B200s) in a single chassis, along with high-speed interconnects between the GPUs (NVLink). Within the node, the GPU-to-GPU (Intra-node) communication is possible at much higher speeds than the communication between the nodes (inter-node).

To effectively host large scale MoE-style models, an inference provider needs multiple GPU nodes. Ideally, we would like to split the model so that each GPU handles some subset of the experts and routes all relevant queries to this GPU. This way each GPU stays busy, and the GPUs do not need to communicate intermediate results as in tensor parallel (TP) setups2. This approach is called expert-parallelism (EP). Note that expert choosing occurs at every layer, for every token. During the decode phase, as the model generates token after token, the new token is routed to different experts, located on different GPUs.

To quote DeepSeek themselves:

Due to the large number of experts in DeepSeek-V3/R1—where only 8 out of 256 experts per layer are activated—the model’s high sparsity necessitates an extremely large overall batch size. This ensures sufficient batch size per expert, enabling higher throughput and lower latency. Large-scale cross-node EP is essential.

As we have adopted prefill-decode disaggregation architecture, we employ different degrees of parallelisms during the prefill and decode phases:

Prefilling Phase [Routed Expert EP32, MLA/Shared Expert DP32]: Each deployment unit spans 4 nodes with 32 redundant routed experts, where each GPU handles 9 routed experts and 1 shared expert.
Decoding Phase [Routed Expert EP144, MLA/Shared Expert DP144]: Each deployment unit spans 18 nodes with 32 redundant routed experts, where each GPU manages 2 routed experts and 1 shared expert.

Increasing the number of nodes operating within a single setup has a beneficial effect not only on the end-to-end performance but also on the per node performance. In other words, we get an increased return on investment (ROI) on our fixed cost of hardware (aka GPUs). This benefit is clearly demonstrated in one of the blog posts by Perplexity (see Fig. 5). You can see that as we increase the number of nodes involved (the EP number), the per node performance increases. We can compare it to a factory - investing more money into automation tools increases the returns on the tools the factory already owns. It has a compounding effect driving down the price per token as we increase the number of nodes in a setup, but it comes at a cost - we need substantially larger usage to utilize the setup.

Figure 5: Throughput per node by Perplexity.

While running a model in the multi-node setup in theory provides enormous economies of scale, enabling the running of bigger batches far more productively, running such an operation at scale is a highly sophisticated endeavor. It requires a mature software stack and an in-depth understanding of the underlying hardware by the people involved in developing such a stack. It is clearly reflected when we look at the list of people involved in the first open-source reproduction of a multi-node DeepSeek setup by SGLang (see Fig. 6). And while there exists an open-source reproduction, as of September 2025 it is still pretty hard to operate, requiring careful coordination of different package versions and branches of underlying software libraries. We are aware of a few public inference providers who, due to the reasons above, went with serving DeepSeek on a single H200 or B200 node. While this setup is far from optimal, it is far easier to maintain and provide it to clients.

Figure 6: List of people involved in the enabling the multi-node DeepSeek setup. SGLang blopost

DeepSeek MoE Architecture

Since the "DeepSeek moment" was the motivation for us to write this article, and DeepSeek V3.1 as of September 2025 remains the most popular open-source model according to OpenRouter (see Fig. 7), we will use it as a reference architecture we use in our calculations. All DeepSeek V3, DeepSeek R1, and DeepSeek V3.1 share exactly the same architectural details; they "only" differ by the values of weights and by their behavior during inference.

DeepSeek V3.1 is a hybrid model that, depending on the inference setting, produces thousands of so-called reasoning tokens before giving a final answer. The underlying inference math is the same at the per-token layer regardless of the setting, but due to potentially vastly different distributions of input to output tokens, the inference math we will achieve will be highly dependent on whether the model is used in the reasoning or non-reasoning mode, as it is reflected by the DeepSeek pricing page (see Fig. 32).

Figure 7: OpenRouter model popularity ranking, showing DeepSeekV3.1 as the most widely used open-source model, with the old DeepSeek V3 0324 remaining a close second (30.08.2025)

The DeepSeek architecture is summarized by the model configuration available on Hugging Face (see Fig. 8). It consists of 61 layers (num-hidden-layers), three of which are dense (first-k-dense-replace), and the remaining 58 are MoE layers. Each MoE layer contains a modified self-attention mechanism (multi-head latent attention, or MLA), a gating mechanism, and 257 experts - 1 shared expert and 256 routed experts - as defined by the n-shared-experts and n-routed-experts. The MoE layers are followed by a traditional language modeling (LM) head. The DeepSeek team also proposed a multi-token prediction (MTP) head for speculative decoding. However, as modeling its real-world performance is complex and less relevant for larger batches, we exclude it from this analysis.

Figure 8: DeepSeekV3 Hugging Face configuration. Note that this is the exact same configuration as used in DeepSeek V3 and DeepSeek R1.

Each layer consists of a Multi-head Latent Attention (MLA) followed by a DeepSeekMoE, as illustrated in Figure 9. MLA is a variant of traditional attention that compresses the KV cache using a linear algebra optimization. Instead of storing full key-value pairs like other models (such as Llama, Qwen, etc.), MLA stores only a compressed latent representation of size kv-lora-rank + qk-rope-head-dim. This reduces memory bandwidth requirements during token-by-token decoding, since less KV cache memory needs to be loaded to produce each token.

These optimizations compress the KV cache to 70KB per token - a 2-7x reduction compared to other models (Qwen3 32B: 163KB, Llama 405B: 516KB per token). This compression directly translates to reduced memory bandwidth requirements and lower inference costs. We detail the computational mechanics of MLA later; the key insight is that this architectural choice fundamentally alters the economics of serving large language models, especially with long context (such as agentic) use cases.

Following the MLA is the DeepSeekMoE component (see Fig. 9). The routing mechanism uses a linear layer mapping from hidden-size to n-routed-experts to classify which experts are most relevant for each token based on its semantic content. Each token is individually routed to a different set of experts. It is a common confusion that the experts are selected at the sequence (or query) level; we want to highlight that this is not the case. DeepSeek selects the top eight experts per token, with routing scores serving as weights when combining expert outputs.

Each expert contains a standard MLP structure with SwiGLU activation: three linear layers (W1 and W3: hidden-size → moe-intermediate-size, W2: reverse). Crucially, the moe-intermediate-size (2048) is smaller than hidden-size (7168) - the opposite of traditional dense models like Llama 3.3 70B, where intermediate dimensions are 3.5x larger (28672 vs 8192). This compression reduces per-expert computational costs while maintaining model capacity through expert diversity.

Beyond the eight routed experts, every token also passes through a shared expert that provides base knowledge common to all inputs. This hybrid approach balances specialization with computational efficiency.

This architecture is repeated across all 58 MoE layers, followed by the LM head for next-token prediction. The key architectural innovations - MLA for memory efficiency and sparse expert activation - represent a fundamental shift from traditional dense transformers toward economically optimized inference. In the following sections, we analyze the computational details of MLA and MoE components, identifying the primary bottlenecks that determine serving costs and scaling limits.

If you are not interested in the architecture details and optimizations, and you just want to read the conclusions and meta analysis feel free to skip to section "Hardware considerations and profit margins".

Figure 9: DeepSeek V3 MoE transformer layers consist of 3 shared experts and 256 routed experts. The router, a single weight matrix, predicts scores where the top 8 scores are chosen and the token is routed to the corresponding expert. In total a token goes through 9 out of 257 (256 routed + 1 shared) experts in each forward pass. Figure taken from DeepSeek V3.

Inference Optimization Techniques

Due to "policy and regulatory constraints", aka export restrictions, the DeepSeek team is operating under severe compute constraints, both for training and for inference of their models. This hardware scarcity is evident when inspecting the average throughput of DeepSeek V3.1 reported by OpenRouter (see Fig. 10). While Western inference providers like Fireworks and Together serve it at a comfortable 60-80 tokens per second (tps), DeepSeek manages only ~25 tps.

For reasoning models, generating thousands of tokens before final answers, this leads to a significantly worse UX for interactive usecases. A typical math problem requiring 3,000 reasoning tokens takes 120 seconds (3000 tokens/ 25 tps) - forcing users into a 2-minute wait that limits practical applications to scenarios where user’s latency tolerance is high.

Faced with hardware scarcity, DeepSeek did what any good engineer would do: they got creative. Rather than designing for perfect conditions they'll never have, they optimized everything around their actual constraints. The inference optimizations detailed in a recent paper by DeepSeek - from expert routing to MLA-MoE computation overlap to networking topology - reflect this constraint-driven mindset. We examine several optimizations that most impact our theoretical cost model.

Figure 10: Average DeepSeek V3.1 throughput as measured by OpenRouter. Note how much slower than the Western providers DeepSeek is on average. This throughput difference suggests constrained hardware allocations, likely forcing DeepSeek to prioritize larger batch sizes over individual request latency. While this strategy reduces per-token serving costs through higher GPU utilization, it comes at the expense of user experience. 31.08.2025

Expert Parallelism

As shown later, expert layers contain approximately 661B parameters, representing 98.5% of the total parameter count. This distribution necessitates careful consideration of parallelization strategies. To minimize weight-related overhead, parameter distribution rather than duplication provides the optimal approach.

In traditional tensor parallel configurations with dense FFN layers, communication involves dispatching and combining hidden-size values per token and layer. MoE models introduce complexity because batch tokens route to different model components (different experts). Given the compact dimensions of expert weight matrices (d-moe=2048), tensor parallel sharding would fragment these matrices into excessively small components, resulting in suboptimal blocked matrix multiplication performance. Expert parallel sharding preserves matrix integrity, enabling more efficient memory access patterns during GEMM operations.

However, this approach increases total communication overhead to

per token and layer, where d is the model's hidden size. Because experts may reside on different devices, expert parallel distribution becomes more exposed to communication bottlenecks, fundamentally changing the performance characteristics compared to dense models.

For deployments with small numbers of devices, particularly single or dual-node configurations, or those with very bad inter-node communication hardware, tensor parallel sharding can achieve superior performance due to the reduced communication overhead.

Expert Parallel Load Balancing

Expert routing probabilities exhibit non-uniform distributions (see Fig. 11), causing some experts to receive disproportionately higher request volume while other experts remain underutilized. Naive uniform expert distribution across available devices creates two critical performance issues: (1) uneven communication patterns where bottlenecks stall entire forward passes, and (2) asymmetric computational loads across devices. Additionally, heavily utilized devices must handle increased activation loading and memory write-back operations, compounding performance degradation.

Load balancing strategies can mitigate these issues through intelligent expert distribution across devices to achieve more uniform computational loads. Furthermore, frequently accessed experts can be duplicated to reduce communication peaks, though this approach carries the trade-off of increased weight loading and reduced per-GPU KV cache capacity. Since the expert layer computation is most often memory bound, having the same number of experts on each device is optimal. Therefore, the number of additional experts per layer must be constrained to multiples of the expert parallel size.

An interesting use case are uneven node configurations, where redundant experts can be used to fill up underutilized devices to achieve balanced expert distribution. For instance, the SGLang team reported using nine nodes for decode operations (72 expert parallel size) with 32 additional experts, achieving favorable trade-offs between additional memory overhead and reduced communication peaks.

Importantly, expert load balancing becomes increasingly challenging as node count increases. This degradation occurs because fewer nodes concentrate more experts per device, increasing the probability of achieving system-wide balance. So for deployments on very few nodes, the improvement in expert balancedness is not worth the corresponding redundant experts.

Figure 11: Expert imbalance during prefill and decode phases as reported by SGLang. These are empirical observations from running over some specific datasets, so in your particular application the exact distribution will differ. The meta point, that experts usage is not uniformly distributed, stands, though.

Location-Aware Expert Selection

To minimize the limiting inter-node communication, expert selection can incorporate locality penalties that preferentially route tokens to experts residing on the same node where their attention computations were performed. This approach reduces cross-node communication overhead, which often represents a primary bottleneck in distributed MoE inference.

During training, DeepSeek V3 implemented expert routing constraints ensuring each token routes to at most M nodes. Node selection follows the sum of the highest Kᵣ/M affinity scores for experts distributed on each node, where Kᵣ represents the number of routed experts and M the maximum number of nodes per token.

This methodology can be adapted for inference scenarios but requires careful tuning to balance locality benefits against response quality. A notable consequence of this approach is that token routing patterns become dependent on the position of the sequence within the batch, potentially creating position-dependent expert utilization patterns that may affect model responses.

Data Parallel Attention

Attention computation employs a data-parallel approach, distributing requests across available devices (see Fig. 12). This strategy enables KV cache sequences to remain on single devices, eliminating the need for duplication or inter-device communication of the latent KV cache, which, on the other hand, would be required under tensor-parallel MLA computation due to projection matrices.

However, this data-parallel approach requires duplicating all MLA weights across devices, approximately 10 GB for DeepSeek V3.1, and they must be loaded during each forward pass. This presents scalability tradeoffs for large-scale deployments. Specifically, in configurations exceeding 64 GPUs, the MLA weight parameters can consume greater memory resources than the expert layers themselves. This duplication reduces available KV cache capacity and renders MLA computations memory-bound for most batch sizes.

Figure 12: Visualization of DP Attention from SGLang blogpost. Different GPUs process different microbatches. It works both for dense layers (first 3 layers in DeepSeek) and mixture-of-experts layers.

Hiding Communication with Two-Batch Overlap

As shown previously, expert parallelism in MoE models generates approximately nine times the communication volume compared to traditional tensor parallelism. To mitigate this overhead, a two-batch overlap (TBO) strategy can be implemented to mask communication time behind computation. This approach partitions the global batch size into two micro-batches, enabling simultaneous execution where one micro-batch performs computation while the other handles communication operations.

Effective overlap implementation requires careful orchestration of computational and communication phases. Figure 13 illustrates a basic TBO configuration for decode operations. Since communication operations consume minimal computational resources, TBO can achieve runtime improvements of close to a factor of two in certain scenarios.

Figure 13: Creating two mini-batches allows for overlapping the communication of one mini-batch with the computation of the other, thus in most cases hiding the communication from the overall runtime. Depending on the inference regime, different overlapping structures can be used. The figure depicts DeepSeeks setup for the decode phase during inference. Figure taken from DeepSeek's profiling of V3.

Prefill Decode Disaggregation

LLM inference comprises two distinct phases with fundamentally different computational characteristics. Prefill operations process entire input sequences simultaneously, creating compute-intensive workloads with high FLOP utilization but minimal KV cache requirements. Decode operations generate tokens iteratively, resulting in memory-bandwidth-bound computations. Also, decoding is much more latency sensitive due to its repetitive nature.

As explained in more detail on SGLang's blog post traditional unified engines process prefill and decode batches together, introducing three critical inefficiencies: (1) incoming prefill batches interrupt ongoing decode operations, causing substantial token generation delays; (2) data-parallel attention imbalances occur when workers simultaneously handle different batch types, increasing decode latency; and (3) incompatibility arises with advanced expert placement strategies that require different dispatch modes for each phase.

Prefill-decode disaggregation resolves these issues by separating workloads into dedicated clusters optimized for each phase's requirements, with prefill usually needing fewer resources than decode due to better compute utilization.

Figure 14: Upon receiving an input request, the workflow proceeds as follows: 1) A Prefill Server and a Decode Server pair via a handshake, establishing a local sender and receiver, respectively. 2) The Decode Server pre-allocates the KV cache, signaling the Prefill Server to begin the model forward pass and compute the KV caches. 3) Once computed, the data transfers to the Decode Server, which handles iterative token generation. As per SGLang blogpost.

Theoretical Performance Model

The theoretical performance model creates a virtual clone of our DeepSeek V3.1 model, enabling analysis of different hardware configurations to determine optimal MoE model serving strategies. Additionally, this framework allows identification of system bottlenecks across various deployment scenarios. While this model is specifically designed for DeepSeek V3 architecture, extensions to Kimi-K2 and other MoE architectures are quite simple.

The theoretical performance model analyzes attention (MLA in the case of DeepSeek V3.1) and expert computations separately, as these components may be constrained by different resources at different times. Since two-batch overlap techniques may be employed to hide communication, these two parts of the model may operate without overlap and are thus not able to hide the memory loading over both computations combined. Furthermore, the model accounts for communication across potentially heterogeneous networks incorporating different intra- and inter-node communication hardware. Communication time can be optionally overlapped using TBO. Finally, we consider scenarios where expert distribution is nonhomogeneous, resulting in imbalanced communication, increased memory loading, and uneven computation across GPUs, which can bottleneck the entire system.

With communication computation overlap, we define the total system performance as:

Without TBO, we can (1) look at the memory loading time and compute time for each block (EP and attention) and take the maximum; (2) add the communication time; and (3) consider imbalanced experts:

The model operates under the following assumptions:

No computational and memory loading overhead from the DeepEP TBO communication library. This assumption does not hold in practice, as the library launches a non-negligible number of CUDA kernels.
All computations and weights are performed and stored in FP8, with the exception of communication operations, where dispatch occurs in BF16. This assumption is largely accurate since over 98% of parameters reside in expert weights, which utilize 8-bit quantization.
The analysis focuses exclusively on decode performance without considering prefill operations. Theoretical derivations for prefill performance will be presented.

First lets look at the execution times of the MLA and expert MLP networks of a transformer block based on the operations performed and the memory loaded. Second we consider the communication. For reference all variable names are listed in Table 1.

Table 1: Nomenclature of the symbols used in this description for DeepSeek V3, including the name in the publicly available config.json file.

Memory Loading

To estimate the time spent loading from memory, we look at what gets loaded during each forward pass. First let's look at the MLA secondly at the expert networks.

MLA

The memory requirements during inference for the MLA can be categorized into three primary components: MLA weights (read), KV cache (read and write), and activations (write).

MLA weights are read once per forward pass. The MLA mechanism requires several weight matrices per layer, stored in FP8 format. As outlined for DeepSeek v2, during inference, the up-projection matrices for the K- and V-tensors can be included into other matrices, thus reducing the overall number of matrices.

This yields a total of 187.2 MB per layer, resulting in 11.4 GB total attention weights that must be replicated across each data-parallel attention rank.

KV Cache size is significantly reduced in MLA architectures compared to traditional attention mechanisms. The cache size per token is determined by

where d_c = 512 represents the KV compression dimension, d^h_R = 64 denotes the per-head dimension of decoupled queries and keys, and L = 61 is the number of layers.

The memory requirement becomes:

For a batch processing S input tokens and generating B output tokens, the system loads S x B tokens and saves B tokens to cache.

Activation vectors require temporary storage for communication between expert computations. These activations are loaded once and written back once during the forward pass:

While non-negligible for very large batch sizes, activation memory remains relatively small compared to weight and cache requirements.

Expert Networks

For the expert MLP networks, we have two sources of memory transfers: model weights, which get read once for each forward pass; and activations, which get loaded once in FP8 and written back once in BF16.

The latent vectors are loaded from and saved back to memory for communication. We load them once in FP8 and write them back once in BF16. This part is often negligible.

The expert MLP is made up of two weight matrices W_1 and W_2, which perform a down-projection followed by an up-projection. Note that traditional transformer models have a larger intermediate space thus they first up-project, which makes the matrices much larger. Secondly, we have to account for the SwiGLU gate weight matrix as well. So the size of one expert is

Lastly, we have to account for the router in each expert which is of size

To get the expert weights per devices, the weights are distributed evenly over the devices. Additionally we assume that one device has to hold the full router (overestimation). So for each expert layer we get:

For DeepSeek V3.1 not all transformer layers are using the MoE architecture. The first three layers are made from traditional dense MLPs. Using d_dense as the intermediate size we get:

When serving in expert parallel with SGLang these weights are sharded data parallel. Thus:

Embedding Layers

The embedding matrix size is V x d, where V is the vocabulary size. As we need to embed and un-embed, and each embedding matrix is sharded across GPUs in a data-parallel fashion, the per-GPU size is:

As this is a large matrix with V » 100k, we assume this calculation is always memory-bound, and thus we only take into account the memory loading time.

Computation

To quantify computational latency, we examine the multi-head latent attention (MLA) mechanism following the architectural specification detailed in DeepSeek v2 Appendix C. Our analysis incorporates matrix absorption optimizations that enable certain linear transformations to be merged during inference. We validate our computational framework against the DeepSeek V3 training time calculator to ensure consistency, though we need to employ some changes due to optimizations only possible during inference. Furthermore, the single-token generation characteristic of decode operations substantially simplifies several equations relative to training contexts where full sequence processing is required. We denote computations specific to prefill scenarios with Prefill annotations to distinguish where prefill diverges from decode execution paths.

Our computational analysis has three main parts: a vanilla MLA implementation baseline, optimized MLA with matrix absorption techniques, and expert network computational latency.

MLA

MLA computational demands differ substantially between prefill and decode phases due to sequence length scaling characteristics, necessitating phase-specific analysis of each MLA component. We begin by reviewing the MLA computation procedure as specified in DeepSeek v2 Appendix C:

where c_t^Q, c_t^{KV} are the compressed Q and KV tensors respectively. During decode operations, MLA requires up-projecting the K- and V-tensors for every token within the cache, leading to significant computational overhead. To mitigate this burden, the up-projection matrices for K- and V-tensors can be absorbed into existing matrix operations, thereby reducing the total number of matrix-vector multiplications, as briefly touched upon in the DeepSeek v2. Through computational reordering, the self-attention mechanism, for instance, transforms to:

The most important advantage, as demonstrated in the following sections, is that the KV cache no longer requires up-projection operations. We apply analogous reordering to the V-tensor up-projection weight matrix W^{UV}, absorbing it into the output projection matrix W^O.

However, materializing

increases the amount of memory transfer and reduces available KV cache capacity. But there is a way to have our cake and eat it. Our approach diverges from DeepSeek's hint by avoiding materialization of the resulting composite matrix. Instead, we achieve efficiency through dynamic computation: rather than storing the composite matrix

we compute it on-demand during each forward pass while calculating q_t^C. This strategy maintains an identical memory footprint and loading patterns while eliminating the computationally expensive sequence-length dependency imposed upon us due to the up-projections of K- and V-tensors during the decode phase.

To understand the computational requirements of MLA, we begin with analyzing the straightforward vanilla implementation to establish baseline FLOP counts, then progress to the optimized variant and show its performance improvements.

Vanilla MLA Implementation

The vanilla implementation follows the MLA specification from DeepSeek v2 Appendix C, comprising three phases: latent projection, self-attention, and output projection.

Latent Up-/Down-Projection The latent projection involves two sequential operations: down-projection to the latent space, followed by up-projection to the attention dimension. For the sake of simplicity, we ignore the RoPE and Softmax calculations.

Prefill:

During decode operations, most sequence-length dependencies can be eliminated since intermediate values are either used once or cached. However, performing K- and V-tensor projections for each token in the context remains necessary due to caching in latent format. This creates a problematic sequence-length dependency that significantly degrades performance. So vanilla decode projection becomes Decode (single token):

where the FLOPs_{k_RoPE_proj} only need to be calculated once for each k, as they are cached.

Attention Computation The attention mechanism computes query-key interactions against cached key-value pairs, exhibiting computational complexity that scales with sequence length.

Prefill:

Decode:

Output Linear Transformation The final linear transformation projects attention outputs back to the model's hidden dimension:

Prefill:

Decode:

We now examine the computational modifications when implementing the non-materialized matrix-absorption approach.

Matrix-Absorbed MLA Implementation

The primary goal of the two matrix absorptions:

is to eliminate per-token KV cache up-projections by enabling direct computation on compressed KV-tensors. This fundamentally alters the memory access pattern.

Latent Up-/Down-Projection Since we absorb the up-projection matrices for K- and V-tensors, we no longer need to perform these projections.

Prefill:

Decode:

This eliminates the sequence-length dependency during the projection stage, which constitutes a significant computational bottleneck.

Attention Computation In this self-attention computation, we need to take the absorbed

into account.

Prefill:

Decode:

As shown, the sequence-length dependency is eliminated for FLOPs×W^(UK) and FLOPs×W^(UV) without introducing additional matrix materialization.

Output Linear Transformation The final linear transformation again projects attention outputs to the model dimension, incorporating the absorption of W^UV into W^O. We again avoid materializing the absorbed matrix to minimize memory overhead.

Prefill:

Decode:

Expert Networks

Following the MLA computational analysis, expert-network computation is relatively simple. Each expert module comprises two components: (1) a router and (2) the experts themselves. The experts consist of two layers with inverse dimensions and a SwiGLU activation function, incorporating gate projection weights of matching dimensions.

In total, for all experts, we get:

With expert imbalance (characterized by the expert imbalance factor β_eb that we introduce in later sections),

Communication

Communication Base Model

Following the analysis presented in the SGLang blog post, the only inter-GPU communications stem from expert-parallelism sharding. Figure 16 illustrates this communication pattern during forward passes within individual layers. Each layer has two distinct communication phases: a dispatch phase routing tokens from data-parallel MLA blocks to selected experts, followed by a combine phase aggregating expert computation results for propagation to the next data-parallel MLA block in the subsequent layer.

Figure 16: Communication pattern during expert-parallel sharding. Each GPU sends out n(r,u)^E + n(s,u)^E = 9 messages going from the MLA to the expert computation and receives n(r,u)^E + n(s,u)^E = 9 many messages going from the experts computation to the next MLA step. This communication can happen intra- or inter-node, where systems with larger EP size increase the percentage of inter-node communication. The base image was taken from the SGLang blog post.

The DeepSeekV3 scaling analysis establishes a baseline communication model assuming uniform expert distribution across devices, where "each device holds one expert’s parameters and processes approximately 32 tokens at a time". This configuration corresponds to a 257-GPU deployment with homogeneous network connections (each GPU-to-GPU connection has the same bandwidth and latency). We extend their formulation to accommodate arbitrary system configurations under assumptions of uniform network topology and perfectly balanced expert allocation. For each GPU communication link, each dispatch and combine operation processes

tokens (accounting for dual-batch overlap), with replication across

The total communication volume per GPU link per forward pass for DeepSeek V3.1 architectures becomes:

The leading factor of 2 reflects computation-communication micro-batch overlap, where consecutive batch processing introduces sequential dependencies. Dispatch operations utilize FP8 precision while combine phases use BF16 precision.

The effective communication bandwidth corresponds to the minimum link bandwidth within the system topology. In well-balanced configurations without bottlenecks this is the link to each GPU. In general cases the constraint becomes:

Within homogeneous network environments, the communication execution time follows:

Improvement 1: Non-Heterogeneous Communication Links

Standard systems often have different interconnect speeds for intra- and inter-node communication. Given that 1/n_nodes of total communication volume remains within individual nodes, this fraction of communication becomes negligible in heterogeneous network configurations where interconnect speeds differ substantially (for example, NVLink at 450 GB/s versus InfiniBand at 50 GB/s). The NVL72 rack configuration represents a notable exception, providing uniform NVLink connectivity across all nodes within the rack.

For systems with heterogeneous interconnections, the total communication time becomes:

Improvement 2: Expert Imbalance

Our initial analysis assumed uniform expert distribution across GPUs, enabling straightforward communication volume calculations from the data-parallel MLA perspective. However, this assumption breaks down in practical deployments. As an easy counter example, consider a system deploying 9 experts across 8 GPUs: one GPU must accommodate 2 experts due to the discreteness of the experts, resulting in nearly double the communication overhead compared to balanced configurations:

\\frac{B}{8} \\times 0.0118 \\text{ GB}\n\\end{aligned}","id":"AAPHZDKQRR"}" data-component-name="LatexBlockToDOM">

While sufficiently large batch sizes with random expert routing (and appropriate shared expert replication) could theoretically rebalance this load, empirical measurements in production systems show inherent differences in expert utilization that contradicts this assumption.

From the perspective of individual GPUs hosting experts, the communication volume becomes:

Incorporating the expert load imbalance factor (introduced in the next section) into the formulation yields:

For heterogeneous network configurations, the resulting communication time then becomes:

List of Common Interconnects

Table 2 presents unidirectional bandwidth specifications for all-to-all communication patterns for commonly used interconnect technologies, where communication throughput is constrained by the bandwidth available to individual GPUs (unidirectional bandwidth):

Table 2: Unidirectional bandwidth specifications across different interconnect technologies.

Expert Balancedness

The distribution of the experts over the GPUs can have a big impact on the communication and execution times of the expert layers. As an illustrative example, we can take a system with 2x8 H100 GPUs and distribute all the experts uniformly. In this case, there is a GPU which has the shared expert plus roughly 16 routed ones. Since all items in a batch will go to the shared expert, this GPU has to load roughly

more activations than the other GPUs. Furthermore, 2.7 times more communication volume will go through the link connecting to the GPU.

To model this imbalance, we define and expose the variable β_eb to the user. Similar to the definition of SGLang, β_eb is defined as the ratio between mean expert load and maximum expert load among GPUs, so:

Therefore, β_eb_gpu=1 is a balanced case and β_eb=1/n_GPUs would be completely imbalanced. Thus the average load increases by L_imbalanced = (1/β_ep) × L_balanced.

For balancing the experts, some n_additional_experts get duplicated onto multiple GPUs. This can lead to an increase in EP memory loading time. As EP is often memory-bound, this can lead to an increased EP execution time. Thus balancing the experts is a tradeoff between loading more weights and more homogeneous communication and computation. Finally, one has to ensure that (n_routed_experts + n_additional_experts) modulo ep_size = 0, as otherwise imbalance in the memory loading and computations would be introduced by design.

Figure 17 from the SGLang blog post shows examples of expert balancedness given a number of GPUs and potentially active load balancing.

Figure 17: Achieved expert balancedness given a fixed number of devices using expert parallel load balancing. Source

From Parts to the Whole

Given all of these considerations, we implemented a theoretical model estimating the model throughput given the hardware. It should make it easier to understand the tradeoffs between the latency, throughput, and cost between different hardware providers.

The model includes a number of assumptions, such as:

All weights are stored in FP8; the MLA is computed in BF16; the matrix multiplications in the expert layers are performed in FP8. The communication is done in FP8 apart from the dispatch which is done in BF16.
We make some strong assumptions about the overhead for compute, memory bandwidth, and communication. We assume the same level of inefficiency across different hardware to make it fair. The levels are arbitrary and arguably can be one of the leading sources of error in our calculation. These inefficiencies in reality are also not the same for every hardware and are strongly dependent on the implementation.
To make the calculations simpler we assume that no MTP is performed. We managed to make it run with MTP; however, we deemed the performance gains not worth the increase in complexity of the model, especially for larger batches.
We only looked at the decode performance without taking into account prefill.
We assume no compute and memory loading overheads from the DeepEP two-batch communication library. This is not true, as these operations start a significant number of CUDA kernels, which can have downstream effects on highly optimized kernels like GEMM, as they will no longer get the expected number of threads.

At a high level, the performance model comprises three primary execution components: MLA computation, expert parallel (EP) computation, and communication overhead. For both MLA and EP operations, we determine whether memory bandwidth or computational throughput constitutes the limiting factor. Communication can be optionally overlapped using two-batch overlap (TBO), where total execution time for one forward pass becomes

The high-level model structure follows:

Computational, memory, and communication time estimates use the previously derived formulae, adjusted for real-world implementation inefficiencies. Practical systems rarely achieve theoretical peak performance, necessitating inefficiency factors across all components. The DeepEP communication library documentation indicates 40 GB/s achieved throughput from 50 GB/s theoretical peak, yielding a communication overhead of ~25%. FlashMLA achieves approximately 66% MFU. Expert-layer computational performance, based on DeepGEMM benchmarks showing 1550 TFLOPs from 1980 TFLOPs theoretical FP8 dense peak performance results in an EP computation overhead factor of ~30%. Both computational inefficiencies receive an additional 10% penalty to account for suboptimal input conditions and overhead between the kernels.

Memory inefficiency estimation proves more challenging without profiling. Due to some operations, such as matrix multiplication, requiring multiple loading operations for the same value, and the fact that most kernels are optimized for compute bound scenarios, we apply a conservative inefficiency factor of 2.0 to account for these overheads.

Token generation rates are calculated as the inverse of total execution time, with global throughput scaling by concurrent batch size:

It is important to note that this model does not consider whether the proposed configurations are viable under real-world memory constraints. For instance, long context sequences can drastically reduce the maximum number of concurrent sequences due to memory limitations, resulting in significantly lower throughput than theoretical predictions.

Predictions and Real-World Comparison

To validate our theoretical model we compare it to real world measurements using three vastly different hardware setups:

4x8 H100: This is the basic setup that we consider reasonable to maintain by a large enterprise. This was also the setup we managed to obtain, therefore we have the measurements for all reasonable batch sizes.
9x8 H100: This is the setup from the SGLang blog post including their tuned performance measurements.
12x4 B200: This involves 48 out of 72 GPUs in an NVL72 setup. We use this to visualize how differently the new generation of hardware performs. This is also a setup tested by the SGLang team.

Figures 18 and 19 demonstrate that our theoretical model achieves reasonable agreement with empirical measurements. The first figure presents a systems total throughput and tokens per second (TPS) per request, while the second emphasizes efficiency by showing TPS per GPU.

As anticipated, our model overestimates actual performance by a considerable margin and we have to tune the model with our inefficiency factors. This discrepancy arises from two primary factors: first, individual component kernels fail to achieve peak performance as discussed previously; and second, peak performance of these individual components is rarely attained with constrained batch sizes. Furthermore, end-to-end optimization is often suboptimal, resulting in kernels optimized for different operational scenarios. These factors justify our incorporated inefficiency assumptions.

Small batch size estimation proved particularly challenging, as illustrated in Figure 18. At batch size 32, actual performance in our setup (shown in blue) exceeds theoretical predictions (given our inefficiency factors; it does not exceed the upper bound posed by the hardware itself). In our model, we assumed uniform expert activation probability, which does not reflect reality. In practice, fewer experts are activated, resulting in higher throughput than predicted. As batch size increases, throughput converges to predicted levels, indicating activation of most to all available experts.

Consistent with our stated assumptions, the model does not assess whether given batch sizes are practically feasible under given systems memory constraints. In our system configuration, sequence eviction begins after batch size 1024, causing a sharp decline in per-request throughput and total throughput saturation. Increasing node count expands the amount of memory available for KV cache, enabling larger batch sizes as demonstrated by the two SGLang configurations.

Figure 18: Comparing the theoretical model with real world measurements for decode throughput performance. The measurements for the 4x8 H100 setup were conducted using our tokenomics benchmark, running AIME requests against the model. Data for the two other setups stems from the SGLang two blogposts Our model works well in different scenarios on multiple hardware setups.

Throughput: Theory vs Practice

Examining Figure 19, we observe that increasing batch size per GPU improves system efficiency substantially. However, realizing these optimal batch sizes necessitates extensive memory allocation for storing KV cache. Given that total weight size remains largely static (excluding data-parallel MLA weights), distributing computation across additional GPUs reduces the per-GPU weight burden proportionally.

Figure 19: Per-GPU efficiency across all configurations demonstrates consistent scaling characteristics. Measurements on our system (in blue) exhibit a performance ceiling due to token memory constraints that trigger sequence evictions, preventing running the full batch concurrently. This limitation clearly illustrates that additional memory capacity would enable our system to achieve higher throughput levels, validating the case for larger setups.

An increasing problem that large-scale systems pose is the communication overhead, which scales linearly with batch size. Consequently, configurations with large batch sizes and short sequence lengths may encounter communication bottlenecks. This phenomenon manifests in Figure 19 where the 4x8 H100 configuration achieves higher per-GPU throughput at batch size 512 compared to the 9x8 H100 setup, because the latter becomes communication-bound. Nevertheless, the former configuration cannot sustain these batch sizes in practice and will evict sequences, effectively running at smaller batch sizes. This also demonstrates the advantages of the NVL72 super node for inference workloads, effectively mitigating potential communication constraints.

Kimi-K2 represents the first open-source LLM surpassing 1T parameters. The model employs the essentially identical architecture to DeepSeek V3.1, just with more routed experts per layer. As demonstrated in Figure 20, this configuration yields reduced throughput, particularly under memory-bound conditions where MLA runtime remains minimal. However, achieving equivalent batch sizes across identical hardware configurations is infeasible for large batches, as Kimi-K2 requires greater GPU memory allocation for storing its weights. Consequently, while theoretical performance degradation appears modest, practical performance disparities may be more pronounced due to reduced effective batch sizes compared to DeepSeek V3.1 deployment.

Figure 20: Comparison of theoretical predictions for Kimi K2 to DeepSeek v3. Even though Kimi K2 has a lot more weights, its only slightly larger due to the MLA and the communication being the same in both models. However, we dont model if a given setup would actually have enough memory and thus would be able to run a given batch size. Kimi K2 would reach this limit much earlier leading to more evictions and smaller achievable batch sizes.

Similar challenges with increased KV cache evictions and diminished effective batch sizes emerge when serving long sequences, as their KV cache demands substantial memory space. Although sequence length exerts a comparatively small impact on decode performance, as shown in Figure 21 (while exhibiting quadratic dependency during prefill on the sequence length), it constrains to running very low batch sizes, significantly reducing system efficiency.

For example, a 4×8 H100 setup provides roughly 20 GB of GPU memory per GPU for the KV cache. At a context length of 32,768 tokens, this translates to a maximum effective batch size of

In practice, fragmentation of the KV cache and other inefficiencies reduce this number further. DeepSeek reports a much shorter average context length of 4989 tokens, which remains within manageable parameters.

Production serving environments typically operate under service level agreement (SLA) requirements that mandate minimum TPS thresholds per request. As illustrated in Figure 22, these performance guarantees often impose surprisingly restrictive limits on achievable batch sizes. Providers find themselves constrained to operate with smaller batches to meet per-request latency requirements, resulting in suboptimal efficiency. This constraint disproportionately affects smaller deployment configurations, creating a natural advantage for large-scale enterprise operations, serving to hundreds of thousands of customers.

Our previous analysis has given limited attention to the prefill phase. During prefill, the system computes the complete KV cache for all input tokens and generates the first output token. The computational complexity of this phase scales quadratically with sequence length due to the full attention computation required. For shorter sequences, prefill duration remains substantially shorter than the subsequent decode phase. However, in long-context scenarios, prefill can exceed decode time, creating significant system bottlenecks.

Serving frameworks typically interrupt decode operations to process prefill batches, stalling the entire inference pipeline. Additionally, prefill operations are generally compute-bound rather than memory-bound, requiring distinct optimization strategies compared to decode. Large-scale deployments address this by implementing prefill-decode disaggregation, physically separating these phases across different instances. The prefill instance typically operates on fewer GPUs than the decode instance, reflecting the shorter duration and different resource requirements of prefill operations.

Figure 21: Serving long contexts affects decode performance less than prefill, since prefill’s runtime scales quadratically with input sequence length. However, a much bigger factor for decode, one we don’t currently model, is the size of the KV cache. A large KV cache renders the MLA memory bound for larger batch sizes and additionally forces running at smaller effective batch sizes, preventing the operation from becoming compute-bound in any realistic setup.

Figure 22: The plots shows the maximum achievable batch size for running with at least 20 TPS per request. This limit is arbitrary but seems close to what providers often offer as an SLA. Again the setup using the NVL72 outshines the competition, allowing for more efficient serving given an SLA.

Interactive chat applications and agentic workflows frequently involve multi-turn sequences where consecutive requests share common prompt prefixes. Given the potential length of these conversational contexts, repeatedly executing prefill for shared content becomes highly inefficient. Sophisticated caching mechanisms can drastically improve performance by reusing computed KV caches across requests. Effective caching architectures extend beyond GPU memory to use CPU memory and even persisting to disk. Even disk-to-GPU transfers often outperform recomputation for sufficiently long sequences. Additionally on-disk caches can be held for longer, potentially for days.

Such caching infrastructure can also serve as a buffer layer between disaggregated prefill and decode instances. Systems like LMCache and Mooncake provide foundational solutions to this problem. However, setting up such a caching infrastructure is non-trivial, and we save this topic for a future blog post. For the current analysis, we note that while prefill can substantially impact overall system performance, well-designed caching strategies offer substantial mitigation. DeepSeek's production deployment reports achieving approximately 56.3% cache hit rates, demonstrating a good reduction in prefill time when deployed.

While open source inference frameworks such as SGLang and vLLM may not achieve the absolute peak performance of specialized commercial inference providers like Fireworks or Together, we believe the performance gap remains relatively narrow. Evidence from production deployments, as referenced in Tweet 23, suggests that open source solutions approach state-of-the-art efficiency levels achieved by major enterprise implementations.

Figure 23: Member of technical staff at Thinky suggesting that the open-source inference software is not that far behind the proparitary inference software stacks available inside the big labs. Twitter

Our theoretical analysis combined with empirical measurements indicates that proprietary inference providers likely achieve comparable computational efficiency to well-optimized local personal deployments. However, these commercial providers maintain competitive advantages through access to superior hardware and more favorable economies of scale. The primary differentiation appears to stem from infrastructure advantages and economies of scale rather than fundamental algorithmic or implementation superiority in the inference stack itself.

Hardware considerations and profit margins

When deciding on hardware for large-scale MoE inference setups like DeepSeek V3.1 or Kimi, several key factors must be considered. First, due to the sparse computation pattern and its effects on the parameters that need to be loaded for a forward pass, there are significant economies of scale from adding more GPUs to the setup. In other words, a combination of four nodes should outcompete two pairs of workers running on two nodes. This is pretty well visualized in the graph created by the SGLang team (see Fig. 24), where a setup of 72 GPUs vastly outperforms one with 16 GPUs involved on a per GPU basis, an observation that confirms what we have seen before in results from Perplexity (see Fig. 2).

Figure 24: Token throughput measured on NVL72 SGLang Blog. 2.08.2025

Second, cumulative throughput and per-user experience are highly dependent on the batch size at which the setup operates. This is well visualized in the benchmarks run we did for testing DeepSeek (see Fig. 18), and in the numbers provided by the SGLang team (see Fig. 25). The larger the allowed batch size the larger cumulative throughput but at a cost of worse "per user" experience (see Tab. 1) - a fundamental trade-off in inference optimization.

Figure 25: Token throughput measured on NVL72 SGLang Blog. The input and output lengths are set to 2000 and 100, respectively.

Table 3: Daily production token production, tps enjoyed by each user, and cost per 1M output tokens. We only consider here the decode phase and leave prefill out of the scope for now. We assume the numbers from SGLang blogpost. The input and output lengths are set to 2000 and 100, respectively. At longer context these numbers will look differently, but we hope this table shows the scale of overcapacity. We assume $2/h and B200 at $8/h. Especially the second number is to be treated with cautious as there is no official market for NVL72, no provider offers this on demand so there is some guessing on our part here.

Hardware selection is the next critical consideration. The optimal choice is highly contingent on the latency/throughput requirements for your specific use case. While B200s in NVL72 will offer superior per-GPU performance compared to H100s, they come at a significantly higher price point - assuming you can even secure them3. Depending on what the inference provider wants to prioritize, either cost or latency, it will affect the type of hardware that will be optimal here.

For your exact application, how many input and output tokens do you run with, how big is your profit per user, how big is your spread in the number of concurrent users per day, how flexible are users, and at the time of peak usage, how low can the tps drop to? All of these factors should impact what hardware will be optimal for you.

One of the interesting observations we made after running the theoretical model for different hardware setups was how much of a bottleneck for B200 slow interconnect is. There seems to be a massive performance gap between the numbers we estimate for B200s connected via InfiniBand and the ones connected by NVLink. This is obviously highly contingent on the model and how much we communicate between the nodes, but overall we believe that for model of a scale such as DeepSeek, running on B200s might be actually suboptimal, as the coms overhead is taking away most of the gains we get from faster memory and more FLOPS compared to H100s (see Fig. 27).

Figure 27: Comparing a system based on the DGX B200 nodes to an NVL72 reveals a striking difference: The DGX systems gets communication bound due to relying on Infiniband for the inter-node communication, drastically limiting the performance compared to the NVL72 system.

Another observation we hope you can take out from reading this text is how "chat centric" the current inference providers are. If you look at the throughput of DeepSeekV3 from various providers, as reported by OpenRouter, most of them offer very comfortable 50+ tps (see Fig. 28). While this is great if we have real time application like a chat, it is less than optimal if we want to use the model to generate the synthetic data. As we have seen multiple times throughout this text, in benchmark from Perplexity (see Fig. 5), in our theoretical estimations and in the real world observations we did (see Fig. 19), keeping the tps so high, while great for real-time applications, is suboptimal when we want to produce as large numbers of tokens as possible. For that the operational batch size would need to be largely increased. This would result in significantly degraded tps performance per request, but substantially higher overall throughput. For asynchronous or non-time-critical workloads, this trade-off is highly beneficial, dramatically reducing cost per token.

Figure 28: DeepSeek V3.1 speeds offered by various providers. 2.08.2025

Such setup would be ideal for synthetic data generation, where individual latency is irrelevant and the goal is maximizing total token production per dollar of hardware investment (Fig. 29). However, we believe that the current inference providers inadequately serve this market. While some offer batch discounts - Fireworks provides 40% off batch APIs, DeepSeek offers 50% off-peak pricing in China (see Fig. 32) - these limited options suggest significant unmet demand for flexible, throughput-optimized serving.

Figure 29: Conceptual model showing the problem with using the same setup for synthetic data generation and real-time API.

This infrastructure gap presents a significant opportunity for NeoCloud providers specializing in short-term, high-throughput compute rentals. Already today some providers, like Prime Intellect, offer on-demand access to the cluster of up to 64 H100s (see Fig. 30). Such a setup would be capable of daily generating billions of synthetic tokens even for large models like DeepSeek.

Reasoning traces from such data runs could be used for a reinforcement-learning fine-tuning (RLFT) in a product similar to the one offered by OpenAI. We believe that using RL to train models that directly maximize the business-specific rewards shows significant growth potential. Think of a virtual assistant helping people in making purchasing decisions, which is rewarded with actual dollar revenues, amplifying the actions that better convert into sales, or a virtual companion that promotes deeply engaging conversations, keeping users longer in the app. There undoubtedly is a huge economic incentive for businesses to apply such techniques, maximizing the revenues in a similar way as YouTube or TikTok already do with recommendation engines.

Figure 30: H100 multi-node cluster available at Prime Intellect. Cluster of InfiniBand connected nodes costs $2.49 × 64 ≈ $160/h or $160/h × 24 = $3825/day. We estimate such cluster to be able to produce up to 30B synthetic tokens within a day.

Furthermore, to improve the inference economics, such RL models could be trained using LoRA adapters or a similar technique and served alongside thousands of other models, all catered to specific use cases. This multi-tenant serving approach represents a compelling business opportunity for inference providers. Clients hosting their custom LoRA adapters on a provider's infrastructure face significant switching costs when migrating to competitors, as the adapters are optimized for specific serving configurations and client workflows. RLFT is based on unique and nuanced rewards that are very client-specific; unlike standard supervised fine-tuning (SFT), it much much more challenging to replicate it just via in-context learning, making it an even more compelling case for inference providers.

We expect the inference markets to further specialize in regard to offered throughput, latency, and pricing. It is only natural for providers of super-fast tokens like Groq and Cerebras to command a much higher premium for the tokens they deliver at few-second latencies and for other providers like NeoCloud specializing in high-latency, high-throughput inference scenarios focused on synthetic data generation. We hope to elaborate on this space in the future text.

From Tokens to Dollars - Estimating Tokenomics

Now we can finally address the original question: What is a fair price per DeepSeek V3.1 token? As we hope you know after reading through this text, the answer is an unsatisfying it depends.

The price per token depends on two factors: how much our hardware costs, and how many tokens it can produce per unit of time. As shown in numbers from Perplexity (Fig. 5) and SGLang results (Fig. 24), there are significant benefits to the performance per GPU when more GPUs are deployed. Putting more GPUs into serving a large-scale MoE model will yield higher performance per GPU and, as a result, lower our costs and boost profits.

Moreover, since LLM inference is heavily memory-bound, the batch size at which we serve the model significantly affects the combined throughput across all requests. The larger the batch size we use, the more tokens we cumulatively produce, but this comes at the cost of increased latency for each individual user, as reflected by Tab. 3.

Furthermore, not all hardware is created equal. While B200s will offer superior compute GPU performance compared to H100s, they are significantly more expensive (see Tab. 3), making them likely a less optimal option when optimizing for cost efficiency and producing as many tokens as possible while minimizing costs

All in all, while we cannot provide an exact number, we hope this analysis provides valuable insights into the factors impacting token pricing. The theoretical performance model we provide, though not perfect, should offer solid intuitions about expected performance and the trade-offs between different hardware options.

The missing tokens

Finally, we want to address the elephant in the room: the problem of missing tokens in the global market. As of this writing, DeepSeek V3.1. remains the most popular open-source model on OpenRouter. While displayed daily consumption hovers around 30B tokens per day, upon closer inspection it becomes clear that the majority of these are input tokens, not output tokens. Daily global consumption of DeepSeek V3.1 output tokens on OpenRouter is approximately 1B tokens. A quick examination of our numbers in Table 3 reveals that with a fraction of a single NVL72, we could meet this demand 20 times over while maintaining a reasonable >30 tokens per second per request.

Figure 31: Daily DeepSeekV3.1 token production on OpenRouter. 30.08.2025

This is a pretty significant gap. How is it possible that the global consumption of the most popular open-source model is so small that it could be met by a single NVL72 with 20 times the capacity to spare? Given this low demand, how can so many inference providers sustain their businesses? Put simply: who is making money here?

One might argue that we only account for the decoded tokens and that the majority of income comes from the input tokens. We do this because, due to the caching mechanism, it is quite challenging to accurately estimate how big portion of input tokens cost can be captured by the inference providers.

To quote DeepSeek:

Within the 24-hour statistical period ... Total input tokens: 608B, of which 342B tokens 56.3% hit the on-disk KV cache.

Caching drastically reduces the cost of prefill, slashing times to first token, and enabling the inference provider to move nodes from doing prefill to only working on decode. For example, DeepSeek offers a 75% discount (see Fig. 32).

Assuming that the DeepSeek caching numbers hold across the industry, this would put the daily total profit achieved via OpenRouter at:

spread across all of the inference providers. Some providers don't offer caching, some offer cheaper pricing, and some offer more expensive pricing than DeepSeek so estimating the exact amount being spent daily is difficult, but we don't expect it to be much different than that. Which begs a question? Where is the demand for DeepSeek?

The first natural answer is that OpenRouter just captures only a small portion of the global demand for DeepSeek models. The question is, how small?
Even had it been just 1%, assuming that our estimations are accurate, it could easily be fulfilled by 3 to 4 NVL72s. One caveat of this calculation is that our numbers (based on the SGLang benchmark) assume a short input length of 2000 tokens, something that we try to account for in our theoretical model. If we increase the context length from 2k to 32k the KV cache footprint increases 16x, severely limiting the batch size at which we can operate, considerably altering our potential margin.

Overall we don't have an answer backed by precise data for the question "where are the missing tokens?"; In the numbers revealed by DeepSeek, they claim to be processing 168B output tokens a day (these numbers are from Feburary 2025, the current numbers are likely significantly higher). This is orders of magnitude more than OpenRouter, a gap that we find quite surprising, but that would largely answer this question. Perhaps the vast majority (>99.9%) of the global demand for the DeepSeek tokens is matched by calling the providers directly and not via services gathering multiple providers.

The only other provider that we were able to find to openly share their numbers is Chutes (see Fig. 33). At around 0.2B output tokens and much lower pricing (only 80¢/1M output tokens), they generate an estimated daily income of $160 from output tokens of DeepSeek V3.14. On top of that, they seem to generate significantly more income from input tokens, but this seems to be mostly due to lack of caching. With the emergence of easily accessible caching solutions, such as LMCache and Mooncake, it is something we expect to be solved in the coming months, with the resulting savings being passed onto consumers.

Figure 32: DeepSeek API pricing, Caching offers substantial savings of (1.0-7/27)=~75%. Note that deepseek-chat and deepseek-reasoner are the same model - DeepSeek V3.1, just with the reasoning setting on or off. 30.08.2025

Figure 33: Daily DeepSeekV3.1 token production on chutes. 30.08.2025.

While talking to industry insiders, it was suggested to us that some leading inference providers, those that have raised nine-figure funding rounds, are processing trillions of tokens daily, but as of September 2025, there is no publicly available evidence supporting such claims. We find this dichotomy between Google, ByteDance, or MSFT declaring that they are processing trillions of tokens daily and the minuscule numbers we see for open-source providers to be quite perplexing!

Acknowledgements

Thanks to @felix_red_panda for giving it a read before the publication, and bouncing out the ideas

@online{tensoreconomics2025llm,
  author = {Piotr Mazurek, Eric Schreiber},
  title = {MoE Inference Economics from First Principles },
  url = {https://www.tensoreconomics.com/p/moe-inference-economics-from-first},
  urldate = {2025-09-02},
  year = {2025},
  month = {September},
  publisher = {Substack}
}

We elaborate on this in a later part of the text

See our previous article on for the detial of TP.

While investigating this topic with industry insiders, we learned that due to extremely limited supply, securing NVL72 is close to impossible at the moment.

Please note that for some reason, DeepSeek R1 and V3 0324 remain much more popular on chutes, with a combined output token production of ~2B tokens a day as of of 30.08.2025.

LLM Inference Economics from First Principles

Piotr Mazurek — Wed, 14 May 2025 18:00:32 GMT

The main product LLM companies offer these days is access to their models via an API, and the key question that will determine the profitability they can enjoy is the inference cost structure. In this text we will explain where the cost of serving/hosting LLMs comes from, how many tokens can be produced by a GPU, and why this is the case. We will build a (simplified) world model of LLM inference arithmetics, based on the popular open-source model-LLama 3.3. The goal is to develop an accurate intuition regarding LLM inference.

The topic of LLM inference economics has far-reaching implications beyond technical considerations. As AI capabilities rapidly advance, inference efficiency directly shapes both industry economics and accessibility. For AI labs, token production costs fundamentally determine profit margins and the cost of generating synthetic training data-more efficient inference means higher returns on a fixed investment in hardware that can fuel further research and development cycles. For users, lower token costs democratize access to these powerful tools, potentially transforming AI from a premium resource into an everyday utility available for even routine tasks. Understanding these cost structures isn't merely academic-it provides insight into one of the key economic forces that will shape AI development in the coming years as we approach increasingly capable systems.

The primary cost behind a generated token boils down to the cost of compute - you need to buy or rent a GPU. In both cases, there is a fixed cost associated with running a GPU per hour. Each GPU can produce a limited number of tokens in an hour. The number of tokens produced per hour divided by the cost of hardware per hour will tell you the unit cost of generating a single token. This is how most of the LLM providers price their API offerings, and this will be the model we will explore.

Model parameters and hardware requirements

As a basis for our inference economics analysis, we will use LLama 3.3 70B. Even today it is still one of the most popular open-source models and an architecture around which a big portion of the industry standardized. There are numerous model fine-tunes of the Llama weights, and while these models will produce different outputs, because they share the same model architecture, they require exactly the same compute resources to run them. Hence, we consider Llama a good candidate to provide a real-world example that will be representative but, at the same time, quite simple to grasp.

LLMs store their "knowledge" in parameters-essentially the weights that define the model's behavior. These parameters require memory to store and compute resources to process. Generally, the more parameters a model has, the greater its resource demands, but so are its potential capability on downstream tasks. Llama 3.3 70B has around 70 billion parameters, which is where its name comes from.

A so-called decoder-only transformer model, like Llama, usually consists of the following components:

one input embedding layer-converting tokens, or words, into vector representations.
multiple transformer layers, each layer containing some parameters for the self-attention part and some for the MLP part.
language modeling (LM) head-the final layer

We assume the reader has a basic understanding of these concepts; hence, we will not be providing the deep intuitions behind them. If you are unfamiliar with the transformer architecture, please stop here and check out one of these amazing tutorials.

Fig. 1: Config of llama 3.3 that will be using throughput this text as a reference model.

Fig. 2 Llama architecture; source

Now, let's break down the model's parameter count step by step, verifying that the claimed 70 billion parameters hold up. To do so, let’s start by looking at the Llama model config in Fig. 1. We can see there are multiple keys and values. Keys, such as hidden_size informing us about the sizes of specific parts of the model. We can use them to calculate the total model size. To grasp the high-level overview of the architecture, take a look at Fig. 2. You can see all three parts we described above, and the graph also shows some implementation details that we will dive into in the next section.

For input embedding layer, we need to convert every possible token position into its vector representation. Hence we have:

parameters.

Then we have N transformer layers (see Fig. 2). Each of these layers has typically many millions parameters that we will need to store in memory in order to run the model. Their sizes can be calculated with hyper parameters from the config.json above:

*The w_v and w_k being 1/8th the size of the w_q is something Llama architecture specific. This is due to the Llama team using a technique called Group Query Attention in which the model has fewer K and V heads than the total attention heads. You can verify this by looking at num_key_value_heads in the hyperparameters from the model config. The model intermediate_size being 3.5x the hidden size is as well a Llama architecture-specific value. These were chosen by the Llama team, and we take them at face value, also to simplify our calculations.

Bringing us to a total of

per transformer block.

Finally, we apply a last RMS Norm before feeding the representation into the LM head, which converts vectors into token logits.

Summing up all of these parameters we obtain:

We can find the values of each of these, vocab_size , hidden_size , num_hidden_layers in the config in Fig. 1. Substituting these values into the equation we will get:

Each parameter is a floating-point number in a bfloat16 format-e.g., 0.22312, -4.3131. Storing each of these numbers takes 16 bits which is 2 bytes of memory. Given that we have a total of 70,553,706,496 parameters to store, we will need 141,107,412,992 bytes or 141GB just to store the model weights in GPU memory.

Note that 141GB is more memory than there is on the most common data center GPUs, such as the Nvidia A100 or H100. Each of these GPUs comes with only 80GB of total memory (we refer to this memory interchangeably as HBM, high bandwidth memory, or global memory). Hence, for serving models, we usually use multiple cards for a single instance of a model. In practice, for more optimal model serving, we want to use more, 4 or even 8 such GPUs. Let’s now just take it at face value, and we will elaborate on why this is the case in the later part of this text.

Compute and memory bound

When looking at the specification of a GPU, you should be paying most attention to two metrics:

Compute: measured in FLOPS - how many floating-point operations (addition and multiplication) a GPU can do in a second*.
Memory bandwidth: how many bytes can be loaded from the global memory in a second.

These two factors dictate how quickly you can process computations; they affect the speed of a single feedforward operation and determine the generation speed (measured in tokens per second, tps) and ultimately define your cost per token.

* Please be aware that FLOPS and FLOPs mean different things. FLOPs (small s) is the plural of floating-point operations not considering time at all but FLOPS (capital S) means floating-point operations that happen within a second

Fig. 3: Compute and memory bandwidth A100 cards vs H100 cards; source

A computer program (such as running an LLM) can be characterized by its arithmetic intensity. Arithmetic intensity is a concept that describes the ratio of computational operations, such as addition or multiplication (measured in FLOPs), to memory accesses (measured in bytes). A higher arithmetic intensity indicates that the program performs more computations per unit of data fetched from memory, which typically leads to better utilization of the processor's computational capabilities and reduced bottlenecking on memory bandwidth. LLM inference has a very low compute intensity because it involves repeatedly accessing large model weights from memory with relatively few computations per byte fetched.

For A100:

FLOPS: 3.12 * 10^14 floating point operations can be performed in a per second
Memory: 2.03 * 10^12 bytes can be loaded from global memory (HBM) per a second

For H100:

FLOPS: 9.89 * 10^14 floating point operations can be performed in a per second
Memory: 3.35 * 10^12 bytes can be loaded from global memory (HBM) per a second

As you'll see in the next parts of this text, LLM inference has both a phase that's heavily compute-bound (very high arithmetic intensity) and a heavily memory-bound (very low arithmetic intensity). The majority of the wall clock time is spent in the memory-bound phase, so the goal of efficient LLM inference is to maximize the utilization of the GPUs' compute capacity during the memory-bound phase. So increasing the arithmetic intensity for the memory-bound phase represents a fundamental optimization target that directly translates to improved inference economics.

In the context of LLMs, in order to generate a token, we need to load the entire model with all parameters from global memory (HBM) (we utilize the memory bandwidth) and calculate the intermediate activations (we use the compute). The ratio between compute and memory utilization is crucial in determining what can be optimized and how to enjoy better inference economics. In the next part we will go more in depth on the two phases of LLM inference:

prompt processing or so called pre-fill phase
token-by-token or so called decoding phase

The end-to-end latency of an LLM request depends critically on the efficiency of both these two phases.

FLOPs in matrix multiplication

Before delving into the two phases, let's clarify how we count floating point operations (FLOPs) in matrix multiplication.

When multiplying matrices A (shape m×n) and B (shape n×o), we produce matrix C = A @ B (shape m×o). The computation involves:

Taking each row from A and each column from B
Computing their dot product to fill each element of C

For a single dot product between vectors of length n:

We perform n multiplications
Followed by n-1 additions
Resulting in n + n-1 = 2n-1 operations total

Since we need to compute this for every element in our result matrix C (m×o elements):

Total FLOPs = (2n-1) × m × o ≈ 2mno

For simplicity in this post, we'll use 2mno as our FLOP count for matrix multiplication.

Prompt processing/prefill phase

The first phase of generating text with LLMs is prompt processing. In this phase an LLM is presented with a list of input (prompt) tokens, and we try to predict our first new token. The duration of this phase is what the API providers present as “latency” or “time to first token" (TTFT) (See Fig. 4).

Fig. 4: You can see how latency, or time to first token, is reported in OpenRouter

This phase is heavily compute bound, which is good; we utilize most of the compute we have available on our GPU. Let’s estimate the FLOPs of a single forward pass to see why it is the case.

Let’s manually count the FLOPs in the model during processing S tokens. For reference, we include the diagram of the Llama architecture (see Fig. 2).

Embedding Layer

FLOPs:

Lookup operation: Embedding lookups involve retrieving vectors from the embedding matrix and are considered to have negligible FLOPs since they involve memory access rather than arithmetic computations.

- Sine:

For each element in q and k, the following operations are performed:

Multiplications: 2 per tensor
- Multiply q with cos
- Multiply the rotated version of q (rotate_half(q)) with sin
Additions: 1 per tensor
- Add the two results to get the embedded q

Since these operations are performed on both q and k, the total per element is:

Total operations per element: 3 FLOPs per tensor x 2 tensors = 6 FLOPs

FLOPS:

Q × K^T

We assume the naive attention implementation. In practice, with algorithms like flash attention, we calculate it iteratively to save memory.

Shapes:

- Query:

- Key:

- Transposed Key (after appropriate reshaping and transposition):

- Result:

FLOPS:

For each attention head:

For all attention heads:

Note: This quadratic dependence on sequence length (S^2) is why attention becomes expensive for long sequences.

Softmax

It is kind of hard to estimate the FLOPs for Softmax. We approximate softmax as 5 FLOPs per element:

Shapes:

- Input:

- Output:

FLOPS:

This is a simplified approximation of the actual operations in softmax (exponentiation, sum, division).

Attention Output (Q @ K^T) @ V

Shapes:

- Attention Scores:

- Value Matrix:

- Output:

FLOPS:

For each head:

For all heads:

O-Projection

Shapes:

- Input:

- Weight Matrix:

- Output:

FLOPS:

Total FLOPs for Self-Attention

RMS Norm:

Query projection:

Keys and values projections:

Positional Embedding (RoPE):

Q @ K^T (across all heads):

Softmax (across all heads):

Attention Output (Q @ K^T) @ V:

O-Projection:

Total FLOPs:

- Weight:

FLOPS:

Total FLOPs in a Llama Model

Total FLOPs in the Llama model is a product of the number of FLOPs per transformer block times the number of blocks, plus the FLOPs for the LM head.

Transformer Block

Components

- Attention:

- MLP:

Total Per Block

Total FLOPs Calculation

Formula

Example: Llama 3.3 70B

For Llama 3.3 70B with:

hidden_size = 8192
vocab_size = 128256
attention_heads = 64
num_hidden_layers = 80
S = 2048

291 TFLOPs is roughly the order of magnitude of FLOPs available in a modern GPU. For example, with H100 cards (see TFLOPS in Fig. 3), it would theoretically take roughly 291/989 = 0.29s to process a prompt of 2048 tokens.

As a reminder, to load the model from global memory, we need to load 141 GB worth of parameters. The memory bandwidth of a modern GPU is around 3350 GB/s, meaning that in theory it will take 141/3350 = 0.04s to load the entire model from global memory - roughly 7x faster than the time needed for all of the computations.

This demonstrates that in the pre-fill phase we are much more bound by the available compute than by the memory bandwidth. This is a desirable situation, as we want to utilize all of the existing compute resources.

The decode phase

This first forward pass for doing the prefill is computationally very expensive. We can eliminate doing large parts of that computation over and over again by introducing a special cache. This cache is called KV cache because it stores the key and value matrices for each token position.

In the attention mechanism, we calculate attention relationships between all tokens in the sequence. The key insight is that at step S+1, we've already calculated the attention between all of the first S tokens during the pre-fill phase. We can store these intermediate values in memory (the "cache") and only calculate new attention values involving the most recently generated token.

This optimization works elegantly with matrix operations:

During Pre-fill (S tokens):

Where d is the hidden dimension.

The attention scores and outputs are computed as:

During Token-by-Token Generation (token S+1):

For generating the next token, we only need to compute:

The new attention calculation becomes:

The key efficiency gain comes from:

1. Reusing K^T_cache and V_cache from previous calculations

2. Only computing new key-value projections for the latest token

3. Reducing the attention calculation from O(S^2) to O(S) for each new token

Fig. 8: We need only to calculate the query, key, and a value for the new token.

KV caching reduces the total FLOPs by a factor of approximately S for all parts of the forward pass:

In self-attention, we only compute attention for the new token against all previous tokens
In the MLP and LM head components, we only process the new token
The LM head remains the same, but it is such a small fraction of the overall computations that we will skip it in our calculations.

For example, with a 2048-token context:

During pre-fill: ~291 TFLOPs total
For generating token 2049: ~291/2048 ≈ 0.14 TFLOPs

On an H100 GPU (989 TFLOP/s), this would take only:

This is approximately 2048 times faster than the pre-fill phase in terms of pure computation, but remember that we still need to load the entire model parameters (141 GB) from the global memory, and that now we need to also load the KV cache.

KV cache memory footprint can be easily calculated as

For Llama 3.3 70B with 2048 tokens using BF16 precision, this amounts to:

Fig. 9: How the memory footprint of KV cache increases with sequence length for Llama 3.3 70B. Note that the x-axis uses a logarithmic scale (powers of 2), which visually compresses the exponential growth. In reality, the memory usage is growing superlinearly with sequence length, not linearly as might be incorrectly inferred from the graph's appearance.

While 671 MB may sound not that significant, this number scales linearly with the batch size (we elaborate on batches later) and with the sequence length (see Fig. 9). This is also the main reason why, at longer sequence lengths, the token-by-token generation* is slower than at shorter sequence lengths-on top of the model weights, the KV cache also needs to be loaded from global memory, increasing processing time for each token produced.

Model weights, plus KV cache, are roughly 141 + 0.6 ≈ 142GB so it takes 142/3350 = 0.04s to load them from the global memory. We calculated above that it only takes 0.00014s to do all computations (assuming 100% compute utilization) - so it takes two orders of magnitude more time to load the model weights than to do the actual computations. This is what we mean by the token-by-token phase of using LLMs being memory bound. We are primarily limited by the time memory transfer takes, not by the speed of compute.

This is one of the insights we hope you take out of reading this article - the token-by-token phase is memory bound; the amount of available compute throughput is of secondary importance, as we massively underutilize the compute resources anyway while waiting for weights to be loaded.

* we use the terms token-by-token phase, decode phase and generation phase interchangeably

Scaling with the input length

One of the main challenges in LLM serving is understanding how input prompt length affects end-to-end performance. Sequence length impacts both the prefill stage and the token-by-token decoding phase, though in fundamentally different ways.

The prompt processing phase exhibits O(N^2) computational complexity; as the sequence length grows, the processing time will grow quadratically*. We derived the FLOPs before as

2 S² hidden_size for score matrix Q @ Kᵗ
2 S² hidden_size for (Q @ Kᵗ) @ V.

This is especially relevant for longer sequence lengths, where the time to the first token will quadratically increase with the length of the input sequence. You can experience this, e.g., when using the long-context Gemini that, as of May 2025, can take up to 2M tokens, but you can wait up to 2 minutes for the first token of the answer to be generated. The intuition you should have developed here is that as the input length keeps increasing, a larger and larger percentage of the total time of processing a request will be spent in prompt processing - the compute-bound part (see Fig. 10).

Fig. 10: As the prompt length increases, the compute required, and hence the time, increases quadratically, occupying an ever-increasing portion of the total processing time of a request. Please note: this is just a visualization to give you some intuitions; it is not actually based on real-world observations.

During the token-by-token phase, the relationship between the generation speed and the sequence length is less straightforward.

FLOP-wise compute is scaling linearly; the new token needs to attend to all of the tokens from the past, or 2 S hidden_size for Q @ Kᵗ and 2 S hidden_size for (Q @ Kᵗ) @ V, but as we already know, FLOPs are not that relevant for the token-by-token case because we are primarily memory bound. What is much more relevant is the size of the data that we need to load from the global memory.

As a reminder, for each forward pass, we need to load the entire model weights from the global memory; on top of this, we also need to load the KV cache from global memory. As we showed in the section above, as we increase the KV cached sequence length, it will occupy linearly more and more memory or by factor S in 2 x bytes_per_param x num_hidden_layers x head_size x num_key_value_heads x S.

Initially the size of the KV cache will be negligible compared to the model size that we need to load, but as we increase the size of the processed prompt, it will occupy an increasingly big portion of the memory (see Fig. 11). Note that if we decide to process larger batches (we discuss this in detail in a later section), the size of the KV cache will grow linearly w.r.t. batch size, as we need to cache the keys and values independently for all examples from the batch. Then at some point the size of the KV cache will overtake the size of the model itself.

The intuition to develop here is that for small batches and short sequences, the sequence length has minimal impact on throughput because loading model weights dominates the memory bandwidth utilization. However, as either batch size or sequence length increases, loading the KV cache takes an amount of time to load for each token, eventually surpassing the time consumed by loading of the model weights themselves.

This transition creates two distinct performance regimes: in the so-called model-dominated regime (short sequences/small batches), throughput remains relatively stable despite increasing sequence length. Once we enter the KV-cache-dominated regime, generation speed begins to degrade in proportion to sequence length. This is largely irrelevant for short sequence lengths but is a significant issue at very long sequence lengths (in the order of tens of thousands of tokens). The additional time loading the KV cache takes scales linearly w.r.t. sequence length.

Fig. 11: KV cache scaling with the increasing sequence length. At 128k tokens for Llama 3.3 70B, the KV cache for a single sentence will amount to 40GB.

*In the naive implementations of the attention mechanism memory would also scale quadratically when we realize the SxS score matrix, but flash attention replaces these naive implementations. Flash attention is calculating (Q @ Kᵗ) @ V in an iterative way, where memory required is kept at O(N).

Multi GPU inference

As you might have noted, the 141 GB we need to store the Llama 3.3 70B parameters is more than what we have available on a single Nvidia H100 GPU. H100 cards come with 80GB of HBM memory. We would need a minimum of two to store the model in memory; however, in practice, we would probably like to use more for the KV cache. If we have more memory available, we will be able to allocate a higher proportion of it to the KV cache and a smaller proportion to the model weights, allowing us to run larger batches. We would also linearly increase the available memory bandwidth, at the cost of an increased overhead in cross-GPU communication, though.

Using just two GPUs for Llama 3.3 70B would result in only having a tiny amount of memory available for KV cache because the model weights already take up 88% (141/160=88%) of that memory, leaving only 19GB of memory (160-141=19GB) available for KV cache (in practice even less than that because we can't use 100% of GPU memory but are limited to around 95%). We wouldn’t be able to run large batches nor long sequence lengths; this would be very inefficient. We touch on this in a later section, but being able to run larger batches is the key to enjoying the good inference economics.

GPU servers almost always come in deployments of 4 or 8 GPUs per node, so using 3 GPUs would be wasteful because that would lead to one GPU in the server being entirely unused in a lot of circumstances. Hence, we jump from 2 to 4 GPUs for a single model instance right away.

Let’s assume then we will run the Llama 3.3 70B on 4 H100 cards. There are two main ways to run large-scale AI models on multiple GPUs:

Pipeline parallel
Tensor parallel

Both offer different tradeoffs between throughput and latency. Let’s explore them briefly.

Pipeline parallelism

In the pipeline parallel (PP) setting, we split the model along the layers axis, meaning that each GPU will have a fraction of all layers in the model. E.g., in the case of Llama 3.3 70B with 80 hidden layers served on 4 GPUs, GPU:0 will host the first 20 layers, GPU:1 the next 20, and so on.

Fig. 12: Pipeline parallelism with continuous batching visualization; source

The upside of such an approach is the very limited communications between the devices. We only need to broadcast activations from one device to the next one 3 times for a single forward pass.

We can also do pipelining - GPU:0 processes batch 0 and is passing it to GPU:1, then batch 1 is coming in, and GPU:0 can process batch 1 while GPU:1 is processing batch 0, etc. (see Fig. 12). This setting minimizes the stall time of each device, ensuring the maximum throughput; however, it comes at a price - at a single point in time, we only have access to 1/4 of the available compute and memory bandwidth per batch. So generation time is significantly slower than if we were to use all 4 GPUs at the same time.

In practice, orchestrating efficient overlapping batching can be quite challenging; hence, for the remaining part of this text, we will focus on analyzing the far more common tensor parallel setting.

Tensor parallelism

The mainstream parallelism strategy, used by a vast majority of LLM inference providers, is tensor parallelism (TP). In TP, individual neural network layers are split across multiple GPUs, harnessing the combined compute power and memory bandwidth of all devices. This approach significantly shortens per-layer inference time but introduces important trade-offs that must be carefully considered:

Communication Overhead: In regular intervals, e.g., twice per transformer block, the execution must synchronize across GPUs, introducing a significant delay (in the order of milliseconds) per synchronization event. This overhead varies significantly based on interconnect technology (NVLink, PCIe, etc.) and network topology.
Sequential Batch Processing: Unlike pipeline parallelism, TP requires all GPUs to process the same batch simultaneously. A new batch cannot begin until the current one completes, reducing throughput efficiency under dynamic workloads.

The most efficient parallelization strategy is to have a so-called column-wise split linear layer (we split by the column dimension) followed by the row-wise layer (split across the row dimension). Such a layout minimizes the synchronization to only one sync every two MLP layers.

Mathematical Intuition:

For a weight matrix W₁ split column-wise across 2 GPUs:

Each GPU computes its partial output independently (no communication):

The hidden layer activation becomes:

No communication is needed here because each GPU has all the necessary

data. For the subsequent row-wise split in W₂:

Each GPU computes:

The final output requires an all-reduce sum; in other words, we need to synchronize between the devices:

This layout we can apply to the transformer block, reducing it to only two synchronizations per transformer block (see Fig. 13 and Fig. 14).

Self-Attention: Heads are processed independently, with synchronization only during the output projection (o_proj).
MLP: The up-projection (w1, w3) is split column-wise and the down-projection (w2) is split row-wise, and the sync is only executed after the down-projection.

Fig. 13: The Colwise → Rowwise layout for a transformer layer from that is used as an example in torch documentation; source

Fig. 14: The Colwise → Rowwise split we see in a transformer block.

Correctly estimating the extra overhead from the coms is quite complicated. In theory, we would need to take into account the two following factors:

Message passing latency: Typically 3-5 μs depending on hardware
Data transfer time: Based on interconnect bandwidth

In an ideal scenario with modern NVLink connections, we could estimate:

The total overhead of 8 or 9 µs would be awesome. However, in practice, it gets much more complicated. During the sync barrier, the compute graph is stalled. We have a constant overhead of a few ms when the GPUs are idling while waiting for the sync to be finished. This extra "tax" is of the main reasons preventing us from utilizing the full memory bandwidth we have available across all the GPUs. Accurately modeling the overhead is quite challenging. As we'll demonstrate in the next sections, the gap between theoretical and actual performance can be quite substantial, requiring empirical measurement for accurate system modeling.

Batching - the key to good economics

As we showed before, during the token-by-token phase, we are primarily memory bound-meaning the main limitation in terms of tokens/s we can get is how fast our GPU is able to load the model weights from the global memory. To generate every single token, we always need to load the entire model.

There is an obvious optimization we could introduce to improve the economics of our operation: run larger batches. Batches are a natural improvement because we can load the model weights once, but we can use the loaded weights to do inference with multiple items from our batch at the same time (and so serve different customers simultaneously).

Increasing the batch size increases the compute usage linearly - we have k times more multiplications to do, but it only marginally changes the memory bandwidth used (only for loading the KV cache), so it's an easy way to increase the compute intensity for our otherwise heavily memory-bound algorithm (to make it less memory-bound). Since the extra memory for the KV cache is significantly smaller than the memory needed for the model, it only adds a small overhead, but it linearly increases the number of produced tokens. We produce twice as many tokens for a batch of 2 and 16x as many tokens with a batch of sixteen.

This is the core message of this post and the main intuition we hope you take away from reading this text: As we grow the batch size, we can effectively share the time to load the model from high bandwidth memory (HBM), aka our cost of loading the model is split across an increasing number of clients, enjoying the economies of scale and decreasing the per-request cost. Having sufficient demand and continuously serving big batches is the key to running a profitable LLM inference business; if you can't support large batches, your cost per token will balloon, making your operation unprofitable.*

*One thing to note is that there is a limit to this model. As we approach really long sequences or really big batches, as we will see in our experiments, the memory footprint of the KV cache starts to slowly overtake the memory footprint of the model itself (see Fig. 15). When this happens, the cost of loading the model will become increasingly irrelevant to the total time of loading data from the global memory.

"Luckily" for us, this situation also has its limit-the memory limit of a GPU node; in the case of H100 cards, it will be 8 × H100 = 8 × 80GB = 640GB. Note how for a batch of 8 at the full context length of Llama, we are already nearly there.

Fig. 15: KV cache scaling - comparison between different batch sizes. Note how the KV cache scales linearly with the batch size.

Throughput: theory vs practice

After all of the theoretical introductions, let’s try to combine all that we learned so far to estimate the LLM throughput. We will:

Develop a theoretical performance model based on the GPU spec sheet.
Compare it with a real-world throughput of Llama 3.3 70B on 4 H100 cards.
Explain the discrepancies between theoretical and actual performance.

The time to produce a response to a prompt is a combination of the pre-fill time and the decode time times the number of decode tokens. The more output tokens we produce, the smaller share of time will be spent in the prefill phase. The prefill time is primarily compute-bound, and the time for token-by-token is primarily memory-bound.

Since prefill is so heavily compute-bound, we can estimate the time by dividing the number of floating-point operations during prefill by the total effective FLOPS across all of the GPUs, plus the extra latency from cross-GPU communication time.

The decode time is mainly memory bound, but as we increase the batch size, the compute component will become increasingly important. We will calculate both and take the more significant factor. We also spend some small amount of time in coms.

The simple modeling script is based on what we have discussed above. We take in the model size and its architecture and estimate the throughput we should get given the hardware characteristics.

Simple script we use for throughput estimation.

For a Llama 3.3 70B, with 2035 tokens in and 300 tokens out, we will get these estimates:

Fig. 16: The throughput estimated using first principles. We can observe what we discussed before: as the batch size increases, the extra memory that needs to be loaded for the KV cache and the speed per request decrease.

Let’s first look at the estimated model performance under the different batch sizes. As we derived before, batch size is the single most relevant statistic regarding the tokenomics. Look how the throughput scales nearly linearly with the batch size (see Fig. 16). This is the single most important message you should get out of this text. The key to the good tokenomics is running the models at a large batch size, which is caused by the LLM inference being memory bound, meaning we want to share the cost of loading the model into the memory across as many requests as possible. As the KV cache size is approaching the model size, the total throughput gains are diminishing.

While the total throughput is increasing with the batch size, the per-request speed is decreasing, and the slowdown is accelerating as we increase the batch size due to the increased memory footprint of the KV cache. At some point, with massive batches, the per-request experience will be massively degraded. This means that even if we can support larger batches, e.g., due to the massive demand as DeepSeek experienced in February 2025, we might not want to do so because of the poor token generation speed each user will experience. In general, the variance in speed you experience on OpenRouter (see Fig. 17) can be largely attributed to the current demand, aka the batch size.

Fig. 17: Throughput variance throughout the day from OpenRouter. The variance can be somehow explained by the varying demand for the service throughout the day, resulting in varying batch sizes.

It is also part of the reason why the batch API is so much cheaper. In cases where speed is not of utmost importance, an LLM provider can just run massive batches, enjoying economy of scale, with individual requests being handled pretty slowly but processed at a higher profit. There are more nuances to this, e.g., the parallelism strategy (pipeline parallelism offers less cross-device communications overhead), that we consider beyond the scope of this text. We just wanted to give you a real-world example of the real impact of batch size on the price of a generated token.

Now let’s compare the results we get from our model to the actual LLM inference performance. For this, we run a vLLM inference server (a popular LLM serving framework) with Llama 3.3 70B on 4 H100 cards connected with NVLink. The result is quite underwhelming.

Fig. 18: The results of a benchmark we run on Llama 3.3 70B TP=4 for different batch sizes. Note that we keep the input tokens and output tokens fixed. To make the model not stop too early, we run it with random weights and just add max_tokens param to them.

Fig. 19: Model vs. the reality. While the shape is somehow similar, the exact values are not due to the real-world constraints.

While the general sigmoid-like shape is quite similar, the actual values are very much different. We go from around ~60% of theoretical estimated performance in small batches to around 40%. Where does this difference come from?

Fig. 20: As we increase the batch size the discrepancies between estimated throughput and the measured one increase.

Well, this is the problem at the core of GPU optimization - in practice, it is actually extremely hard to properly estimate the wall-time performance of the GPU application. There are many compounding factors here, making the picture more blurry. To mention a few things that are likely contributing to the discrepancy in the observed results vs. reality:

In our model we assumed 100% memory and compute utilization. Theoretical peak performance metrics like TFLOPS and memory bandwidth are never fully achieved in practice. GPUs typically achieve only 50-70% of their theoretical compute capacity due to
- Kernel launch overhead and synchronization costs
- Memory access patterns that don't perfectly align with cache architectures
- Warp divergence and other GPU-specific inefficiencies
- Instruction mix that isn't optimal for utilizing all GPU units
- … and a lot of other factors
We assumed very little overhead from the cross-device communication; as we explored previously, this is not necessarily the case in practice. In practice, we have tons of other factors that contribute to the potential extra latency, such as
- Having to sync the CUDA graph across devices for cross-GPU communication
- Synchronization barriers that force all GPUs to wait for the slowest one
- internal buffers and management overhead
- potential suboptimal (non-coalesced) memory access patterns, especially at larger sizes, with KV caches being stored in random pages of the VRAM memory
- and other factors we don’t mention here
For simplicity of calculating FLOPs, we assumed a very naive implementation of the attention mechanism. In practice, everyone is using something like Flash attention. Properly estimating the time and FLOPs involved is quite challenging and complicated and outside the scope of this text.
Overhead coming from the practical implementation of the LLM serving engine, such as paged attention
Extra overhead from using Python, PyTorch, and the vLLM framework itself

We tried to account for some of the above and include them in the extended simulation model. The details can be found in the code, but TL;DR; we assumed decreased compute and memory utilization values, the extra latency from coms, and a few other factors, like extra memory overhead increasing exponentially with the batch size.

Fig. 21: Updated model vs. reality. Now the two lines are much closer. We achieve this by adding the estimated overhead to the model. We run it for 2k tokens input and 300 tokens output.

While far from perfect, it kind of works. It generalizes pretty well to other sizes; e.g., it works for the Llama 3.1 8B run on TP1.

Fig. 22: The estimation we run seems to work quite well for other model sizes. E.g., here we run for Llama 3.1 8B, 2k tokens input 300 tokens output. The two lines are pretty close.

However, when tried with different batches of different lengths, the differences are more significant, suggesting that the model is far, far from perfect. E.g., we tried estimating the throughput for long context model performance, with 16k tokens in and 1000 tokens out. Due to excessive memory consumption, we kept this setting at a batch size of 8. In such a case, the model failed to correctly predict the final throughput, suggesting that our model is far from perfect.

Fig. 23: Unfortunately, for different input/output configurations (16k tokens input, 2k tokens output in this case), the model estimates the throughput less accurately, more so for Llama 3.3 70B and less so for Llama 3.1 8B.

How challenging it is to correctly predict a model's throughput is another thing that we hope you can take out of reading this text. Accurately estimating it would require a series of very detailed profiling steps, such as profiling the memory access patterns across multiple model sizes, batch sizes, and prompt length combinations; delving deeply into the exact implementation details of (paged) attention implementation; implementation details of the batch scheduling; estimating how different shapes affect the compute and memory utilization; and dozens of different experiments. We deem an accurate estimation across different settings challenging to a degree that it might not be feasible to do so in practice. If you want accurate results, you are probably better off just measuring the real-world results.

From tokens to dollars - estimating tokenomics

As you might know, the pricing of various LLM providers is token-based: you will pay a specific price per million input and output tokens. Some providers use the same price for input and output tokens; others have two distinct prices for input and output tokens.

To summarize what we've learned so far:

The time of prefill is quadratically dependent on the sequence length. As the input size grows, it will occupy an increasingly big percentage of the request processing time.
The time spent generating a single token grows linearly as the context length grows. KV cache will gradually become a more and more substantial percentage of the total data loaded from the global memory (alongside the model parameters).
Since the time of generating a single token can be well approximated by the cost of loading the model weights once from global memory, this time grows linearly with the number of generated tokens.

What should be apparent from the description above is that estimating a fair market price for the input tokens is a non-trivial task. Increased input quadratically increases the cost of prefill, but for standard use cases, prefill is only a minority of the time the GPU spends on processing the request. Then, depending on the batch size and context length and the proportion of these two to the model size, it will affect the throughput.

Fig. 24: Market prices for input and output tokens of Llama 3.3 70B on OpenRouter

While estimating a universal proportion of cost between the input and output tokens that generalizes well across all possible shapes of input and output is quite complicated, estimating the cost for a specific input and output config is much simpler. We can just:

Measure the execution wall time.
Assume a fixed ratio γ between input and output token costs (e.g., γ=0.3). There is no deep reasoning behind choosing this particular value of γ we just need to choose some value.
Calculate the per-token cost β by solving the following equation:

For example, with 2,035 input tokens, 300 output tokens, a batch size of 16, and a runtime of 8.96 second - as we measure in one of the experiments mentioned in the previous section:

Which equals approximately $1.72 per million output tokens and $0.51 per million input tokens.

We apply the same calculations to the different batch sizes based on the experiments we did run in the previous section. As you can see, there is a dramatic price reduction with increasing batch size. The reader should verify that the reduction is directly proportional to the increased total throughput from running larger batches.

Table 1: Estimated cost per 1M tokens for different batch sizes, assuming that the batch is 2035 tokens input and 300 tokens output. The pricing is estimated via the method described in this paragraph.

Obviously, the above is just a single data point from a single experiment for a single pair of input and output tokens. If we change the proportions of input and output tokens, the picture will be very different. To get a more realistic picture, we could, for example, run a Monte Carlo simulation- gathering multiple data points for different input and output configs, calculating the pricing for each example, removing the outliers (e.g., random slower executions due to external factors), and calculating the final price as an average of the median of these samples.

The above strategy still suffers from making very strong assumptions, e.g., we only benchmark the performance of rectangularly shaped inputs-all elements of the batch have the same number of input tokens and the same number of output tokens. We don’t mix the prefill phase with the decode phase-we submit an entire batch at the same time, since all of its elements have the same shape; we expect the prefill to take more or less the same amount of time, after which we only do the decoding part. This is a very strong assumption that is not necessarily going to be the case in the real world. We also hardcoded the value of γ, which is not necessarily fair and optimal for the settings we will be serving.

We’ll pause our pricing model discussion here, and we hope to do a deep dive into more sophisticated pricing strategies in the future text. There are multiple pricing strategies one could use; another interesting angle is estimating the minimal number of users of the LLM model under which an inference is profitable.

Luckily for us, at the end of the day, you can easily verify if you are running a profitable operation. You just add up the profits you made from charging users for input and output tokens, and you verify if this sums up to more than what you paid for renting a GPU for an hour.

Summary

We really hope this text to be a founding block for you to build an accurate world model of LLM inference economics, using Llama 3.3 70B as an example. We start explaining what parameters are present in an LLM, what it means for a model to have 70B parameters, and how the memory is consumed. Then we give you a brief introduction to compute- and memory bandwidth performance and what it means to be compute- or memory-bound, and we explain the concept of FLOP and how many FLOPs are in a matrix multiplication.

We introduce the concept of the prefill phase and the token-by-token phase. We break down FLOPs during different parts of a forward pass, and we show how the prefill is primarily compute-bound. We then explain how different the token-by-token phase is. We introduce the concept of kv-cache; we go again through a forward pass and show how, thanks to the kv-caching, it is now far less dependent on compute (times S less FLOPs), hence how it becomes primarily memory bound. We show how, with the increasing input length, the KV cache occupies an increasingly big portion of the memory load time. We then briefly mention different parallelization strategies, and we describe how extra latency is added when running a multi-GPU setting.

We follow this by introducing the concept of batching. We explain why, since we are primarily memory bound, we can radically improve the economics of our operation by running larger batches. This is the core message of this text and the intuition with which we hope you leave after reading it. We then build a simplified throughput model from first principles and compare its performance to a real-world Llama 3.3 70b run via vLLM. We show the difference between the theoretical performance and a real one, and we give a brief explanation of where the extra overhead is coming from. We show how inaccurate the theoretical model is, which we hope lets you build an intuition about the challenges of predicting real-world performance by a bunch of heuristics.

Lastly, we discuss the challenges of establishing pricing and a fair pricing ratio between input and output tokens. We present a simplified cost model that, while not fully accurate, enables you to build a simple heuristic to price the input and output tokens.

Readers should realize why running on more than the minimal number of GPUs is actually highly beneficial to the inference economics. More GPUs enable higher efficiency through better caching that give a better unit cost per token. With additional GPUs, the same model weights occupy proportionally less of the total memory, allowing more space for KV cache and consequently supporting larger batch sizes. Since throughput scales nearly linearly with batch size, this directly translates to better economics. Each additional GPU also contributes its memory bandwidth to the system, further improving the token generation rate since we are memory-bound in the token-by-token phase.

When evaluating hardware for LLM inference, readers should understand that memory size is not the only important factor - memory bandwidth is equally, if not more, critical. Since token-by-token generation is primarily memory-bound, they should always ask, "What is the memory speed?" as this determines how fast models can run. For example, NVIDIA L40S GPUs offer 48GB of memory but with a bandwidth of only 864 GB/s (compared to 3350 in H100 cards), resulting in very slow inference. Similarly, the Apple Mac Studio with M3 Ultra has 512GB of unified memory but only 819GB/s of memory bandwidth (see Fig. 25), limiting its LLM inference capabilities despite the large memory pool.

Fig. 25: Comparison of memory characteristics between Mac MacStudio and an Nvidia GPU. While Mac has a large memory on paper, it is pretty slow compared to an Nvidia GPU, making it far less suitable to serve LLMs.

Readers should also realize why running models on edge devices will always be relatively expensive on a per-token basis. When running on a consumer end device, we are always running batch-size=1. We can’t enjoy the economies of scale from sharing the cost of model load between multiple users. We always bear the entire device and energy cost ourselves. This, combined with the likely suboptimal characteristics of the on-edge hardware and slow memory, will result in high costs such as electricity and hardware depreciation.

The above model is just the tip of the iceberg; we don't discuss any of the possible optimizations, such as speculation models, hardware-specific optimizations, or quantization techniques. The goal of this text is for you to build some basic intuitions on LLM inference. We hope you understand why there are two phases, why one is compute-bound and the other one is memory-bound, and why we hope you developed some intuitions about the relationship between the number of input and output tokens and the request processing time.

To replicate the experiments, find the instructions at Github

@online{tensoreconomics2025llm,
  author = {Piotr Mazurek, Felix Gabriel},
  title = {LLM Inference Economics from First Principles},
  url = {https://www.tensoreconomics.com/p/moe-inference-economics-from-first},
  urldate = {2025-09-17},
  year = {2025},
  month = {May},
  publisher = {Substack}
}