LLM Inference Economics from First Principles

May 14

The main product LLM companies offer these days is access to their models via an API, and the key question that will determine the profitability they can enjoy is the inference cost structure.

8 Comments

Wow, can't remember when I read such detailed write up. Are you doing this for work or was hat your Master's degree paper ?

Should I want to build an inference provider business, it would be all in here ! However, too competitive and expensive to launch, but still great insight.

However, I got some real value out of it as I now understand why why I won't get far with my 8GB GPU in my laptop as the KV store for a 131,072 context window would be itself need 40GB (nice diagram !)

Expand full comment

Thanks for the kind words; this is mostly some thoughts gathered over a few weekends.

You probably still lack a few key facts to run a competitive inference operation, but brace yourself for the future releases :)

Expand full comment

hidden gold mine of information!

Expand full comment

Seriously, this post and "The MoE Inference Economics" should be a chapter in a textbook.

Expand full comment

Impressive article, thank you!

Could you please publish all parameters you used for Llama 8b numbers in Figure 22? Similarly to Fig 1 for Llama 70b. Thank you!

Expand full comment

https://github.com/tugot17/tokenomics-from-first-principles

you can just model this youself; we open-source the code

Expand full comment

Jun 24Edited

Hello l want to help every one my name is NAING SOE

Expand full comment

impressive!

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts