How DeepSeek Built a Frontier AI for Just $5.6M (And How It Works)

For the last few years, the entire artificial intelligence industry has run on one brutal, unspoken rule. If you want a smarter AI, you buy more computers. That is it. That is the whole strategy.

The technical name for this rule is scaling laws. The idea is simple. A model's intelligence is roughly proportional to how much money you throw at it. More chips, more electricity, more data, more "parameters." Spend more, get more. Engineers call this the brute-force approach, and for a long time it worked beautifully.

But it also built a wall. A financial moat so wide that only a handful of companies on Earth could even sit at the table. Training one frontier model the old way burns tens of millions of GPU hours. We are talking about machines that cost more than houses, running flat out for months.

Look at Meta. They are one of the richest tech companies in the world. They spent somewhere north of $90 million to train their Llama 3.1 405B model. They bought thousands of premium NVIDIA H100 chips. They burned a staggering amount of power. And they proved the rule: if you want a world-class brain, you pay a world-class fortune.

Then a company called DeepSeek walked in and quietly broke the rule.

They trained a model with 671 billion parameters — a massive brain — for roughly $5.6 million. Not 5% cheaper. Not 50% cheaper. More than 90% cheaper. And here is the part that should make every business owner sit up straight: they did it on slower, export-restricted chips. They did not win by buying a bigger factory. They won by redesigning the machine inside it.

This post is the full teardown of how they did it. I want to walk through it the way I wish someone had walked me through it — nothing left as undefined jargon, but nothing rounded off either, not a single number or formula. The rhythm is usually the same: the plain-English version of an idea first, often with an analogy to make it stick, then the real engineering underneath so you can see how it actually works. It's long. It earns the length.

1. The Big Idea: Stop Fighting Your Constraints, Redesign Around Them

Before we touch a single piece of architecture, you need the one mental model that explains everything DeepSeek did. They call it algorithm–infrastructure co-design. That is a mouthful, so let me translate it.

Most companies treat their hardware as a fixed wall. "We have these chips, this much memory, this much network speed — now let's cram the biggest possible model onto it." DeepSeek did the opposite. They designed the math of the model and the plumbing of the data center at the same time, so each one bent to fit the other. The model was shaped to respect the hardware's limits, and the hardware was driven to serve the model's exact needs. Nothing was left to the default.

That is the whole philosophy. Every trick below — the memory compression, the expert routing, the 8-bit math, the custom networking — is just one expression of that single idea: treat your biggest constraint as a design input, not a barrier.

The Core Idea in One Sentence

Instead of buying their way past their limits, DeepSeek engineered their way around them — reshaping the model's "brain" and the data center's wiring in the same breath, so each was built to suit the other rather than merely tolerate it.

The restaurant version: one place hires another cook every time the kitchen falls behind, and the payroll climbs forever. The place next door re-lays the kitchen, rewrites the recipes so stations share prep, and re-sequences how orders move. Same dishes on the plate, a fraction of the cost to make them — and, unlike the first place, it keeps working as the line gets longer. DeepSeek is the second restaurant.

Keep that idea in your back pocket. It keeps resurfacing in different disguises — in how they handle memory, then compute, then numerical precision, then the network — for the rest of the article.

2. A Two-Minute Vocabulary Primer

The technical sections below use a handful of words over and over. Let me define them once, in plain language, so nothing trips you up later. Skim this and refer back whenever you need to.

Term

What It Actually Means

Coffee-Shop Translation

Parameter

A single tunable number inside the model. 671B of them here.

One knob on a mixing board. More knobs, more nuance.

Token

A chunk of text the AI reads or writes (roughly 3/4 of a word).

One Lego brick of language.

FLOP

One floating-point math operation (a single multiply or add).

One heartbeat of calculation.

GPU Hour

One graphics chip running flat-out for one hour. The unit of cost.

One worker-hour. Your bill is the headcount times the clock.

KV Cache

The short-term memory that holds the conversation so far.

The notebook the AI keeps open during a chat.

Precision

How many bits each number uses. Fewer bits = faster but riskier.

Rounding. $1.00 is precise; rounding to $1 is faster but loses cents.

That's the vocabulary. Now the machine itself.

3. Fixing the Memory Bottleneck: Multi-Head Latent Attention (MLA)

Every time you chat with an AI, it has to remember everything said so far. It stores that memory in a structure called the Key-Value cache, or KV cache. Think of it as the AI's open notebook. The longer the conversation, and the more people chatting at once, the fatter that notebook gets. Eventually it eats all the memory on the chip, and the whole system grinds to a halt. This is one of the single biggest bottlenecks in AI today.

There is a popular shortcut to shrink that notebook called Grouped-Query Attention (GQA). It works by making several "query heads" share the same keys and values. It saves memory, sure. But it does it by throwing away representational capacity — the model simply has fewer distinct things it can pay attention to. You save space and you get a slightly dumber model. Bad trade.

DeepSeek refused that trade. Their answer is Multi-Head Latent Attention (MLA).

The Simple Explanation

The Meaning: MLA is a brilliant zip file for the AI's memory. It crushes the core meaning of each word into a tiny package before storing it, and only unzips it back to full size at the exact moment it's needed.

The Business Analogy: Imagine you have to remember a 500-page financial report. You can't memorize every word. So you write one razor-sharp page of summary. When your boss asks a question, you read your summary and answer perfectly — without re-reading all 500 pages. You stored 1 page instead of 500, but lost none of the meaning. That is exactly what MLA does to the AI's memory.

How the compression actually works

Here is the real mechanism. Normally, for each word, the model stores big, full-size "key" and "value" vectors. MLA refuses to store those. Instead, it squeezes them down into one small latent vector and stores only that.

In plain math: it takes the input for a word, written (a vector with numbers in it), and projects it down into a much smaller latent vector , where the compressed size is far smaller than the full size — written . It does this squeeze using a single "down-projection" matrix:

That tiny vector is the only thing written into the notebook. Memory use drops like a stone. Then, the instant the model needs the full-resolution keys and values back, it reconstructs them on the fly using two "up-projection" matrices, and .

The beautiful part: because the stored thing is so small, the model also performs far fewer raw multiply-add operations. So MLA gives you two wins at once — less memory and fewer calculations. That is the kind of two-for-one that bends a cost curve.

The RoPE Conflict (and the elegant fix)

There is a catch, and it is a nasty one. This compression trick clashes head-on with a critical tool called Rotary Position Embeddings (RoPE).

RoPE is how the model knows word order — that "dog bites man" is not "man bites dog." It works by applying a position-dependent rotation directly to the keys and queries. And here is the problem: you cannot mathematically pull that rotation back out through the compression matrices. The rotation and the compression simply refuse to commute. If you compress, you lose the ability to rebuild the positional information cleanly. Dead end.

DeepSeek's fix is clever. It is called decoupled RoPE, and the idea is to split the job in two:

The meaning half — the non-positional part, carrying the core semantics of each word. This is what gets compressed into the latent vector.
The position half — a small, separate set of components that are projected independently, with RoPE applied directly to them. These carry pure word-order information.

Because that positional key component is shared across all attention heads, it adds almost nothing to the size of the notebook. You keep the compression and you keep perfect word order. Instead of forcing RoPE and the compression to coexist where they refuse to, DeepSeek split the work so each gets its own clean path.

So what does this buy them? Using half-rank latent dimensions, MLA cuts the total KV-cache memory footprint by a massive 45%, while the validation loss (the model's error rate) goes up by a microscopic 0.3%. In engineering terms, that is a Pareto-optimal result — you can't do meaningfully better on one axis without giving up the other. A 45% memory saving for a third of a percent of accuracy is the kind of trade you take every time.

Here are the actual dimensions DeepSeek used in their attention block:

Parameter

Value

Purpose / Role

Transformer Layers

How deep the model is — the number of stacked reasoning blocks

Attention Heads (n_h)

128

How many things it can pay attention to at once

Head Dimension (d_h)

128

The resolution of each individual attention head

Hidden Dimension (d)

7168

The overall width of the model internal representation

KV Compression Rank (d_c)

512

The size of the compressed latent vector stored in memory

Query Compression Rank (d_c′)

1536

Compresses queries during training to shrink the activation footprint

4. Doing Less Work: DeepSeekMoE and Loss-Free Load Balancing

If MLA is about saving memory, this next idea is about saving effort. And it is the single biggest reason DeepSeek is so cheap to run. It is called a Mixture-of-Experts, or MoE, architecture.

A normal "dense" model — like Llama 3.1 405B — wakes up its entire brain for every word it processes. Every one of its 405 billion parameters fires, every time. That is enormous, wasteful, and expensive. It is like turning on every light in a skyscraper because one person walked into the lobby.

The Simple Explanation

The Meaning: A dense AI is one giant generalist who has to think about everything for every question. DeepSeek instead built a huge team of tiny specialists, and a smart receptionist who only wakes up the two or three specialists each question actually needs.

The Business Analogy: A hospital doesn't send your sprained ankle to the chief brain surgeon. The triage desk routes you to the right specialist. DeepSeekMoE is that triage desk. For a coding question, it wakes the coding experts. For a poetry question, the language experts. The other 250+ specialists stay asleep, drawing no power. Same quality of care, a fraction of the staff working at any one moment.

Fine-grained experts plus a shared expert

Standard MoE designs (like Google's GShard) chop the model's feed-forward layers into a few big, coarse-grained experts. DeepSeek went the other way — they use many small, fine-grained, lightweight experts. More specialists, each more narrowly skilled. This lets the routing be far more precise, and it stops different experts from redundantly learning the same things.

But they added one more twist. Alongside the specialists, they keep one shared expert that is always on, for every token. Its job is to handle the universal, general-purpose stuff — basic grammar, common logic, the patterns that every question needs. With the shared expert covering the basics, the specialized experts are freed up to store only their narrow, domain-specific knowledge. The result is a model that is both stable to train and very efficient.

The hidden killer: route collapse

There is a famous failure mode in MoE models called route collapse. The receptionist gets lazy and starts sending almost every question to the same two or three favorite experts. Those few experts get overwhelmed while the rest sit idle. Training becomes unstable, the hardware bottlenecks badly, and the gradients between overworked and underworked experts diverge. The whole thing wobbles.

The standard fix is to add an auxiliary loss — a penalty term bolted onto the training objective that punishes the model for imbalanced routing. It works, but at a real cost. That penalty fights the main goal of the model, which is to predict language well. You are now optimizing two things that pull against each other, and the language quality suffers. You bought balance and paid for it with intelligence.

The Technical Deep Dive: Loss-Free Balancing

The Core Architecture: DeepSeek invented a non-differentiable, loss-free load-balancing strategy. Instead of a penalty that distorts the gradient, they add a simple, per-expert bias term directly to each expert's gating affinity score when the router picks its top-K experts.

The raw gating score is computed with a Sigmoid function over the token's hidden state and that expert's centroid. The bias is then added on top only for the routing decision.

The Self-Correcting Loop: After every training step, each bias is nudged based on that expert's load error — the gap between how many tokens it actually received and the target average load:

If an expert is overloaded, its bias drops, so it looks less attractive next step. If it's starving, its bias rises. The system self-balances in real time. The step size of that nudge is a tunable update speed: too small and balance arrives too slowly; too large and the routing thrashes wildly.

Why it's brilliant: That bias adjustment is non-differentiable — it lives outside backpropagation fully. So it never touches, distorts, or fights the language-modeling gradient. You get perfect load balance and a clean optimization signal at the same time. No trade-off.

That last line is the whole point. The textbook fix makes you choose: you can have balanced routing, or you can have an undistorted language signal, but the auxiliary loss makes you buy one at the expense of the other. DeepSeek declined the trade and built a mechanism that delivers both at once.

Feature

Standard MoE (e.g. GShard)

DeepSeekMoE

Expert Granularity

Few big, coarse, heavy experts

Many small, fine-grained, lightweight experts

Common Knowledge

Redundantly copied across experts

Handled by one dedicated, always-on shared expert

Load Balancing

Auxiliary loss bolted onto the main objective

Dynamic routing bias, adjusted online, outside backprop

The Cost

Gradient conflict degrades model quality

Perfect balance with zero gradient interference

5. Halving the Cost of Every Calculation: Native FP8 Training

This is the most technically daring thing DeepSeek did, and it is worth slowing down for. They trained the entire model in FP8 — 8-bit floating-point math — from step zero. Almost nobody does this, because it is really dangerous.

Let me explain why it matters. Every number inside an AI is stored with a certain precision — a number of bits. The industry standard is 32-bit (FP32) or 16-bit (BF16). FP8 uses just 8 bits per number. The payoff is enormous: roughly double the peak calculation speed and half the memory bandwidth. If you can pull it off, you have just made every calculation in your model twice as cheap.

The catch is that 8 bits is tiny. The specific format, called E4M3, has a very narrow dynamic range — the span between the smallest and largest numbers it can represent. Push a number too big and it overflows to garbage. Let one get too small and it underflows to zero and vanishes. Training a model is a storm of numbers of wildly different sizes. Run it all in FP8 naively and the math diverges into noise within hours.

The Simple Explanation

The Meaning: Normal AI math is like keeping books to the exact cent. FP8 math is like rounding everything to the nearest dollar — twice as fast to add up, but if you round carelessly across millions of transactions, the errors stack up and the books go bankrupt.

The Business Analogy: DeepSeek stopped the bankruptcy with two accounting tricks. First, instead of one rounding rule for the whole company, they set a separate, custom rounding scale for each tiny department — so a department dealing in pennies and one dealing in millions each round sensibly. Second, for the final tally, they switch back to exact-cent bookkeeping before the rounding errors can compound. The speed of rounding, with the safety of precision.

Trick one: fine-grained block-wise quantization

The naive way to do FP8 is to pick one scaling factor for an entire giant tensor (a whole matrix of numbers). The problem: one outlier number in that matrix forces a scale that ruins the precision of everything else.

DeepSeek instead chops everything into small blocks and gives each block its own scaling factor:

Activations are quantized in 1 × 128 tiles (small strips).
Weights are quantized in 128 × 128 blocks (small squares).

Because each little block gets its own scale, a wild outlier in one block can only damage that block — it can't poison the whole matrix. Each block's scale adapts to its own local range of values. This fine-grained control is what makes FP8 stable enough to use the efficient E4M3 format everywhere, on both the forward and backward passes. That lets them skip the usual clunky hybrid setup (E4M3 forward, E5M2 backward) that costs precision during the gradient calculation.

Trick two: online scaling, not delayed scaling

Most FP8 schemes use delayed quantization — they guess the right scaling factor for this step based on the maximum values seen in previous steps. It's a guess, and guesses drift. DeepSeek computes the scaling factor online, in real time: for every 1 × 128 tile and every 128 × 128 block, it finds the actual maximum absolute value right now, during the forward and backward passes, and scales to that. This fully eliminates scale drift and keeps the narrow E4M3 range perfectly matched to the live data at every moment.

Trick three: the hybrid accumulation pipeline

Here is a hardware flaw most people never hear about. When NVIDIA's Tensor Cores multiply FP8 numbers together (the operation called a GEMM), they add up the running total in a cramped 14-bit accumulation register. As thousands of tiny numbers get summed, that register underflows, and you can pick up an error rate of up to 2%. On a model with trillions of operations, a 2% systematic error is fatal.

DeepSeek built a hybrid accumulation pipeline to route around the flaw:

Run exactly 128 FP8 multiplications inside the fast Tensor Cores.
Then promote those intermediate partial sums up to the CUDA core registers.
Finish adding them there in full, precise FP32.

You get the blistering speed of FP8 for the multiplications and the rock-solid accuracy of FP32 for the additions. Best of both, by hand-routing the data to the right place at the right time.

Crucially, DeepSeek did not run everything in FP8. They were surgical. The numerically sensitive parts of the model stayed in safer formats. Here is the exact precision map:

Model Component

Numerical Format

Why This Format

Linear layers (FFN, MoE experts)

FP8 (E4M3)

Maximize speed and slash activation memory where volume is highest

Post-attention linear inputs

Custom E5M6

Preserve dynamic range on a sensitive activation path

MoE SwiGLU intermediate states

FP8 (recomputed)

Recompute on the fly to balance memory against precision

Attention, gating, embedding, output head

BF16

Too sensitive to rounding errors to risk in FP8

Optimizer states (AdamW moments)

BF16 (master weights in FP32)

Shrink the optimizer footprint while keeping a precise master copy

That table is the whole lesson in one frame: be aggressive where it's safe, conservative where it's not, and know the difference. FP8 where the volume lives, BF16 and FP32 where the fragility lives.

6. Beating a Slower Network: DualPipe and DeepEP

Now we leave the model itself and walk out onto the data center floor. A 671-billion-parameter MoE model is far too big to fit on one chip. It has to be sharded — split — across thousands of GPUs. And those GPUs have to talk to each other constantly, shuffling tokens to the right experts. That conversation between chips is where fortunes are won and lost.

Here is the kicker. Because of export restrictions, DeepSeek trained on NVIDIA H800 chips, not the top-tier H100s Meta used. The H800's high-speed NVLink interconnect runs at 400 GB/s — only 44% of the 900 GB/s an H100 system enjoys. DeepSeek was racing on a track with less than half the straightaway speed.

On a slow network, two problems normally dominate and waste most of your expensive compute time:

Pipeline bubbles — chips sitting idle, waiting for the next batch of work to arrive.
All-to-all latency — the delay while every chip ships its tokens to every other chip's experts.

The Simple Explanation

The Meaning: DeepSeek was stuck with narrower roads than their competitors. So they became masters of traffic control — making the roads run in both directions at once, and freeing up their drivers from ever having to direct traffic themselves.

The Business Analogy: If your logistics company is stuck on two-lane country roads, you cannot afford a single traffic jam. So you write custom routing software that runs trucks both directions simultaneously, times every light perfectly, and — critically — stops making your drivers get out and direct traffic. The drivers (the GPUs) get to just drive (compute). The roads stay full. Nobody waits.

DualPipe: making the road run both ways

The DualPipe scheduling algorithm is DeepSeek's answer to idle chips. It runs bidirectional pipeline parallelism — it pushes micro-batches of work in from both ends of the pipeline at the same time, so forward passes and backward passes run concurrently instead of waiting on each other.

DualPipe slices each chunk of work into four overlapping stages:

Attention — the model reads context.
All-to-all dispatch — tokens are shipped out to their experts.
MLP — the experts do their computation.
All-to-all combine — the results are shipped back and reassembled.

The magic is the overlap. While one device is busy on the MLP stage of a forward pass, it is simultaneously doing the all-to-all dispatch for the next step. Computation hides the communication. The network delay happens "for free," behind work that was going to run anyway.

DualPipe adds one more refinement: it splits the backward pass into two separate steps — a backward pass for the inputs, and a backward pass for the weights. Decoupling these removes a sequential dependency that would otherwise force chips to wait, which shrinks the pipeline bubbles even further. The one cost: each device has to hold two active copies of the model's parameters to run the bidirectional schedule. A memory price they happily paid for the throughput.

DeepEP: firing the GPU's traffic cops

The second weapon is a custom communication library called DeepEP. Here's the problem it solves. Normally, when GPUs do all-to-all communication, the work hogs the GPU's Streaming Multiprocessors (SMs) — the very cores that are supposed to be doing math. Your expensive compute engine spends its time playing traffic cop instead of calculating.

DeepEP is purpose-built for both intra-node NVLink and inter-node RDMA networks, and it gets the GPU out of the traffic-management business. It uses a lightweight NCCL Gin backend and a hook-based, low-occupancy design.

The Technical Deep Dive: DeepEP's Numbers

From manual tuning to automatic: The first version of DeepEP leaned on NVSHMEM and aggressive load/store instructions that required painstaking manual auto-tuning. The updated version is a JIT-compiled framework that analytically computes the optimal number of SMs and Queue Pairs to use — eliminating the empirical tuning fully. The system figures out its own best configuration.

The payoff in hard numbers: DeepEP slashes SM occupancy from 24 SMs down to just 4–6 SMs, while sustaining the same or higher throughput — around 643 GB/s on dispatch and 675 GB/s on combine even under that minimal SM allocation.

What that means: Eighteen-plus of the GPU's precious compute cores are handed back to doing model math instead of shuffling network packets. On a cluster of thousands of chips, that recovered compute is worth millions.

On paper, an interconnect running at 44% of your rival's speed should be fatal for a model this size — too much of the cluster's time would drain away into waiting. DeepSeek wrote the scheduler and the communication library specifically around that bottleneck, and the slow network mostly stopped being the thing that decided their costs.

7. Learning More From Every Example: Multi-Token Prediction (MTP)

We've covered memory, compute, precision, and networking. The last two tricks are about learning efficiency — getting more value out of every training token and every training step. Because if each example teaches the model more, you need fewer of them, and fewer examples means lower cost.

A traditional language model is trained to do one thing: given everything so far, guess the single next word. One word at a time. It works, but it is a thin learning signal. The model only ever gets graded on one prediction per position.

DeepSeek-V3 uses Multi-Token Prediction (MTP) instead. During training, it forecasts several future tokens at once, using extra "prediction heads" arranged in sequence.

The Simple Explanation

In plain terms: a normal model learns by guessing the next word, then the next, then the next — always reacting one step at a time. DeepSeek made its model predict several words ahead at once, which pushes it to plan rather than merely react.

Picture a checkers player who only sees the move in front of them next to a chess player reading three moves out. Forecasting a few tokens ahead forces the model to form a richer sense of where a sentence is heading — same training data, more learning pulled out of each example.

How MTP keeps the logic honest

MTP doesn't just guess wildly — it maintains a strict causal chain. To predict the token two steps ahead, , the first MTP module takes the main Transformer's hidden representation for the current position, , and combines it with the embedding of the actual, ground-truth next token . That combined representation is then run through a shared output head to make the prediction. Each future prediction is properly conditioned on the real tokens before it, so the model is forced to build representations that capture genuine long-range dependencies — not shortcuts.

By supervising several future tokens per position, MTP increases the density of the learning signal per step. More feedback, more learning, same data. That is data efficiency, and data efficiency is money.

Now, the elegant bookkeeping. The weight given to that extra prediction loss is not constant. DeepSeek uses a uniform relative weighting of 0.3 for the first 10 trillion tokens of training, then anneals it down to 0.1 for the final stages of pre-training. Early on, planning-ahead is emphasized; later, the model is allowed to focus on its core objective.

And the best part for deployment: during normal inference, those extra MTP heads can simply be thrown away. The training-time efficiency gains cost you nothing at runtime. Or — if you want — you can keep the heads and use them to speculatively decode, generating multiple tokens at once to speed up inference. Free upside either way.

8. Firing the Expensive Grader: Group Relative Policy Optimization (GRPO)

The final stage of building a helpful AI is alignment — teaching the raw model to actually be useful, follow instructions, and behave. This is usually done with reinforcement learning, and reinforcement learning is notoriously memory-hungry.

The standard method, Proximal Policy Optimization (PPO), needs a second model called a critic running alongside the main model (the actor). The critic's job is to estimate how good each response is. The brutal part: that critic is usually the same size as the actor. So your reinforcement-learning phase needs double the VRAM and double the compute, just to grade the homework.

The Simple Explanation

The Meaning: Normal AI alignment hires a second, equally expensive AI just to grade the first AI's answers. DeepSeek fired the grader. Instead, it has the AI write several answers to the same question, then grades them against each other — the best answer in the batch becomes the benchmark.

The Business Analogy: Don't hire a $200k professor to score one essay. Instead, ask the student to write five drafts, lay them on the table, and judge each one against the group average. The above-average drafts get reinforced, the below-average ones discouraged. No expensive grader required — and you cut your memory bill in half.

The Technical Deep Dive: How GRPO Works

No critic, ever: Group Relative Policy Optimization (GRPO) eliminates the critic model fully. For any given prompt, the actor model generates a whole group of candidate responses. Each one is scored by a rule-based verifier or a reward model.

The group is its own benchmark: Instead of a learned critic estimating value, GRPO computes the baseline directly from the empirical mean and standard deviation of that group's rewards. Each response's relative advantage — how far above or below the group average it scored — is used directly to update the policy.

The memory math: Standard PPO has to keep three model copies in GPU memory (actor, critic, and a frozen reference). GRPO needs only two — the active policy and the frozen reference. That single change cuts VRAM usage by roughly 50%, which is exactly what lets large-scale RL run on cost-effective hardware.

It's the same instinct that runs under everything else they built. The field had quietly accepted that reinforcement learning means paying for a second, actor-sized model just to grade the first. DeepSeek asked whether that grader needed to exist at all, decided it didn't, and removed it.

9. The Final Scorecard: What All of This Adds Up To

Individually, each of these six innovations saves maybe 20% to 50% on one axis. Stacked together, they compound into something staggering. Let's look at the receipts.

Training Economics

Estimated cost to train one frontier-class model, in millions of dollars.

The Efficiency Gap

While the giants scaled by buying more chips and burning more power, DeepSeek scaled by removing waste. Three bets did most of the heavy lifting:

MoE sparsity — only 37B of 671B parameters fire per token, cutting the compute per token to roughly 10%.
FP8 precision — half the bit-width of standard training, for roughly double the speed.
DualPipe scheduling — near-eliminates idle GPU time across the whole cluster.

First, where did the money actually go? Here is how DeepSeek allocated its GPU hours across the entire build. Notice how brutally efficient even the expensive late stages are.

Training Phase

GPU Hours (H800)

What Happened

Pre-training (Stage 1)

2,664,000

95.55%

The core training on 14.8T tokens at 4K context length

Context Extension

119,000

4.27%

Stretching the context window from 32K up to 128K tokens

Post-training (SFT / RL)

5,000

0.18%

Alignment via supervised fine-tuning and GRPO

Total

2,788,000

100%

The full cycle — the finished 671B base and chat models

That 0.18% is worth dwelling on. The entire alignment phase — the part that turns a raw model into a usable assistant — ran in just 5,000 GPU hours, a rounding error against the pre-training bill. That is GRPO's memory savings showing up as real money.

Now the headline comparison. Here is DeepSeek-V3 next to Meta's Llama 3.1 405B, a comparable dense model trained on a comparable amount of data:

Training Metric

Llama 3.1 405B

DeepSeek-V3

Total Parameters

405B

671B

Active Parameters per Token

405B (all of them)

37B (only 5.5%)

Primary Chip

NVIDIA H100

NVIDIA H800 (slower)

Training Tokens

15.6T

14.8T

Total GPU Hours

~30.8M

2.788M

Estimated Training Cost

~$92.4M – $123.2M

~$5.6M

Cost per Trillion Tokens

~$5.93M – $7.90M

~$378K

Model FLOPs Utilization

~40% (BF16)

~21.4% FP8 (≈42.9% BF16)

Let me make sure two of those numbers really land, because they are the entire story.

Active parameters: 37B vs 405B. Even though DeepSeek's model is bigger overall (671B vs 405B), its MoE design means only 37 billion parameters actually fire for any given token. That is the DeepSeekMoE chapter paying off. The practical effect: the nominal compute (FLOPs) needed per token drops to roughly 10% of what Llama 3.1 405B burns. A bigger brain that thinks with a tenth of the effort.

That MFU number is the subtle flex. Model FLOPs Utilization (MFU) measures how much of your hardware's theoretical peak you actually use — higher is better. DeepSeek's 21.4% in FP8 looks lower than Llama's 40%, but that's apples to oranges. FP8 has double the peak FLOPs of BF16, so 21.4% of an FP8 peak is equivalent to about 42.9% in BF16 terms — better utilization than Llama, achieved on a slower network. That number is DualPipe and DeepEP doing their job. They kept thousands of slower chips really busy.

When you multiply it all out — a tenth of the compute per token, twice the speed per calculation from FP8, near-perfect hardware utilization despite a 44%-speed network, half the memory in alignment — you don't get a 20% saving. You get a 90%+ saving. That is what compounding efficiency looks like.

10. What This Actually Means For Your Business

You are probably not about to train a 671-billion-parameter model. So why should any of this matter to you? Because the lesson underneath the math is the most valuable business principle of the decade, and it has nothing to do with AI specifically.

For years, the entire industry believed that better outcomes required bigger budgets. That belief was a wall, and walls keep small players out. DeepSeek didn't climb the wall or buy a ladder. They proved the wall was made of assumptions. Here is how that translates to your operation:

Your biggest constraint is your best design brief. DeepSeek's slow H800 chips weren't a handicap — they were the forcing function that produced DualPipe and DeepEP. The next time you hit a hard limit (budget, headcount, time), ask DeepSeek's question: what would we redesign if we accepted this limit as permanent? That is usually where the real innovation hides.
Stop waking the whole building for every task. This is the MoE lesson, and it is exactly what good automation does for a business. You don't need your most expensive people answering routine questions all day. You need a smart "router" that handles the common 90% automatically and escalates only the rare, really hard cases to a human expert. Same quality, a fraction of the cost-per-task.
Efficiency compounds; it doesn't add. No single trick here was a silver bullet. A 45% memory cut here, a 50% compute cut there, a 2x speedup somewhere else — stacked, they became a 90% reduction. The same is true of business systems. One automation saves a little. A dozen well-chosen ones, working together, change your entire cost structure.
Elegance is now a moat. The old moat was capital. The new moat is knowing where the waste is and engineering it out. That is a moat a focused, smart operator can actually build — without a Meta-sized bank account.

The Bottom Line

DeepSeek didn't beat the giants by outspending them. They beat them by out-thinking them. They took six different bottlenecks — memory, sparse computation, numerical precision, network bandwidth, learning density, and reinforcement-learning memory — and instead of paying to brute-force past each one, they redesigned the machine around all of them at once.

It came together piece by piece. Multi-Head Latent Attention took the memory problem largely off the table, while DeepSeekMoE with loss-free balancing dropped the compute per token by an order of magnitude. FP8 made each calculation roughly twice as cheap; DualPipe and DeepEP quietly absorbed the penalty of a slower network. MTP wrung more learning out of every token, and GRPO cut the memory bill for alignment in half. No single one of those explains the headline number — but stacked on top of each other, they turn a $92 million problem into a $5.6 million one.

The era of "just buy more chips" is over. The era of elegant, constraint-driven engineering has begun — and the good news for everyone who isn't a trillion-dollar company is that this game rewards intelligence over raw capital. That is a game far more of us can win.

1. The Big Idea: Stop Fighting Your Constraints, Redesign Around Them

The Core Idea in One Sentence

2. A Two-Minute Vocabulary Primer

3. Fixing the Memory Bottleneck: Multi-Head Latent Attention (MLA)

The Simple Explanation

How the compression actually works

The RoPE Conflict (and the elegant fix)

4. Doing Less Work: DeepSeekMoE and Loss-Free Load Balancing

The Simple Explanation

Fine-grained experts plus a shared expert

The hidden killer: route collapse

The Technical Deep Dive: Loss-Free Balancing

5. Halving the Cost of Every Calculation: Native FP8 Training

The Simple Explanation

Trick one: fine-grained block-wise quantization

Trick two: online scaling, not delayed scaling

Trick three: the hybrid accumulation pipeline

6. Beating a Slower Network: DualPipe and DeepEP

The Simple Explanation

DualPipe: making the road run both ways

DeepEP: firing the GPU's traffic cops

The Technical Deep Dive: DeepEP's Numbers

7. Learning More From Every Example: Multi-Token Prediction (MTP)

The Simple Explanation

How MTP keeps the logic honest

8. Firing the Expensive Grader: Group Relative Policy Optimization (GRPO)

The Simple Explanation

The Technical Deep Dive: How GRPO Works

9. The Final Scorecard: What All of This Adds Up To

Training Economics

The Efficiency Gap

10. What This Actually Means For Your Business

The Bottom Line

Related articles

The Ultimate Guide to the 7-Stage Sales Funnel (2026)

Top-of-Funnel Mastery: How to Build Brand Awareness & Qualify B2B Leads

The Middle of the Funnel: Winning the Consideration & Intent Phases