The Pipeline: A Language Model from Scratch in Twelve Steps

Breaking apart what feels like magic!

May 06, 2026

Every frontier language model in 2026, from Claude to GPT to DeepSeek to Llama, comes out of the same twelve-step pipeline. Some steps take months. Some take milliseconds. The names of the techniques vary by lab. The steps don’t.

I’ll walk through all twelve. This is the second piece in a series adapted from a book called Built, Not Born about how modern AI is engineered, part by part. The first piece argued that LLMs are following the same demystification arc that cars went through in the early twentieth century. This one is the map.

A note before we start. You will see a wide range of numbers in what follows. Training a frontier model costs anywhere from $5 million to $150 million depending on the lab and chip generation. A “step” can take a Tuesday afternoon or six months. Where I give a number it’s representative, often pulled from a published technical report. Where the actual cost is undisclosed (as it usually is for closed labs), I’ll say so.

*Figure 2.1. What the 12 steps build: an exploded view of a transformer LM, with each part labeled by the pipeline step that produces it.*

Step 1. Decide what to build

Someone, somewhere, decides whether the next model is going to be a 30-billion-parameter dense model, a 600-billion-parameter sparse model, a small on-device model, a vision model, a code model, or something else. The decision is not just business. It’s also physics. The lab has a fixed amount of compute, and that compute can be spent on more parameters, more training data, longer context, or more post-training.

The famous 2022 paper called Chinchilla (Hoffmann et al.) showed that there’s a sweet spot. For a given budget, there’s an optimal ratio of parameters to training tokens. Most labs once built with that ratio in mind. The ratio has shifted since. Modern frontier models train on far more tokens per parameter than Chinchilla recommended, because the model gets used in inference for a long time and inference cost dominates training cost over the model’s life. Sardana and Frankle (2023) made this argument explicit and gave it a name: inference-aware scaling.

The other meta-decisions: dense or MoE, what context length to design for, what languages to prioritize, what reasoning capability to target. These cascade through every later step. If you decide on MoE in step 1, you live with that decision through training, post-training, and inference. There is no easy way back.

Time and cost: weeks to months of internal debate. Zero compute. Mostly meetings, dashboards, and one nervous person writing the slide deck.

Step 2. Get the data

A modern frontier model trains on between 15 and 35 trillion tokens of text, code, math, multilingual material, and increasingly synthetic content. A token is roughly three-quarters of a word in English. Thirty-three trillion tokens is more than every book ever printed, several times over.

Most of the data comes from a web crawl that someone (usually a non-profit called Common Crawl) has been doing for over a decade. The lab takes that crawl and runs it through filters: dedup the text, throw out spam and adult content, keep only the educational stuff, balance the languages, and add code from open repositories.

There are many publicly known recipes for doing this. FineWeb-Edu (Penedo et al. 2024) is the high-quality educational subset that became the de facto starting point for several 2024–2026 frontier models. The Stack v2 (Lozhkov et al. 2024) is the standard for code. Dolma (Soldaini et al. 2024) is the 3-trillion-token open dataset from Allen AI. RefinedWeb(Penedo et al. 2023) is the older, filtering-heavy CommonCrawl derivative that Falcon used.

The data step is one of the highest-return parts of the whole pipeline. A model trained on better data outperforms a bigger model trained on worse data. It’s also the part where labs are most secretive, because data is one of their few defensible advantages. A lab that knows how to filter a 100-trillion-token raw crawl down to 30 trillion tokens of high-quality input has a moat that doesn’t show up in the model card.

Time and cost: typically 3–6 months of pipeline development for a frontier-grade dataset, then weeks of running the pipeline at scale. Compute cost for filtering is on the order of single-digit millions, dwarfed by the training run that follows.

Step 3. Tokenize it

Computers don’t understand text directly. They understand numbers. So between the raw text and the model, there’s a translator called a tokenizer. The tokenizer takes a sentence like “the cat sat on the mat” and turns it into a list of integers: maybe [464, 3797, 6155, 319, 262, 7775]. Each integer corresponds to a chunk of text the model knows how to handle.

Most modern tokenizers use a scheme called byte-pair encoding (BPE), originally proposed for machine translation by Sennrich et al. (2015 arXiv, ACL 2016). The exact scheme isn’t important right now. What matters is that the tokenizer is a fixed component, decided early, hard to change later, and it shapes what the model can do.

If your tokenizer merges multi-digit numbers into single tokens, your model will be bad at arithmetic. The number 2048 becomes one token, 2049 becomes three, and the model can’t see the digit-by-digit structure. Modern tokenizers (Llama 3 onward) split numbers into individual digits to fix this. If the tokenizer merges common words but breaks rare ones into letters, your model will be slow on technical text. If it doesn’t have enough Chinese vocabulary in its merge list, your Chinese performance is capped. The tokenizer is the model’s mouth and ears, and you live with whatever you build.

Vocabulary sizes have grown over the years. GPT-2 used 50,000 tokens. Llama 1 used 32,000. Llama 3 jumped to 128,000. DeepSeek V3 uses 128,000. Qwen 2.5 uses 152,064. Larger vocabularies mean fewer tokens per sentence (cheaper inference), but more parameters in the embedding table.

Time and cost: a few weeks to design and train a new tokenizer if you don’t reuse an existing one. Most labs reuse, with modifications. The compute cost is negligible.

Step 4. Decide the architecture

This is where you choose which engine to build.

A modern frontier model has a transformer at its core. On top of that core, the lab makes about twenty named decisions. Will the attention be Multi-Head, Multi-Query, Grouped-Query, or Multi-Head Latent? Will the feed-forward layer be a regular dense MLP or a Mixture of Experts? If MoE, how many experts and how many active per token? What’s the position encoding scheme: RoPE, ALiBi, or something newer? What activation function: SwiGLU, GeGLU? What normalization placement and which variant: LayerNorm or RMSNorm? Will there be a multi-token prediction head?

Don’t try to hold any of those names in your head right now. The wall of acronyms is supposed to feel like a wall.

These twenty-some decisions add up to the architecture. Different labs make different choices, and the differences matter. They determine context length, inference cost, and capability ceiling.

A concrete example, same warning. DeepSeek V4 uses Multi-Head Latent Attention plus Compressed Sparse Attention plus Heavily Compressed Attention; Multi-Token Prediction heads; RMSNorm with QK-Norm; SwiGLU with output clamping; manifold-constrained hyper-connections; DeepSeekMoE-style fine-grained experts plus one shared expert with auxiliary-loss-free balancing. Ten named decisions, each with a paper or a section in the V4 technical report. You don’t need to recognize any of them yet.

Time and cost: one to three months of architecture design and ablation experiments at smaller scales. The ablation experiments alone consume on the order of $1–5 million in compute. The architecture decisions get frozen before the main training run begins.

Step 5. Train it

Training is where the model actually learns. It’s also where most of the money goes. A frontier-scale training run uses 10,000 to 30,000 GPUs, costs $50–$150 million depending on hardware generation, and takes a couple of months.

What’s actually happening: the model is shown a sequence of tokens and asked to predict the next one. It guesses wrong. Based on how wrong, every parameter in the model is adjusted by a tiny amount in the direction that would have made the guess more right. Then the next sequence. Then the next. Repeat for 30 trillion tokens.

The math behind this is older than computers: gradient descent with an algorithm called Adam (Kingma & Ba 2014), or its newer challenger Muon (Jordan 2024). What’s new is doing it at this scale without crashing, which turns out to be the hard part. Training runs at frontier scale fail constantly. GPUs die. Network links flap. Gradients explode. The loss spikes for no obvious reason. Surviving the failures is its own engineering discipline.

Some specific costs. DeepSeek V3 reported training on 2,048 H800 GPUs for 53 days at a marginal cost of around $5.6 million. Llama 3.1 405B trained on 16,000 H100s for 54 days, with total compute spend estimated in the $50–$70 million range (Meta hasn’t disclosed exactly, but the math is straightforward). Training a frontier closed-weight model in 2024 was widely estimated at $100–$150 million all-in (training run plus failed runs plus engineering costs spread across the project).

Compute is by far the largest line item in the budget for any frontier model.

Time and cost: 6–10 weeks of compute, $5M to $150M. Plus 6+ months of preparation and tuning before the run begins.

Step 6. Make it stable

Closely related to step 5, but worth its own listing. Training a language model at frontier scale is like running a marathon over icy roads. The lab spends real engineering effort on:

Detecting when something is going wrong (rising gradient norms, loss spikes).
Rolling back to a good checkpoint.
Modifying the architecture or the routing to avoid recurrence.
Reordering the data so problematic examples don’t all hit at once.

Meta’s Llama 3 paper (one of the most rigorously documented training runs in the open literature) reports 466 total interruptions on its 16,000-GPU cluster over the 54-day run. 419 of those were unexpected: GPUs failed, network links flapped, memory threw errors. Roughly one unexpected interruption every three hours, for two months straight. That’s an unsung subsystem. There are entire teams whose job is “the training run doesn’t die.” Without them, no model.

DeepSeek V4 introduced something called anticipatory routing specifically to handle MoE-related loss spikes. When the loss starts to spike in a routing-related way, V4’s system rolls back routing assignments by a few steps and shifts them out of sync, breaking the feedback loop that produced the spike. It’s an industrial-engineering fix, not a research result. It worked.

Time and cost: stability work runs in parallel with training. Hard to attribute separately. If forced to estimate, 10–20% of the total engineering effort on a frontier run is dedicated to detection-and-recovery infrastructure.

Step 7. Post-train it for instructions

After all that compute, what you have is called a base model. A base model is a brilliant savant with no manners. If you ask it “What is the capital of France?” it might continue with “What is the capital of Germany? What is the capital of Spain?”, copying the structure of a quiz, not answering you. Base models complete text. They don’t follow instructions.

To make a base model useful, you do post-training. The first step is supervised fine-tuning(SFT). Show the model many thousands of examples of instruction / good response pairs, and it learns the convention.

The original demonstration was OpenAI’s InstructGPT (Ouyang et al. 2022), which combined SFT with reinforcement learning from human feedback (RLHF). For a long time the SFT corpus was hand-written by paid annotators. By 2023, most labs had moved to a mix of hand-written, synthetic (model-generated), and distilled-from-stronger-model data. Tulu 3 and Zephyr are the open-weight reference recipes; FLAN (Wei et al. 2021) is the original instruction-tuning collection.

After SFT, your model will follow instructions. It still won’t know that some responses are better than others. That’s the next step.

Time and cost: 1–3 weeks of compute. Single-digit millions of dollars at most. The hard part is the data, not the training.

Step 8. Post-train it for preferences

SFT is not enough. SFT teaches the model what a good response looks like. It doesn’t teach the model that some responses are better than others. To do that, you train it on human (or AI) preferences: pairs where two responses are shown and one is marked better.

The classical method is RLHF (reinforcement learning from human feedback) using an algorithm called PPO (Schulman et al. 2017). Newer methods skip the reinforcement-learning machinery and just optimize the preferences directly. DPO (Rafailov et al. 2023), SimPO (Meng et al. 2024), ORPO (Hong et al. 2024), KTO (Ethayarajh et al. 2024). DPO is the open-weight standard since mid-2024.

The data is the bottleneck. High-quality preference pairs cost real money to collect: humans need to read both responses, judge them, and label them consistently. This used to be the largest cost in post-training. Now, with RLAIF (Lee et al. 2023) and Constitutional AI (Bai et al. 2022, Anthropic), AI feedback substitutes for much of the human labor, and the cost has dropped.

Anthropic’s Constitutional AI is the framework that produces Claude. It works by writing down a “constitution” of principles, having the model critique its own outputs against the constitution, and then training on the resulting preference pairs. The constitution does the work that humans used to do.

Time and cost: 1–4 weeks of compute. Cost depends heavily on whether preference data is collected from humans (millions of dollars) or generated via AI (relatively small).

Step 9. Post-train it for reasoning

This is the newest step in the pipeline. Until late 2024, most labs stopped at step 8. Then OpenAI’s o1 model arrived, and the field discovered that you could push reasoning capability much further by training the model with reinforcement learning where the reward is whether the answer is verifiably correct.

For math, the reward is a checker that knows whether 2 + 2 = 4. For code, it’s whether the unit test passes. The model generates a long chain of reasoning, and if the final answer is right, the whole chain gets reinforced. After enough rounds, the model gets dramatically better at problems it never saw.

GRPO (Group Relative Policy Optimization, from DeepSeekMath, Shao et al. 2024) is the algorithm DeepSeek-R1 (DeepSeek-AI 2025) used. The R1 paper described it in plain English and shipped open weights. Within months it was the open community’s default for reasoning training.

The capability shift is striking. AIME math benchmark scores jumped from 13.4% on GPT-4o (the strongest non-reasoning frontier model at the time) to 83% on o1’s release. R1 scored within a few points of o1 with open weights. Test-time compute became a tunable dial: you could pay more inference compute and get more correct answers.

This step is the single biggest reason 2026 LLMs feel different from 2023 LLMs.

Time and cost: 4–8 weeks of compute on top of the preference-trained model. Compute cost is comparable to a moderate-scale pretraining run, in the range of $10–$30M for a frontier reasoning training pass.

Step 10. Make it cheap to run

The model is now finished. But finishing is not the same as deploying.

A finished frontier model has hundreds of billions of parameters. Running it for one user’s question takes a server full of GPUs. Running it for a million users at once requires a small data center. Without a serving stack (software that batches requests together, caches their shared parts, compresses the model’s weights, and predicts ahead so the model spits out tokens faster) none of these models would be economically viable.

The serving stack is its own world. vLLM (Kwon et al. 2023, with PagedAttention), SGLang(Zheng et al. 2024), TensorRT-LLM, EAGLE-2 (Li et al. 2024) for speculative decoding, MTP heads as drafters, prefix caching, FP8 weights, INT4 quantization via AWQ (Lin et al. 2023). Each is a 1.3× to 3× cost reduction. They compose. Together they’re roughly an order of magnitude cheaper than the same model running on naive code. That order of magnitude is why ChatGPT is a free service and not a hundred-dollar-a-month one.

This is the part of the pipeline where most of the engineering investment per dollar of value happens in 2026. A model is built once. It runs for billions of requests over its lifetime. A 2× improvement in serving cost saves more money than a 2× improvement in training cost.

Time and cost: ongoing. Every major lab has a permanent inference-engineering team. The infrastructure costs hundreds of millions of dollars over time across cluster, software, and personnel.

Step 11. Wrap it for tools

The model can now answer questions. But to be useful as an agent (something that does work for you, not just talks to you) it needs hands. Tools.

A tool, in this context, is a function the model can call: search the web, read a file, execute a piece of Python, draft an email, hit an API. The model emits a structured request (usually JSON) and a runtime executes it and feeds the result back. The whole framework around this is sometimes called the agent loop: plan, act, observe, replan.

The post-training pipeline has to teach the model the convention. The serving stack has to enforce that the JSON is valid (constrained decoding). The runtime has to handle when a tool fails. None of these are model-architecture problems. They’re all systems problems on top of the model.

In late 2024, Anthropic released the Model Context Protocol (MCP) as an open spec for tool-model integration. By mid-2026, most major labs and open-weight ecosystems support it. Tools are now a plug-in ecosystem, much like browser extensions or app stores.

Time and cost: the agent layer is its own product engineering discipline. A small team can build a useful agent harness in weeks; a frontier-quality one (the kind that drives Claude’s computer use, for example) takes months and continuing investment.

Step 12. Align it and ship it

Before the model goes out, the lab runs a final alignment pass. This is where harmlessness, refusal calibration, jailbreak resistance, and policy compliance get baked in. Anthropic’s name for the framework is Constitutional AI; OpenAI’s is deliberative alignment; the techniques rhyme. There’s also an interpretability and monitoring layer (sparse autoencoders à la Templeton et al. 2024, content classifiers like Llama Guard, jailbreak detectors) that sits in front of the deployed model and watches what it’s doing.

Then it ships. Then someone types a question in. Then the inference half of the pipeline (steps 10 and 11, plus the safety checks bolted on top of the served model) runs in under a second to produce a reply. The other ten steps are already frozen into the weights on disk.

Time and cost: weeks of red-teaming and refusal-tuning. Plus ongoing monitoring infrastructure.

Where the time and money go

Putting the twelve steps together, a rough breakdown of where a frontier model’s all-in cost goes:

PhaseSteps% of cost% of timePre-build (decide, data, tokenize, architecture)1–45%30%Pretraining5–660%40%Post-training (SFT, preference, reasoning)7–915%15%Inference engineering1015%continuousAgentic + alignment11–125%15%

This breakdown is approximate and lab-dependent. Closed labs with heavy safety investment shift more cost into 11–12. Open-weight labs typically shift more into 5–6 because they have less safety machinery. Inference engineering is structurally different from the rest because it’s continuous, not project-based.

*Figure 2.2. Where the time and money go: approximate breakdown of cost and time across the five pipeline phases.*

The compute cost is dominated by step 5 (pretraining). The capability quality is dominated by steps 2 (data), 8 (preferences), and 9 (reasoning). The deployed cost is dominated by step 10 (inference engineering). The user experience is dominated by step 11 (agent layer). Different labs prioritize different steps based on their bet about which dominates user value.

Where the labs differ

The twelve steps are universal. The recipes within them are not.

DeepSeek‘s distinguishing characteristic is publishing the recipe in detail. V3 and V4 technical reports include exact configurations, FP8 numerical recipes, and ablation tables. They also lean architectural: most of their advantage comes from steps 4 and 5.

Anthropic‘s distinguishing characteristic is post-training and alignment. Constitutional AI, sparse autoencoder monitoring, and a heavy refusal/safety pipeline mean steps 8 and 12 consume more relative effort than at most labs. The architecture details aren’t published.

OpenAI‘s distinguishing characteristic is the reasoning RL frontier. The o-series got there first (step 9), and it’s still the source of much of OpenAI’s lead on math and code benchmarks.

Meta‘s distinguishing characteristic is the training-systems engineering. The Llama 3 paper documented step 5 better than anything else in the open literature. They also commit to open weights, which puts them on a faster post-training cycle: Llama 3 to 3.1 to 3.3 in a year.

Mistral and Qwen share a distinguishing characteristic: the small-model line, 7B-class models that punch above their weight. They emphasize step 8 efficiency and step 10 deployment for on-device use cases.

Knowing which lab cares most about which step is a useful shortcut for predicting where their next release will be strongest.

What an “intermediate state” looks like

A useful mental note. Each post-training step produces a different kind of model:

Base model (after step 5): a brilliant savant with no manners. Continues text. Doesn’t answer questions.
SFT model (after step 7): can answer questions in the right form. Often verbose, hedging, or off-target on subtle requests.
Preference-trained model (after step 8): polished. Sounds good. Misses on hard reasoning. This is what most non-reasoning chat models look like.
Reasoning model (after step 9): can think through hard problems for many tokens before answering. Slower per turn, dramatically better on math and code. This is the o1 / R1 / Claude-with-extended-thinking class.
Distilled and merged (final variant): a smaller or specialized version of the above, with multiple post-training paths combined.

When you see “Llama 3.1 8B Instruct” or “DeepSeek-R1-Distill” in a model card, the suffix is telling you which intermediate state you’re using. Base means raw step 5 output. Instruct or Chat means through step 7 or 8. R1 / Reasoning / Thinking means through step 9.

A real walkthrough: DeepSeek V4-Pro through the twelve steps

Each step traced to what V4 actually did, where the technical report tells us:

Decide what to build: a 1.6T-parameter MoE with 49B active per token, 1M context window, designed to compete on agentic and reasoning tasks.
Get the data: ~33 trillion tokens, including high-quality web (FineWeb-class), code (Stack v2-class), math, multilingual, and a substantial synthetic-data fraction.
Tokenize: DeepSeek’s own BPE tokenizer, vocabulary ~128K.
Decide the architecture: MLA + CSA + HCA attention; DeepSeekMoE with 384 routed experts, 1 shared, 6 active per token; RoPE position encoding; RMSNorm + QK-Norm; SwiGLU with clamping; manifold-constrained hyper-connections; MTP heads.
Train it: BF16 master weights with FP8 matrix multiplies, AdamW optimizer, WSD schedule. Distributed via FSDP + tensor parallel + DualPipe pipeline parallel + expert parallel.
Stabilize it: anticipatory routing, SwiGLU clamping, attention sinks, sliding-window branch.
SFT: on a mix of instruction data including reasoning traces from earlier R1-class checkpoints.
Preference: DPO-class preference training plus iterative rounds.
Reasoning: GRPO with rule-based rewards on math and code, scaled to V4 base.
Inference engineering: vLLM/SGLang continuous batching, prefix caching, EAGLE-2 plus MTP-as-drafter speculative decoding, FP8 inference, disaggregated prefill/decode.
Tool wrap: MCP-compatible tool interface, function calling, computer-use compatibility.
Align and ship: refusal calibration, safety classifiers, public release with API and open weights.

That’s twelve steps for one specific model. Other labs follow the same shape. The recipes within each step differ.

What this map gives you

Twelve steps. Three categories of work behind them.

The first category is what’s in the engine: steps 4 through 7. The architecture, the training, the stability work. Most of the named parts live here. Most of the recent innovation has happened here.

The second category is how it learns to behave: steps 7 through 9. Post-training and reasoning RL.

The third category is how it gets used in the world: steps 10 through 12. Serving stack and the agent layer.

If you remember one thing from this piece, remember the steps and roughly which parts of the model live at which step. When you read about a new release (”DeepSeek V4 introduces Compressed Sparse Attention, ships with 1M context, and reduces inference cost by 73%”), you’ll know that “Compressed Sparse Attention” is at step 4 (architecture), “1M context” is jointly steps 4 and 10 (architecture and serving), and “73% inference cost reduction” is at step 10. That’s the shape of the win, expressed in pipeline coordinates.

Five sentences to take with you

A frontier language model is the output of a 12-step pipeline that’s universal across labs: decide, data, tokenize, architecture, train, stabilize, SFT, preference, reasoning, inference, tools, align.
Most of the cost lives at step 5 (pretraining); most of the quality comes from steps 2, 8, and 9 (data, preferences, reasoning); most of the deployed performance comes from step 10 (inference engineering).
The intermediate states matter: a base model is not an instruct model is not a reasoning model. Reading a model card means knowing which intermediate state you’re using.
Different labs distinguish themselves by which step they invest most in: DeepSeek on architecture, Anthropic on post-training and alignment, OpenAI on reasoning, Meta on training systems, Mistral and Qwen on the small-model deployment surface.
The map is what makes new releases legible. Once you know the steps, “they shipped a new attention scheme” tells you it’s a step-4 win and you know roughly what to expect downstream.

The next piece in this series unpacks the most important box on the map: step 4, the architecture. The transformer in plain English. One page of words.

Three papers worth your time

Vaswani et al. 2017. Attention Is All You Need. arXiv:1706.03762. The transformer paper. The architectural foundation.
DeepSeek-V3 Technical Report. arXiv:2412.19437. The most thorough public description of a working frontier model and its full pipeline.
Llama 3 Herd of Models. arXiv:2407.21783. The most rigorously documented frontier-scale training run in the open literature.

Read in that order, you have the engine, the recipe, and the production reality.

This is the second piece in a series adapted from the book Built, Not Born: How Modern AI Is Engineered, Part by Part. Subscribe to follow the series. The next chapter unpacks the transformer itself.

Learn Agentic AI

Discussion about this post

Ready for more?