A transformer is a neural network architecture that turns text into tokens, converts those tokens into vectors, repeatedly mixes those vectors through attention and feed-forward layers, and then predicts the next token from the resulting scores. In modern language models, that loop is the engine behind chat, search, coding, summarization, and agent behavior.
If you want the inside view, the shortest useful description is this: tokenization chooses the units, embeddings place those units in vector space, positional information tells the model where they sit, attention lets each token pull information from other tokens, the MLP reshapes each token’s representation locally, the residual stream carries the running state forward, layer norm keeps that state trainable, and logits plus decoding turn the final state into actual output text.
The original Transformer was introduced as an encoder-decoder architecture for sequence-to-sequence tasks such as translation. Many large language models used today are decoder-only variants, but the core block logic is still similar enough that understanding the original design gives you a durable mental model for modern systems.
The inside path from raw text to next-token prediction
Before zooming into the parts, it helps to see the whole path in order.
- Tokenization breaks raw text into model units such as words, subwords, or byte-level pieces.
- Embeddings map each token ID to a dense vector.
- Positional encoding or positional embeddings add order information, because the model would otherwise see a bag of tokens.
- Transformer blocks repeatedly update the running representation through attention, MLP layers, residual connections, and normalization.
- Output projection turns the final hidden state into vocabulary-sized logits, which are raw scores for every possible next token.
- Decoding chooses the next token from those scores, appends it to the sequence, and repeats the loop.
That sequence is why transformer explanations that only talk about attention are incomplete. Attention is central, but it is one part of a larger pipeline, and many practical issues come from the steps before and after it.
Before the model can “think”: tokenization, embeddings, and position
Tokenization decides what the model is allowed to see
Transformers do not read raw characters the way humans do. They read tokens. In practice, modern models usually use subword tokenization, which means a rare word may be split into several pieces while a common word may stay intact.
This matters more than many teams expect. Tokenization affects sequence length, which affects latency, memory use, and cost. It also affects failure modes. A product name, account code, legal clause reference, or unusual spelling may become several tokens, making it harder for the model to treat it as one stable unit.
A useful way to think about tokenization is that it is the model’s input contract. If the tokenizer breaks a phrase into awkward fragments, the rest of the network has to recover from that choice downstream.
Embeddings turn token IDs into vectors
After tokenization, each token ID is looked up in an embedding table. That table maps discrete symbols into continuous vectors, which gives the model something it can manipulate with linear algebra. Similar meanings, usages, or roles can end up near one another in this space, but the embedding itself is only the starting representation. The model still has to contextualize it.
For example, the token for bank starts with one learned vector, but its final meaning will differ depending on whether the surrounding context is about finance or rivers. The embedding gives the model a starting point; the transformer blocks do the disambiguation.
Positional encoding tells the model where tokens sit
Attention by itself does not inherently know first, second, or last. The model therefore needs positional information. In the original Transformer, this came from sinusoidal positional encodings added to token embeddings. Many newer systems use different positional schemes, but the job is the same: preserve order.
Without position, the sentences the contract replaced the policy and the policy replaced the contract would contain the same tokens but lose the relation that makes them mean different things.
The practical takeaway is simple: token identity and token position are both required. One tells the model what is present; the other tells it where.
What happens inside a transformer block
Once the model has token vectors with positional information, it sends them through a stack of transformer blocks. Each block updates a running hidden state. In interpretability work, that running state is often called the residual stream.
Multi-head attention mixes information across tokens
Attention is the part that lets one token look at other tokens and decide what matters right now. Each token produces query, key, and value projections. Query-key interactions produce attention weights, and those weights determine how much value information gets pulled from other positions.
In plain language, attention answers a question like: for this token, which other tokens should influence its next representation?
Multi-head attention repeats that process with several learned projections in parallel. Different heads can specialize in different relationships: nearby syntax, long-range references, separators, formatting patterns, or task-specific cues. You should not imagine each head as a human-understandable rule, but the multi-head design gives the model multiple ways to inspect the same sequence.
A simple example is the sentence The contract expired in March, but it was renewed in April. When processing it, attention helps the model connect that token back to contract rather than to March or April.
The MLP or feed-forward block reshapes each token locally
After attention, each position goes through a feed-forward network, often called an MLP block. Unlike attention, this step does not mix information across positions. Instead, it applies the same nonlinear transformation to each token representation independently.
This is where the model can expand the representation into a larger intermediate space, transform it, and compress it back. A practical mental model is that attention gathers context and the MLP digests what was gathered. If attention is the communication layer, the MLP is a heavy part of the per-token computation layer.
That distinction matters because teams often over-credit attention for everything the model knows. In reality, useful behavior comes from repeated interaction between context mixing and local nonlinear transformation.
Residual stream and layer norm keep the system usable at depth
The residual stream is the running vector state that flows through the network from block to block. Attention and MLP layers do not completely replace that state. They write updates into it through residual connections, which means the model keeps carrying forward prior information while adding new adjustments.
That design makes very deep networks much easier to train and reason about. Instead of rebuilding the representation from scratch at every block, the model keeps refining an ongoing working state.
Layer normalization helps stabilize that process by normalizing activations so the scale of the signal stays trainable across many layers. In practice, this is one of the quiet components that makes the flashy components work.
Transformer components at a glance
| Component | Main job | Why it matters in practice |
|---|---|---|
| Tokenization | Choose the units the model reads | Changes cost, sequence length, and how well rare terms survive preprocessing |
| Embeddings plus position | Turn tokens into ordered vectors | Without them the model has no usable continuous representation of text order |
| Multi-head attention | Mix information across tokens | Enables reference tracking, long-range dependencies, and context-sensitive interpretation |
| MLP block | Transform each token representation nonlinearly | Handles a large share of per-token feature processing after context is gathered |
| Residual stream plus layer norm | Carry and stabilize the running hidden state | Makes deep stacks trainable and lets blocks refine rather than replace the representation |
| Logits plus decoding | Turn hidden states into actual output tokens | Directly affects determinism, diversity, repetition, and runtime behavior |
How the model turns hidden states into text
Logits are raw scores, not final words
At the output side, the model takes the final hidden state for the current position and projects it into a vector with one score per vocabulary item. Those scores are the logits. They are not probabilities yet. They are raw preferences.
Softmax converts logits into a probability distribution over the vocabulary. A very high logit for one token means the model strongly prefers that token next, but the system still needs a policy for choosing from the distribution.
This is where many product discussions quietly confuse the model with the generation system. The model produces logits. The inference stack decides how to use them.
Decoding is the policy that turns scores into output
Decoding is the step that chooses the next token. Greedy decoding picks the highest-probability token every time. Beam-style methods keep multiple candidate continuations alive. Sampling-based methods introduce randomness, often with controls such as temperature, top-k, or top-p.
These choices create real tradeoffs. Greedy decoding is predictable but can become dull or repetitive. Sampling can produce more varied and creative output, but it can also become less stable. Two products using the same base model can therefore behave very differently because of decoding choices rather than because the underlying model weights changed.
For business teams, this is an important operational point: not every output problem is a training problem. Some are decoding-policy problems.
Pretraining and fine-tuning are different jobs
Pretraining is the large-scale stage where a model learns broad statistical structure from massive unlabeled corpora. For decoder-style language models, this is usually next-token prediction. The goal is not to memorize one business workflow. The goal is to learn a general-purpose internal representation of language, patterns, and world structure.
Fine-tuning happens later. It adapts a pretrained model to a narrower objective, behavior style, domain, or task. In some cases that means supervised fine-tuning on labeled examples. In others it means instruction tuning, preference tuning, or adapter-style updates that change behavior without retraining the entire model from scratch.
The difference matters because teams often expect fine-tuning to create missing fundamentals. It usually does not. If the base model is weak on reasoning, long-context handling, or domain vocabulary, fine-tuning can help shape behavior, but it cannot cheaply replace the value of a stronger pretrained foundation.
The safer mental model is this: pretraining builds the broad engine, while fine-tuning steers that engine toward a narrower destination.
Common mistakes when people explain or implement transformers
- Treating attention as the whole model. Attention is crucial, but embeddings, positional information, MLP layers, residual paths, normalization, and decoding are also essential.
- Ignoring tokenization. Many production issues start before the first attention calculation because the tokenizer split the input in an awkward or expensive way.
- Confusing logits with probabilities or outputs. Logits are just raw scores. Output behavior depends heavily on decoding.
- Assuming longer context means perfect long-range understanding. A model can accept long sequences and still use that space inefficiently.
- Expecting fine-tuning to fix everything. Poor data, narrow labels, or unrealistic expectations can make a fine-tuned model worse, not better.
- Forgetting compute tradeoffs. Classic attention cost grows quickly with sequence length, which is one reason context-window decisions matter for latency and spend.
Practical checklist after reading this guide
If you are evaluating or deploying transformer-based systems, use this checklist:
- Identify the tokenizer and inspect how it splits your important domain terms, IDs, names, and abbreviations.
- Separate model-quality questions from decoding-policy questions before you decide you need retraining.
- Check whether the use case depends on long-range context, exact copying, or structured output, because each stresses different parts of the stack.
- Ask whether you need raw pretraining strength, task-specific fine-tuning, retrieval, or a workflow layer around the model rather than assuming one lever solves everything.
- Measure latency and cost in tokens, not only in requests, because tokenization and context length change the real compute profile.
- When debugging behavior, trace the full path: tokenizer, prompt format, context, model choice, decoding settings, and post-processing.
The practical payoff of understanding transformer architecture is not academic trivia. It gives you a better way to reason about why a model is slow, why it misses a reference, why it repeats itself, why a fine-tune underdelivers, or why two systems built on similar models still behave very differently. Once you can see the architecture from the inside, product decisions around LLMs become much less mystical and much more operational.