A feed-forward neural network layer is the part of a model that takes an input vector, applies learned weights and a bias, and then usually passes the result through a non-linear activation. An MLP, or multi-layer perceptron, is just several of those layers stacked together. In transformers, this same idea appears as the feed-forward or MLP block that sits beside attention and does most of the per-token feature transformation.
If attention decides which tokens should exchange information, the feed-forward block decides how each token representation should be rewritten once that information arrives. That is why dense layers, activations, hidden sizes, residual paths, and normalization still matter even in modern transformer models.
What a dense layer actually does
A dense layer, fully connected layer, or linear layer is the basic building block inside most feed-forward networks. Each output feature is computed from all input features, not just one of them. In plain language, the layer learns how much every input signal should matter for every output signal.
Matrix multiplication intuition
The simplest mental model is this: put your input features into a row of numbers, put the learned weights into a matrix, multiply them, then add a bias term. The result is a new feature vector. Each output dimension is a weighted mixture of the full input.
If your input has shape batch × input_features and your weight matrix maps input_features → output_features, the layer can transform many examples at once with one matrix multiplication. That is why dense layers are so common: they are expressive, easy to optimize, and fast on modern hardware.
Imagine an input vector with three business signals: response time, ticket complexity, and customer sentiment. A dense layer can learn one output feature that acts like an escalation score, another that acts like a churn-risk feature, and another that combines both. It does not need a human to hard-code those combinations first.
Why this block is called feed-forward
The term feed-forward means information moves from the current input toward the current output without an internal recurrence loop in that layer. There is no step-by-step hidden state inside the block the way a classic recurrent network has. During training, gradients still flow backward through backpropagation, but the forward computation itself is direct.
What makes an MLP an MLP
An MLP is not one dense layer. It is multiple dense layers with non-linear activations between them. That middle non-linearity is what turns a simple affine transform into a more expressive function approximator.
Why activations matter
If you stack linear layers with no activation in between, the whole stack can be collapsed into one equivalent linear transform. That means depth would not buy you much. Activations are what stop that collapse and let the model learn curved, piecewise, or otherwise non-linear decision boundaries.
ReLU is the classic example: it keeps positive values and clips negative ones to zero. GELU is smoother and is widely used in transformer-style models because it scales values rather than making a hard cutoff. You do not need to memorize the formulas to use them well. The practical point is that activations decide how aggressively the network gates or reshapes intermediate features.
Hidden dimensions are a capacity dial
The hidden dimension is the width of the intermediate layer inside the MLP. Increasing it gives the network a larger temporary workspace to build richer internal features before projecting back to the output space. Decreasing it creates a tighter bottleneck.
This is one of the most important design levers in an MLP. A hidden layer that is too small may underfit because it cannot express enough useful combinations of features. A hidden layer that is much too large may improve fit on paper but add parameter count, memory pressure, latency, and overfitting risk.
A useful mental model is:
- First linear layer: expand or remix the input features.
- Activation: gate or reshape that expanded representation.
- Second linear layer: compress or project it into the output space you actually need.
That simple pattern is the core of many MLPs, from small tabular models to the feed-forward blocks inside large language models.
Residual connections and normalization are not optional details in deep stacks
Once networks get deeper, the feed-forward block itself is only part of the story. Residual connections and normalization are what make those blocks trainable at scale.
Residual connections let a layer learn a correction
A residual connection adds the input of a block back to that block’s output. Instead of forcing a layer to rewrite the entire representation from scratch, the model can learn a delta or correction. That makes optimization easier because a layer can stay close to the identity map when a large change is not useful yet.
In practice, this helps gradients move through deep networks and reduces the chance that adding more layers immediately makes training worse. For transformer-style architectures, residual paths are a major reason deep stacks remain stable enough to optimize.
Normalization keeps feature scale under control
Normalization helps prevent intermediate activations from drifting into unstable scales. In transformer blocks, layer normalization is the standard choice. Instead of normalizing across the whole batch the way batch normalization does, layer normalization works across the feature dimension for each example.
That matters because transformers often operate on variable sequence lengths, large model widths, and sometimes small effective batch sizes. Layer norm gives the model a more predictable scale from block to block, which makes deep feed-forward and attention stacks easier to train.
Pre-norm and post-norm
You will often see two common patterns:
- Pre-norm: normalize first, then run the block, then add the residual.
- Post-norm: run the block, add the residual, then normalize.
Both use the same ingredients, but they behave differently during optimization. The important takeaway is not to memorize a preferred pattern blindly. It is to recognize that normalization and residual design change how well very deep MLP and transformer stacks train.
Where feed-forward blocks appear inside transformers
Transformers are often explained as if attention is the whole model. It is not. Each transformer layer usually has an attention sub-layer and a feed-forward sub-layer, with residual paths and normalization around them.
Attention and feed-forward layers do different jobs
Attention mixes information across positions. It lets one token look at other tokens and pull in context. The feed-forward block then processes each position separately using the same learned weights at every position.
That division of labor is important:
- Attention: decides what information should move between tokens.
- Feed-forward / MLP: decides how the representation at each token should be transformed once that context is available.
So if you are trying to understand what gives a transformer expressivity, do not think of the MLP block as an accessory. It is one of the main places where the model builds new per-token features.
Why transformer FFNs are usually wide
The original transformer used a model dimension of 512 and a feed-forward inner dimension of 2048 in its base setup. That 4× expansion is a good illustration of what the block is for: create a wider intermediate space, apply a non-linearity, then project back down.
Modern models vary the exact ratio, but the pattern remains common because width buys the model a larger representational workspace. That is one big reason feed-forward blocks matter for capacity. They are not just moving numbers around. They are providing a large, learned transformation space at every layer.
There is also a practical downside: wide FFN blocks are expensive. Increasing hidden width can improve quality, but it also increases parameter count, memory usage, and inference cost. That tradeoff is central when teams are trying to balance model quality against runtime constraints.
Why MLPs matter even when attention is strong
Research on pure attention has shown that skip connections and MLPs are not decorative extras. They help stop representations from degenerating as depth increases. In practical terms, that means the MLP block is part of what keeps a transformer useful as a deep model rather than a token-mixing mechanism with weak per-token transformation power.
Step-by-step implementation logic
If you are designing or reviewing an MLP block, a simple implementation checklist is usually enough:
- Define the input dimension. This is your feature count or model dimension.
- Choose an output dimension. In a classifier, this may be number of classes. In a transformer block, it often returns to the original model dimension.
- Pick a hidden dimension. Start with the smallest width that can plausibly do the job. In transformer blocks, 2× to 4× expansion is a common baseline.
- Add a non-linearity. ReLU is simple and cheap. GELU is common in transformer-style models.
- Add residual and normalization if the stack is deep. These usually matter once you are beyond very shallow networks.
- Estimate parameter count and latency before scaling up. Hidden width gets expensive fast.
- Validate with real metrics. A larger MLP is only better if validation quality or downstream task performance improves enough to justify the cost.
Concrete examples
Example 1: a small tabular MLP
A fraud model with 40 engineered features might use an MLP shaped like 40 → 128 → 64 → 1. The first dense layer expands the representation, the second refines it, and the final layer emits a risk score. Residual paths may be unnecessary if the model is shallow, but the same design logic still applies.
Example 2: a transformer feed-forward block
A transformer layer might receive a token embedding of width 4096, expand it to a much larger hidden dimension, apply a smooth activation or gated variant, and project it back to 4096. Attention has already mixed cross-token context. The MLP now performs the heavy per-token rewriting step.
Example 3: when width becomes the bottleneck
If a model seems able to move information across tokens but still struggles to build useful task-specific features, the FFN width may be too constrained. If the model is already accurate but slow and memory-hungry, the FFN is often one of the first places to inspect for optimization.
Common mistakes
- Calling every dense layer an MLP. A single dense layer is part of an MLP, not automatically an MLP by itself.
- Stacking linear layers with no activation. Without non-linearity, extra depth often collapses into one linear transform.
- Treating attention as a replacement for the feed-forward block. Attention mixes tokens, but it does not remove the need for strong per-token transformation.
- Oversizing hidden dimensions by habit. Wider layers increase cost quickly and should be justified by validation gains.
- Ignoring residual and normalization choices. Deep stacks become harder to optimize when these are treated as afterthoughts.
- Using the same design everywhere. Tabular MLPs, recommendation models, and transformer FFNs may all use feed-forward blocks, but they do not need identical widths, activations, or normalization schemes.
A practical checklist you can use after reading
- Can I clearly name the input, hidden, and output dimensions of the block?
- Did I place a real non-linearity between linear layers?
- If this is a transformer, do I understand what attention is doing separately from what the MLP is doing?
- Is the hidden dimension wide enough to add useful capacity, but not so wide that it dominates cost for no measurable gain?
- Are residual connections shape-compatible with the block output?
- Do I know whether the design is pre-norm or post-norm?
- Have I measured memory, latency, and validation quality instead of assuming wider is always better?
The bottom line
Feed-forward layers and MLPs are still some of the most important machinery inside modern neural networks. Dense layers provide learned weighted mixing, activations add non-linearity, hidden dimensions control capacity, residual connections help deep stacks learn corrections, and normalization stabilizes training. In transformers, attention may get more attention, but the feed-forward block is still one of the main places where the model actually builds expressive internal features.