Weights in an LLM are the learned tensor parameters inside the model layers—numbers that get updated during training and then stay fixed during inference. Those tensors determine how token IDs become hidden states, how attention patterns are computed, and how the final hidden states are converted into logits over the vocabulary.
Key Takeaways
- Weights are the trainable tensor parameters (e.g., projection matrices, FFN kernels, layer-norm scale/bias) whose values determine the model’s mapping from token sequences to output logits.
- In a transformer, weights live in linear layers and normalization layers; activations (hidden states) are the per-input values computed at runtime.
- Attention is fundamentally a weighted mixing operation: Q/K/V are produced by linear transforms using weights, and the attention weights multiply V to form new representations.
- The FFN (feed-forward network) uses additional weight matrices to apply nonlinear transformations (typically GELU/SwiGLU) to each token’s hidden state independently.
- Weights are learned by minimizing a next-token prediction loss (cross-entropy) using backpropagation and an optimizer like AdamW.
- Generalization comes from weights encoding distributed statistical structure—features that can be composed across layers and positions.
- Checkpoints store weights as serialized tensors plus configuration; deployment loads those tensors into the runtime and may apply quantization or pruning that changes numerical behavior.
What do “weights” mean in an LLM?
In deep learning, “weights” usually means the parameters of a model: tensors that are learned from data by gradient-based optimization. In an LLM (transformer-based language model), weights are the numeric contents of layers such as linear projections (matrix multiplications) and normalization parameters (scale and bias). They are distinct from activations, which are the intermediate tensors computed for a specific input sequence during the forward pass.
A precise way to separate them is:
- Weights (parameters): fixed after training for a given checkpoint; they define the function class the model can represent.
- Activations (runtime values): computed per input; they include hidden states, attention outputs, and FFN outputs.
Concretely, consider the attention projection for queries:
[
Q = XW_Q
]
Here (X) is the hidden state activation tensor for a particular batch of tokens (runtime), and (W_Q) is a weight matrix (parameter). If you change (X) while keeping (W_Q) fixed, you change the computed Q values; if you change (W_Q) while keeping (X) fixed, you change the model’s behavior. That “behavior-defining” property is what practitioners mean by weights.
It’s also worth clarifying what weights are not. Weights are not the embedding vectors for a specific token occurrence; those are retrieved from the embedding matrix (which is a weight tensor), but the retrieved embedding for token id 123 is an activation instance of that weight row. Similarly, attention scores are not weights; they are computed from Q and K using dot products and scaling, producing activations that depend on the input.
Finally, weights are typically stored as floating-point tensors (FP32 during training; FP16/BF16 in many inference/training setups). The exact dtype affects numerical behavior, but the conceptual role remains: weights are the learnable constants that define the mapping from token sequences to logits.
Which weight tensors exist inside a transformer?
A transformer layer is mostly a stack of submodules, each with its own parameter tensors. The key groups are the attention projections, the attention output projection, the FFN projections, and normalization parameters. Different architectures vary (pre-norm vs post-norm, GELU vs SwiGLU, tied embeddings vs untied), but the weight “shape” is remarkably consistent across modern LLMs.
Attention projection weights: Q/K/V and output
For multi-head attention, you can think of a single weight block that produces Q, K, V for all heads. If the model hidden size is (d_{model}) and there are (h) heads with head dimension (d_{head}) such that (d_{model}=h\cdot d_{head}), then:
- (W_Q \in \mathbb{R}^{d_{model}\times d_{model}})
- (W_K \in \mathbb{R}^{d_{model}\times d_{model}})
- (W_V \in \mathbb{R}^{d_{model}\times d_{model}})
Many implementations use a single fused weight matrix for QKV for efficiency, but conceptually it’s three projections.
After computing attention for each head, the heads are concatenated and projected back to the model dimension using:
- (W_O \in \mathbb{R}^{d_{model}\times d_{model}})
So the attention block has four major weight matrices per layer (Q, K, V, and output projection), plus normalization parameters around the block (discussed below).
FFN weights: up/down (and sometimes gate)
The FFN (also called MLP) is usually a two-layer (or three-layer in gated variants) position-wise network: it transforms each token’s hidden state independently using weights, then applies a nonlinearity.
A common “dense” FFN uses:
- (W_{up} \in \mathbb{R}^{d_{model}\times d_{ff}})
- (W_{down} \in \mathbb{R}^{d_{ff}\times d_{model}})
where (d_{ff}) is typically larger than (d_{model}) (often ~2× to ~8× depending on architecture). In gated FFNs like SwiGLU, you also have a second “up” projection that acts as a gate:
- (W_{gate} \in \mathbb{R}^{d_{model}\times d_{ff}})
Then the FFN computes something like: [ \text{FFN}(x) = W_{down}\big(\text{SwiGLU}(xW_{up},, xW_{gate})\big) ] where SwiGLU introduces multiplicative gating after the nonlinearity.
LayerNorm (and RMSNorm) weights
Transformers use normalization to stabilize training. Two common variants:
- LayerNorm: has learnable scale (\gamma) and bias (\beta) vectors of length (d_{model}).
- RMSNorm: often only has a learnable scale (\gamma) (no bias), which reduces parameters and compute.
In either case, normalization parameters are small compared to the large projection matrices, but they materially affect behavior and must be included in checkpoints.
Embedding weights and output head weights
At the model level (outside per-layer blocks), there are also weight tensors:
- Token embedding matrix (E \in \mathbb{R}^{V\times d_{model}}), where (V) is vocab size.
- Output projection (LM head) that maps hidden states to vocabulary logits.
Many LLMs tie embeddings and output weights (using (E) for both input embedding and output projection), reducing parameters and often improving results. When untied, the LM head has its own weight matrix (W_{lm}\in\mathbb{R}^{d_{model}\times V}) (plus bias in some designs).
Weight tensor inventory (conceptual)
Here’s a conceptual mapping for one transformer block (assuming pre-norm and SwiGLU-like FFN; exact details vary by implementation):
| Component | Parameter tensors (typical) | Notes |
|---|---|---|
| Attention QKV | (W_Q, W_K, W_V) | Often fused into one tensor for speed |
| Attention output | (W_O) | Projects concatenated heads back |
| FFN | (W_{up}, W_{gate}, W_{down}) | SwiGLU uses gate + up |
| Norms | (\gamma) (and (\beta) for LayerNorm) | Per-channel vectors of length (d_{model}) |
A sharp distinction: weights vs “structure”
The transformer architecture defines where weights are used and how tensors interact (matrix multiplications, attention softmax, residual connections). But the actual numeric values inside the weights are what encode the learned behavior. Two models with the same architecture but different training runs will produce different logits because the weight tensors differ.
How do weights turn tokens into predictions?
The forward pass is the mechanism by which weights transform input token IDs into logits. The core idea: token IDs become initial embeddings, then a sequence of weighted linear transforms and attention mixing produce a final hidden representation at each position, which is converted into vocabulary scores.
Stage 1: Token IDs → embeddings (weights are looked up)
Given input token IDs (t_1,\dots,t_n), the model retrieves embedding vectors from the token embedding weight matrix (E): [ x_i = E[t_i] ] Stacked into a tensor (X \in \mathbb{R}^{n\times d_{model}}). This is where weights first appear: the embedding matrix is a weight tensor, but the retrieval produces an activation tensor for the current input.
If positional information is added (absolute learned embeddings or rotary positional embeddings), additional operations occur. For rotary embeddings (RoPE), the positional effect is applied by modifying the Q/K computations rather than adding a separate positional embedding tensor.
Stage 2: Hidden states → attention features (weights via linear transforms)
In one transformer block, you typically have a normalization, then projections: [ \tilde{X} = \text{Norm}(X) ] [ Q = \tilde{X}W_Q,\quad K = \tilde{X}W_K,\quad V = \tilde{X}W_V ] Each of these uses weight matrices. After shaping into heads, attention scores are computed: [ S = \frac{QK^\top}{\sqrt{d_{head}}} ] Then apply causal masking (for autoregressive decoding) and softmax: [ A = \text{softmax}(S + M) ] where (M) is the mask (e.g., (-\infty) for disallowed future positions).
Now the attention output is the weighted sum of V: [ \text{AttnOut} = AV ] Finally, the result is projected back: [ Y = \text{AttnOut}W_O ] and combined with residual connections: [ X' = X + Y ]
Notice the role of weights: they create Q/K/V and the output mixing transformation. The softmax attention weights (A) are activations computed per input; they are not parameters.
Stage 3: Hidden states → FFN transformation (weights via MLP)
Next comes the FFN, again usually preceded by normalization: [ \hat{X} = \text{Norm}(X') ] Then the FFN uses weight matrices to expand and transform each token representation:
- For a SwiGLU-like FFN: [ U = \hat{X}W_{up},\quad G = \hat{X}W_{gate} ] [ H = \text{SwiGLU}(U,G) = \text{Silu}(G)\odot U ] [ \text{FFNOut} = H W_{down} ]
- With GELU FFN, it’s typically: [ \text{FFNOut} = \text{GELU}(\hat{X}W_{up})W_{down} ]
Then residual add: [ X'' = X' + \text{FFNOut} ]
Again, weights are the matrices (W_{up}, W_{gate}, W_{down}). The nonlinearities are fixed functions; they apply pointwise to activations.
Stage 4: Final hidden state → logits → probabilities
After the final transformer block, you get a final hidden state tensor (H_{final}\in\mathbb{R}^{n\times d_{model}}). Apply a final normalization (common in pre-norm transformers): [ \bar{H} = \text{Norm}(H_{final}) ]
Then compute logits over vocabulary: [ \text{logits}i = \bar{h}i W{lm} + b ] If embeddings are tied, (W{lm}) may be (E^\top).
Autoregressive decoding uses the last position: [ p(t_{n+1}\mid t_{1:n}) = \text{softmax}(\text{logits}_n) ]
A minimal “weights-to-logits” mental model
You can compress the whole process to this chain:
- Look up embeddings: (x = E[t])
- Repeatedly apply blocks where:
- weights produce Q/K/V and output projections,
- attention mixes activations,
- weights produce FFN transformations.
- Project final hidden state to vocabulary logits using the LM head weights.
The non-obvious but crucial point: attention is not a learned “pattern” directly stored in weights; instead, weights shape representations so that the computed attention scores become useful for the current input.
How are LLM weights learned during training?
Training updates weights so that the model’s predicted token distributions match the training data. In practice, LLM training is almost always next-token prediction with a causal language modeling objective, optimized with backpropagation and an optimizer like AdamW.
Stage 1: Build training examples and compute loss
Given a tokenized sequence (t_{1:n}), the model is trained to predict (t_{i}) from (t_{<i}). The standard causal LM loss is cross-entropy: [ \mathcal{L} = -\sum_{i=1}^{n} \log p(t_i \mid t_{<i}) ] where (p(\cdot)) comes from a softmax over logits produced using the current weights.
In implementation, you typically shift labels by one position: inputs are tokens (t_{0:n-1}), labels are (t_{1:n}). Then you compute logits for all positions and apply cross-entropy per position, often with masking for padding.
Stage 2: Backpropagation computes gradients for each weight tensor
The loss is a scalar function of all weights (\theta). Backpropagation applies the chain rule through:
- embedding lookup (gradient flows into embedding weight rows used),
- each attention projection matrix,
- the softmax attention (gradients flow through Q/K/V and the softmax),
- FFN linear layers and nonlinearities,
- normalization operations,
- final LM head.
In a simplified view: [ \nabla_{\theta}\mathcal{L} = \frac{\partial \mathcal{L}}{\partial \theta} ] Every weight tensor (e.g., (W_Q), (W_{up}), layer-norm (\gamma)) receives a gradient.
A practical implementation detail: because the model is huge, training uses automatic differentiation frameworks (PyTorch/XLA/JAX) and relies on memory-efficient techniques like activation checkpointing, fused kernels, and mixed precision to make the gradient computation feasible.
Stage 3: AdamW updates weights with moments and weight decay
AdamW is widely used. For each parameter tensor (\theta), Adam maintains:
- first moment estimate (m_t) (exponential moving average of gradients)
- second moment estimate (v_t) (exponential moving average of squared gradients)
Then updates parameters roughly as: [ m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t ] [ v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 ] [ \theta_{t+1} = \theta_t - \alpha\frac{m_t}{\sqrt{v_t}+\epsilon} - \alpha\cdot \lambda\theta_t ] where (g_t=\nabla_\theta \mathcal{L}), (\alpha) is learning rate, and (\lambda) is weight decay.
The “W” in AdamW matters: decoupled weight decay applies a separate shrinkage term rather than mixing it into the gradient. This can improve training stability for large transformer models.
Stage 4: Training loop sketch
A simplified training loop (PyTorch-like) looks like:
for step, batch in enumerate(loader):
input_ids, labels = batch["input_ids"], batch["labels"]
logits = model(input_ids) # uses current weights
loss = cross_entropy(logits, labels) # scalar
optimizer.zero_grad(set_to_none=True)
loss.backward() # gradients for all weight tensors
optimizer.step() # AdamW update of weights
In real training, you’ll also see:
- gradient accumulation (simulate larger batch size),
- learning rate schedules (warmup + cosine decay),
- mixed precision (FP16/BF16 with loss scaling),
- distributed data parallel (shard batches across GPUs).
A sharp distinction: weights vs optimizer state
After the first backward pass, weights (\theta) are updated, but the optimizer also stores state tensors (m_t) and (v_t) for each parameter. Those are not model weights; they’re training-time variables used to compute future updates. At inference time, you load only the model weights (plus any required config), not the optimizer state.
What do weights “store” and why do they generalize?
Weights don’t store “facts” in a literal key-value database sense. They store parameters of a high-dimensional function that maps contexts to next-token distributions. The reason this generalizes is that the model learns distributed representations and compositional features that can be reused in new contexts.
Weights encode statistical structure
During training, the optimizer pushes weights so that the model’s output distribution matches observed next-token frequencies and their dependencies. In transformer terms, weights shape:
- how token embeddings map into hidden representations,
- how attention heads select and combine information,
- how FFN layers transform representations into more linearly separable features for the final LM head.
A useful analogy: weights are like the coefficients of a giant set of basis functions. Activations are the coefficients for a particular input. Generalization happens when the learned basis functions capture reusable structure across the training distribution.
Distributed representations: features spread across dimensions
A single “feature” is rarely a single neuron. Instead, information is distributed across many dimensions of the hidden state vectors. That means even if a specific combination of tokens hasn’t appeared exactly, the model can still activate a similar pattern of features because the relevant structure is captured in the weights.
Empirically, this is consistent with observations from mechanistic interpretability research: attention heads and MLP neurons often develop roles like copying, induction, or pattern detection, but those roles are distributed and context-dependent rather than hard-coded.
Composition across layers
Transformers stack multiple blocks. Early layers tend to learn more local or surface-level transformations; later layers refine and compose features. Composition is enabled because each block applies learned linear maps plus nonlinearities and residual connections: [ X_{l+1} = X_l + f_l(X_l; W_l) ] where (f_l) is the sublayer function parameterized by weights (W_l). Residual connections preserve information flow, allowing later layers to build on representations rather than relearn from scratch.
Why this improves out-of-distribution behavior (up to a point)
Generalization is not magic; it’s approximation. Weights learn a mapping that performs well on the training distribution and nearby regions in representation space. When inputs are too far (different grammar, different domain vocabulary, adversarial prompts), the hidden states may activate unfamiliar feature combinations, and predictions degrade.
So the “why” is: weights learn a function that is smooth enough in the learned representation space to handle variations. But the smoothness is induced by training data and architecture, not guaranteed.
How do weights relate to checkpoints, formats, and loading?
A model checkpoint is the serialized form of the learned weight tensors plus metadata describing the architecture. At inference time, loading a checkpoint means reconstructing the model structure and copying the stored tensors into the corresponding parameter shapes.
What a checkpoint contains
A typical checkpoint includes:
- Weight tensors: e.g., embedding matrix, QKV/O matrices, FFN matrices, layer-norm parameters, LM head.
- Configuration: vocab size, hidden size, number of layers, number of heads, FFN dimension, normalization type, positional encoding type, etc.
- Sometimes: tokenizer metadata or references, and sometimes optimizer state (usually omitted for inference).
If the config doesn’t match the checkpoint tensor shapes, the load will fail or produce incorrect behavior.
Serialization formats: common patterns
Common formats include:
- PyTorch:
.pt/.binusingstate_dictserialization. - Safetensors:
.safetensorsstores tensors with a safer loading mechanism (no code execution on load). - Sharded checkpoints: split weights across multiple files for huge models.
The exact on-disk format varies, but the logical content is the same: named tensors with specific shapes and dtypes.
Weight loading: what happens at runtime
Inference frameworks load weights into memory (often GPU). The forward pass uses these tensors in matrix multiplications. For performance, frameworks may also:
- fuse operations (e.g., fused QKV projection),
- use specialized kernels (FlashAttention),
- convert weights to inference dtype (FP16/BF16/INT8/INT4).
Comparison: tied vs untied embeddings affects checkpoint tensors
| Design choice | What’s stored | Impact |
|---|---|---|
| Tied embeddings | One embedding tensor reused for LM head | Fewer parameters; logits computed using same matrix |
| Untied embeddings | Separate LM head weight tensor | More parameters; sometimes flexibility |
When you inspect a checkpoint, this difference shows up as whether the LM head tensors are present separately or referenced from the embedding tensor.
Minimal load example (conceptual)
from safetensors.torch import load_file
import torch
weights = load_file("model.safetensors") # dict of tensors by name
# Example: assign one tensor to the model parameter
model.transformer.layers[0].attn.W_Q.weight.data.copy_(weights["...W_Q.weight"])
In practice you use a library (Hugging Face Transformers, Megatron-LM, vLLM, TensorRT-LLM) that maps checkpoint tensor names to model parameter objects automatically.
How do weight quantization and pruning change behavior?
Quantization and pruning modify the weight tensors (and sometimes the computation graph) to reduce memory bandwidth and improve latency. They change behavior because they alter the numerical values used in matrix multiplications and the effective capacity of the model.
Quantization: changing dtype and representing weights approximately
Quantization maps floating-point weights to lower-precision formats like INT8 or INT4. The simplest view:
- FP16/BF16 weights are replaced with integer representations plus scale factors.
- During inference, the runtime uses integer or mixed-precision kernels to compute approximate outputs.
A typical INT8 scheme stores:
- int8 weights (W_q)
- per-channel or per-group scale (s)
- dequantization uses (W \approx s \cdot W_q)
This approximation introduces error in linear layers. The error propagates through attention and FFN transformations, affecting logits.
Why quantization can fail in specific ways
Quantization failure modes often show up as:
- degraded perplexity (worse next-token accuracy),
- instability in softmax attention due to score scaling errors,
- “rare token” issues where small logits are perturbed more relative to their magnitude.
To mitigate this, modern quantization methods use calibration, per-channel scales, and sometimes keep sensitive components (like layer-norms or embedding layers) in higher precision.
Pruning: removing weights (sparsity)
Pruning sets some weights to zero (or removes them structurally), creating sparsity. There are two broad types:
- Unstructured pruning: arbitrary weights set to zero; effective only if the runtime exploits sparsity efficiently.
- Structured pruning: remove entire channels/neurons/heads; easier to accelerate on many hardware backends.
Pruning reduces capacity, which can harm accuracy. However, if pruning targets weights that contribute least to the loss (often determined by magnitude or gradient-based importance), the accuracy drop can be controlled.
Comparison: quantization vs pruning
| Technique | What changes | Typical benefit | Typical risk |
|---|---|---|---|
| Quantization | Numerical precision of weights (and sometimes activations) | Lower memory + faster matmuls | Accumulated approximation error; attention instability |
| Pruning | Remove or zero weights to create sparsity | Smaller effective model; potential speedups | Capacity loss; irregular sparsity overhead |
A practical perspective: why deployment stacks both
In real deployments, teams often combine:
- quantization to reduce weight memory bandwidth,
- pruning or structured sparsity to reduce compute,
- kernel fusion and attention optimizations to reduce overhead.
But combining them increases the risk of compounding errors. That’s why production pipelines typically include regression tests on perplexity/accuracy proxies and latency budgets.
Frequently Asked Questions
Are weights the same as embeddings?
Embeddings are a type of weight. The token embedding matrix (E \in \mathbb{R}^{V\times d_{model}}) is a weight tensor. When you feed a token ID into the model, the embedding lookup selects a row of (E), producing an embedding vector for that particular token occurrence. That selected vector is an activation instance derived from weights. So:
- The embedding matrix is weights.
- The embedding vector for token id 42 in a specific batch is an activation (a runtime tensor) produced by indexing into weights.
If the model uses tied embeddings, the same weight tensor (E) may also be used in the LM head to compute logits. In that case, weights directly influence both input representation and output scoring.
How many weights does an LLM have?
The number of weights depends on architecture hyperparameters: vocab size (V), hidden size (d_{model}), number of layers (L), number of heads (h), and FFN dimension (d_{ff}). A rough rule of thumb for transformer parameter scaling is that the dominant weights come from attention projections and FFN matrices, which scale like:
- attention: (\mathcal{O}(L\cdot d_{model}^2))
- FFN: (\mathcal{O}(L\cdot d_{model}\cdot d_{ff}))
The embeddings and LM head add (\mathcal{O}(V\cdot d_{model})). For very large models, the projection matrices dominate, and vocab size is a smaller fraction, though for smaller models vocab can matter more.
If you want the exact count, you sum tensor sizes from the config and whether embeddings are tied.
Do weights update during inference?
No—at least not in standard inference. Inference runs the forward pass with weights fixed. Gradients are not computed, and parameters are not modified. However, there are exceptions:
- Online learning / continual learning: weights are updated during or after inference (rare in typical serving).
- Test-time training / adaptation: some methods update a subset of parameters (e.g., adapters or LoRA weights) for a specific prompt.
- Safety or personalization: some systems fine-tune lightweight layers on user data.
But for a typical LLM serving stack, weights are constant after loading the checkpoint.
What’s the difference between weights and activations?
Weights are parameters learned during training and stored in the checkpoint. Activations are the intermediate tensors computed from weights and the current input. Activations change with every batch; weights are constant for a given checkpoint.
A useful mental model is: weights define the “rules,” activations are the “state” produced by those rules for a particular input.
Why do small weight changes sometimes cause big output changes?
Because LLMs are deep and nonlinear. A small change in a weight matrix can alter hidden representations, which then changes attention patterns (softmax over dot products), which can cascade through residual connections across many layers. Softmax can be sensitive: a tiny change in scores can shift probabilities if logits are close.
Additionally, the final logits are a linear projection of the last hidden state. If the hidden state changes direction slightly, the dot products with vocabulary vectors can change enough to flip the top token.
Are attention patterns stored in weights?
Not directly. Attention patterns (the softmax weights (A)) are activations computed per input from Q and K: [ A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_{head}}}\right) ] Weights determine how Q and K are computed, which shapes which tokens attend to which. But the actual attention matrix is computed at runtime.
You can think of weights as shaping the geometry so that, for a given context, the dot products produce useful attention distributions.
What happens to weights when you use LoRA or adapters?
LoRA doesn’t replace the original weights; it adds low-rank update matrices. During training of LoRA, the base weights may be frozen while new parameters (the low-rank factors) are learned. At inference, the effective weight used in linear layers becomes: [ W_{eff} = W_{base} + \Delta W ] where (\Delta W) is low-rank. So the system uses both base weights and adapter weights, and only the adapter weights are changed relative to the checkpoint you started from.
Do quantized weights remain “the same model”?
They remain intended to approximate the same model function, but numerically they are different. Quantization changes the representation of weights, which changes outputs. In well-calibrated systems, the degradation is small enough for target tasks, but it’s not identical.
For sensitive tasks (e.g., exact reasoning steps or low-probability tokens), quantization error can matter. That’s why deployment uses task-specific evaluation after quantization.
Conclusion
Weights in LLMs are the learned tensor parameters that govern behavior: they define the linear maps that produce Q/K/V, transform hidden states through FFNs, scale and shift normalized activations, and finally project hidden states into vocabulary logits. Training updates these tensors via backpropagation and optimizers like AdamW, while checkpoints serialize them so inference can load fixed parameters and run the forward pass deterministically (modulo quantization/precision choices).
Single most actionable next step: inspect a real checkpoint and map a few named tensors (e.g., q_proj, k_proj, v_proj, mlp.up_proj, lm_head) to the equations in the forward pass, then verify their shapes against the model config.
Two adjacent advanced topics worth reading next: (1) mechanistic interpretability of attention heads and MLP features, and (2) post-training quantization methods (GPTQ/AWQ-style) and their calibration mechanics.