阅读视图

From RNNs to Transformers

2026年4月8日 06:30

Natural Language Processing has undergone a massive evolution in recent years. To understand state-of-the-art models, we first need to look back at how we used to process sequences and the critical bottleneck that led to the invention of Attention.

The Foundation: Seq2Seq with RNNs

In traditional Sequence-to-Sequence (Seq2Seq) models—commonly used for tasks like machine translation—the architecture relies on Recurrent Neural Networks (RNNs) and consists of two main components:

Encoder RNN: Processes the input sequence $x = (x_1, x_2, \dots, x_T)$ and produces a sequence of hidden states $h = (h_1, h_2, \dots, h_T)$ .
Decoder RNN: Generates the output sequence $y = (y_1, y_2, \dots, y_{T'})$ step by step.

The encoder’s final hidden state $h_T$ is passed as the initial hidden state of the decoder. This vector—often called the context vector—carries a heavy burden: it must encode all necessary information from the entire input sequence.

The Context Bottleneck
When dealing with long input sequences, a single fixed-size context vector struggles to capture all relevant details. Information inevitably gets lost or diluted, leading to poor model performance on longer text.

The Solution: The Attention Mechanism

To overcome the context bottleneck, the attention mechanism was introduced. It allows the decoder to dynamically focus on different parts of the input sequence at each decoding step, rather than relying on a single static vector.

How It Works Step-by-Step

At decoder timestep $t$ , the model performs the following operations:

Compute Alignment Scores: Calculate the relevance between the current decoder state $s_{t-1}$ and each encoder hidden state $h_i$ :

$e_{ti} = f_{\text{attn}}(s_{t-1}, h_i)$

A common mathematical choice for this is:

$e_{ti} = v_a^\top \tanh(W_a [s_{t-1}; h_i])$

(Where $W_a$ and $v_a$ are learnable parameters, and $[\cdot;\cdot]$ denotes concatenation).
2. Normalize Scores: Use a softmax function to convert the scores into probabilities (attention weights):

$\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^T \exp(e_{tj})}$

Compute the Context Vector: Create a weighted sum of the encoder hidden states using the attention weights:

$c_t = \sum_{i=1}^T \alpha_{ti} h_i$

Decode: Use the new, time-dependent context vector $c_t$ in the decoder. This is typically done by concatenating $c_t$ with the decoder input $y_{t-1}$ , or feeding $[c_t; s_{t-1}]$ directly into the output layer to predict $y_t$ .

Key Advantages of Attention

Dynamic Focus: The model attends to the most relevant input tokens for each specific output step.
No Fixed Bottleneck: The full input sequence remains accessible throughout the entire decoding process.
Fully Differentiable: Attention weights are learned end-to-end via backpropagation, requiring no external supervision for alignment.

Generalizing to Attention Layers

In the Seq2Seq model, we can think of the decoder RNN states as Query Vectors and the encoder RNN states as Data Vectors. These get transformed to output Context Vectors. This specific operation is so powerful that it was extracted into a standalone, general-purpose neural network component: the Attention Layer.

To prevent vanishing gradients caused by large similarities saturating the softmax function, modern layers use Scaled Dot-Product Attention. Furthermore, separating the data into distinct “Keys” and “Values” increases flexibility, and processing multiple queries simultaneously maximizes parallelizability.

The Inputs

Query vector $Q$ [Dimensions: $N_Q \times D_Q$ ]
Data vectors $X$ [Dimensions: $N_X \times D_X$ ]
Key matrix $W_K$ [Dimensions: $D_X \times D_Q$ ]
Value matrix $W_V$ [Dimensions: $D_X \times D_V$ ]

(Note: The key weights, value weights, and attention weights are all learnable through backpropagation).

The Computation Process

Keys: The data vectors are projected into the key space.

$K = X W_K \quad [N_X \times D_Q]$

Values: The data vectors are projected into the value space.

$V = X W_V \quad [N_X \times D_V]$

Similarities: Calculate the dot product between queries and keys, scaled by the square root of the query dimension.

$E = Q K^T / \sqrt{D_Q} \quad [N_Q \times N_X]$

$E_{ij} = Q_i \cdot K_j / \sqrt{D_Q}$

Attention Weights: Normalize the similarities using softmax along dimension 1.

$A = \text{softmax}(E, \text{dim}=1) \quad [N_Q \times N_X]$

Output Vector: Generate the final output as a weighted sum of the values.

$Y = A V \quad [N_Q \times D_V]$

$Y_i = \sum_j A_{ij} V_j$

Flavors of Attention

Depending on how we route the data, attention layers take on different properties:

Cross-Attention Layer: The data vectors and the query vectors come from two completely different sets of data.
Self-Attention Layer: The exact same set of vectors is used for both the data and the query. Because self-attention is permutation equivariant (it doesn’t inherently know the sequence order), we must add positional encoding to inject position information into each input vector.
Masked Self-Attention: We override certain similarities with negative infinity ( $-\infty$ ) before the softmax step. This strictly controls which inputs each vector is allowed to “look at” (often used to prevent looking into the future during text generation).
Multi-Headed Attention: We run $H$ copies of Self-Attention in parallel, each with its own independent weights (called “heads”). We then stack the $H$ independent outputs and use a final output projection matrix to fuse the data together.

The Rise of Transformers

Transformers Paper

Under the hood, self-attention boils down to four highly optimized matrix multiplications. However, calculating attention across every token pair requires $O(N^2)$ compute.

By taking these attention layers and building around them, we get the modern Transformer:

The Transformer Block
A single block consists of a self-attention layer, a residual connection, Layer Normalization, several Multi-Layer Perceptrons (MLPs) applied independently to each output vector, followed by another residual connection and a final Layer Normalization.

Because they discard the sequential nature of RNNs, Transformers are incredibly parallelizable. Ultimately, a full Transformer Model is simply a stack of these identical, highly efficient blocks working together to process complex sequential data.

Applications

The true power of the Transformer lies in its versatility. By simply changing how data is pre-processed and fed into the model, the exact same attention mechanism can solve vastly different problems.

Large Language Models (LLMs)

Modern text-generation giants (like GPT-4 or Gemini) are primarily built on decoder-only Transformers. Here is how the pipeline flows:

Embedding: The model begins with an embedding matrix that converts discrete words (or sub-word tokens) into continuous, dense vectors.
Masked Self-Attention: These vectors are passed through stacked Transformer blocks. Crucially, these blocks use Masked Multi-Headed Self-Attention. The mask prevents the model from “cheating” by looking at future words, forcing it to learn sequence dependencies based only on past context.
Projection: After the final Transformer block, a learned projection matrix transforms each vector into a set of scores (logits) mapping to every word in the model’s vocabulary. A softmax function converts these into probabilities to predict the next word.

Vision Transformers (ViTs)

Who says Transformers are only for text? In 2020, researchers proved that the exact same architecture could achieve state-of-the-art results on images.

Patching: Instead of tokens, a ViT breaks an image down into a grid of fixed-size patches (e.g., 16x16 pixels).
Flattening: These 2D patches are flattened into 1D vectors and passed through a linear projection layer.
Positional Encoding: Because the model processes all patches simultaneously, positional encodings are added to retain the image’s 2D spatial relationships.
Unmasked Attention: Unlike LLMs, ViTs use an encoder-only architecture. There is no masking—the model is allowed to attend to the entire image at once to understand global context.
Pooling: At the end of the transformer blocks, the output vectors are pooled (or a special [CLS] classification token is used) to make a final prediction about the image.

Modern Architectural Upgrades

The original “vanilla” Transformer from the 2017 Attention is All You Need paper is rarely used exactly as written today. Researchers have introduced several key modifications to make models train faster, scale larger, and perform better.

Pre-Norm (vs. Post-Norm)

The original Transformer applied Layer Normalization after adding the residual connection (Post-Norm). Modern architectures apply it before the Attention and MLP blocks (Pre-Norm). This seemingly minor change drastically improves training stability, allowing researchers to train much deeper networks without the gradients blowing up or vanishing.

RMSNorm (Root Mean-Square Normalization)

Standard Layer Normalization is computationally expensive because it requires calculating the mean to center the data. RMSNorm is a leaner alternative that drops the mean-centering step entirely, scaling the activations purely by their Root Mean Square. This makes training slightly more stable and noticeably faster.

Given an input vector $x$ of shape $D$ , and a learned weight parameter $\gamma$ of shape $D$ , the output $y$ is calculated as:

$y_i = \frac{x_i}{RMS(x)} * \gamma_i$

Where the Root Mean Square is defined as:

$RMS(x) = \sqrt{\epsilon + \frac{1}{N} \sum_{i=1}^N x_i^2}$

(Note: $\epsilon$ is a very small number added to prevent division by zero).

SwiGLU Activation in MLPs

Inside a Transformer block, the output of the attention layer is passed through a Multi-Layer Perceptron (MLP). To understand the modern upgrade, let’s look at the classic setup versus the new standard.

The Classic MLP:

Input: $X$ $[N \times D]$
Weights: $W_1$ $[D \times 4D]$ and $W_2$ $[4D \times D]$
Output: $Y = \sigma(XW_1)W_2$ $[N \times D]$

Modern models (like LLaMA) have replaced this with the SwiGLU (Swish-Gated Linear Unit) architecture, which introduces a gating mechanism via element-wise multiplication ( $\odot$ ):

The SwiGLU MLP:

Input: $X$ $[N \times D]$
Weights: $W_1$ and $W_2$ $[D \times H]$ , plus $W_3$ $[H \times D]$
Output: $Y = (\sigma(XW_1) \odot XW_2)W_3$

To ensure this new architecture doesn’t inflate the model’s size, researchers typically set the hidden dimension $H = 8D/3$ , which keeps the total parameter count identical to the classic MLP.

Interestingly, while SwiGLU consistently yields better performance and smoother optimization, the original authors famously quipped about its empirical nature in their paper:

“We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.”

Mixture of Experts (MoE)

As models grow, compute costs skyrocket. MoE is a clever architectural trick to increase a model’s parameter count (its “knowledge”) without proportionately increasing the compute required to run it.

How it works: Instead of a single, massive MLP layer in each Transformer block, the model learns $E$ separate, smaller sets of MLP weights. Each of these smaller MLPs is considered an “expert.”
Routing: When a token passes through the layer, a learned routing network decides which experts are best suited to process that specific token. Each token gets routed to a subset of the experts. These are the active experts.
The Benefit: This is called Sparse Activation. A 70-billion parameter MoE model might only activate 12 billion parameters per token. You get the capacity of a massive model with the speed and cost of a much smaller one.

Reference

Vaswani, A., Shazeer, N. M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. In Neural Information Processing Systems (pp. 5998–6008).

repetition_penality的作用与实现

Yunfeng's Simple Blog

Yunfeng Wang

2025年6月2日 15:49

1. 原理说明

在跑LLM推理的时候，有时候会出现模型不断复读的现象，也就是模型一直输出同一个token或者token序列，不结束输出。transformers库中有一个参数repetition_penality专门针对此现象进行优化，通过将其设置为大于1.0的一个浮点数（如1.05， 1.1， 1.2等），有些情况下能缓解重复问题。这个优化思路是在2019年的论文CTRL中提出的。

那这个参数是怎么解决重复问题的呢？其实实现原理很简单：对于之前出现过的token，在其logits（没有经过softmax的raw score)上作用一个repetition_penality 系数，使得它的logits数值降低，进而减少被选做下一个token的概率。

原理上，可以设置repetition_penality 为一个小于1.0的浮点数，使得模型增加前面token重复输出的概率，构造一个复读机，虽然好像实际没什么作用。

这个功能在transformers库中的核心代码如下（完整代码参见RepetitionPenaltyLogitsProcessor类的实现）：

if self.prompt_ignore_length:
    input_ids = input_ids[:, self.prompt_ignore_length :]
score = torch.gather(scores, 1, input_ids)
# if score < 0 then repetition penalty has to be multiplied to reduce the token probabilities
score = torch.where(score < 0, score * self.penalty, score / self.penalty)
scores_processed = scores.scatter(1, input_ids, score)

代码解释如下：

1-2行：如果设置了prompt_ignore_length（一般是用户的原始input的长度），则忽略原始input，也就是不对问题token作用惩罚系数，注意这里原始的input_ids既包含输入又包含之前预测tokens。
3行：获取所有的scores (logits)中input_ids 对应的score
4行：如果score <0，则乘以惩罚系数，使得logits变得更小(例如-0.5*1.1->-0.55)，如果score>0，则除以惩罚系数，使得logits变得更小(例如0.5/1.1->0.454)
5行：将经过惩罚系数作用后的score写入到大的scores中
可以看到这个功能的实现是比较简单直接的，没有太多弯弯绕绕的东西。

2. 效果实测

利用下面代码可以明显地看到这个参数对输出的影响，输入I love coding. I love，预测下一个token：

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.logits_process import RepetitionPenaltyLogitsProcessor

def print_top_tokens(tokenizer, scores, message=""):
    # 获取top 5的token和它们的概率
    probs = F.softmax(scores, dim=-1)
    top_scores = torch.topk(scores[0], 5)

    print(f"\n{message}")
    print("-" * 50)
    print(f"{'Token':<15} {'Raw Score':<15} {'Probability':<15}")
    print("-" * 50)

    for idx, (score, prob) in enumerate(
        zip(top_scores.values, probs[0][top_scores.indices])
    ):
        token = tokenizer.decode([top_scores.indices[idx]])
        print(f"{token:<15} {score.item():>8.3f}       {prob.item():>8.6f}")

# Loda模型和Tokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")

# 输入文本
text = "I love coding. I love"
input_ids = tokenizer.encode(text, return_tensors="pt")

# 获取模型输出的logits
with torch.no_grad():
    outputs = model(input_ids)
    original_scores = outputs.logits[:, -1, :].clone()  # 获取最后一个位置的logits

    # 创建不同penalty值的处理器
    penalty_values = [0.8, 1.2, 2.0]

    print(f"输入文本: {text}")

    # 打印原始分数
    print_top_tokens(tokenizer, original_scores, "原始输出 (无repetition penalty)")

    # 对比不同penalty值的效果
    for penalty in penalty_values:
        processor = RepetitionPenaltyLogitsProcessor(penalty=penalty)
        processed_scores = processor(input_ids, original_scores.clone())
        print_top_tokens(
            tokenizer, processed_scores, f"应用repetition penalty = {penalty}后的输出"
        )

结果如下:

输入文本: I love coding. I love
原始输出 (无repetition penalty)
--------------------------------------------------
Token           Raw Score       Probability
--------------------------------------------------
 the              16.583       0.176431
 to               15.963       0.094929
 learning         15.550       0.062831
 solving          15.482       0.058693
 programming      15.221       0.045199

应用repetition penalty = 0.8后的输出
--------------------------------------------------
Token           Raw Score       Probability
--------------------------------------------------
 coding           18.377       0.519966
 the              16.583       0.086431
 to               15.963       0.046504
 learning         15.550       0.030780
 solving          15.482       0.028753

应用repetition penalty = 1.2后的输出
--------------------------------------------------
Token           Raw Score       Probability
--------------------------------------------------
 the              16.583       0.180972
 to               15.963       0.097372
 learning         15.550       0.064449
 solving          15.482       0.060203
 programming      15.221       0.046362

应用repetition penalty = 2.0后的输出
--------------------------------------------------
Token           Raw Score       Probability
--------------------------------------------------
 the              16.583       0.181423
 to               15.963       0.097615
 learning         15.550       0.064609
 solving          15.482       0.060353
 programming      15.221       0.046477