阅读视图

发现新文章,点击刷新页面。
🔲 ☆

From RNNs to Transformers

Natural Language Processing has undergone a massive evolution in recent years. To understand state-of-the-art models, we first need to look back at how we used to process sequences and the critical bottleneck that led to the invention of Attention.

The Foundation: Seq2Seq with RNNs

In traditional Sequence-to-Sequence (Seq2Seq) models—commonly used for tasks like machine translation—the architecture relies on Recurrent Neural Networks (RNNs) and consists of two main components:

  • Encoder RNN: Processes the input sequence x=(x1,x2,,xT)x = (x_1, x_2, \dots, x_T) and produces a sequence of hidden states h=(h1,h2,,hT)h = (h_1, h_2, \dots, h_T).
  • Decoder RNN: Generates the output sequence y=(y1,y2,,yT)y = (y_1, y_2, \dots, y_{T'}) step by step.

The encoder’s final hidden state hTh_T is passed as the initial hidden state of the decoder. This vector—often called the context vector—carries a heavy burden: it must encode all necessary information from the entire input sequence.

The Context Bottleneck
When dealing with long input sequences, a single fixed-size context vector struggles to capture all relevant details. Information inevitably gets lost or diluted, leading to poor model performance on longer text.

The Solution: The Attention Mechanism

To overcome the context bottleneck, the attention mechanism was introduced. It allows the decoder to dynamically focus on different parts of the input sequence at each decoding step, rather than relying on a single static vector.

How It Works Step-by-Step

At decoder timestep tt, the model performs the following operations:

  1. Compute Alignment Scores: Calculate the relevance between the current decoder state st1s_{t-1} and each encoder hidden state hih_i:

eti=fattn(st1,hi)e_{ti} = f_{\text{attn}}(s_{t-1}, h_i)

A common mathematical choice for this is:

eti=vatanh(Wa[st1;hi])e_{ti} = v_a^\top \tanh(W_a [s_{t-1}; h_i])

(Where WaW_a and vav_a are learnable parameters, and [;][\cdot;\cdot] denotes concatenation).
2. Normalize Scores: Use a softmax function to convert the scores into probabilities (attention weights):

αti=exp(eti)j=1Texp(etj)\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^T \exp(e_{tj})}

  1. Compute the Context Vector: Create a weighted sum of the encoder hidden states using the attention weights:

ct=i=1Tαtihic_t = \sum_{i=1}^T \alpha_{ti} h_i

  1. Decode: Use the new, time-dependent context vector ctc_t in the decoder. This is typically done by concatenating ctc_t with the decoder input yt1y_{t-1}, or feeding [ct;st1][c_t; s_{t-1}] directly into the output layer to predict yty_t.

Key Advantages of Attention

  • Dynamic Focus: The model attends to the most relevant input tokens for each specific output step.
  • No Fixed Bottleneck: The full input sequence remains accessible throughout the entire decoding process.
  • Fully Differentiable: Attention weights are learned end-to-end via backpropagation, requiring no external supervision for alignment.

Generalizing to Attention Layers

In the Seq2Seq model, we can think of the decoder RNN states as Query Vectors and the encoder RNN states as Data Vectors. These get transformed to output Context Vectors. This specific operation is so powerful that it was extracted into a standalone, general-purpose neural network component: the Attention Layer.

To prevent vanishing gradients caused by large similarities saturating the softmax function, modern layers use Scaled Dot-Product Attention. Furthermore, separating the data into distinct “Keys” and “Values” increases flexibility, and processing multiple queries simultaneously maximizes parallelizability.

The Inputs

  • Query vector QQ [Dimensions: NQ×DQN_Q \times D_Q]
  • Data vectors XX [Dimensions: NX×DXN_X \times D_X]
  • Key matrix WKW_K [Dimensions: DX×DQD_X \times D_Q]
  • Value matrix WVW_V [Dimensions: DX×DVD_X \times D_V]

(Note: The key weights, value weights, and attention weights are all learnable through backpropagation).

The Computation Process

  1. Keys: The data vectors are projected into the key space.

K=XWK[NX×DQ]K = X W_K \quad [N_X \times D_Q]

  1. Values: The data vectors are projected into the value space.

V=XWV[NX×DV]V = X W_V \quad [N_X \times D_V]

  1. Similarities: Calculate the dot product between queries and keys, scaled by the square root of the query dimension.

E=QKT/DQ[NQ×NX]E = Q K^T / \sqrt{D_Q} \quad [N_Q \times N_X]

Eij=QiKj/DQE_{ij} = Q_i \cdot K_j / \sqrt{D_Q}

  1. Attention Weights: Normalize the similarities using softmax along dimension 1.

A=softmax(E,dim=1)[NQ×NX]A = \text{softmax}(E, \text{dim}=1) \quad [N_Q \times N_X]

  1. Output Vector: Generate the final output as a weighted sum of the values.

Y=AV[NQ×DV]Y = A V \quad [N_Q \times D_V]

Yi=jAijVjY_i = \sum_j A_{ij} V_j

Flavors of Attention

Depending on how we route the data, attention layers take on different properties:

  • Cross-Attention Layer: The data vectors and the query vectors come from two completely different sets of data.
  • Self-Attention Layer: The exact same set of vectors is used for both the data and the query. Because self-attention is permutation equivariant (it doesn’t inherently know the sequence order), we must add positional encoding to inject position information into each input vector.
  • Masked Self-Attention: We override certain similarities with negative infinity (-\infty) before the softmax step. This strictly controls which inputs each vector is allowed to “look at” (often used to prevent looking into the future during text generation).
  • Multi-Headed Attention: We run HH copies of Self-Attention in parallel, each with its own independent weights (called “heads”). We then stack the HH independent outputs and use a final output projection matrix to fuse the data together.

The Rise of Transformers

Transformers Paper

Under the hood, self-attention boils down to four highly optimized matrix multiplications. However, calculating attention across every token pair requires O(N2)O(N^2) compute.

By taking these attention layers and building around them, we get the modern Transformer:

The Transformer Block
A single block consists of a self-attention layer, a residual connection, Layer Normalization, several Multi-Layer Perceptrons (MLPs) applied independently to each output vector, followed by another residual connection and a final Layer Normalization.

Because they discard the sequential nature of RNNs, Transformers are incredibly parallelizable. Ultimately, a full Transformer Model is simply a stack of these identical, highly efficient blocks working together to process complex sequential data.

Applications

The true power of the Transformer lies in its versatility. By simply changing how data is pre-processed and fed into the model, the exact same attention mechanism can solve vastly different problems.

Large Language Models (LLMs)

Modern text-generation giants (like GPT-4 or Gemini) are primarily built on decoder-only Transformers. Here is how the pipeline flows:

  1. Embedding: The model begins with an embedding matrix that converts discrete words (or sub-word tokens) into continuous, dense vectors.
  2. Masked Self-Attention: These vectors are passed through stacked Transformer blocks. Crucially, these blocks use Masked Multi-Headed Self-Attention. The mask prevents the model from “cheating” by looking at future words, forcing it to learn sequence dependencies based only on past context.
  3. Projection: After the final Transformer block, a learned projection matrix transforms each vector into a set of scores (logits) mapping to every word in the model’s vocabulary. A softmax function converts these into probabilities to predict the next word.

Vision Transformers (ViTs)

Who says Transformers are only for text? In 2020, researchers proved that the exact same architecture could achieve state-of-the-art results on images.

  1. Patching: Instead of tokens, a ViT breaks an image down into a grid of fixed-size patches (e.g., 16x16 pixels).
  2. Flattening: These 2D patches are flattened into 1D vectors and passed through a linear projection layer.
  3. Positional Encoding: Because the model processes all patches simultaneously, positional encodings are added to retain the image’s 2D spatial relationships.
  4. Unmasked Attention: Unlike LLMs, ViTs use an encoder-only architecture. There is no masking—the model is allowed to attend to the entire image at once to understand global context.
  5. Pooling: At the end of the transformer blocks, the output vectors are pooled (or a special [CLS] classification token is used) to make a final prediction about the image.

Modern Architectural Upgrades

The original “vanilla” Transformer from the 2017 Attention is All You Need paper is rarely used exactly as written today. Researchers have introduced several key modifications to make models train faster, scale larger, and perform better.

Pre-Norm (vs. Post-Norm)

The original Transformer applied Layer Normalization after adding the residual connection (Post-Norm). Modern architectures apply it before the Attention and MLP blocks (Pre-Norm). This seemingly minor change drastically improves training stability, allowing researchers to train much deeper networks without the gradients blowing up or vanishing.

RMSNorm (Root Mean-Square Normalization)

Standard Layer Normalization is computationally expensive because it requires calculating the mean to center the data. RMSNorm is a leaner alternative that drops the mean-centering step entirely, scaling the activations purely by their Root Mean Square. This makes training slightly more stable and noticeably faster.

Given an input vector xx of shape DD, and a learned weight parameter γ\gamma of shape DD, the output yy is calculated as:

yi=xiRMS(x)γiy_i = \frac{x_i}{RMS(x)} * \gamma_i

Where the Root Mean Square is defined as:

RMS(x)=ϵ+1Ni=1Nxi2RMS(x) = \sqrt{\epsilon + \frac{1}{N} \sum_{i=1}^N x_i^2}

(Note: ϵ\epsilon is a very small number added to prevent division by zero).

SwiGLU Activation in MLPs

Inside a Transformer block, the output of the attention layer is passed through a Multi-Layer Perceptron (MLP). To understand the modern upgrade, let’s look at the classic setup versus the new standard.

The Classic MLP:

  • Input: XX [N×D][N \times D]
  • Weights: W1W_1 [D×4D][D \times 4D] and W2W_2 [4D×D][4D \times D]
  • Output: Y=σ(XW1)W2Y = \sigma(XW_1)W_2 [N×D][N \times D]

Modern models (like LLaMA) have replaced this with the SwiGLU (Swish-Gated Linear Unit) architecture, which introduces a gating mechanism via element-wise multiplication (\odot):

The SwiGLU MLP:

  • Input: XX [N×D][N \times D]
  • Weights: W1W_1 and W2W_2 [D×H][D \times H], plus W3W_3 [H×D][H \times D]
  • Output: Y=(σ(XW1)XW2)W3Y = (\sigma(XW_1) \odot XW_2)W_3

To ensure this new architecture doesn’t inflate the model’s size, researchers typically set the hidden dimension H=8D/3H = 8D/3, which keeps the total parameter count identical to the classic MLP.

Interestingly, while SwiGLU consistently yields better performance and smoother optimization, the original authors famously quipped about its empirical nature in their paper:

“We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.”

Mixture of Experts (MoE)

As models grow, compute costs skyrocket. MoE is a clever architectural trick to increase a model’s parameter count (its “knowledge”) without proportionately increasing the compute required to run it.

  • How it works: Instead of a single, massive MLP layer in each Transformer block, the model learns EE separate, smaller sets of MLP weights. Each of these smaller MLPs is considered an “expert.”
  • Routing: When a token passes through the layer, a learned routing network decides which experts are best suited to process that specific token. Each token gets routed to a subset of the experts. These are the active experts.
  • The Benefit: This is called Sparse Activation. A 70-billion parameter MoE model might only activate 12 billion parameters per token. You get the capacity of a massive model with the speed and cost of a much smaller one.

Reference

  1. Vaswani, A., Shazeer, N. M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. In Neural Information Processing Systems (pp. 5998–6008).
🔲 ☆

repetition_penality的作用与实现

1. 原理说明

在跑LLM推理的时候,有时候会出现模型不断复读的现象,也就是模型一直输出同一个token或者token序列,不结束输出。transformers库中有一个参数repetition_penality专门针对此现象进行优化,通过将其设置为大于1.0的一个浮点数(如1.05, 1.1, 1.2等),有些情况下能缓解重复问题。 这个优化思路是在2019年的论文CTRL中提出的。

那这个参数是怎么解决重复问题的呢?其实实现原理很简单:对于之前出现过的token,在其logits(没有经过softmax的raw score)上作用一个repetition_penality 系数,使得它的logits数值降低,进而减少被选做下一个token的概率。

原理上,可以设置repetition_penality 为一个小于1.0的浮点数,使得模型增加前面token重复输出的概率,构造一个复读机,虽然好像实际没什么作用。

这个功能在transformers库中的核心代码如下(完整代码参见RepetitionPenaltyLogitsProcessor类的实现):

1
2
3
4
5
6
if self.prompt_ignore_length:
input_ids = input_ids[:, self.prompt_ignore_length :]
score = torch.gather(scores, 1, input_ids)
# if score < 0 then repetition penalty has to be multiplied to reduce the token probabilities
score = torch.where(score < 0, score * self.penalty, score / self.penalty)
scores_processed = scores.scatter(1, input_ids, score)

代码解释如下:

  1. 1-2行:如果设置了prompt_ignore_length(一般是用户的原始input的长度),则忽略 原始input,也就是不对问题token作用惩罚系数,注意这里原始的input_ids既包含输入又包含之前预测tokens。
  2. 3行:获取所有的scores (logits)中input_ids 对应的score
  3. 4行:如果score <0,则乘以惩罚系数,使得logits变得更小(例如-0.5*1.1->-0.55),如果score>0,则除以惩罚系数,使得logits变得更小(例如0.5/1.1->0.454)
  4. 5行:将经过惩罚系数作用后的score写入到大的scores中
    可以看到这个功能的实现是比较简单直接的,没有太多弯弯绕绕的东西。

2. 效果实测

利用下面代码可以明显地看到这个参数对输出的影响,输入I love coding. I love,预测下一个token:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.logits_process import RepetitionPenaltyLogitsProcessor

def print_top_tokens(tokenizer, scores, message=""):
# 获取top 5的token和它们的概率
probs = F.softmax(scores, dim=-1)
top_scores = torch.topk(scores[0], 5)

print(f"\n{message}")
print("-" * 50)
print(f"{'Token':<15} {'Raw Score':<15} {'Probability':<15}")
print("-" * 50)

for idx, (score, prob) in enumerate(
zip(top_scores.values, probs[0][top_scores.indices])
):
token = tokenizer.decode([top_scores.indices[idx]])
print(f"{token:<15} {score.item():>8.3f} {prob.item():>8.6f}")

# Loda模型和Tokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")

# 输入文本
text = "I love coding. I love"
input_ids = tokenizer.encode(text, return_tensors="pt")

# 获取模型输出的logits
with torch.no_grad():
outputs = model(input_ids)
original_scores = outputs.logits[:, -1, :].clone() # 获取最后一个位置的logits

# 创建不同penalty值的处理器
penalty_values = [0.8, 1.2, 2.0]

print(f"输入文本: {text}")

# 打印原始分数
print_top_tokens(tokenizer, original_scores, "原始输出 (无repetition penalty)")

# 对比不同penalty值的效果
for penalty in penalty_values:
processor = RepetitionPenaltyLogitsProcessor(penalty=penalty)
processed_scores = processor(input_ids, original_scores.clone())
print_top_tokens(
tokenizer, processed_scores, f"应用repetition penalty = {penalty}后的输出"
)

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
输入文本: I love coding. I love
原始输出 (无repetition penalty)
--------------------------------------------------
Token Raw Score Probability
--------------------------------------------------
the 16.583 0.176431
to 15.963 0.094929
learning 15.550 0.062831
solving 15.482 0.058693
programming 15.221 0.045199

应用repetition penalty = 0.8后的输出
--------------------------------------------------
Token Raw Score Probability
--------------------------------------------------
coding 18.377 0.519966
the 16.583 0.086431
to 15.963 0.046504
learning 15.550 0.030780
solving 15.482 0.028753

应用repetition penalty = 1.2后的输出
--------------------------------------------------
Token Raw Score Probability
--------------------------------------------------
the 16.583 0.180972
to 15.963 0.097372
learning 15.550 0.064449
solving 15.482 0.060203
programming 15.221 0.046362

应用repetition penalty = 2.0后的输出
--------------------------------------------------
Token Raw Score Probability
--------------------------------------------------
the 16.583 0.181423
to 15.963 0.097615
learning 15.550 0.064609
solving 15.482 0.060353
programming 15.221 0.046477

可以看到,通过设置repetition_penality 为0.8后,预测的概率最大token为coding,概率为0.519966,而设置repetition_penality为1.2和2.0,提高了预测token the 的出现概率。

🔲 ☆

DataMeasurementsTool介绍

资源

引子

随着机器学习数据集统一平台的快速发展(Lhoest et al. 2021),HuggingFace团队开始探索如何管理数据集文档(McMillan-Major et al., 2021)。文档是认识数据集必要的第一步,通过文档我们知道如何统计和查看这份数据集,动态观察数据集的不同角度。

在这里,我们介绍一个开源Python库和零代码界面,名为Data Measurements Tool。通过DatasetSpaces社区,搭配Streamlit tool工具,它可以用来帮助理解、构建、洞察和比较数据集。

🔲 ☆

Transformers仓库解读之序

Transformers仓库是HuggingFace公司开源的非常火的预训练模型仓库,它把预训练模型处理的流程抽象包装成了高级的api接口,非常适合掉包侠快速使用。但如果是新手第一次分析源码,拆开一步步的功能的话,是有一定学习成本的。而且仓库源码太多,涉及到方方面面的细节,一时间难以抓住重点。


介于此,作为掉包侠的我,尝试将仓库各个模块从类的角度拆分,结合实操代码,系统分析每个模块基本功能。

❌