普通视图

发现新文章，点击刷新页面。

昨天以前Louis Aeilot's Blog

Louis Aeilot's Blog
自動微分 | DIY 實現自己的 PyTorchLouis C Deng
在機器學習問題中，模型訓練的核心是梯度下降：計算損失函數（Loss Function）計算損失對參數的導數（Gradient）根據梯度更新參數重復直到收斂我們需要找到一個高效的梯度計算方法。幾種梯度計算方法梯度計算方法包括：數值微分、符號微分、自動微分 1. 數值微分數值微分的基本思想，是使用一個非常小的 hhh 去近似計算。dfdx≈f(x+h)−f(x)h\frac{df}{dx} \approx \frac{f(x+h) - f(x)}{h}dxdf≈hf(x+h)−f(x)但是，這種計算方法存在舍入誤差，而且計算復雜度高。如果我們有 nnn 個參數，需要執行 O(n)O(n)O(n) 次函數計算。在參數量達百萬級的神經網絡中，這種計算成本是不可接受的。 2. 符號微分符號微分的思路是通過鏈式法則得到一個完整的解析表達式。在 Python 中，我們可以使用 sympy 去實現。但是，一旦函數變得復雜，表達式長度增長極快，同一個子表達式在求導過程中會被多次重復計算。這種方法都理論上很優雅，在工程上卻不可行。自動微分自動微分將復雜函數拆解成一系列簡單的「基本運算」，並在計算過
2026年4月13日 07:45

自動微分 | DIY 實現自己的 PyTorch

Louis Aeilot's Blog

作者 Louis C Deng

2026年4月13日 07:45

在機器學習問題中，模型訓練的核心是梯度下降：

計算損失函數（Loss Function）
計算損失對參數的導數（Gradient）
根據梯度更新參數
重復直到收斂

我們需要找到一個高效的梯度計算方法。

幾種梯度計算方法

梯度計算方法包括：數值微分、符號微分、自動微分

1. 數值微分

數值微分的基本思想，是使用一個非常小的 $h$ 去近似計算。

$\frac{df}{dx} \approx \frac{f(x+h) - f(x)}{h}$

但是，這種計算方法存在舍入誤差，而且計算復雜度高。

如果我們有 $n$ 個參數，需要執行 $O(n)$ 次函數計算。在參數量達百萬級的神經網絡中，這種計算成本是不可接受的。

2. 符號微分

符號微分的思路是通過鏈式法則得到一個完整的解析表達式。在 Python 中，我們可以使用 sympy 去實現。

但是，一旦函數變得復雜，表達式長度增長極快，同一個子表達式在求導過程中會被多次重復計算。這種方法都理論上很優雅，在工程上卻不可行。

自動微分

自動微分將復雜函數拆解成一系列簡單的「基本運算」，並在計算過程中同步或反向計算導數。

自動微分通過記錄這些基礎運算的執行軌跡（即構建「計算圖」），並在計算過程中系統化地應用微積分中的鏈式法則##，從而極其精確且高效地計算出復雜函數對各個變量的導數。

要實現自動微分，首先要將復雜的算式轉化為「計算圖」。在這張有向無環圖（DAG）中：葉子節點代表輸入變量或模型參數。內部節點代表基本運算（如加法、乘法、 $sin$ 等）。邊代表數據流向。

我們可以在計算圖上進行反向傳播，這一過程分為兩個階段：

前向階段：從輸入到輸出正常計算，但會保存所有的中間變量（計算圖）。
反向階段：從最終的標量輸出（如 Loss 損失值）開始，從輸出向輸入反向遍歷計算圖。它通過鏈式法則，將輸出節點對當前節點的導數（稱為伴隨值 Adjoint，用 $\bar{v}_i$ 表示）一步步往回傳遞。

工程實現上，我們可以使用一個簡單的拓撲排序，按照拓撲順序，從輸出節點反向計算梯度，完成梯度下降。

工程實現

筆者自己完成了一自動微分的簡單實現，包括了一些常用的神經網絡 Loss Function 和 Optimizer，在 GitHub 上開源：https://github.com/aeilot/simplegrad

我實現的是類似 PyTorch 的動態圖。

下面簡單介紹一些核心模塊。

Tensor

class Tensor:
    def __init__(self, data, requires_grad: bool = False):
        if not isinstance(data, np.ndarray):
            data = np.array(data, dtype=np.float32)

        self.data = data
        self.requires_grad = requires_grad
        self.grad = None # 梯度
        self.parents = [] # 父節點
        self.grad_fn = None # 創造了這個 `Tensor` 的數學操作

    # 以下省略

Tensor 不僅僅是簡單的張量，還存儲了梯度、父節點等信息。通過運算符重載，我們可以隱式地構建計算圖，實現反向傳播。

Ops / Function

對於數學操作，我們需要定義前向運算和反向運算。


class Function:
    def __init__(self):
        pass

    def forward(self):
        raise NotImplementedError

    def backward(self, grad_output):
        raise NotImplementedError

定義了接口，我們就可以實現各種各樣的算子。

以矩陣乘法為例。

眾所周知，矩陣乘法求導有如下公式：

$Y = AB$

$dA = \frac{\partial L}{\partial A} = \frac{\partial L}{\partial Y} B^T$

$dB = \frac{\partial L}{\partial B} = A^T \frac{\partial L}{\partial Y}$

我們很容易寫出代碼：

class MatMul(Function):
    def __init__(self, a, b):
        super().__init__()
        self.a = a
        self.b = b

    def backward(self, grad):
        grad_a = grad @ self.b.data.T
        grad_b = self.a.data.T @ grad
        return [grad_a, grad_b]

然後進行運算符重載：

def __matmul__(self, other):
    other = ensure_tensor(other) # 防止類型異常
    out = Tensor(
        self.data @ other.data,
        requires_grad=self.requires_grad or other.requires_grad,
    )
    if out.requires_grad and GradMode.enabled:
        out.grad_fn = ops.MatMul(self, other)
        out.parents = [self, other]
    return out

運算符重載的時候，我們儲存了父節點，用戶不需要自己手動建圖，我們自動構建了計算圖。

用這種方法，我實現了加法、減法、Hadamard 積、對數、Softmax、求和、平均數等運算，可以在 GitHub 倉庫查看（不要白嫖給個 star）

GradMode

細心的讀者可能註意到了，在前面 MatMul 的重載代碼中，有一個判斷條件 if out.requires_grad and GradMode.enabled:。

這對應了 PyTorch 中的 torch.no_grad()。在模型訓練好後進行推理（Inference）或評估時，我們不需要更新參數，也就不需要計算梯度。如果此時框架還在後臺默默「建圖」（保存父節點和 grad_fn），會白白浪費大量內存。

我們可以用一個簡單的上下文管理器（Context Manager）來實現這個開關：

class GradMode:
    enabled = True

@contextmanager
def no_grad():
    prev = GradMode.enabled
    GradMode.enabled = False
    try:
        yield
    finally:
        GradMode.enabled = prev

使用時非常優雅：

1
2
3

with no_grad():
    # 這裏的運算不會構建計算圖，節省內存
    y_pred = model(x)

backward

反向傳播的任務是從輸出節點出發，一路沿著圖的邊，把梯度精準地傳回給所有的輸入節點。

我們基於 DFS + 拓撲排序實現，保證在計算某個節點的梯度之前，依賴於它的所有「下遊」節點的梯度都已經被完全計算出來。

def backward(output):
    # 1. 拓撲排序：確保梯度的計算順序正確
    def toposort(tensor):
        visited = set()
        order = []

        def dfs(node):
            node_id = id(node)
            if node_id in visited:
                return
            visited.add(node_id)
            for pa in node.parents:
                dfs(pa)
            order.append(node)

        dfs(tensor)
        order.reverse() # 翻轉得到從輸出到輸入的遍歷順序
        return order

    order = toposort(output)

    # 2. 梯度初始化：最終輸出節點（如 Loss）的梯度設為 1
    output.grad = np.ones_like(output.data)

    # 3. 鏈式法則與梯度傳播
    for node in order:
        # 如果是葉子節點（如用戶直接輸入的變量），則跳過
        if node.grad_fn is None:
            continue

        # 調用計算節點對應的局部反向函數
        grads = node.grad_fn.backward(node.grad)

        # 遍歷當前節點的所有父節點，將梯度回傳
        for p, g in zip(node.parents, grads):
            if p.grad is None:
                p.grad = unbroadcast(g, p.shape)
            else:
                # 關鍵點：梯度累加
                p.grad += unbroadcast(g, p.shape)

當一個變量在計算圖中被多次使用時（例如 $y = x * x$ ），它的梯度必須是各個分支回傳梯度的總和。

這裏註意一點細節，有的時候 numpy 運算會出現 broadcasting，導致父節點和子節點 shape 不符。

在 NumPy 中，如果我們把一個形狀為 (3, 1) 的張量與一個形狀為 (1, 3) 的張量相加，NumPy 會極其聰明地將它們自動擴展（Broadcast）成 (3, 3) 並計算出結果。然而，在反向傳播時，上遊傳下來的梯度形狀是 (3, 3)，但我們原本的節點形狀是 (3, 1)，直接將梯度傳回去會導致形狀不匹配而報錯。

因此，在把梯度賦值給父節點之前，我們必須進行一次反廣播（Unbroadcasting），把多出來的維度通過求和操作「擠壓」回原本的形狀：

def unbroadcast(grad, target_shape):
    # 如果形狀已經完美匹配，直接返回
    if grad.shape == target_shape:
        return grad
        
    # 情況一：處理由於廣播而新增的前置維度（例如標量與矩陣運算）
    ndims_added = grad.ndim - len(target_shape)
    for _ in range(ndims_added):
        grad = grad.sum(axis=0)
        
    # 情況二：處理被擴展的維度（原本大小為 1 的維度被擴展成了 N）
    for i, dim in enumerate(target_shape):
        if dim == 1:
            # 沿著被擴展的維度進行求和降維，並保持維度結構
            grad = grad.sum(axis=i, keepdims=True)
            
    return grad

unbroadcast 函數的邏輯非常清晰，它分兩步解決了形狀還原的問題：

首先，如果正向計算時由於維度不對齊導致新增了維度，我們將這些多余的維度全部按軸求和並抹除。
其次，逐一對比目標形狀。如果發現原本該維度的大小為 1（意味著它被 NumPy 復製擴展了），我們就將傳回來的梯度沿著這個維度進行聚合求和，並使用 keepdims=True 維持其 (..., 1, ...) 的骨架。

優化器與梯度清零

我實現了一個帶動量的隨機梯度下降優化器：

import numpy as np

class SGD:
    def __init__(self, parameters, lr=0.01, momentum=0.0):
        self.parameters = list(parameters)
        self.lr = lr
        self.momentum = momentum
        
        # 為每個參數初始化一個全零的速度矩陣
        self.velocities = {id(p): np.zeros_like(p.data) for p in self.parameters}

    def step(self):
        for p in self.parameters:
            if p.grad is None:
                continue

            pid = id(p)
            grad = p.grad

            if self.momentum != 0.0:
                # 核心動量公式：v = momentum * v + grad
                self.velocities[pid] = self.momentum * self.velocities[pid] + grad
                update_dir = self.velocities[pid]
            else:
                update_dir = grad

            # 更新參數
            p.data -= self.lr * update_dir

    def zero_grad(self):
        for p in self.parameters:
            # 這裏要求我們的 Tensor 類中實現一個簡單的 zero_grad 方法
            # def zero_grad(self): self.grad = None
            p.zero_grad()

在我們的 backward 實現中，有一行代碼是 p.grad += unbroadcast(...)。我們使用的是梯度累加（+=），而不是直接賦值（=）。這意味著，如果不手動清空梯度，第二輪叠代的梯度就會疊加上第一輪的梯度，導致參數更新完全錯誤！因此，每次叠代前必須調用 optimizer.zero_grad()。

其他

我在 SimpleGrad 中還實現了如 Adam 優化器等其他模塊，大家可以自行查看。

From RNNs to Transformers

Louis Aeilot's Blog

作者 Louis C Deng

2026年4月8日 06:30

Natural Language Processing has undergone a massive evolution in recent years. To understand state-of-the-art models, we first need to look back at how we used to process sequences and the critical bottleneck that led to the invention of Attention.

The Foundation: Seq2Seq with RNNs

In traditional Sequence-to-Sequence (Seq2Seq) models—commonly used for tasks like machine translation—the architecture relies on Recurrent Neural Networks (RNNs) and consists of two main components:

Encoder RNN: Processes the input sequence $x = (x_1, x_2, \dots, x_T)$ and produces a sequence of hidden states $h = (h_1, h_2, \dots, h_T)$ .
Decoder RNN: Generates the output sequence $y = (y_1, y_2, \dots, y_{T'})$ step by step.

The encoder’s final hidden state $h_T$ is passed as the initial hidden state of the decoder. This vector—often called the context vector—carries a heavy burden: it must encode all necessary information from the entire input sequence.

The Context Bottleneck
When dealing with long input sequences, a single fixed-size context vector struggles to capture all relevant details. Information inevitably gets lost or diluted, leading to poor model performance on longer text.

The Solution: The Attention Mechanism

To overcome the context bottleneck, the attention mechanism was introduced. It allows the decoder to dynamically focus on different parts of the input sequence at each decoding step, rather than relying on a single static vector.

How It Works Step-by-Step

At decoder timestep $t$ , the model performs the following operations:

Compute Alignment Scores: Calculate the relevance between the current decoder state $s_{t-1}$ and each encoder hidden state $h_i$ :

$e_{ti} = f_{\text{attn}}(s_{t-1}, h_i)$

A common mathematical choice for this is:

$e_{ti} = v_a^\top \tanh(W_a [s_{t-1}; h_i])$

(Where $W_a$ and $v_a$ are learnable parameters, and $[\cdot;\cdot]$ denotes concatenation).
2. Normalize Scores: Use a softmax function to convert the scores into probabilities (attention weights):

$\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^T \exp(e_{tj})}$

Compute the Context Vector: Create a weighted sum of the encoder hidden states using the attention weights:

$c_t = \sum_{i=1}^T \alpha_{ti} h_i$

Decode: Use the new, time-dependent context vector $c_t$ in the decoder. This is typically done by concatenating $c_t$ with the decoder input $y_{t-1}$ , or feeding $[c_t; s_{t-1}]$ directly into the output layer to predict $y_t$ .

Key Advantages of Attention

Dynamic Focus: The model attends to the most relevant input tokens for each specific output step.
No Fixed Bottleneck: The full input sequence remains accessible throughout the entire decoding process.
Fully Differentiable: Attention weights are learned end-to-end via backpropagation, requiring no external supervision for alignment.

Generalizing to Attention Layers

In the Seq2Seq model, we can think of the decoder RNN states as Query Vectors and the encoder RNN states as Data Vectors. These get transformed to output Context Vectors. This specific operation is so powerful that it was extracted into a standalone, general-purpose neural network component: the Attention Layer.

To prevent vanishing gradients caused by large similarities saturating the softmax function, modern layers use Scaled Dot-Product Attention. Furthermore, separating the data into distinct “Keys” and “Values” increases flexibility, and processing multiple queries simultaneously maximizes parallelizability.

The Inputs

Query vector $Q$ [Dimensions: $N_Q \times D_Q$ ]
Data vectors $X$ [Dimensions: $N_X \times D_X$ ]
Key matrix $W_K$ [Dimensions: $D_X \times D_Q$ ]
Value matrix $W_V$ [Dimensions: $D_X \times D_V$ ]

(Note: The key weights, value weights, and attention weights are all learnable through backpropagation).

The Computation Process

Keys: The data vectors are projected into the key space.

$K = X W_K \quad [N_X \times D_Q]$

Values: The data vectors are projected into the value space.

$V = X W_V \quad [N_X \times D_V]$

Similarities: Calculate the dot product between queries and keys, scaled by the square root of the query dimension.

$E = Q K^T / \sqrt{D_Q} \quad [N_Q \times N_X]$

$E_{ij} = Q_i \cdot K_j / \sqrt{D_Q}$

Attention Weights: Normalize the similarities using softmax along dimension 1.

$A = \text{softmax}(E, \text{dim}=1) \quad [N_Q \times N_X]$

Output Vector: Generate the final output as a weighted sum of the values.

$Y = A V \quad [N_Q \times D_V]$

$Y_i = \sum_j A_{ij} V_j$

Flavors of Attention

Depending on how we route the data, attention layers take on different properties:

Cross-Attention Layer: The data vectors and the query vectors come from two completely different sets of data.
Self-Attention Layer: The exact same set of vectors is used for both the data and the query. Because self-attention is permutation equivariant (it doesn’t inherently know the sequence order), we must add positional encoding to inject position information into each input vector.
Masked Self-Attention: We override certain similarities with negative infinity ( $-\infty$ ) before the softmax step. This strictly controls which inputs each vector is allowed to “look at” (often used to prevent looking into the future during text generation).
Multi-Headed Attention: We run $H$ copies of Self-Attention in parallel, each with its own independent weights (called “heads”). We then stack the $H$ independent outputs and use a final output projection matrix to fuse the data together.

The Rise of Transformers

Transformers Paper

Under the hood, self-attention boils down to four highly optimized matrix multiplications. However, calculating attention across every token pair requires $O(N^2)$ compute.

By taking these attention layers and building around them, we get the modern Transformer:

The Transformer Block
A single block consists of a self-attention layer, a residual connection, Layer Normalization, several Multi-Layer Perceptrons (MLPs) applied independently to each output vector, followed by another residual connection and a final Layer Normalization.

Because they discard the sequential nature of RNNs, Transformers are incredibly parallelizable. Ultimately, a full Transformer Model is simply a stack of these identical, highly efficient blocks working together to process complex sequential data.

Applications

The true power of the Transformer lies in its versatility. By simply changing how data is pre-processed and fed into the model, the exact same attention mechanism can solve vastly different problems.

Large Language Models (LLMs)

Modern text-generation giants (like GPT-4 or Gemini) are primarily built on decoder-only Transformers. Here is how the pipeline flows:

Embedding: The model begins with an embedding matrix that converts discrete words (or sub-word tokens) into continuous, dense vectors.
Masked Self-Attention: These vectors are passed through stacked Transformer blocks. Crucially, these blocks use Masked Multi-Headed Self-Attention. The mask prevents the model from “cheating” by looking at future words, forcing it to learn sequence dependencies based only on past context.
Projection: After the final Transformer block, a learned projection matrix transforms each vector into a set of scores (logits) mapping to every word in the model’s vocabulary. A softmax function converts these into probabilities to predict the next word.

Vision Transformers (ViTs)

Who says Transformers are only for text? In 2020, researchers proved that the exact same architecture could achieve state-of-the-art results on images.

Patching: Instead of tokens, a ViT breaks an image down into a grid of fixed-size patches (e.g., 16x16 pixels).
Flattening: These 2D patches are flattened into 1D vectors and passed through a linear projection layer.
Positional Encoding: Because the model processes all patches simultaneously, positional encodings are added to retain the image’s 2D spatial relationships.
Unmasked Attention: Unlike LLMs, ViTs use an encoder-only architecture. There is no masking—the model is allowed to attend to the entire image at once to understand global context.
Pooling: At the end of the transformer blocks, the output vectors are pooled (or a special [CLS] classification token is used) to make a final prediction about the image.

Modern Architectural Upgrades

The original “vanilla” Transformer from the 2017 Attention is All You Need paper is rarely used exactly as written today. Researchers have introduced several key modifications to make models train faster, scale larger, and perform better.

Pre-Norm (vs. Post-Norm)

The original Transformer applied Layer Normalization after adding the residual connection (Post-Norm). Modern architectures apply it before the Attention and MLP blocks (Pre-Norm). This seemingly minor change drastically improves training stability, allowing researchers to train much deeper networks without the gradients blowing up or vanishing.

RMSNorm (Root Mean-Square Normalization)

Standard Layer Normalization is computationally expensive because it requires calculating the mean to center the data. RMSNorm is a leaner alternative that drops the mean-centering step entirely, scaling the activations purely by their Root Mean Square. This makes training slightly more stable and noticeably faster.

Given an input vector $x$ of shape $D$ , and a learned weight parameter $\gamma$ of shape $D$ , the output $y$ is calculated as:

$y_i = \frac{x_i}{RMS(x)} * \gamma_i$

Where the Root Mean Square is defined as:

$RMS(x) = \sqrt{\epsilon + \frac{1}{N} \sum_{i=1}^N x_i^2}$

(Note: $\epsilon$ is a very small number added to prevent division by zero).

SwiGLU Activation in MLPs

Inside a Transformer block, the output of the attention layer is passed through a Multi-Layer Perceptron (MLP). To understand the modern upgrade, let’s look at the classic setup versus the new standard.

The Classic MLP:

Input: $X$ $[N \times D]$
Weights: $W_1$ $[D \times 4D]$ and $W_2$ $[4D \times D]$
Output: $Y = \sigma(XW_1)W_2$ $[N \times D]$

Modern models (like LLaMA) have replaced this with the SwiGLU (Swish-Gated Linear Unit) architecture, which introduces a gating mechanism via element-wise multiplication ( $\odot$ ):

The SwiGLU MLP:

Input: $X$ $[N \times D]$
Weights: $W_1$ and $W_2$ $[D \times H]$ , plus $W_3$ $[H \times D]$
Output: $Y = (\sigma(XW_1) \odot XW_2)W_3$

To ensure this new architecture doesn’t inflate the model’s size, researchers typically set the hidden dimension $H = 8D/3$ , which keeps the total parameter count identical to the classic MLP.

Interestingly, while SwiGLU consistently yields better performance and smoother optimization, the original authors famously quipped about its empirical nature in their paper:

“We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.”

Mixture of Experts (MoE)

As models grow, compute costs skyrocket. MoE is a clever architectural trick to increase a model’s parameter count (its “knowledge”) without proportionately increasing the compute required to run it.

How it works: Instead of a single, massive MLP layer in each Transformer block, the model learns $E$ separate, smaller sets of MLP weights. Each of these smaller MLPs is considered an “expert.”
Routing: When a token passes through the layer, a learned routing network decides which experts are best suited to process that specific token. Each token gets routed to a subset of the experts. These are the active experts.
The Benefit: This is called Sparse Activation. A 70-billion parameter MoE model might only activate 12 billion parameters per token. You get the capacity of a massive model with the speed and cost of a much smaller one.

Reference

Vaswani, A., Shazeer, N. M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. In Neural Information Processing Systems (pp. 5998–6008).

CS231n Lecture Note VII: Recurrent Neural Networks

Louis Aeilot's Blog

作者 Louis C Deng

2026年4月7日 21:30

Recurrent Neural Networks (RNNs) are a class of neural networks designed to handle sequential data. Unlike standard feedforward networks, RNNs maintain an internal state (memory) that is updated as each element of a sequence is processed, allowing information to persist across time steps.

Sequence Architectures

RNNs are flexible and can be adapted to various input-output mapping structures depending on the task:

One-to-Many: A single input produces a sequence of outputs.
- Example: Image Captioning (Input: Image $\rightarrow$ Output: Sequence of words).
Many-to-One: A sequence of inputs produces a single output.
- Example: Action Prediction or Sentiment Analysis (Input: Sequence of video frames/words $\rightarrow$ Output: Label).
Many-to-Many: A sequence of inputs produces a sequence of outputs.
- Example: Video Captioning (Input: Sequence of frames $\rightarrow$ Output: Sequence of words) or Video Classification (Frame-by-frame labeling).

The Recurrence Mechanism

The “key idea” behind an RNN is the use of a recurrence formula applied at every time step ( $t$ ). This allows the network to process a sequence of vectors $x$ by maintaining a hidden state $h$ .

A. Hidden State Update

The hidden state is updated by combining the previous state with the current input.

$h_t = f_W(h_{t-1}, x_t)$

Parameters:

$h_t$ : New state (current hidden state).
$h_{t-1}$ : Old state (hidden state from the previous time step).
$x_t$ : Input vector at the current time step.
$f_W$ : A function (often a non-linear activation like $tanh$ or $ReLU$ ) with trainable parameters $W$ .

Note: the same function and the same set of parameters are used at every time step.

B. Output Generation

After the hidden state is updated, the network can produce an output at that specific time step.

$y_t = f_{W_{hy}}(h_t)$

Parameters:

$y_t$ : Output at time $t$ .
$h_t$ : New state (the updated hidden state).
$f_{W_{hy}}$ : A separate function with its own trainable parameters $W_{hy}$ used to map the hidden state to the output space.

Backpropagation Through Time

Optimizing an RNN requires a specialized version of gradient descent known as Backpropagation Through Time (BPTT).

In its standard form, the model performs a complete forward pass through the entire sequence to calculate a global loss. During the subsequent backward pass, gradients are propagated from the final loss all the way back to the first time step.

While this captures long-range dependencies accurately, it is often computationally prohibitive for long sequences due to the massive memory requirements for storing intermediate states.

To mitigate these resource constraints, researchers often employ Truncated BPTT. This technique involves partitioning the sequence into smaller, manageable chunks.

The model performs a forward and backward pass on a specific chunk to update its weights before moving to the next. Crucially, while the hidden state is carried forward into the next chunk to maintain continuity, the gradient flow is “truncated” at the chunk boundary.

This approximation significantly reduces memory overhead and allows for the training of models on extended temporal data without sacrificing the benefits of sequential learning.

RNN Tradeoffs

RNNs can process input sequences of arbitrary length, and their model size remains constant regardless of input length since the same weights are shared across all time steps—ensuring temporal symmetry in how inputs are processed. In principle, they can leverage information from arbitrarily distant time steps.

However, in practice, recurrent computation tends to be slow due to its sequential nature, and capturing long-range dependencies remains challenging because relevant information from many steps back often becomes inaccessible or diluted over time.

Long Short Term Memory

Long Short Term Memory (LSTMs) are used to alleviate vanishing or exploding gradients when training long sequences by using a gated architecture that regulates the flow of information.

LSTM Mathematics

The operations of an LSTM cell at time step $t$ are defined by three primary equations:

1. The Gate Vector
LSTMs compute four internal vectors (gates and candidates) simultaneously by concatenating the previous hidden state $h_{t-1}$ and the current input $x_t$ :

$\begin{pmatrix} i \\ f \\ o \\ g \end{pmatrix} = \begin{pmatrix} \sigma \\ \sigma \\ \sigma \\ \tanh \end{pmatrix} W \begin{pmatrix} h_{t-1} \\ x_t \end{pmatrix}$

$i$ (Input Gate): Decides which new information to store in the cell state.
$f$ (Forget Gate): Decides which information to discard from the previous state.
$o$ (Output Gate): Decides which part of the cell state to output.
$g$ (Cell Candidate): Creates a vector of new candidate values (using $\tanh$ ) to be added to the state.

2. The Cell State ( $c_t$ )
This is the “long-term memory” of the network. It is updated via a linear combination of the old state and the new candidates:

$c_t = f \odot c_{t-1} + i \odot g$

The use of the Hadamard product ( $\odot$ , element-wise multiplication) allows the forget gate to “zero out” specific memories while the input gate adds new ones. This additive update is the secret to preventing vanishing gradients.

3. The Hidden State ( $h_t$ )
The hidden state is the “working memory” passed to the next cell and the higher layers. It is a filtered version of the cell state:

$h_t = o \odot \tanh(c_t)$

Uncovering Batch & Layer Normalization

Louis Aeilot's Blog

作者 Louis C Deng

2026年4月7日 07:45

Batch normalization and layer normalization improve training stability and reduce sensitivity to initialization by normalizing intermediate activations.

When training a deep neural network, the distribution of inputs to each layer can change as the network’s weights are updated. If the weights become too large, the inputs to subsequent layers may grow excessively; if the weights shrink toward zero, the inputs may diminish accordingly. These shifts in input distributions make training more difficult, as each layer must continuously adapt to changing conditions. This phenomenon is known as internal covariate shift.

Normalization allows us to use much higher learning rates and be less careful about initialization.

Plain Normalization

If normalization (e.g., centering the activations to zero mean) is performed outside of the gradient descent step, the optimizer will remain “blind” to the normalization’s effects.

In formal terms, if the optimizer treats the mean-subtraction operation as a fixed constant, the gradient updates will fail to reflect the true dynamics of the network.

Consider a layer that computes $x = u + b$ and subsequently normalizes it: $\hat{x} = x - E[x]$ .

If the gradient $\frac{\partial \ell}{\partial b}$ is computed without considering how $E[x]$ depends on $b$ , the optimizer will attempt to adjust $b$ to minimize the loss.

Because the normalization step subsequently subtracts the updated mean, the update $\Delta b$ is effectively cancelled. The output $\hat{x}$ remains numerically identical to its state prior to the update:

$(u + b + \Delta b) - E[u + b + \Delta b] = u + b - E[u + b]$

Since the output—and consequently the loss—remains invariant despite the update, the optimizer will continue to increase $b$ in a futile attempt to reach a lower loss. This results in unbounded parameter growth while the network’s predictive performance stagnates.

To maintain training stability, normalization must be included within the computational graph so that gradients correctly capture its dependence on the parameters.

However, performing full whitening across all examples can be computationally expensive. This is why techniques like Batch Normalization and Layer Normalization are used instead.

Batch Normalization

Batch Normalization performs normalization for each training mini-batch.

In Batch Normalization, we normalize each scalar feature independently, by making it have the mean of zero and the variance of 1.

$\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{\mathrm{Var}[x^{(k)}] + \epsilon}}$

But simply normalizing each input of a layer may change what the layer can represent. The authors make sure that the transformation inserted in the network can represent the identity transform. Thus introducing:

$y^{(k)} = \gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}$

The parameters here are learnable along with the original model parameters, and restore the representation power of the network.

By setting $\gamma^{(k)} = \sqrt{\mathrm{Var}[x^{(k)}]}$ and $\beta^{(k)} = \mathbb{E}[x^{(k)}]$ , we could recover the original activations, if that were the optimal thing to do.

Algorithm 1: Batch Normalization Forward Pass

The forward pass of the Batch Normalization layer transforms a mini-batch of activations $\mathcal{B} = \{x_{1 \dots m}\}$ into a normalized and linearly scaled output $\{y_i\}$ . This process ensures that the input to subsequent layers maintains a stable distribution throughout training.

Input: Values of $x$ over a mini-batch: $\mathcal{B} = \{x_{1 \dots m}\}$ ; Parameters to be learned: $\gamma, \beta$ .
Output: $\{y_i = \text{BN}_{\gamma, \beta}(x_i)\}$ .

Mini-batch Mean:

$\mu_{\mathcal{B}} \leftarrow \frac{1}{m} \sum_{i=1}^m x_i$

Mini-batch Variance:

$\sigma_{\mathcal{B}}^2 \leftarrow \frac{1}{m} \sum_{i=1}^m (x_i - \mu_{\mathcal{B}})^2$

Normalize:

$\hat{x}_i \leftarrow \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}$

Scale and Shift:

$y_i \leftarrow \gamma \hat{x}_i + \beta \equiv \text{BN}_{\gamma, \beta}(x_i)$

Differentiation

To train the network using stochastic gradient descent, we must compute the gradient of the loss function $\ell$ with respect to the input $x_i$ and the learnable parameters $\gamma$ and $\beta$ . This is achieved by applying the chain rule through the computational graph of the BN transform.

1. Gradients for Learnable Parameters

The parameters $\gamma$ and $\beta$ are updated based on their contribution to all samples in the mini-batch:

Gradient w.r.t. $\beta$ : Since $\frac{\partial y_i}{\partial \beta} = 1$ , the gradient is the sum of the upstream gradients:

$\frac{\partial \ell}{\partial \beta} = \sum_{i=1}^m \frac{\partial \ell}{\partial y_i}$

Gradient w.r.t. $\gamma$ : Since $\frac{\partial y_i}{\partial \gamma} = \hat{x}_i$ , the gradient is the sum of the product of the upstream gradient and the normalized input:

$\frac{\partial \ell}{\partial \gamma} = \sum_{i=1}^m \frac{\partial \ell}{\partial y_i} \cdot \hat{x}_i$

2. Gradient w.r.t. Intermediate Statistics

The gradient propagates backward from $y_i$ to the normalized value $\hat{x}_i$ , and then to the batch statistics $\mu_{\mathcal{B}}$ and $\sigma_{\mathcal{B}}^2$ :

Gradient w.r.t. $\hat{x}_i$ :

$\frac{\partial \ell}{\partial \hat{x}_i} = \frac{\partial \ell}{\partial y_i} \cdot \gamma$

Gradient w.r.t. $\sigma_{\mathcal{B}}^2$ : This accounts for how the variance affects every $\hat{x}_i$ in the batch:

$\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^2} = \sum_{i=1}^m \frac{\partial \ell}{\partial \hat{x}_i} \cdot (x_i - \mu_{\mathcal{B}}) \cdot \frac{-1}{2} (\sigma_{\mathcal{B}}^2 + \epsilon)^{-3/2}$

Gradient w.r.t. $\mu_{\mathcal{B}}$ : The mean affects the loss both directly through the numerator of $\hat{x}_i$ and indirectly through the variance calculation:

$\frac{\partial \ell}{\partial \mu_{\mathcal{B}}} = \left( \sum_{i=1}^m \frac{\partial \ell}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} \right) + \frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^2} \cdot \frac{\sum_{i=1}^m -2(x_i - \mu_{\mathcal{B}})}{m}$

3. Gradient w.r.t. Input $x_i$

Finally, the gradient with respect to the original input $x_i$ is a combination of three paths: the direct path through $\hat{x}_i$ , the path through the variance $\sigma_{\mathcal{B}}^2$ , and the path through the mean $\mu_{\mathcal{B}}$ :

$\frac{\partial \ell}{\partial x_i} = \frac{\partial \ell}{\partial \hat{x}_i} \cdot \frac{1}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} + \frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^2} \cdot \frac{2(x_i - \mu_{\mathcal{B}})}{m} + \frac{\partial \ell}{\partial \mu_{\mathcal{B}}} \cdot \frac{1}{m}$

Training and Inference with Batch Normalization

The normalization of activations allows efficient training, but is neither necessary nor desirable during inference. Thus, once the network has been trained, we use the normalization using the population, as opposed to the mini-batch:

$\hat{x} = \frac{x - \mathbb{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}}$

In practice, we use the fixed moving average calculated during training. During training, the mini-batch statistics are stochastic estimates of the true data distribution. The moving average serves as a stable, low-variance estimate of the population mean ( $\mathbb{E}[x]$ ) and population variance ( $\mathrm{Var}[x]$ ).

These running statistics are typically updated at each training step $t$ using a momentum coefficient $\alpha$ (usually 0.9 or 0.99):

$\hat{\mu}_{new} = \alpha \hat{\mu}_{old} + (1 - \alpha) \mu_{\mathcal{B}}$

$\hat{\sigma}^2_{new} = \alpha \hat{\sigma}^2_{old} + (1 - \alpha) \sigma^2_{\mathcal{B}}$

Layer Normalization

Layer Normalization (LayerNorm) is an alternative normalization technique that normalizes across the features of a single sample rather than across a mini-batch.

For an input vector $x \in \mathbb{R}^d$ , LayerNorm computes the mean and variance over the feature dimension, ensuring that each individual sample has zero mean and unit variance.

Unlike Batch Normalization, it does not depend on batch statistics, which makes it particularly suitable for recurrent neural networks and transformer architectures where batch sizes may be small or variable.

Similar to BatchNorm, LayerNorm includes learnable parameters $\gamma$ and $\beta$ to scale and shift the normalized output, preserving the model’s representational capacity.

BN vs. LN

Reference

Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167.

CS231n Lecture Note VI: CNN Architectures and Training

Louis Aeilot's Blog

作者 Louis C Deng

2026年4月3日 22:45

In this note, we will continue discussing CNNs.

Normalization Layers

Normalization layers standardize activations and then apply a learned scale and shift:

$\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y = \gamma \hat{x} + \beta$

The difference between methods is which dimensions are used to compute $\mu$ and $\sigma$ (for input shape $N, C, H, W$ ):

BatchNorm: per channel, over $(N, H, W)$ ; depends on batch statistics
LayerNorm: per sample, over all features; independent of batch
InstanceNorm: per sample & channel, over $(H, W)$ ; removes instance contrast
GroupNorm: per sample, over channel groups + spatial dims; stable for small batches

Key idea: same formula, different normalization axes → different behavior and use cases.

Regularization: Dropout

In each forward pass, randomly set a subset of activations to zero with probability $p$ (a hyperparameter, often $0.5$ ). To keep the expected activation unchanged, the remaining activations are scaled:

$\tilde{x} = \frac{m \odot x}{1 - p}, \quad m \sim \text{Bernoulli}(1 - p)$

This forces the network to have redundant representation, preventing co-adaptation of neurons
Acts like implicit ensemble averaging
Applied only during training; disabled at inference (no randomness)

With inverted dropout, no test-time scaling is needed because the activations were already rescaled during training.

Training: Add some kind of randomness
Testing: Average out randomness (sometimes approximate)

Activation Functions

Goal: Introduce non-linearity to the model.

Sigmoid:
- Problem: the grandient gets small with many layers.
ReLU (Rectified Linear Unit):
- Does not saturate
- Converge faster than Sigmoid
- Problem: not zero centered; dead when $x <0$ .
GELU (Gaussian Error Linear Unit): $f(x) = x\Phi(x)$
- Smoothness facilitates training
- Problem: higher computational cost; large negative values get small gradients

Case Study

AlexNet, VGG

Small filters, deep network.

The very first CNNs.

ResNet

Increasing the depth of a CNN does not always improve performance; in fact, very deep plain networks can exhibit the degradation problem, where training error increases as layers are added.

Although a deeper model should theoretically perform at least as well as a shallower one (since additional layers could learn identity mappings), in practice, standard layers struggle to approximate identity functions, making optimization difficult.

ResNet addresses this by reformulating the learning objective: instead of directly learning a mapping $H(x)$ , each block learns a residual function $F(x) = H(x) - x$ , so the output becomes $H(x) = F(x) + x$ .

The shortcut (skip) connection that adds the input $x$ back to the learned residual improves gradient flow and makes it significantly easier to train very deep networks, especially when the desired transformation is close to identity.

Weight Initialization

If the value is too small, all activations tend to zero for deeper network layers.

If too large, activations blow up quickly.

The fix: Kaiming Initialization

ReLU correction: std = sqrt( 2/Din )

dims = [4096] * 7
hs = []
x = np.random.randn(16, dims[0])

for Din, Dout in zip(dims[:-1], dims[1:]):
    W = np.random.randn(Din, Dout) * np.sqrt(2 / Din) # Kaiming Initialization
    x = np.maximum(0, x.dot(W))
    hs.append(x)

This makes the std almost constant. Activations are nicely scaled for all layers.

Data Preprocessing

Image Normalization: center and scale for each channel

Subtract per-channel mean and Divide by per-channel std (almost all modern models) (stats along each channel = 3 numbers)
Requires pre-computing means and std for each pixel channel (given your dataset)

Data Augmentation

We can do flips, random crops and scales, color jitters, random cutouts to enhance our training data.

Transfer Learning

How do we train without much data?

We can start with a large dataset like the ImageNet and train a deep CNN, learning general features.

With a small dataset, we can use the pretrained model and freeze most layers. We replace and reinitialize the final layer to match our new classes. We train only this last layer (a linear classifier).

With a bigger dataset, we can start from the pretrained model and initialize all layers with the pretrained weights. We then finetune more (or all) layers.

Deep learning frameworks provide pretrained models that we can transfer from:

PyTorch: https://github.com/pytorch/vision
Huggingface: https://github.com/huggingface/pytorch-image-models

Hyperparameters

A structured approach to stabilizing and optimizing deep learning training:

Step 1: Check Initial Loss

Verify the loss starts at a reasonable value (e.g., \log(\text{num_classes}) for classification). Large deviations typically indicate implementation issues such as incorrect labels or loss computation.

Step 2: Overfit a Small Sample

Train on a tiny subset (10–100 samples) and confirm the model can achieve near-zero loss. Failure here suggests problems with model capacity, optimization, or data preprocessing.

Step 3: Find a Working Learning Rate

Using the full dataset and a small weight decay (e.g., $10^{-4}$ ), test learning rates: $1e^{-1}, 1e^{-2}, 1e^{-3}, 1e^{-4}, 1e^{-5}$

Run ~100 iterations and select the rate that yields a clear, stable loss decrease.

Step 4: Coarse Hyperparameter Search

Explore a broad range of values (learning rate, weight decay, batch size, model size) using grid or random search. Train each configuration for ~1–5 epochs to identify promising regions.

Step 5: Refine the Search

Narrow the search around strong candidates and train longer. Optionally introduce schedules (e.g., cosine decay) or warmup.

Step 6: Analyze Curves

Inspect training vs. validation loss and accuracy:

Overfitting → increase regularization
Underfitting → increase capacity or training time
Instability → adjust learning rate or optimizer

Summary Insight

Most failures are due to basic setup issues (bad LR, incorrect loss, broken pipeline), not fine-grained hyperparameter choices. Systematic debugging is more effective than premature tuning.

CS231n Lecture Note V: Convolution Neural Networks Basics

Louis Aeilot's Blog

作者 Louis C Deng

2026年4月2日 22:45

When images get really large, we would have too many neurons to train in a regular neural network. The full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting. What’s more, regular neural networks ignore the spatial structures of the image, missing out on some important information.

As a result, a new type of neural network, the Convolutional Neural Network (CNN) was developed.

Feature Extraction

Extract features from raw pixels to feed the model with more information.

Examples include Color Histogram and Histogram of Oriented Gradients. Hand-engineered features (e.g., HOG, SIFT) are largely obsolete in modern deep learning pipelines.

Stack them all together.

Traditional feature extraction is manually designed and fixed; CNNs learn hierarchical feature representations end-to-end.

Data and compute can solve problems better than human designers?

Linear Classification Intuition

Matrix multiplication involves vector dot products. Dot products measure how strongly an input aligns with learned feature directions, which act like soft, continuous templates.

Architecture

ConvNets use convolution and pooling operators to extract features while respecting 2D image structures.

The input remains a 3D tensor (e.g. 32x32x3), preserving spatial structures.

The layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth.

The neurons in a layer will only be connected to a small region of the layer before it.

CNN Architecture

We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks).

A simple ConvNet could have the architecture [INPUT - CONV - RELU - POOL - FC].

CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume.
POOL layer will perform a downsampling operation along the spatial dimensions (width, height).

ConvNets transform the original image layer by layer from the original pixel values to the final class scores.

The parameters in the CONV/FC layers will be trained with gradient descent.

Convolutional Layers

A convolutional layer performs a linear operation on the input using a set of learnable filters (also called kernels). Although commonly referred to as “convolution,” most deep learning libraries actually implement cross-correlation; the distinction is usually not important in practice.

The CONV layer’s parameters consist of a set of learnable filters. Each filter is spatially small but spans the full depth of the input.

Each filter typically has a single scalar bias, shared across all spatial positions in its activation map.

During the forward pass, each filter slides across the input volume. At every spatial location, the layer computes a dot product between the filter weights and the corresponding local region of the input, then adds a bias term. This produces a 2D activation map (feature map) for that filter.

A convolutional layer contains multiple filters. Each filter produces its own activation map, and these maps are stacked along the depth (channel) dimension to form the output volume. As a result, the number of filters determines the number of output channels.

Filters are initialized differently to break the symmetry.

For convolution with size $K$ , each element in the output depend on a $K \times K$ receptive field. When you stack convolutional layers, the effective receptive field grows. Each deeper layer aggregates information from a larger region of the original input.

Apart from 2D conv, we also have 1D conv and 3D conv etc.

Zero Padding

Without zero-padding, feature maps may shrink with each layer. So we add padding around the input before sliding the filter.

With the input size $W \times W$ , filter size $K \times K$ , and padding size $P$ , we get an output of size $W - K + 2 \times P$ .

A common setting is to set $P$ as $\frac{K-1}{2}$ , having the output of the same size of the input.

Stride

To increase effective receptive fields more quickly, we can skip some using strides.

With the input size $W \times W$ , filter size $K \times K$ , stride $S$ , and padding size $P$ , we get an output of size $\frac{W - K + 2 \times P}{2} + 1$ .

In this way, we are downsampling the image and the receptive field grows exponentially.

Pooling

Pooling layers downsample a feature map by summarizing local neighborhoods.

In max pooling, a small window, such as $2 \times 2$ , slides across the input and outputs the largest value in each window. This keeps the strongest local activation and reduces spatial resolution.

In average pooling, the layer outputs the mean value in each window instead of the maximum.

Translation Equivariance

CNNs use shared filters so the same local features can be detected at different positions in an image, while preserving information about where those features occur. This is called translation equivariance.

Demystifying Softmax Loss: A Step-by-Step Derivation for Linear Classifiers

Louis Aeilot's Blog

作者 Louis C Deng

2026年3月29日 07:45

If you are diving into the mechanics of neural networks, you will inevitably encounter the backpropagation of the Softmax and Cross-Entropy loss. At first glance, the matrix calculus can feel a bit intimidating. However, once you break it down step-by-step using the chain rule, you will discover that the final gradients are incredibly elegant and intuitive.

In this post, we will walk through the complete mathematical derivation of the gradients for a linear classifier, moving from single variables to full matrix vectorization.

1. The Setup: Forward Propagation

Let’s define our variables for a single training sample. Assume our input features have a dimension of $D$ , and we are classifying them into $C$ distinct classes.

Input $x$ : A $D \times 1$ column vector.
Weights $W$ : A $C \times D$ matrix.
Bias $b$ : A $C \times 1$ column vector.
True Label $y$ : The true class index (or a $C \times 1$ one-hot encoded vector where only the true class index is $1$ ).

The Linear Layer (Logits):
First, we compute the raw scores (logits) $z$ for each class:

$z = Wx + b$

The Softmax Layer:
We convert these raw scores into a valid probability distribution. The probability of the sample belonging to class $i$ is:

$p_i = \frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}}$

The Cross-Entropy Loss:
For a single sample where the true class is $y$ , the loss only cares about the predicted probability assigned to that true class:

$L = -\log(p_y)$

By substituting the Softmax formula into the loss function, we get:

$L = -z_y + \log\left(\sum_{j=1}^C e^{z_j}\right)$

2. The Core Derivation: Gradient with respect to Logits ( $z$ )

To perform backpropagation, we first need to find how the loss changes with respect to each logit $z_i$ . We denote this gradient as $\frac{\partial L}{\partial z_i}$ .

We must split this into two scenarios: when $i$ is the true class, and when $i$ is any other class.

Case 1: Deriving for the true class ( $i = y$ )

$\frac{\partial L}{\partial z_y} = \frac{\partial}{\partial z_y} \left( -z_y + \log\left(\sum_{j=1}^C e^{z_j}\right) \right)$

Applying the chain rule to the logarithm term:

$\frac{\partial L}{\partial z_y} = -1 + \frac{1}{\sum_{j=1}^C e^{z_j}} \cdot e^{z_y}$

Notice that the fractional term is exactly our definition of $p_y$ !

$\frac{\partial L}{\partial z_y} = p_y - 1$

Case 2: Deriving for an incorrect class ( $i \neq y$ )
Because $z_y$ does not contain $z_i$ , the derivative of the first term is $0$ .

$\frac{\partial L}{\partial z_i} = 0 + \frac{1}{\sum_{j=1}^C e^{z_j}} \cdot e^{z_i}$

Again, the fractional term is $p_i$ :

$\frac{\partial L}{\partial z_i} = p_i$

The Vectorized Form:
We can beautifully combine these two cases using an indicator function $\mathbb{I}(i=y)$ (which is $1$ if $i$ is the true class, and $0$ otherwise).

Let $dz$ be the gradient vector $\frac{\partial L}{\partial z}$ . Its vectorized form is simply:

$dz = p - y$

Intuition Check: This result is incredibly logical. The gradient is simply the Predicted Probability minus the True Probability. If the model is 100% confident and correct ( $p_y \approx 1$ ), the gradient is $0$ , and no weights will be updated. The larger the error, the larger the gradient pushing the model to learn.

3. Gradients with respect to Weights ( $W$ ) and Bias ( $b$ )

Now that we have the gradient of the loss with respect to the logits ( $dz$ ), we use the chain rule to pass this error signal back to our parameters $W$ and $b$ .

Deriving for $W$ :
Let’s look at a single weight element $W_{ij}$ . It only influences the loss through the specific logit $z_i$ .

$\frac{\partial L}{\partial W_{ij}} = \frac{\partial L}{\partial z_i} \cdot \frac{\partial z_i}{\partial W_{ij}}$

Since $z_i = \sum W_{ik} x_k + b_i$ , the local gradient $\frac{\partial z_i}{\partial W_{ij}}$ is simply $x_j$ . Therefore:

$\frac{\partial L}{\partial W_{ij}} = dz_i \cdot x_j$

To vectorize this back into a $C \times D$ matrix (the same shape as $W$ ), we take the outer product of the column vector $dz$ and the row vector $x^T$ :

$\frac{\partial L}{\partial W} = dz \cdot x^T$

Deriving for $b$ :
Since $z = Wx + b$ , the local derivative $\frac{\partial z}{\partial b}$ is $1$ . Thus, the error signal passes directly through:

$\frac{\partial L}{\partial b} = dz$

4. Scaling up: The Mini-Batch Form

In real-world training, we process $N$ samples at a time to stabilize gradients and utilize parallel computing.

$X$ becomes a $D \times N$ matrix.
$dZ = P - Y$ becomes a $C \times N$ matrix.

To find the average gradient across the entire batch, we perform a matrix multiplication and divide by $N$ :

Batch Weight Gradient:

$\frac{\partial L}{\partial W} = \frac{1}{N} dZ \cdot X^T$

Batch Bias Gradient:
Sum the errors across all $N$ samples for each class, then average:

$\frac{\partial L}{\partial b} = \frac{1}{N} \sum_{i=1}^N dZ^{(i)}$

Summary

By systematically applying the chain rule, we transformed a seemingly complex matrix calculus problem into clean, highly efficient linear algebra operations. Understanding this $dz = p - y$ dynamic is the fundamental key to grasping how classification networks “learn” from their mistakes.

Backpropagation: A Vector Calculus Perspective

Louis Aeilot's Blog

作者 Louis C Deng

2026年2月14日 06:45

The primary goal of backpropagation is to calculate the partial derivatives of the cost function $C$ with respect to every weight $\mathbf{W}$ and bias $\mathbf{b}$ in the network.

1. Notation (Matrix Form)

To facilitate a clean derivation using matrix calculus, we adopt the following conventions:

$\mathbf{a}^l$ : Activation vector of the $l$ -th layer ( $n_l \times 1$ ).
$\mathbf{z}^l$ : Weighted input vector of the $l$ -th layer ( $n_l \times 1$ ), where $\mathbf{z}^l = \mathbf{W}^l \mathbf{a}^{l-1} + \mathbf{b}^l$ .
$\mathbf{W}^l$ : Weight matrix of the $l$ -th layer ( $n_l \times n_{l-1}$ ).
$\mathbf{b}^l$ : Bias vector of the $l$ -th layer ( $n_l \times 1$ ).
$\boldsymbol{\delta}^l$ : Error vector of the $l$ -th layer, defined as $\boldsymbol{\delta}^l \equiv \frac{\partial C}{\partial \mathbf{z}^l}$ .
$\odot$ : Hadamard product (element-wise multiplication).
$\sigma'(\mathbf{z}^l)$ : The derivative of the activation function applied element-wise to $\mathbf{z}^l$ .

2. Derivation of the Equations

BP1: Error at the Output Layer

Goal: $\boldsymbol{\delta}^L = \nabla_{\mathbf{a}^L} C \odot \sigma'(\mathbf{z}^L)$

Derivation Steps:

Apply the Chain Rule: The error $\boldsymbol{\delta}^L$ represents how the cost $C$ changes with respect to the weighted input $\mathbf{z}^L$ . Since $C$ depends on $\mathbf{z}^L$ through the activations $\mathbf{a}^L$ :

$\boldsymbol{\delta}^L = \frac{\partial C}{\partial \mathbf{z}^L} = \left( \frac{\partial \mathbf{a}^L}{\partial \mathbf{z}^L} \right)^T \frac{\partial C}{\partial \mathbf{a}^L}$

Compute the Jacobian: Because $a^L_j = \sigma(z^L_j)$ (each output depends only on its own input), the Jacobian matrix $\frac{\partial \mathbf{a}^L}{\partial \mathbf{z}^L}$ is a diagonal matrix:

$\frac{\partial \mathbf{a}^L}{\partial \mathbf{z}^L} = \text{diag}(\sigma'(z^L_1), \dots, \sigma'(z^L_n))$

Result: Multiplying a diagonal matrix by a vector is equivalent to the Hadamard product:

$\boldsymbol{\delta}^L = \nabla_{\mathbf{a}^L} C \odot \sigma'(\mathbf{z}^L)$

BP2: Propagating Error to Hidden Layers

Goal: $\boldsymbol{\delta}^l = ((\mathbf{W}^{l+1})^T \boldsymbol{\delta}^{l+1}) \odot \sigma'(\mathbf{z}^l)$

Derivation Steps:

Link Consecutive Layers: We express the error at layer $l$ in terms of the error at layer $l+1$ using the multivariate chain rule:

$\boldsymbol{\delta}^l = \frac{\partial C}{\partial \mathbf{z}^l} = \left( \frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l} \right)^T \frac{\partial C}{\partial \mathbf{z}^{l+1}} = \left( \frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l} \right)^T \boldsymbol{\delta}^{l+1}$

Differentiate the Linear Transformation: Recall $\mathbf{z}^{l+1} = \mathbf{W}^{l+1} \mathbf{a}^l + \mathbf{b}^{l+1}$ and $\mathbf{a}^l = \sigma(\mathbf{z}^l)$ . Applying the chain rule to find $\frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l}$ :

$\frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l} = \frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{a}^l} \cdot \frac{\partial \mathbf{a}^l}{\partial \mathbf{z}^l} = \mathbf{W}^{l+1} \cdot \text{diag}(\sigma'(\mathbf{z}^l))$

Transpose and Simplify: Substitute back and use the property $(AB)^T = B^T A^T$ :

$\boldsymbol{\delta}^l = \left( \mathbf{W}^{l+1} \cdot \text{diag}(\sigma'(\mathbf{z}^l)) \right)^T \boldsymbol{\delta}^{l+1} = \text{diag}(\sigma'(\mathbf{z}^l)) (\mathbf{W}^{l+1})^T \boldsymbol{\delta}^{l+1}$

Final Form:
Convert the diagonal matrix multiplication to a Hadamard product:

$\boldsymbol{\delta}^l = ((\mathbf{W}^{l+1})^T \boldsymbol{\delta}^{l+1}) \odot \sigma'(\mathbf{z}^l)$

BP3: Gradient for Biases

Goal: $\frac{\partial C}{\partial \mathbf{b}^l} = \boldsymbol{\delta}^l$

Derivation Steps:

Chain Rule via Intermediate $z$ :

$\frac{\partial C}{\partial \mathbf{b}^l} = \left( \frac{\partial \mathbf{z}^l}{\partial \mathbf{b}^l} \right)^T \frac{\partial C}{\partial \mathbf{z}^l}$

Compute Local Gradient: From $\mathbf{z}^l = \mathbf{W}^l \mathbf{a}^{l-1} + \mathbf{b}^l$ , we see that $\mathbf{b}^l$ is added directly to the weighted sum. Thus, $\frac{\partial \mathbf{z}^l}{\partial \mathbf{b}^l}$ is the Identity matrix $\mathbf{I}$ .
Conclusion:

$\frac{\partial C}{\partial \mathbf{b}^l} = \mathbf{I}^T \boldsymbol{\delta}^l = \boldsymbol{\delta}^l$

BP4: Gradient for Weights

Goal: $\frac{\partial C}{\partial \mathbf{W}^l} = \boldsymbol{\delta}^l (\mathbf{a}^{l-1})^T$

Derivation Steps:

Element-wise Approach: Consider a single weight $w^l_{jk}$ connecting neuron $k$ in layer $l-1$ to neuron $j$ in layer $l$ :

$\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial z^l_j} \frac{\partial z^l_j}{\partial w^l_{jk}} = \delta^l_j \frac{\partial z^l_j}{\partial w^l_{jk}}$

Solve for Partial Derivative: Since $z^l_j = \sum_m w^l_{jm} a^{l-1}_m + b^l_j$ , the derivative with respect to $w^l_{jk}$ is simply $a^{l-1}_k$ . Therefore, $\frac{\partial C}{\partial w^l_{jk}} = \delta^l_j a^{l-1}_k$ .
Vectorize as an Outer Product: The collection of all such derivatives $\delta^l_j a^{l-1}_k$ for all $j, k$ forms the Outer Product of the error vector $\boldsymbol{\delta}^l$ and the input activation vector $\mathbf{a}^{l-1}$ :

$\frac{\partial C}{\partial \mathbf{W}^l} = \boldsymbol{\delta}^l (\mathbf{a}^{l-1})^T$

CS231n Lecture Note IV: Neural Networks and Backpropagation

Louis Aeilot's Blog

作者 Louis C Deng

2026年2月14日 04:45

Here we are introducing Neural Networks and Backpropagation.

Neural Networks

We will start with a 2-layer example:

$f = W_2 \max(0, W_1 x)$

$x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}$

The $max$ here creates some non-linearity between $W1$ and $W2$ . It is called the activation function.

In practice, we will usually add a learnable bias at each layer as well.

Here’s an example of a 3-layer one.

$f = W_3 \max(0, W_2 \max(0, W_1 x))$

$x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H_1 \times D}, W_2 \in \mathbb{R}^{H_2 \times H_1}, W_3 \in \mathbb{R}^{C \times H_2}$

The networks we are talking about here are more accurately called fully-connected networks, or sometimes called multi-layer perceptrons (MLP).

Activation Functions

The most important characteristic of activation functions is to create some non-linearity.

There are multiple activation functions to choose from.

The function $max(0,x)$ is called ReLU. It is a good default choice for most problems.

$\text{ReLU}(x) = \max(0, x)$

We also have:

Leaky ReLU: $\max(0.1x, x)$ — Allows a small gradient for negative inputs. Avoids dead neurons.
Sigmoid: $\sigma(x) = \frac{1}{1+e^{-x}}$ — S-shaped curve squashing output between 0 and 1.
Tanh: $\tanh(x)$ — S-shaped curve squashing output between -1 and 1.
ELU: Exponential Linear Unit.
GELU & SiLU: Newer functions often used in modern architectures like Transformers.

Choosing an activation function is very empirical.

Architecture

In a Neural Network, we have an input layer, an output layer, and several hidden layers.

Sample Code

import numpy as np
from numpy.random import randn

# Describe the Network
# N: Batch size (number of inputs processed at once)
# D_in: Input dimension (features)
# H: Hidden Layer Dimension
# D_out: Output dimentsion
N, D_in, H, D_out = 64, 1000, 100, 10
x, y = randn(N, D_in), randn(N, D_out)
w1, w2 = randn(D_in, H), randn(H, D_out)

for t in range(2000):
    # Forward pass
    h = 1 / (1 + np.exp(-x.dot(w1)))
    y_pred = h.dot(w2)
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Calculate the analytical gradients
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h.T.dot(grad_y_pred)
    grad_h = grad_y_pred.dot(w2.T)
    grad_w1 = x.T.dot(grad_h * h * (1 - h))

    # Gradient descent
    w1 -= 1e-4 * grad_w1
    w2 -= 1e-4 * grad_w2

The number of neurson in the hidden layers is often related to the capacity of information. More neurons, more capacity.

A regularizer keeps the network simple and prevents overfitting. Do not use size of neural network as a regularizer. Use stronger regularization instead.

Derivatives

$f(x, y) = x + y \to \frac{\partial f}{\partial x} = 1, \frac{\partial f}{\partial y} = 1$

$f(x, y) = \max(x, y) \to \frac{\partial f}{\partial x} = \mathbb{1}(x \geq y), \frac{\partial f}{\partial y} = \mathbb{1}(y \geq x)$

$f(x, y) = xy \to \frac{\partial f}{\partial x} = y, \frac{\partial f}{\partial y} = x$

The derivative on each variable tells you the sensitivity of the whole expression on its value.

add gate: gradient distributor
mul gate: “swap multiplier”
copy gate: gradient adder
max gate: gradient router

Backpropagation

The problem now is to calculate the gradients in order to learn the parameters.

Obviously we can’t simply derive it on paper, for it can be very tedious and not feasible for complex models.

Here’s the better idea: Computational graphs + Backpropagation.

Backpropagation is generally chain rules.

$\text{Downstream} = \text{Upstream} \times \text{Local}$

$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial x}$

================================================================================
                    COMPUTATIONAL GRAPH NODE: BACKPROPAGATION
================================================================================

          FORWARD PASS
          Computes output z from inputs x, y
          
       [ Input x ] ---------------------\
                                         \
                                          \
                                           v
                                     _____________
                                    |             |
                                    |   Node f    | -----------------> [ Output z ]
                                    |_____________|
                                           ^
                                          /
                                         /
       [ Input y ] ---------------------/


================================================================================

          BACKWARD PASS
          Chain Rule: Downstream = Upstream * Local
          
                                                                    Upstream Gradient
                                                                        (Received)
                                                                          dL/dz
                                                                            |
                                                                            v
                                     _____________                    _____________
     Downstream Gradient            |             |                  |             |
        (To x)                      | Local Grad  |                  |             |
        dL/dx      <----------------|   dz/dx     | <----------------|             |
          =                         |_____________|                  |             |
    dL/dz * dz/dx                                                    |             |
                                                                     |   Node f    |
                                                                     |             |
                                     _____________                   |             |
     Downstream Gradient            |             |                  |             |
        (To y)                      | Local Grad  |                  |             |
        dL/dy      <----------------|   dz/dy     | <----------------|             |
          =                         |_____________|                  |_____________|
    dL/dz * dz/dy

Modularize: Forward/Backward API

Gate / Node / Function object: Actual PyTorch code

class Multiply(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, y):
        ctx.save_for_backward(x, y)   # Need to cache some values for use in backward
        z = x * y
        return z
    @staticmethod
    def backward(ctx, grad_z):        # Upstream gradient (grad_z)
        x, y = ctx.saved_tensors
        grad_x = y * grad_z           # dz/dx * dL/dz   <-- Multiply upstream and local gradients
        grad_y = x * grad_z           # dz/dy * dL/dz
        return grad_x, grad_y

Aside: Vector Derivatives

For vector derivatives:

Vector to Scalar

$x \in \mathbb{R}^N, y \in \mathbb{R}$

The derivative is a gradient.

$\frac{\partial y}{\partial x} \in \mathbb{R}^N \quad \left(\frac{\partial y}{\partial x}\right)_n = \frac{\partial y}{\partial x_n}$

Vector to Vector

$x \in \mathbb{R}^N, y \in \mathbb{R}^M$

The derivative is a Jacobian Matrix.

$\frac{\partial y}{\partial x} \in \mathbb{R}^{N \times M} \quad \left(\frac{\partial y}{\partial x}\right)_{n,m} = \frac{\partial y_m}{\partial x_n}$

Aside: Jacobian Matrix

A Jacobian Matrix is the derivative for a function that maps a Vector to a Vector.

Definition

Let $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$ be a function such that each of its first-order partial derivatives exists on $\mathbb{R}^n$ . This function takes a point $\mathbf{x} = (x_1, \ldots, x_n) \in \mathbb{R}^n$ as input and produces the vector $\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x})) \in \mathbb{R}^m$ as output.

Then the Jacobian matrix of $\mathbf{f}$ , denoted $\mathbf{J}_{\mathbf{f}}$ , is the $m \times n$ matrix whose $(i, j)$ entry is $\frac{\partial f_i}{\partial x_j}$ ; explicitly:

$\mathbf{J}_{\mathbf{f}} = \begin{bmatrix} \frac{\partial \mathbf{f}}{\partial x_1} & \cdots & \frac{\partial \mathbf{f}}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \nabla^T f_1 \\ \vdots \\ \nabla^T f_m \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$

where $\nabla^T f_i$ is the transpose (row vector) of the gradient of the $i$ -th component.

Backpropagation with Vectors

================================================================================
                    COMPUTATIONAL GRAPH NODE: BACKPROP WITH VECTORS
================================================================================

          FORWARD PASS (Green Arrows)
          Computes output vector z from input vectors x, y
          
       [ Vector x ] --------------------\
        Shape: Dx                        \
                                          \
                                           v
                                     _____________          [ Vector z ]
                                    |             |         Shape: Dz
                                    |   Node f    | ----------------->
                                    |_____________|
                                           ^
                                          /
                                         /
       [ Vector y ] --------------------/
        Shape: Dy


================================================================================

          BACKWARD PASS (Red Arrows)
          Matrix-Vector Multiply: Downstream = Local * Upstream
          
                                                                Upstream Gradient
                                                                   (dL/dz)
                                                                   Shape: Dz
                                                                      |
                                                                      v
                                     _____________              _____________
    Downstream Gradient             |             |            |             |
       (dL/dx)                      |  Jacobian   |            |             |
       Shape: Dx      <-------------|  [Dx x Dz]  | <----------|             |
          =                         |    dz/dx    |            |             |
    (dz/dx) * (dL/dz)               |_____________|            |             |
                                                               |             |
                                                               |   Node f    |
                                                               |             |
                                     _____________             |             |
    Downstream Gradient             |             |            |             |
       (dL/dy)                      |  Jacobian   |            |             |
       Shape: Dy      <-------------|  [Dy x Dz]  | <----------|             |
          =                         |    dz/dy    |            |_____________|
    (dz/dy) * (dL/dz)               |_____________|

Jacobians can be sparse. For element-wise operations, the off-diagonal elements are always zero; the Jacobian becomes a diagonal matrix.

Four Fundamental Equations of Backpropagation

These allow us to propagate error ( $\delta$ ) from the output backward through the network. We get them from simple chain rules.

Notation:

$\odot$ : Hadamard (element-wise) product
$\sigma'$ : Derivative of the activation function
$\delta^l$ : Error term for layer $l$

(BP1) Error at the Output Layer

$\delta^L = \nabla_a C \odot \sigma'(z^L)$

Interpretation: How much the output layer “missed” the target, scaled by the activation’s sensitivity.

(BP2) Propagating Error to Hidden Layers

$\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)$

Interpretation: The error at layer $l$ is the weighted sum of errors from the next layer ( $l+1$ ), pulled backward through the transpose of the weights.

(BP3) Gradient for Biases

$\frac{\partial C}{\partial b^l_j} = \delta^l_j$

Interpretation: The gradient for a bias is exactly equal to the error at that neuron.

(BP4) Gradient for Weights

$\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$

Interpretation: The gradient for a weight is the product of the input activation ( $a^{l-1}$ ) and the output error ( $\delta^l$ ).

CS231n Lecture Note III: Optimization

Louis Aeilot's Blog

作者 Louis C Deng

2026年2月11日 09:45

With the score function and the loss function, now we focus on how we minimize the loss. Optimization is the process of finding the set of parameters $W$ that minimize the loss function.

Strategy #1: A first very bad idea solution: Random search

Core idea: iterative refinement

Strategy #2: Follow the slope

In 1-dimension, the derivative of a function:

$\frac{df(x)}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension $\nabla_W L$ .

The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient.

Numerical Gradient

We can get an approximate numerical gradient. It is often sufficient to use a very small value (such as 1e-5).

This is easy to write, but might be slow.

Analytic Gradient

We can also use calculus to get the exact value of gradients.

Practical considerations

Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

It often works better to compute the numeric gradient using the centered difference formula: [f(x+h)−f(x−h)]/2h.

Gradient Descent

# Vanilla Gradient Descent

while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad # perform parameter update

But in large-scale applications, the training data can have on order of millions of examples. It seems wasteful to compute the full loss function over the entire training set in order to perform only a single parameter update.

Stochastic Gradient Descent (SGD)

A very common approach to addressing this challenge is to compute the gradient over batches of the training data, which is called the Mini-batch Gradient Descent.

The gradient from a mini-batch is a good approximation of the gradient of the full objective. Therefore, much faster convergence can be achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter updates.

# Vanilla Minibatch Gradient Descent

while True:
  data_batch = sample_training_data(data, 256) # sample 256 examples
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights += - step_size * weights_grad # perform parameter update

The extreme case of this is a setting where the mini-batch contains only a single example. This process is called Stochastic Gradient Descent (SGD).

The size of the mini-batch is a hyperparameter. It is usually based on memory constraints (if any), or set to some value, e.g. 32, 64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2.

Problems with SGD

Jittering
Local minima or saddle points
Gradients from mini-batches might be too noisy.

SGD + Momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations.

$v_{t+1} = \rho v_t + \nabla f(x_t)$

$x_{t+1} = x_t - \alpha v_{t+1}$

vx = 0
while True:
    dx = compute_gradient(x)
    vx = rho * vx + dx
    x -= learning_rate * vx

$v$ : Velocity (accumulates gradient directions)
$\rho$ (rho): Momentum coefficient (friction), typically 0.9 or 0.99.
$\alpha$ (alpha): Learning rate.
$\nabla f(x)$ : Gradient.

Aside: Nesterov Momentum

Standard Momentum calculates the gradient first, and then add the previous velocity to it.

Unlike Standard Momentum, Nesterov Momentum adds the previous velocity to the parameter first, and then calculate the gradient with the updated parameters. It looks ahead, making it more stable.

$v_{t+1} = \rho v_t - \alpha \nabla f(x_t + \rho v_t)$

$x_{t+1} = x_t + v_{t+1}$

We can let $\tilde{x}_t = x_t + \rho v_t$ to make it look nicer. Rearrange:

$v_{t+1} = \rho v_t - \alpha \nabla f(\tilde{x}_t)$

$\tilde{x}_{t+1} = \tilde{x}_t - \rho v_t + (1 + \rho)v_{t+1}$

$= \tilde{x}_t + v_{t+1} + \rho(v_{t+1} - v_t)$

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

RMSProp

RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm designed to improve the performance and speed of training deep learning models.

It adds element-wise scaling of the gradient based on historical sums of squares in each dimension (with decay).

Mathematical Formulation:

Compute Gradient

$g_t = \nabla \theta$

Update Moving Average of Squared Gradients

$E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma)g_t^2$

Update Parameters

$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \odot g_t$

where:

Learning Rate (η): Controls the step size during parameter updates.
Decay Rate (γ): Determines how quickly the moving average of squared gradients decays.
Epsilon (ϵ): A small constant (e.g., 1e-8) added to the denominator to prevent division by zero and ensure numerical stability.

grad_squared = 0
while True:
    dx = compute_gradient(x)
    # Update moving average of the squared gradients
    grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx
    # Update weights
    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

Aside: AdaGrad

grad_squared = 0
while True:
    dx = compute_gradient(x)
    # Difference HERE:
    grad_squared = dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

RMSProp is basically AdaGrad with decay.

Adam & AdamW

Adam (Adaptive Moment Estimation) optimizer combines the advantages of Momentum and RMSprop techniques to adjust learning rates during training.

first_moment = 0
second_moment = 0
for t in range(1, num_iterations):
  dx = compute_gradient(x)
  # Momentum
  first_moment = beta1 * first_moment + (1 - beta1) * dx
  # RMSProp
  second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
  # Bias Correction
  first_unbias = first_moment / (1 - beta1 ** t)
  second_unbias = second_moment / (1 - beta2 ** t)
  # RMSProp
  x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)

However, since the first and second momenet estimates start at zero, the initial step might be gigantic. So we implement bias correction to prevent early-stage instability.

For Standard Adam, the regularization is done in x before gradent computation.

For AdamW (Weight Dacay), the regularization term is added to the final x after the moments.

Learning Rate Schedules

The learning rate is a hyperparameter. We may make it decay over time.

Reduce learning rate at a few fixedpoints.
Cosine decay
Linear decay
Inverse sqrt decay
etc.

Empirical rule of thumb: If you increase the batch size by N, also scale the initial learning rate by N.

Second-Order Optimization

We can use gradient and Hessian to form a quadratic approximation, then step to its minima.

Second-Order Taylor Series:

$J(\boldsymbol{\theta}) \approx J(\boldsymbol{\theta}_0) + (\boldsymbol{\theta} - \boldsymbol{\theta}_0)^\top \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0) + \frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}_0)^\top \mathbf{H} (\boldsymbol{\theta} - \boldsymbol{\theta}_0)$

Solving for the critical point we obtain the Newton parameter update:

$\boldsymbol{\theta}^* = \boldsymbol{\theta}_0 - \mathbf{H}^{-1} \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0)$

This can be bad for deep learning, for the amount of computation required.

Quasi-Newton methods (BGFS)
L-BGFS (Limited Memory BGFS): Does not store the full inverse Hessian.

Aside: Hessian Matrices

The Hessian matrix (or simply the Hessian), denoted as $\mathbf{H}$ , represents the second-order partial derivatives of a function. While the gradient ( $\nabla$ ) tells you the slope (direction of steepest descent), the Hessian tells you the curvature of the loss function surface.

It is a square matrix of second-order partial derivatives of a scalar-valued function, $f: \mathbb{R}^n \to \mathbb{R}$ . It describes the local curvature of a function of many variables.

$\mathbf{H}(f) = \begin{bmatrix}\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\\vdots & \vdots & \ddots & \vdots \\\frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2}\end{bmatrix}$

$\mathbf{H}_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$

Hessian Matrices are symmetric.

CS231n Lecture Note II: Linear Classifiers

Louis Aeilot's Blog

作者 Louis C Deng

2026年2月11日 08:45

With the disadvantages of the KNN algorithm, we need to come up with a more powerful approach. The new approach will have two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels.

Score Function

The score function maps the pixel values of an image to confidence scores for each class.

As before, let’s assume a training dataset of images $\mathbf{x}_i \in \mathbf{R}^D$ , each associated with a label $y_i$ . Here $i = 1 \dots N$ and $y_i \in 1 \dots K$ .

That is, we have $\mathbf{N}$ examples (each with a dimensionality $\mathbf{D}$ ) and $\mathbf{K}$ distinct categories.

We will define the score function $f : \mathbf{R}^D \mapsto \mathbf{R}^K$ that maps the raw image pixels to class scores.

Linear Classifier

We will start out with arguably the simplest possible function, a linear mapping.

$f(\mathbf{x}_i, \mathbf{W}, \mathbf{b}) = \mathbf{W}\mathbf{x}_i + \mathbf{b}$

In the above equation, we are assuming that the image $\mathbf{x}_i$ has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix $\mathbf{W}$ (of size [K x D]), and the vector $\mathbf{b}$ (of size [K x 1]) are the parameters of the function.

The parameters in $\mathbf{W}$ are often called the weights, and $\mathbf{b}$ is called the bias vector because it influences the output scores, but without interacting with the actual data $\mathbf{x}_i$ .

The input data are given and fixed. The goal is to set $\mathbf{W,b}$ in such way that the computed scores match the ground truth labels across the whole training set.

Bias Tricks

We can combine the two sets of parameters into a single matrix that holds both of them by extending the vector $\mathbf{x}_i$ with one additional dimension that always holds the constant $\mathbf{1}$ - a default bias dimension.

$f(\mathbf{x}_i, \mathbf{W}) = \mathbf{W}\mathbf{x}_i$

Bias Tricks

Image Data Preprocessing

In Machine Learning, it is a very common practice to always perform normalization of input features. In particular, it is important to center your data by subtracting the mean from every feature.

Loss Function

We will develop Multiclass Support Vector Machine (SVM) loss.

The score function takes the pixels and computes the vector $f(\mathbf{x}_i, \mathbf{W})$ of class scores, which we will abbreviate to $\mathbf{s}$ (short for scores).

The Multiclass SVM loss for the i-th example is then formalized as follows:

$L_i = \sum_{j \neq y_i} \max(0, s_j - s_{y_i} + \Delta)$

The function accumulates the error of incorrect classes within Delta.

In summary, the SVM loss function wants the score of the correct class $y_i$ to be larger than the incorrect class scores by at least by $\Delta$ (delta).

The threshold at zero max(0, -) function is often called the hinge loss. We also have squared hinge loss SVM (or L2-SVM), which uses the form max(0, -)² that penalizes violated margins more strongly.

The Multiclass Support Vector Machine "wants" the score of the correct class to be higher than all other scores by at least a margin of delta.

Regularization

We wish to encode some preference for a certain set of weights W over others to remove this ambiguity. We can do so by extending the loss function with a regularization penalty $R(\mathbf{W})$ . The most common regularization penalty is the squared L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters:

$R(\mathbf{W}) = \sum_k \sum_l W_{k,l}^2$

Including the regularization penalty completes the full Multiclass Support Vector Machine loss, which is made up of two components: the data loss (which is the average loss $L_i$ over all examples) and the regularization loss.

$L = \underbrace{\frac{1}{N} \sum_i L_i}_{\text{data loss}} + \underbrace{\lambda R(\mathbf{W})}_{\text{regularization loss}}$

Or in full form:

$L = \frac{1}{N} \sum_i \sum_{j \neq y_i} \left[ \max(0, f(\mathbf{x}_i; \mathbf{W})_j - f(\mathbf{x}_i; \mathbf{W})_{y_i} + \Delta) \right] + \lambda \sum_k \sum_l W_{k,l}^2$

Penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. It keeps the weights small and simple. This can improve the generalization performance of the classifiers on test images and lead to less overfitting.

It prevents the model from doing too well on the training data.

Note that due to the regularization penalty we can never achieve loss of exactly 0.0 on all examples.

Practical Considerations

Setting Delta: It turns out that this hyperparameter can safely be set to $\Delta = 1.0$ in all cases. (The exact value of the margin between the scores is in some sense meaningless because the weights can shrink or stretch the differences arbitrarily.)

Binary Support Vector Machine: The loss for the i-th example can be written as

$L_i = C \max(0, 1 - y_i \mathbf{w}^T \mathbf{x}_i) + R(\mathbf{W})$

$\mathbf{C}$ in this formulation and $\lambda$ in our formulation control the same tradeoff and are related through reciprocal relation $C \propto \frac{1}{\lambda}$ .

Other Multiclass SVM formulations: Multiclass SVM presented in this section is one of few ways of formulating the SVM over multiple classes.

Another commonly used form is the One-Vs-All (OVA) SVM which trains an independent binary SVM for each class vs. all other classes. Related, but less common to see in practice is also the All-vs-All (AVA) strategy.

The last formulation you may see is a Structured SVM, which maximizes the margin between the score of the correct class and the score of the highest-scoring incorrect runner-up class.

Softmax Classifier

In the Softmax Classifier, we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form:

$L_i = -\log\left(\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}\right) \hspace{1cm} \text{or equivalently} \hspace{1cm} L_i = -f_{y_i} + \log\sum_j e^{f_j}$

where we are using the notation $f_j$ to mean the j-th element of the vector of class scores $\mathbf{f}$ .

The function $f_j(\mathbf{z}) = \frac{e^{z_j}}{\sum_k e^{z_k}}$ is called the softmax function: It takes a vector of arbitrary real-valued scores (in $\mathbf{z}$ ) and squashes it to a vector of values between zero and one that sum to one.

Information Theory View

The cross-entropy between a “true” distribution $\mathbf{p}$ and an estimated distribution $\mathbf{q}$ is defined as:

$H(\mathbf{p}, \mathbf{q}) = -\sum_x p(x) \log q(x)$

Minimizing Cross-Entropy is equivalent to minimizing the KL Divergence.

$H(p, q) = H(p) + D_{KL}(p||q)$

Because the true distribution $p$ is fixed (its entropy $H(p)$ is zero in this scenario), minimizing cross-entropy is the same as forcing the predicted distribution $q$ to look exactly like the true distribution $p$ .

The Softmax Loss objective is to force the neural network to output a probability distribution where the correct class has a probability very close to 1.0, and all other classes are close to 0.0.

Information Theory Supplementary

Information Entropy measures the uncertainty or unpredictability of a random variable.

The more unpredictable an event is (lower probability), the more information is gained when it occurs, and the higher the entropy. Conversely, if an event has a probability of 1 (certainty), its entropy is 0.

For a discrete random variable $X$ with possible outcomes $\{x_1, ..., x_n\}$ and probabilities $P(x_i)$ , the entropy $H(X)$ is defined as:

$H(X) = -\sum_{i=1}^{n} P(x_i) \log P(x_i)$

Cross Entropy measures the total cost of using distribution $q$ to represent distribution $p$ .

Minimizing the cross-entropy $H(p, q)$ is mathematically equivalent to minimizing the KL Divergence. It forces the predicted distribution $q$ to become as close as possible to the true distribution $p$ .

Probabilistic View

$P(y_i \mid x_i; W) = \frac{e^{f_{y_i}}}{\sum_j e^{f_j}}$

The formula maps raw scores to a range of $(0, 1)$ such that the sum of all class probabilities equals 1.

Using the Cross-Entropy loss function during training is equivalent to maximizing the likelihood of the correct class.

Numeric Stability

Dividing large numbers can be numerically unstable, so it is important to use a normalization trick.

$\frac{e^{f_{y_i}}}{\sum_j e^{f_j}} = \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} = \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}}$

A common choice for $C$ is to set $\log C = -\max_j f_j$ . This simply states that we should shift the values inside the vector $f$ so that the highest value is zero. In code:

f = np.array([123, 456, 789]) # example with 3 classes and each having large scores
p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup

# instead: first shift the values of f so that the highest number is 0:
f -= np.max(f) # f becomes [-666, -333, 0]
p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer

Note

To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss.

The Softmax classifier uses the cross-entropy loss.

SVM vs Softmax

The SVM interprets these as class scores and its loss function encourages the correct class to have a score higher by a margin than the other class scores.

The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low).

Softmax classifier provides “probabilities” for each class.

The “probabilities” are dependent on the regularization strength. They are better thought of as confidences where the ordering of the scores is interpretable.

In practice, SVM and Softmax are usually comparable. Compared to the Softmax classifier, the SVM is a more local objective.

The Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better.

However, the SVM is happy once the margins are satisfied and it does not micromanage the exact scores beyond this constraint.

CS231n Lecture Note I: Image Classification

Louis Aeilot's Blog

作者 Louis C Deng

2026年2月10日 08:45

Image Classification

Task: assigning an input image one label from a fixed set of categories

Images are defined as tensors of integers between [0,255], e.g. 800 x 600 x 3

Challenges:

Viewpoint variation
Scale variation
Deformation
Occlusion
Illumination conditions
Background clutter
Intra-class variation

A good image classification model must be invariant to the cross product of all these variations, while simultaneously retaining sensitivity to the inter-class variations.

The data-driven approach: first accumulating a training dataset of labeled images, then develop learning algorithms to learn about them.

The image classification pipeline:

Input: Input a set of N images, each labeled with one of K different classes.
Learn: use the training set to learn what every one of the classes looks like. training a classifier, or learning a model.
Evaluate: evaluate the quality of the classifier by asking it to predict labels for a new set of images that it has never seen before.

Nearest Neighbor Classifier

Here is the English translation of your content.

L1 Distance (Manhattan Distance)

L1 Distance is the sum of the absolute values of the differences between corresponding dimensions of two vectors. The calculation formula is:

$L1(X, Y) = \sum_{i=1}^{n} |x_i - y_i| = |x_1 - y_1| + |x_2 - y_2| + ... + |x_n - y_n|$

L2 Distance (Euclidean Distance)

L2 Distance is the square root of the sum of the squared differences between corresponding dimensions of two vectors. This is what we commonly refer to as the straight-line distance between two points. The calculation formula is:

$L2(X, Y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + ... + (x_n - y_n)^2}$

Note on L2: Squaring amplifies values, thereby magnifying the influence of outliers.

Evaluation

For evaluation, we use accuracy, which measures the fraction of predictions that were correct.

k-Nearest Neighbr Classifier

The idea: we will find the top k closest images, and have them vote on the label of the test image.

Hyperparameters

It’s often not obvious what values/settings one should choose for hyperparameters.

We cannot use the test set for the purpose of tweaking hyperparameters.

-> Split your training set into training set and a validation set. Use validation set to tune all hyperparameters. At the end run a single time on the test set and report performance.

Cross-validation

Cross-validation: iterating over different validation sets and averaging the performance across these.

Cross-validation

CSAPP Cache Lab II: Optimizing Matrix Transposition

Louis Aeilot's Blog

作者 Louis C Deng

2026年2月5日 08:00

In this part of the Cache Lab, the mission is simple yet devious: optimize matrix transposition for three specific sizes: 32x32, 64x64, and 61x67. Our primary enemy? Cache misses.

Matrix Transposition

A standard transposition swaps rows and columns directly:

void trans(int M, int N, int A[N][M], int B[M][N])
{
    int i, j, tmp;

    for (i = 0; i < N; i++) {
        for (j = 0; j < M; j++) {
            tmp = A[i][j];
            B[j][i] = tmp;
        }
    }    

}

While correct, this approach is a cache-miss nightmare because it ignores how data is actually stored in memory.

Cache Overview

To optimize effectively, we first have to understand our hardware constraints. The lab specifies a directly mapped cache with the following parameters:

Parameter	Value
Sets (S)	32
Block Size (B)	32 bytes
Associativity (E)	1 (Direct-mapped)
Integer Size	4 bytes
Capacity per line	8 integers

We will use Matrix Tiling and Loop Unrolling to optimize the codes.

32x32 Case

In this case, a row of the matrix needs 32/8 = 4 sets of cache to store. And cache conflicts occur every 32/4 = 8 rows. This makes 8x8 tiling the sweet spot.

By processing the matrix in $8 \times 8$ blocks, we ensure that once a line of A is loaded, we use all 8 integers before it gets evicted. We also use loop unrolling with 8 local variables to minimize the overhead of accessing B.

int i,j,k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;
for(i = 0; i<N; i+=8){
    for(j = 0; j<M; j+=8){
        for(k = i; k<N && k<i+8; k++) {
          // Read row from A
            tmp1 = A[k][j];
            tmp2 = A[k][j+1];
            tmp3 = A[k][j+2];
            tmp4 = A[k][j+3];
            tmp5 = A[k][j+4];
            tmp6 = A[k][j+5];
            tmp7 = A[k][j+6];
            tmp8 = A[k][j+7];

          // Write to columns of B
            B[j][k] = tmp1;
            B[j+1][k] = tmp2;
            B[j+2][k] = tmp3;
            B[j+3][k] = tmp4;
            B[j+4][k] = tmp5;
            B[j+5][k] = tmp6;
            B[j+6][k] = tmp7;
            B[j+7][k] = tmp8;
        }
    }
}

61x67 Case

Since 61 and 67 are not powers of two, the conflict misses don’t occur in a regular pattern like they do in the square matrices. This “irregularity” is actually a blessing. We can get away with simple tiling. A 16x16 block size typically yields enough performance to pass the miss-count threshold.

int BLOCK_SIZE = 16;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
    for(j = 0; j<M; j+=BLOCK_SIZE){
        a = i+BLOCK_SIZE;
        b = j+BLOCK_SIZE;
        for(k = i; k<N && k<a; k++) {
            for(l = j; l<M && l<b; l++){
                tmp = A[k][l];
                B[l][k] = tmp;
            }
        }
    }
}

64x64 Case

This is the hardest part. In a 64x64 matrix, a row needs 8 sets, but conflict misses occur every $32/8 = 4$ rows. If we use 8x8 tiling, the bottom half of the block will evict the top half.

We can try a 4x4 matrix tiling first.

int BLOCK_SIZE = 4;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
    for(j = 0; j<M; j+=BLOCK_SIZE){
        a = i+BLOCK_SIZE;
        b = j+BLOCK_SIZE;
        for(k = i; k<N && k<a; k++) {
            for(l = j; l<M && l<b; l++){
                tmp = A[k][l];
                B[l][k] = tmp;
            }
        }
    }
}

But this isn’t enough to pass the miss-count threshold.

We try a 8x8 matrix tiling. We solve this by partitioning the $8 \times 8$ block into four $4 \times 4$ sub-blocks and using the upper-right corner of B as a “buffer” to store data temporarily.

$\text{Block } A = \begin{pmatrix} A_{TL} & A_{TR} \\ A_{BL} & A_{BR} \end{pmatrix} \quad \xrightarrow{\text{Transpose}} \quad \text{Block } B = \begin{pmatrix} A_{TL}^T & A_{BL}^T \\ A_{TR}^T & A_{BR}^T \end{pmatrix}$

Here are the steps:

Transpose $A_{TL}$ into $B_{TL}$ while simultaneously moving $A_{TR}$ into $B_{TR}$ (as a temp storage).
Move the stored $A_{TR}$ from $B_{TR}$ to its final position, while moving $A_{BL}$ into its spot.
Transpose $A_{BR}$ into $B_{BR}$ .

int i, j, k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;

// Iterate through the matrix in 8x8 blocks to improve spatial locality
for (i = 0; i < N; i += 8) {
    for (j = 0; j < M; j += 8) {
        
        /**
         * STEP 1: Handle the top half of the 8x8 block (rows i to i+3)
         */
        for (k = 0; k < 4; k++) {
            // Read 8 elements from row i+k of matrix A into registers
            tmp1 = A[i + k][j];     tmp2 = A[i + k][j + 1];
            tmp3 = A[i + k][j + 2]; tmp4 = A[i + k][j + 3]; // Top-left 4x4
            tmp5 = A[i + k][j + 4]; tmp6 = A[i + k][j + 5];
            tmp7 = A[i + k][j + 6]; tmp8 = A[i + k][j + 7]; // Top-right 4x4

            // Transpose top-left 4x4 from A directly into top-left of B
            B[j][i + k]     = tmp1;
            B[j + 1][i + k] = tmp2;
            B[j + 2][i + k] = tmp3;
            B[j + 3][i + k] = tmp4;

            // Temporarily store top-right 4x4 of A in the top-right of B
            // This avoids cache misses by using the already-loaded cache line in B
            B[j][i + k + 4]     = tmp5;
            B[j + 1][i + k + 4] = tmp6;
            B[j + 2][i + k + 4] = tmp7;
            B[j + 3][i + k + 4] = tmp8;
        }

        /**
         * STEP 2: Handle the bottom half and fix the temporary placement
         */
        for (k = 0; k < 4; k++) {
            // Read bottom-left 4x4 column-wise from A
            tmp1 = A[i + 4][j + k]; tmp2 = A[i + 5][j + k];
            tmp3 = A[i + 6][j + k]; tmp4 = A[i + 7][j + k];
            
            // Read bottom-right 4x4 column-wise from A
            tmp5 = A[i + 4][j + k + 4]; tmp6 = A[i + 5][j + k + 4];
            tmp7 = A[i + 6][j + k + 4]; tmp8 = A[i + 7][j + k + 4];

            // Retrieve the top-right elements we temporarily stored in B in Step 1
            int t1 = B[j + k][i + 4];
            int t2 = B[j + k][i + 5];
            int t3 = B[j + k][i + 6];
            int t4 = B[j + k][i + 7];

            // Move bottom-left of A into the top-right of B
            B[j + k][i + 4] = tmp1;
            B[j + k][i + 5] = tmp2;
            B[j + k][i + 6] = tmp3;
            B[j + k][i + 7] = tmp4;

            // Move the retrieved temporary values into the bottom-left of B
            B[j + k + 4][i]     = t1;
            B[j + k + 4][i + 1] = t2;
            B[j + k + 4][i + 2] = t3;
            B[j + k + 4][i + 3] = t4;

            // Place bottom-right of A into the bottom-right of B
            B[j + k + 4][i + 4] = tmp5;
            B[j + k + 4][i + 5] = tmp6;
            B[j + k + 4][i + 6] = tmp7;
            B[j + k + 4][i + 7] = tmp8;
        }
    }
}

Note: The key trick here is traversing B by columns where possible (so B stays right in the cache) and utilizing local registers (temporary variables) to bridge the gap between conflicting cache lines.

Conclusion

Optimizing matrix transposition is less about the math and more about mechanical sympathy—understanding the underlying hardware to write code that plays nice with the CPU’s cache.

The jump from the naive version to these optimized versions isn’t just a marginal gain; it’s often a 10x reduction in cache misses. It serves as a stark reminder that in systems programming, how you access your data is just as important as the algorithm itself.

CSAPP Cache Lab I: Let's simulate a cache memory!

Louis Aeilot's Blog

作者 Louis C Deng

2026年2月5日 00:45

For the CSAPP Cache Lab, the students are asked to write a small C program (200~300 lines) that simulates a cache memory.

The full code is here on GitHub.

Understanding a Cache

1. The Anatomy of a Cache ( $S$ , $E$ , $B$ , $m$ )

A cache can be described with the following four parameters:

$S = 2^s$ (Cache Sets): The cache is divided into sets.
$E$ (Cache Lines per set): This is the “associativity.”
- If $E=1$ , it’s a direct-mapped cache. If $E>1$ , it’s set-associative.
- Each line contains a valid bit, a tag, and the actual data block.
$B = 2^b$ (Block Size): The number of bytes stored in each line.
- The $b$ bits at the end of an address tell the cache the offset within that block.
$m$ : The bits of the machine memory address.

2. Address Decomposition

When the CPU wants to access a 64-bit address, the cache doesn’t look at the whole number at once. It slices the address into three distinct fields:

Field	Purpose
Tag	Used to uniquely identify the memory block within a specific set. `t = m - b - s`
Set Index	Determines which set the address maps to.
Block Offset	Identifies the specific byte within the cache line.

3. The “Search and Match” Process

When our simulator receives an address (e.g., from an L or S operation in the trace file), it follows these steps:

Find the Set: Use the set index bits to jump to the correct set in our cache structure.
Search the Lines: Look through all the lines in that set.

Hit: If a line has valid == true AND the tag matches the address tag.
Miss: If no line matches.

Handle the Miss:

Cold Start: If there is an empty line (valid == false), fill it with the new tag and set valid = true.
Eviction: If all lines are full, we must kick one out. This is where the LRU (Least Recently Used) policy comes in: we find the line that hasn’t been touched for the longest time and replace it.

Lab Requirements

For this Lab Project, we will write a cache simulator that takes a valgrind memory trace as an input.

Input

The input looks like:

I 0400d7d4,8
 M 0421c7f0,4
 L 04f6b868,8
 S 7ff0005c8,8

Each line denotes one or two memory accesses. The format of each line is

1	[space]operation address,size

The operation field denotes the type of memory access:

“I” denotes an instruction load, “L” a data load,
“S” a data store
“M” a data modify (i.e., a data load followed by a data store).

Mind you: There is never a space before each “I”. There is always a space before each “M”, “L”, and “S”.

The address field specifies a 64-bit hexadecimal memory address. The size field specifies the number of bytes accessed by the operation.

CLI

Our program should take the following command line arguments:

Usage: ./csim-ref [-hv] -s <s> -E <E> -b <b> -t <tracefile>

-h: Optional help flag that prints usage info
-v: Optional verbose flag that displays trace info
-s <s>: Number of set index bits (S = 2s is the number of sets)
-E <E>: Associativity (number of lines per set)
-b <b>: Number of block bits (B = 2b is the block size)
-t <tracefile>: Name of the valgrind trace to replay

Caveats

For this lab, we ignore all Is (the instruction cache accesses).

We assume that memory accesses are aligned properly, such that a single memory access never crosses block boundaries.

The Codes

We basically start from scratch, given an almost blank csim.c file to fill in. The file comes with only a main function and no header files.

Data Models

// Data Model
char* fileName = NULL;
int set_bit = -1;
long long sets = -1;
int associativity = -1;
int block_bit = -1;
long long block_size = -1;
bool verboseMode = false;

int global_timer = 0; // For LRU

int memory_bit = 64; // Assuming 64-bit addresses
int tag_bit = 0; // Tag bits

Handling Command-Line Arguments

First, we add the int argc, char** argv parameters to the main function. argc stands for argument count, while argv stands for argument values.

We use getopt to parse arguments.

void handleArgs(int argc, char** argv){
    int opt;

    while ((opt = getopt(argc, argv, "hvs:E:b:t:")) != -1) {
        switch(opt) {
            case 'h':
                printUsage(argv);
                exit(0);
            case 'v':
                verboseMode = true;
                break;
            case 't':
                fileName = optarg;
                break;
            case 's':
                set_bit = atoi(optarg);
                break;
            case 'E':
                associativity = atoi(optarg);
                break;
            case 'b':
                block_bit = atoi(optarg);
                break;
            case '?':
                printUsage(argv);
                exit(1);
            default:
                exit(1); 
        }
    }

    if(fileName == NULL || set_bit == -1 || associativity == -1 || block_bit == -1) {
        printf("Missing required command line argument");
        printUsage(argv);
        exit(1);
    }

    sets = 1LL << set_bit;
    block_size = 1LL << block_bit;
    
    tag_bit = memory_bit - (set_bit + block_bit);
}

getopt comes in unistd.h, but the compiler option is set to -std=c99, which hides all POSIX extensions. GNU systems provide a standalone <getopt.h> header. So we include getopt.h instead.

1	opt = getopt(argc, argv, "hvs:E:b:t:")

h and v: These are boolean flags.
s:, E:, b:, and t:: These are required arguments. The colon tells getopt that these flags must be followed by a value (e.g., -s 4).

After parsing the arguments, we set the initial value of our Cache Data Model.

sets = 1LL << set_bit;
block_size = 1LL << block_bit;

tag_bit = memory_bit - (set_bit + block_bit);

Initialize Cache

// Cache Line Structure
typedef struct CacheLine {
    bool valid;
    long long tag;
    /*
        Need LRU stamp to implement LRU eviction policy
    */
    int lru_counter;
} CacheLine;

CacheLine** cache = NULL;

void initCache() {
    // Initialize cache data structures
    cache = (CacheLine**) malloc(sizeof(CacheLine*) * sets);
    for(int i = 0; i<sets; i++){
        cache[i] = (CacheLine*) calloc(associativity, sizeof(CacheLine));
    }   
}

Caution: malloc has to be initialized. Or the data might contain garbage values.

So we use calloc. The calloc (stands for contiguous allocation) function is similar to malloc but it initializes the allocated memory to zero.

And don’t forget to free the allocated memory!

void freeCache() {
    // Free allocated memory for cache
    for(int i = 0; i<sets; i++) free(cache[i]);
    free(cache);
}

Handling File Input

  // Handle trace file
  FILE *traceFile = fopen(fileName, "r");
  if (traceFile == NULL) {
      printf("Error opening file: %s\n", fileName);
      exit(1);
  }
  char operation;
  long long address;
  int size;
  while (fscanf(traceFile, " %c %llx,%d", &operation, &address, &size) == 3) {
      switch (operation) {
          case 'L':
              // Handle load operation
              loadData(address, size);
              break;
          case 'S':
              // Handle store operation
              storeData(address, size);
              break;
          case 'M':
              // Handle modify operation
              modifyData(address, size);
              break;
          default:
              // Ignore other operations
              break;
      }
  }
  // Close trace file
fclose(traceFile);

Caution:

fscanf does not skip spaces before %c, so we add a space before %c in the format string.
!feof(traceFile) does not work correctly here.It only returns true after a read operation has already attempted to go past the end of the file and failed. Using it as a loop condition (e.g., while (!feof(p))) causes an “off-by-one” error, where the loop executes one extra time with garbage data from the last successful read.

Parsing Addresses

// Parse Line Structure
long long getTag(long long address) {
    return address >> (set_bit + block_bit);
}

long long getSetIndex(long long address) {
    long long mask = (1LL << set_bit) - 1;
    return (address >> block_bit) & mask;
}

long long getBlockOffset(long long address) {
    long long mask = (1LL << block_bit) - 1;
    return address & mask;
}

We use bit masks to parse the addresses.

Loading Cache

void loadData(long long address, int size) {
    // Simulate accessing data at the given address
    int s = getSetIndex(address);
    long long t = getTag(address);
    global_timer++;

    for (int i = 0; i < associativity; i++) {
        if (cache[s][i].valid && cache[s][i].tag == t) {
            hit_count++;
            cache[s][i].lru_counter = global_timer;
            if (verboseMode) printf(" hit");
            return;
        }
    }

    miss_count++;
    if (verboseMode) printf(" miss");

    for (int i = 0; i < associativity; i++) {
        if (!cache[s][i].valid) {
            cache[s][i].valid = true;
            cache[s][i].tag = t;
            cache[s][i].lru_counter = global_timer; 
            return;
        }
    }

    eviction_count++;
    if (verboseMode) printf(" eviction");

    int victim_index = 0;
    int min_lru = cache[s][0].lru_counter;

    for (int i = 1; i < associativity; i++) {
        if (cache[s][i].lru_counter < min_lru) {
            min_lru = cache[s][i].lru_counter;
            victim_index = i;
        }
    }

    cache[s][victim_index].tag = t;
    cache[s][victim_index].lru_counter = global_timer;
}

The code simulates the process of loading cache.

We first check if the data already exists in the cache.

If it doesn’t exist, we have to scan for blank lines to load the data.

If blank lines don’t exist, we need to evict a line using the LRU strategy. We replace the victim line with the new line.

Other Operations

void storeData(long long address, int size) {
    // Simulate storing data at the given address
    loadData(address, size);
}

void modifyData(long long address, int size) {
    // Simulate modifying data at the given address
    loadData(address, size);
    hit_count++;
    if (verboseMode) printf(" hit\n");
}

For this simulator, storing data and modifying data are basically the same thing as loading data.

Print Summary

We are asked to output the answer using the printSummary function.

1 2	// Print Summary printSummary(hit_count, miss_count, eviction_count);

And Voila!

                        Your simulator     Reference simulator
Points (s,E,b)    Hits  Misses  Evicts    Hits  Misses  Evicts
     3 (1,1,1)       9       8       6       9       8       6  traces/yi2.trace
     3 (4,2,4)       4       5       2       4       5       2  traces/yi.trace
     3 (2,1,4)       2       3       1       2       3       1  traces/dave.trace
     3 (2,1,3)     167      71      67     167      71      67  traces/trans.trace
     3 (2,2,3)     201      37      29     201      37      29  traces/trans.trace
     3 (2,4,3)     212      26      10     212      26      10  traces/trans.trace
     3 (5,1,5)     231       7       0     231       7       0  traces/trans.trace
     6 (5,1,5)  265189   21775   21743  265189   21775   21743  traces/long.trace
    27

Summary

In this project, we moved from the theory of hierarchy to the practical reality of memory management. By building this simulator, we reinforced several core concepts of computer systems.

With our simulator passing all the trace tests, we’ve effectively mirrored how a CPU “thinks” about memory. The next step is applying these insights to optimize actual code, ensuring our algorithms play nicely with the hardware we’ve just simulated.

CS188 Search Lecture Notes III

Louis Aeilot's Blog

作者 Louis C Deng

2026年1月28日 02:26

Here are the notes for CS188 Local Search.

Local Search algorithms allow us to find local maxima or minima to find a configuration that satisfies some constraints or optimizes some objective function.

Local Search

Hill-Climbing Search

The hill-climbing search algorithm (or steepest-ascent) moves from the current state towards the neighboring state that increases the objective value the most.

The “greediness” of hill-climbing makes it vulnerable to being trapped in local maxima and plateaus.

There are variants of hill-climbing. The stochastic hill-climbing selects an action randomly among the possible uphill moves. Random sideways moves, allows moves that don’t strictly increase the objective, helping the algorithm escape “shoulders.”

function HILL-CLIMBING(problem) returns a state
  current ← make-node(problem.initial-state)
  loop do
    neighbor ← a highest-valued successor of current
    if neighbor.value ≤ current.value then
      return current.state
    current ← neighbor

The algorithm iteratively moves to a state with a higher objective value until no such progress is possible. Hill-climbing is incomplete.

Random-restart hill-climbing conducts a number of hill-climbing searches from randomly chosen initial states. It is trivially complete.

Simulated Annealing Search

Simulated annealing aims to combine random walk (randomly moving to nearby states) and hill-climbing to obtain a complete and efficient search algorithm.

The algorithm chooses a random move at each timestep. If the move leads to a higher objective value, it is always accepted. If it leads to a smaller objective value, the move is accepted with some probability.

This probability is determined by the temperature parameter, which initially is high (allowing more “bad” moves) and decreases according to some “schedule.”

Theoretically, if the temperature is decreased slowly enough, the simulated annealing algorithm will reach the global maximum with a probability approaching 1.

function SIMULATED-ANNEALING(problem,schedule) returns a state
current ← problem.initial-state
for t = 1 to ∞ do
  T ← schedule(t)
  if T = 0 then return current
  next ← a randomly selected successor of current
  ΔE ← next.value – current.value
  if ΔE > 0 then current ← next
    else current ← next only with probability e^(ΔE/T)

Local Beam Search

Local beam search is another variant of the hill-climbing search algorithm.

The algorithm starts with a random initialization of $k$ states, and at each iteration, it selects $k$ new states. The algorithm selects the $k$ best successor states from the complete list of successor states from all the threads. If any of the threads find the optimal value, the algorithm stops.

The threads can share information between them, allowing “good” threads (for which objectives are high) to “attract” the other threads to that region as well.

function LOCAL-BEAM-SEARCH(problem, k) returns a state
  states ← k randomly generated states
  loop do
    if any state in states is a goal then return that state
    
    all_successors ← an empty list
    for each state in states do
      all_successors.add( all successors of state )
      
    current_best ← the highest valued state in all_successors
    if current_best.value ≤ max(states).value then
      return the highest valued state in states (Local Maxima Reached)
      
    states ← the k highest-valued states from all_successors

Genetic Algorithms

Genetic Algorithms are a variant of local beam search.

Genetic algorithms begin as beam search with $k$ randomly initialized states called the population. States (called individuals) are represented as a string over a finite alphabet.

Their main advantage is the use of crossovers — large blocks of letters that have evolved and lead to high valuations can be combined with other such blocks to produce a solution with a high total score.

function GENETIC-ALGORITHM(population, FITNESS-FN) returns an individual
  inputs: population, a set of individuals
          FITNESS-FN, a function that measures the fitness of an individual

  repeat
    new_population ← empty set
    for i = 1 to SIZE(population) do
      x ← RANDOM-SELECTION(population, FITNESS-FN)
      y ← RANDOM-SELECTION(population, FITNESS-FN)
      child ← REPRODUCE(x, y)
      if (small random probability) then child ← MUTATE(child)
      add child to new_population
    population ← new_population
  until some individual is fit enough, or enough time has elapsed
  return the best individual in population, according to FITNESS-FN

function REPRODUCE(x, y) returns an individual
  inputs: x, y, parent individuals

  n ← LENGTH(x); c ← random number from 1 to n
  return APPEND(SUBSTRING(x, 1, c), SUBSTRING(y, c + 1, n))

CS188 Search Lecture Notes II

Louis Aeilot's Blog

作者 Louis C Deng

2026年1月27日 02:26

Here are the lecture notes for UC Berkeley CS188 Lecture 3.

Heuristics

Heuristics are estimates of distance to goal states.

Heuristics are typically solutions to relaxed problems (where some of the constraints of the original problem have been removed).

With heuristics, it becomes very easy to implement logic in our agent that enables them to “prefer” expanding states that are estimated to be closer to goal states when deciding which action to perform.

Greedy Search

Greedy Search always selects the frontier node with the lowest heuristic value for expansion.

The frontier is represented by a priority queue.

Greedy search is not guaranteed to find a goal state if one exists, nor is it optimal. (If a very bad heuristic function is selected)

A* Search

A* search is a strategy for exploration that always selects the frontier node with the lowest estimated total cost for expansion.

A* search also uses a priority queue to represent its frontier.

A* search is both complete and optimal, if the heuristic is admissible.

$g(n)$ : The function representing total backwards cost computed by UCS.
$h(n)$ : The heuristic value function, or estimated forward cost, used by greedy search.
$f(n)$ : The function representing estimated total cost, used by A* search.

$f(n) = g(n) + h(n)$

Admissibility

The condition required for optimality when using A* tree search is known as admissibility.

Defining $h^*(n)$ as the true optimal forward cost to reach a goal state from a given node $n$ , we can formulate the admissibility constraint mathematically as follows:

$\forall n, \quad 0 \leq h(n) \leq h^*(n)$

Theorem. For a given search problem, if the admissibility constraint is satisfied by a heuristic function h, using A* tree search with h on that search problem will yield an optimal solution.

We can prove this theorem using proof by contradiction.

Graph Search

We keep track of which states you’ve already expanded, and never expand them again.

Tree search with this added optimization is known as graph search.

function A*-GRAPH-SEARCH(problem, frontier) return a solution or failure
    reached ← an empty dict mapping nodes to the cost to each one
    frontier ← INSERT((MAKE-NODE(INITIAL-STATE[problem]),0), frontier)
    while not IS-EMPTY(frontier) do
        node, node.CostToNode ← POP(frontier)
        if problem.IS-GOAL(node.STATE) then return node
        if node.STATE is not in reached or reached[node.STATE] > node.CostToNode then
            reached[node.STATE] = node.CostToNode
            for each child-node in EXPAND(problem, node) do
                frontier ← INSERT((child-node, child-node.COST + CostToNode), frontier)
    return failure

Note that in implementation, it’s critically important to store the reached set as a disjoint set and not a list.

Dominance

If heuristic $a$ is dominant over heuristic $b$ , then the estimated goal distance for $a$ is greater than the estimated goal distance for $b$ for every node in the state space graph. Mathematically,

$\forall n : h_a(n) \geq h_b(n)$

Dominance very intuitively captures the idea of one heuristic being better than another - if one admissible heuristic is dominant over another, it must be better because it will always more closely estimate the distance to a goal from any given state.

Additionally, the trivial heuristic is defined as $h(n) = 0$ , and using it reduces A* search to UCS.

All admissible heuristics dominate the trivial heuristic.

As a general rule, the max function applied to multiple admissible heuristics will also always be admissible.

Consistency

A heuristic function is said to be consistent, or monotone, if its estimate is always less than or equal to the estimated distance from any neighbouring vertex to the goal, plus the cost of reaching that neighbour.

Mathematically, this is expressed as the triangle inequality:

$\forall n, n' : h(n) \le c(n, n') + h(n')$

where $c(n, n')$ is the actual cost of the edge from $n$ to $n'$ .

If $h$ is consistent, the first time A* reaches a node, it is guaranteed to be via the optimal path.

CS188 Search Lecture Notes I

Louis Aeilot's Blog

作者 Louis C Deng

2026年1月25日 08:45

Here are the lecture notes for UC Berkeley CS188 Lecture 1 and 2.

Agents

The central problem of AI is to create a rational agent. A rational agent is an entity that has goals or preferences and tries to perform a series of actions that yield the best/optimal expected outcome given these goals.

The agent exists in an environment. Together, the agent and the environment constitute the world.

Types of Agents

A reflex agent selects its action based on only the current state of the world.

A planning agent maintains a model of the world and uses this model to simulate performing various actions.

Sometimes agents fail due to wrong world models.

Task Environment

The PEAS (Performance Measure, Environment, Actuators, Sensors) description is used for defining the task environment.

The performance measure describes what utility the agent tries to increase.

The environment summarizes where the agent acts and what affects the agent.

The actuators and the sensors are the methods with which the agent acts on the environment and receives information from it.

Types of Environment

We can characterize the types of environments in the following ways:

partially observable environments
fully observable environments
Stochastic environments (uncertainty in the transition model)
deterministic environments
multi-agent environments
static environments
dynamic environments
environments with known physics
environments with unknown physics

State Space and Search Problems

A search problem consists of the following elements:

A state space - The set of all possible states that are possible in your given world
A set of actions available in each state
A transition model - Outputs the next state when a specific action is taken at current state
An action cost - Incurred when moving from one state to another after applying an action
A start state - The state in which an agent exists initially
A goal test - A function that takes a state as input, and determines whether it is a goal state

A search problem is solved by first considering the start state, then exploring the state space using the action and transition and cost methods, iteratively computing children of various states until we arrive at a goal state.

A world state contains all information about a given state, whereas a search state contains only the information about the world that’s necessary for planning.

State Space Graphs and Search Trees

CS188 State Space Graphs and Search Trees

A state space graph is constructed with states representing nodes, with directed edges existing from a state to its children. These edges represent actions, and any associated weights represent the cost of performing the corresponding action.

They can be noticeably large, but they represent states well. (All states occur only once)

Search trees encode the entire path (or plan) from the start state to the given state in the state space graph.

Since there often exist multiple ways to get from one state to another, states tend to show up multiple times in search trees.

Since these structures can be too large for computers, we can choose to store only states we’re immediately working with, and compute new ones on-demand.

Uninformed Search

The standard protocol for finding a plan from the start state to the goals is to maintain an outer frontier and continually expanding the frontier by replacing a node with its children.

Most implementations of such algorithms will encode information about the parent node, distance to node, and the state inside the node object. This procedure is known as tree search.

function TREE-SEARCH(problem, frontier) return a solution or failure
    frontier ← INSERT(MAKE-NODE(INITIAL-STATE[problem]), frontier)
    while not IS-EMPTY(frontier) do
        node ← POP(frontier)
        if problem.IS-GOAL(node.STATE) then return node
        for each child-node in EXPAND(problem, node) do
            add child-node to frontier
    return failure

The EXPAND function appearing in the pseudocode returns all the possible nodes that can be reached from a given node by considering all available actions.

When we have no knowledge of the location of goal states in our search tree, we are forced to do uninformed search.

The completeness of each search strategy - if there exists a solution to the search problem, is the strategy guaranteed to find it given infinite computational resources?
The optimality of each search strategy - is the strategy guaranteed to find the lowest cost path to a goal state?
The **branching factor ** - The increase in the number of nodes on the frontier each time a frontier node is dequeued and replaced with its children is . At depth k in the search tree, there exists nodes.
The maximum depth m.
The depth of the shallowest solution s.

Depth-First Search

Depth-First Search (DFS) always selects the deepest frontier node.

The frontier is represented by a stack.

Depth-first search is not complete. If there exist cycles in the state space graph, this inevitably means that the corresponding search tree will be infinite in depth.

Depth-first search is not optimal.

Breadth-Frist Search

Breadth-first search (BFS) always selects the shallowest frontier node.

The frontier is represented by a queue.

If a solution exists, then the depth of the shallowest node s must be finite, so BFS must eventually search this depth. Hence, it’s complete.

BFS is generally not optimal.

Iterative Deepening

Iterative Deepening (ID) is basically BFS implemented with DFS. It repeatedly executes depth-first search with an increasing depth limit.

The frontier is represented by a stack.

Iterative deepening is complete. Because it exhaustively searches each depth level before increasing the limit, it avoids the infinite loops of DFS and will find a solution if one exists at a finite depth.

ID is generally not optimal.

If the cost is uniform, ID and BFS will be optimal.

Uniform Cost Search

Uniform cost search (UCS) always selects the lowest cost frontier node.

The frontier is represented by a priority queue.

Uniform cost search is complete and optimal.

The three strategies outlined above are fundamentally the same - differing only in expansion strategy

RECAP2025: 留白

Louis Aeilot's Blog

作者 Louis C Deng

2026年1月1日 09:26

前幾天才猛然驚覺，按照慣例，是時候寫年終總結了。盯著白花花的螢幕，半天拼湊不出一句話。似乎 2025 是注定被遺忘的一年。

既然這是難以回憶難以定義的一年，不如我們留白。

歲末

有一位詩人寫過一首詩。我把它原封不動摘錄在這裡。

從一年的開始到終結
我走了多年
讓歲月彎成了弓
到處是退休者的鞋
私人的塵土
公共的垃圾

這是並不重要的一年
鐵錘閒著，而我
向以後的日子借光
瞥見一把白金尺
在鐵砧上
北島

夢

十一月，似乎記得做過一個夢。

一束刺眼的強光，手術台般，直射頭頂，似乎要剝離我的靈魂。流水線廠房裡，只能聽到工業巨獸的咆哮，什麼也聽不見。感官被屏蔽了。等到我恢復知覺，肉體已隨著傳送帶前行。傳送帶的盡頭是什麼？前方是絕對的虛無。什麼也看不見，什麼也聽不見。

這不對。驚起，我開始大喊。沒有響應。

廠房無限延伸。黑暗中，平行的傳送帶時刻不停。金屬零件整齊排列。冰冷的秩序令人膽顫心驚。當你仔細望過去，才發現那些零件竟然是人。按照不知道什麼規則分類排布在各自的傳送帶上，一聲不響。他們並沒有戴著鐐銬，但他們一動不動。

是誰在操縱？

廠房全自動化作業。頭頂的攝影機和感測器監視著一切。

我開始往反方向跑，試圖喚醒周圍的人，沒有回應。

我奔跑，傳送帶時刻不停。

我在跑。我只能跑。

我是誰

醒來時，我忘記了我是誰。

但是，還有書籍。

我研究，我晝夜不停，我追尋答案。

一月是從未涉足的故鄉，西半球正在下雪。

二月用常識的謊言背叛了自己。

三月沿青石台階一路向上，從一數到一百。

四月在湖畔閒坐。

五月為徒勞日夜兼程。

六月是結束的開始。

七月在山間行走。

八月有些中暑。

九月十月十一月。

十二月。

留白。

我是誰？

這段歷史沒有姓名，就連編號也找不到。

也沒有書籍，似乎失去了靈魂。

午夜飛行（續）

二月，我登上午夜的飛機，反抗黑夜。惶恐穿越子午線。

也許我不應該離開，其實我應該留下。

我應該做什麼？

他登上火車，前往故鄉。故鄉是哪裡？他不知道。故鄉只存在於記憶深處，或是夢中。或許這一切都不存在。窗外是多少朝代。傾倒的電線杆，無盡的平原，高山，水壩。世界的盡頭。多少片雪花無聲落下。他什麼也不知道。一切都消散在風雨中。

你想去看看長城。你說你背負著五千年，從昨天就忍痛前行，行走在這條泥濘的荊棘路上。你從未走出這片森林，從未離開這座大山。你日夜不停。你不知道這意味著什麼。

他感受到畏懼。他似乎想起為什麼啟程。一切都由他們來定義。他們是誰？他不相信這平原竟是懸崖峭壁，他不相信噩夢和神話。他學著眾人的模樣，在淺灘上奔跑。

你追隨祖先走向大海。那艘巨艦並未帶走你的先輩，故步自封，信仰成為棺木。你的先祖曾經堅守，被背叛，被遺忘，結束自己卑微的一生。你沒有記憶，歷史選擇忘記。

他沒有犯錯，他不相信，他惶恐。他什麼也不知道，他什麼也不相信。他相信嗎？你相信嗎？這故鄉不屬於你，這大山與你無關。或許故鄉只是一個精神上的坐標。他沒有犯錯，卻要接受刑罰。什麼是正義？沒有人解釋。清楚一切，被抹去。不需要解釋。

你自認為與眾不同，你不願意接受，你試著唱反調。他們需要你接受。你非要挑戰。你發現你窒息在空氣中。

你不再做夢，你沒有主義，你無家可歸，你四海為家。你在沙漠公路上前行，你沒有追隨者。

一輪明月，你只看見殉道者的鮮血，為誰而流？他只想苟且偷生。你呼喚他午夜飛行。真理和真理之間藏著謊言。你選擇追隨祖先，追隨感覺。

他覺得沒必要相信，但他做不到。他只能在月光下趟水過河，冰冷的河水使他不得不保持緘默。而緘默是最沉痛的懦弱。

我們都是貪生怕死之輩。現狀是一潭死水。我們任人擺布。

你要去哪裡？你還相信嗎？

你不相信夢想，你不相信陳見，你不相信權力和暴力，你不相信太陽，你不相信神話，你不相信別人，你不相信規則，你不相信故鄉，你不相信過去，你不相信飛行，你不相信雲，你不相信歷史，你不相信直覺你不相信冰塊你不相信大海你不相信微塵你不相信革命你不相信政治你不相信謊言你不相信幻覺不相信收音機不相信信標不相信城牆不相信詩人不相信江山不相信愛情不相信生活不相信文化不相信天氣。你什麼都不相信。你還相信什麼？你還相信相信嗎？

真的嗎？

在錯綜複雜的迷宮中，我選擇放棄抵抗，跪在雨後的林間小道上，雙膝上沾滿了泥濘。我選擇了從一開始就選擇的一條路，選擇低下頭，俯身穿過荊棘。雖然我知道這條路通向哪裡，雖然我知道正確答案。

我選擇了恐懼。

午夜飛行，我向未知妥協。

我背叛了黎明。後悔已沒有意義。

你還是自由的嗎？

最後的特拉阿托尼

當西班牙征服者攻入特諾奇蒂特蘭

當城邦被水淹沒

當神殿之上建立起天主教教堂

夸烏特莫克成為最後的特拉阿托尼

雙足被燙傷

“阿茲特克人沒有藏金”

烈火吞噬火槍

仙人掌被水淹沒

“西班牙人懂得尊重勇敢”

絞刑架揭穿謊言

到底什麼才是真理？

征服者失眠

戰船傾倒

帝國隕落

城市裡樹立起紀念碑

文明毀滅之後永不歸來

在煙霧繚繞的黑曜石鏡中

第五個太陽疲憊地闔上眼簾

不再有搏動的心臟

餵養它日益黯淡的光芒

東方海面吹來的風

沒有帶來羽蛇神的承諾

只送來了披著鐵甲的蝗蟲

和一種無法被祭祀治癒的瘟疫

蘆葦束斷裂的聲音

淹沒在陌生銅鐘的轟鳴裡

時間之輪卡死在最後的紀元

不再轉動

在特斯科科湖乾涸的夢境下方

一條巨大的、看不見的蛇

在淤泥中緩慢地蛻皮

留下一具名叫特諾奇蒂特蘭的空殼

等待一個永遠不會升起的黎明

城市穿過阿茲特克的胸腔

永恆消失的帝國

Zócalo/Tenochtitlan 被地鐵站牌記憶

雨西湖

飛機降落在七月。我偏離只有草地的晴天，走向暴雨中的西湖。

但我仍然會想起你。夜間，高鐵在黑夜裡撕開一道縫隙，電話那頭的聲音被電流聲撕碎。追趕時光。錯位的車票。疾馳的火車。車站和夜晚都不會再等待。

我的傘禁受不了風雨。我在寶石山下的木屋裡等待雨停。

陌生的旅客，傘下的遊人。

遠山隱入蒼冥，斷橋遁入虛白。

硯台被打翻了，黑雲遮蔽錢塘的街市。湖水之下卻永遠寧靜。湖面和天空密謀了一場大霧，把保俶塔的尖頂也藏了起來。湖水之上，雨成為牢籠，成片倒下來，淋不醒夢中的人。

我在湖邊行走。這是多少次來到杭州？

雨點切斷聯結。人們只匆匆趕路。環湖的路上已經沒有行人，斷橋上卻擠滿了遊客，停滯不前。傘面下的世界縮減到了方寸之間，我們在此刻互為盲人。有誰會探頭望向灰暗的湖面和天空，感受雨點擊打在面頰上，看著石階上瀑布般的水流。逆行上山。轉頭被石壁中的佛像嚇到。石像沉睡千年，永遠都不會醒來，永遠沉醉於昨日繁華。我匆忙尋找避雨的屋簷。土腥味指引我來到山腳下的木屋。雨點浸濕了我的衣服，汗水和雨融合到一起，脊背冰冷，也如同夜晚西湖上的風。

電話中聽到你的聲音。空調吐出的寒氣比窗外的雨更凜冽，或許也是因為我全身濕透。一頓簡單的 KFC，成為暴雨中唯一的慰藉。或許你們去聚餐遊戲？但我永遠都不屬於這裡，不屬於人群和都市。

高鐵斬碎夜色和西湖的夢，西湖萬古不變，西湖的雨萬古不變。

或許一陣風來，又是暖風熏得遊人醉。

雨西湖為何而留白？

張岱是否也這樣回憶臨安？

有些事情，逝去了就不會再來。有些人逝去了，如雨入湖中，無跡可尋。

休言萬事轉頭空，未轉頭時皆夢？

旅行

我們在盛夏走向大山。記不起景德鎮的窯火，或是廬山，或是張家界，或是長沙。記憶吞噬一切，模糊。但旅行的意義，在於行走，在於一次旅行之後仍然期待下一次前行。

四月就曾騎行江畔，點幾串燒烤，吹江風。卻又在深夜彷徨，突如其來，或是有所徵兆，不知所措。

南方的小城，租一輛電動車。我開錯導航，誤入歧途。沒有責怪。重新認識你，或許今後又很難常見。用一個簡單的謊言掩蓋事實，是自我保護還是連自己也欺騙，不願意承認。但在七月的烈日之下，騎車自由遊蕩，不需要任何計劃，點一杯奶茶，走到深巷裡吃當地菜餚。或許又在山間酒醉，綠茶啤酒，看著晚霞逐漸熄滅。夜遊長沙，遁入黑夜的不止有那天的星空。乘船來到東海小島，其實並未做任何計劃，甚至是購買的最後幾張票。行走在環島海濱公路上，這裡的大海或許和往日不同。

我們為何啟程？我們何時再次啟程？

留白 · 斷章

一年的盡頭為新的一年留白。

不再偽裝勇敢，不再滿身疲憊，在清醒中假裝沉睡。

重新整理時間線，讀一些書，走一些路。

游標在空白上跳動。我試圖尋找意義，把這一年填滿。我發現我似乎失去了過去的力量，失去了語言，我試圖定義，卻發現最有力量的文字藏在那些被我刪除的段落，被我遺忘的話語，我不敢承認也不敢想的語言。最後都化為一聲嘆息，成為 12 月的一場大雪。

2025 年注定是一場大霧。那就留白。沒有人能夠定義這一切。

繼續保持清醒。

刪去一萬字，留下空白。

在這空白裡，請聽一聽你內心的聲音。

CSAPP Bomb Lab 解析

Louis Aeilot's Blog

作者 Louis C Deng

2025年12月21日 02:45

做完了 CSAPP Bomb Lab，寫一篇解析。

題目要求

運行一個二進制文件 bomb，它包括六個"階段(phase)“，每個階段要求學生通過 stdin 輸入一個特定的字串。如果輸入了預期的字串，那麼該階段被"拆除”，進入下一個階段，直到所有炸彈被成功"拆除"。否則，炸彈就會"爆炸"，列印出"BOOM!!!"

環境

這個系統是在 x86_64 Linux 上運行的，而筆者的環境是 ARM 架構的 macOS (Apple Silicon)。

弄了半天 docker，虛擬化一個 x86_64 Ubuntu 出來，結果裡面的 gdb 不能用，不想折騰。

發現 educoder 上面有環境，可以直接用，而且免費，於是就在 educoder 上面完成了本實驗。

地址：https://www.educoder.net/paths/6g398fky

前置知識

本實驗要求掌握 gdb 的一些指令。

1. 啟動與退出 (Startup & Exit)

指令	縮寫	描述
`gdb executable`	-	啟動 GDB 並載入可執行文件。
`run [args]`	`r`	開始運行程序。如果有命令行參數，跟在後面（如 `r input.txt`）。
`quit`	`q`	退出 GDB。
`start`	-	運行程序並在 `main` 函數的第一行自動暫停（省去手動打斷點的麻煩）。
`set args ...`	-	設置運行時的參數（在 `r` 之前使用）。

2. 斷點管理 (Breakpoints)

指令	縮寫	描述	範例
`break <loc>`	`b`	設置斷點。支持函數名、行號、檔案名:行號。	`b main` `b 15` `b file.c:20`
`info breakpoints`	`i b`	查看當前所有斷點及其編號 (Num)。	-
`delete <Num>`	`d`	刪除指定編號的斷點。不加編號則刪除所有。	`d 1`
`disable/enable <Num>`	-	暫時禁用或啟用某個斷點（保留配置但不生效）。	`disable 2`
`break ... if <cond>`	-	條件斷點：僅當條件為真時才暫停（非常有用）。	`b 10 if i==5`

3. 執行控制 (Execution Control)

指令	縮寫	描述	區別點
`next`	`n`	單步跳過。執行下一行程式碼。	如果遇到函數調用，不進入函數內部，直接執行完該函數。
`step`	`s`	單步進入。執行下一行程式碼。	如果遇到函數調用，進入函數內部逐行除錯。
`continue`	`c`	繼續運行，直到遇到下一個斷點或程序結束。	-
`finish`	-	執行直到當前函數返回。	當你不小心 `s` 進了一個不想看的庫函數時，用這個跳出來。
`until <line>`	`u`	運行直到指定行號。	常用於快速跳出循環。

4. 查看數據 (Inspection)

指令	縮寫	描述
`print <var>`	`p`	列印變數的值。支持表達式（如 `p index + 1`）。
`display <var>`	-	持續顯示。每次程序暫停時，自動列印該變數的值（適合跟蹤循環中的變數）。
`info locals`	-	列印當前棧幀中所有局部變數的值。
`whatis <var>`	-	查看變數的數據類型。
`ptype <struct>`	-	查看結構體或類的具體定義（成員列表）。
`x /nfu <addr>`	`x`	查看記憶體。`n`是數量，`f`是格式(x=hex, d=dec, s=str)，`u`是單位(b=byte, w=word)。例如：`x/10xw &array` (以16進制顯示數組前10個word)。

5. 堆棧與上下文 (Stack & Context)

指令	縮寫	描述
`backtrace`	`bt`	查看調用棧。顯示程序崩潰時的函數調用路徑（從 main 到當前函數）。
`frame <Num>`	`f`	切換到指定的堆棧幀（配合 `bt` 看到的編號）。切換後可以用 `p` 查看該層函數的局部變數。
`list`	`l`	顯示當前行附近的原始碼。

6. 提升體驗：TUI 模式 (Text User Interface)

layout src：螢幕分為兩半，上面顯示原始碼和當前執行行，下面是命令窗口。（強烈推薦）
layout asm：顯示匯編代碼。
layout split：同時顯示原始碼和匯編。

反匯編

我們可以使用 objdump 直接進行反匯編，查看匯編原始碼。

1	objdump -d bomb > bomb.asm

我們可以觀察到，幾個 phase 其實是幾個函數，phase_x()。

strings

在終端輸入：

1	strings bomb

這會把 bomb 文件裡所有連續的可列印字元（ASCII）都列印出來。

Phase 1

我們先看看 phase_1 長什麼樣子，disas phase_1

Dump of assembler code for function phase_1:
   0x0000000000400ee0 <+0>:     sub    $0x8,%rsp
   0x0000000000400ee4 <+4>:     mov    $0x402400,%esi
   0x0000000000400ee9 <+9>:     callq  0x401338 <strings_not_equal>
   0x0000000000400eee <+14>:    test   %eax,%eax
   0x0000000000400ef0 <+16>:    je     0x400ef7 <phase_1+23>
   0x0000000000400ef2 <+18>:    callq  0x40143a <explode_bomb>
   0x0000000000400ef7 <+23>:    add    $0x8,%rsp
   0x0000000000400efb <+27>:    retq   
End of assembler dump.

sub $0x8,%rsp 是設置棧幀，在這裡不用管。

mov $0x402400,%esi 和 callq 0x401338 <strings_not_equal> 似乎進行了字串的 strcmp。

接下來 je 0x400ef7 <phase_1+23> 就很明顯了，如果相等跳出炸彈。

設置斷點，b phase_1

之後運行程序，r，隨便輸入一些內容，就可以觸發斷點

以字串形式查看 0x402400 所指向的記憶體：x/s 0x402400

1	0x402400: "Border relations with Canada have never been better."

我們找到了答案。

Phase 2

還是先反匯編：

Dump of assembler code for function phase_2:
   0x0000000000400efc <+0>:     push   %rbp
   0x0000000000400efd <+1>:     push   %rbx
   0x0000000000400efe <+2>:     sub    $0x28,%rsp
   0x0000000000400f02 <+6>:     mov    %rsp,%rsi
   0x0000000000400f05 <+9>:     callq  0x40145c <read_six_numbers>
   0x0000000000400f0a <+14>:    cmpl   $0x1,(%rsp)
   0x0000000000400f0e <+18>:    je     0x400f30 <phase_2+52>
   0x0000000000400f10 <+20>:    callq  0x40143a <explode_bomb>
   0x0000000000400f15 <+25>:    jmp    0x400f30 <phase_2+52>
   0x0000000000400f17 <+27>:    mov    -0x4(%rbx),%eax
   0x0000000000400f1a <+30>:    add    %eax,%eax
   0x0000000000400f1c <+32>:    cmp    %eax,(%rbx)
   0x0000000000400f1e <+34>:    je     0x400f25 <phase_2+41>
   0x0000000000400f20 <+36>:    callq  0x40143a <explode_bomb>
   0x0000000000400f25 <+41>:    add    $0x4,%rbx
   0x0000000000400f29 <+45>:    cmp    %rbp,%rbx
   0x0000000000400f2c <+48>:    jne    0x400f17 <phase_2+27>
   0x0000000000400f2e <+50>:    jmp    0x400f3c <phase_2+64>
   0x0000000000400f30 <+52>:    lea    0x4(%rsp),%rbx
   0x0000000000400f35 <+57>:    lea    0x18(%rsp),%rbp
   0x0000000000400f3a <+62>:    jmp    0x400f17 <phase_2+27>
   0x0000000000400f3c <+64>:    add    $0x28,%rsp
   0x0000000000400f40 <+68>:    pop    %rbx
   0x0000000000400f41 <+69>:    pop    %rbp
   0x0000000000400f42 <+70>:    retq   
End of assembler dump.

0x0000000000400f05 <+9>: callq 0x40145c <read_six_numbers> 這裡看到 read_six_numbers

我們可以反匯編 read_six_numbers

Dump of assembler code for function read_six_numbers:
   0x000000000040145c <+0>:     sub    $0x18,%rsp
   0x0000000000401460 <+4>:     mov    %rsi,%rdx
   0x0000000000401463 <+7>:     lea    0x4(%rsi),%rcx
   0x0000000000401467 <+11>:    lea    0x14(%rsi),%rax
   0x000000000040146b <+15>:    mov    %rax,0x8(%rsp)
   0x0000000000401470 <+20>:    lea    0x10(%rsi),%rax
   0x0000000000401474 <+24>:    mov    %rax,(%rsp)
   0x0000000000401478 <+28>:    lea    0xc(%rsi),%r9
   0x000000000040147c <+32>:    lea    0x8(%rsi),%r8
   0x0000000000401480 <+36>:    mov    $0x4025c3,%esi
   0x0000000000401485 <+41>:    mov    $0x0,%eax
   0x000000000040148a <+46>:    callq  0x400bf0 <__isoc99_sscanf@plt>
   0x000000000040148f <+51>:    cmp    $0x5,%eax
   0x0000000000401492 <+54>:    jg     0x401499 <read_six_numbers+61>
   0x0000000000401494 <+56>:    callq  0x40143a <explode_bomb>
   0x0000000000401499 <+61>:    add    $0x18,%rsp
   0x000000000040149d <+65>:    retq   
End of assembler dump.

看到有一行 callq 0x400bf0 <__isoc99_sscanf@plt>，調用了 sscanf

我們看一眼 $0x4025c3，x/s 0x4025c3，得到 %d %d %d %d %d %d，確實是讀了六個數字。

函數調用時，參數多於六個，就會丟到棧裡面去。我們看到：

0x0000000000401460 <+4>:     mov    %rsi,%rdx
0x0000000000401463 <+7>:     lea    0x4(%rsi),%rcx
0x0000000000401467 <+11>:    lea    0x14(%rsi),%rax
0x000000000040146b <+15>:    mov    %rax,0x8(%rsp)
0x0000000000401470 <+20>:    lea    0x10(%rsi),%rax
0x0000000000401474 <+24>:    mov    %rax,(%rsp)
0x0000000000401478 <+28>:    lea    0xc(%rsi),%r9
0x000000000040147c <+32>:    lea    0x8(%rsi),%r8

參數順序：rdi, rsi, rdx, rcx, r8, r9，超過了六個參數。rsp 為棧頂指針，多於六個的參數存在棧上。

於是讀取的六個數字依次存為：rsi, rsi+4, rsi+8, rsi+12, rsi+16 (0x10 = 16), rsi+20 (0x14 = 20)

再回到 phase_2

1	0x0000000000400f02 <+6>: mov %rsp,%rsi

棧頂指針作為參數傳入了 read_six_numbers，因此，這六個數字應該是在 phase_2 對應棧幀的棧上

1
2
3

0x0000000000400f0a <+14>:    cmpl   $0x1,(%rsp)
0x0000000000400f0e <+18>:    je     0x400f30 <phase_2+52>
0x0000000000400f10 <+20>:    callq  0x40143a <explode_bomb>

這裡判斷棧頂元素是否是 1，也就是說第一個元素是否是 1

之後跳轉到了 0x400f30

0x0000000000400f17 <+27>:    mov    -0x4(%rbx),%eax
0x0000000000400f1a <+30>:    add    %eax,%eax
0x0000000000400f1c <+32>:    cmp    %eax,(%rbx)
0x0000000000400f1e <+34>:    je     0x400f25 <phase_2+41>
0x0000000000400f20 <+36>:    callq  0x40143a <explode_bomb>
0x0000000000400f25 <+41>:    add    $0x4,%rbx
0x0000000000400f29 <+45>:    cmp    %rbp,%rbx
0x0000000000400f2c <+48>:    jne    0x400f17 <phase_2+27>
0x0000000000400f2e <+50>:    jmp    0x400f3c <phase_2+64>
0x0000000000400f30 <+52>:    lea    0x4(%rsp),%rbx
0x0000000000400f35 <+57>:    lea    0x18(%rsp),%rbp
0x0000000000400f3a <+62>:    jmp    0x400f17 <phase_2+27>

這裡很顯然是一個循環，依次讀取六個數位（每次移動四個位元組，正好是 int 的長度）

1
2
3

0x0000000000400f1a <+30>:    add    %eax,%eax
0x0000000000400f1c <+32>:    cmp    %eax,(%rbx)
0x0000000000400f1e <+34>:    je     0x400f25 <phase_2+41>

這六個數字，後一個是前一個的兩倍。

於是我們可以得到答案：1 2 4 8 16 32

我們也可以把代碼翻譯成 C 語言：

for (int i = 1; i < 6; i++) {
    // mov -0x4(%rbx), %eax 
    int previous = num[i-1];
    // add %eax, %eax
    int expected = previous + previous; 
    // cmp %eax, (%rbx)
    if (num[i] != expected) {
        explode_bomb();
    }
}

Phase 3

反匯編：

Dump of assembler code for function phase_3:
   0x0000000000400f43 <+0>:     sub    $0x18,%rsp
   0x0000000000400f47 <+4>:     lea    0xc(%rsp),%rcx
   0x0000000000400f4c <+9>:     lea    0x8(%rsp),%rdx
   0x0000000000400f51 <+14>:    mov    $0x4025cf,%esi
   0x0000000000400f56 <+19>:    mov    $0x0,%eax
   0x0000000000400f5b <+24>:    callq  0x400bf0 <__isoc99_sscanf@plt>
   0x0000000000400f60 <+29>:    cmp    $0x1,%eax
   0x0000000000400f63 <+32>:    jg     0x400f6a <phase_3+39>
   0x0000000000400f65 <+34>:    callq  0x40143a <explode_bomb>
   0x0000000000400f6a <+39>:    cmpl   $0x7,0x8(%rsp)
   0x0000000000400f6f <+44>:    ja     0x400fad <phase_3+106>
   0x0000000000400f71 <+46>:    mov    0x8(%rsp),%eax
   0x0000000000400f75 <+50>:    jmpq   *0x402470(,%rax,8)
   0x0000000000400f7c <+57>:    mov    $0xcf,%eax
   0x0000000000400f81 <+62>:    jmp    0x400fbe <phase_3+123>
   0x0000000000400f83 <+64>:    mov    $0x2c3,%eax
   0x0000000000400f88 <+69>:    jmp    0x400fbe <phase_3+123>
   0x0000000000400f8a <+71>:    mov    $0x100,%eax
   0x0000000000400f8f <+76>:    jmp    0x400fbe <phase_3+123>
   0x0000000000400f91 <+78>:    mov    $0x185,%eax
   0x0000000000400f96 <+83>:    jmp    0x400fbe <phase_3+123>
   0x0000000000400f98 <+85>:    mov    $0xce,%eax
   0x0000000000400f9d <+90>:    jmp    0x400fbe <phase_3+123>
   0x0000000000400f9f <+92>:    mov    $0x2aa,%eax
   0x0000000000400fa4 <+97>:    jmp    0x400fbe <phase_3+123>
   0x0000000000400fa6 <+99>:    mov    $0x147,%eax
   0x0000000000400fab <+104>:   jmp    0x400fbe <phase_3+123>
   0x0000000000400fad <+106>:   callq  0x40143a <explode_bomb>
   0x0000000000400fb2 <+111>:   mov    $0x0,%eax
   0x0000000000400fb7 <+116>:   jmp    0x400fbe <phase_3+123>
   0x0000000000400fb9 <+118>:   mov    $0x137,%eax
   0x0000000000400fbe <+123>:   cmp    0xc(%rsp),%eax
   0x0000000000400fc2 <+127>:   je     0x400fc9 <phase_3+134>
   0x0000000000400fc4 <+129>:   callq  0x40143a <explode_bomb>
   0x0000000000400fc9 <+134>:   add    $0x18,%rsp
   0x0000000000400fcd <+138>:   retq

看著有點複雜，觀察到 sscanf

看一眼 0x4025cf，x/s 0x4025cf，得到 %d %d，看起來是輸入了兩個整數

1 2	0x0000000000400f47 <+4>: lea 0xc(%rsp),%rcx 0x0000000000400f4c <+9>: lea 0x8(%rsp),%rdx

這兩個整數依次存為 rsp+8, rsp+c

1 2	0x0000000000400f6a <+39>: cmpl $0x7,0x8(%rsp) 0x0000000000400f6f <+44>: ja 0x400fad <phase_3+106>

這裡判斷了第一個數，如果這個數大於 7，就會引爆

1 2	0x0000000000400f71 <+46>: mov 0x8(%rsp),%eax 0x0000000000400f75 <+50>: jmpq *0x402470(,%rax,8)

我們把第一個整數存入 eax，這裡很明顯是一個 switch 的跳轉表：0x402470 + 8*rax

eax 和 rax 實際上是同一個東西，前者是這個暫存器的前 32 位，後者是這個暫存器的完整 64 位，這是歷史遺留產物，實際上，還有 ax, ah, al，為了向後相容而保留。

我們來讀取 10 個，x/10x 0x402470，得到：

1
2
3

0x402470:       0x00400f7c      0x00000000      0x00400fb9      0x00000000
0x402480:       0x00400f83      0x00000000      0x00400f8a      0x00000000
0x402490:       0x00400f91      0x00000000

這是 switch 語句的跳轉表，與匯編代碼中對應。

我們隨便選一個就能得到正確答案，如，0 對應 0x00400f7c

0x0000000000400f7c <+57>:    mov    $0xcf,%eax
0x0000000000400f81 <+62>:    jmp    0x400fbe <phase_3+123>
...
0x0000000000400fbe <+123>:   cmp    0xc(%rsp),%eax
0x0000000000400fc2 <+127>:   je     0x400fc9 <phase_3+134>
0x0000000000400fc4 <+129>:   callq  0x40143a <explode_bomb>

第二個數和 eax 比較，相等就拆除成功

我們得到第二個數 0xcf = 207

於是，答案是 0 207

實際上，答案並不唯一，觀察代碼可以知道，每一個 switch 分支中，都對應了一個第二個整數的正確答案。

Phase 4

反編譯：

Dump of assembler code for function phase_4:
   0x000000000040100c <+0>:     sub    $0x18,%rsp
   0x0000000000401010 <+4>:     lea    0xc(%rsp),%rcx
   0x0000000000401015 <+9>:     lea    0x8(%rsp),%rdx
   0x000000000040101a <+14>:    mov    $0x4025cf,%esi
   0x000000000040101f <+19>:    mov    $0x0,%eax
   0x0000000000401024 <+24>:    callq  0x400bf0 <__isoc99_sscanf@plt>
   0x0000000000401029 <+29>:    cmp    $0x2,%eax
   0x000000000040102c <+32>:    jne    0x401035 <phase_4+41>
   0x000000000040102e <+34>:    cmpl   $0xe,0x8(%rsp)
   0x0000000000401033 <+39>:    jbe    0x40103a <phase_4+46>
   0x0000000000401035 <+41>:    callq  0x40143a <explode_bomb>
   0x000000000040103a <+46>:    mov    $0xe,%edx
   0x000000000040103f <+51>:    mov    $0x0,%esi
   0x0000000000401044 <+56>:    mov    0x8(%rsp),%edi
   0x0000000000401048 <+60>:    callq  0x400fce <func4>
   0x000000000040104d <+65>:    test   %eax,%eax
   0x000000000040104f <+67>:    jne    0x401058 <phase_4+76>
   0x0000000000401051 <+69>:    cmpl   $0x0,0xc(%rsp)
   0x0000000000401056 <+74>:    je     0x40105d <phase_4+81>
   0x0000000000401058 <+76>:    callq  0x40143a <explode_bomb>
   0x000000000040105d <+81>:    add    $0x18,%rsp
   0x0000000000401061 <+85>:    retq   
End of assembler dump.

我們還是看到 sscanf

讀一下 0x4025cf，得到 %d %d，看起來又是讀兩個數字，分別存入 rdx, rcx

接著往下讀，jbe 0x40103a，要求 rdx <= 14

1
2
3

0x000000000040103a <+46>:    mov    $0xe,%edx
0x000000000040103f <+51>:    mov    $0x0,%esi
0x0000000000401044 <+56>:    mov    0x8(%rsp),%edi

明顯在傳參，調用了 func4

我們先不急著看 func4，接著往下讀

0x000000000040104d <+65>:    test   %eax,%eax
0x000000000040104f <+67>:    jne    0x401058 <phase_4+76>
...
0x0000000000401058 <+76>:    callq  0x40143a <explode_bomb>

回顧一下暫存器知識，eax 在這裡是函數的返回值，這裡要求返回值等於 0

1 2	0x0000000000401051 <+69>: cmpl $0x0,0xc(%rsp) 0x0000000000401056 <+74>: je 0x40105d <phase_4+81>

這裡要求讀取到的第二個數是 0，算是得到了半個答案

接下來我們看 func4

Dump of assembler code for function func4:
   0x0000000000400fce <+0>:     sub    $0x8,%rsp
   0x0000000000400fd2 <+4>:     mov    %edx,%eax
   0x0000000000400fd4 <+6>:     sub    %esi,%eax
   0x0000000000400fd6 <+8>:     mov    %eax,%ecx
   0x0000000000400fd8 <+10>:    shr    $0x1f,%ecx
   0x0000000000400fdb <+13>:    add    %ecx,%eax
   0x0000000000400fdd <+15>:    sar    %eax
   0x0000000000400fdf <+17>:    lea    (%rax,%rsi,1),%ecx
   0x0000000000400fe2 <+20>:    cmp    %edi,%ecx
   0x0000000000400fe4 <+22>:    jle    0x400ff2 <func4+36>
   0x0000000000400fe6 <+24>:    lea    -0x1(%rcx),%edx
   0x0000000000400fe9 <+27>:    callq  0x400fce <func4>
   0x0000000000400fee <+32>:    add    %eax,%eax
   0x0000000000400ff0 <+34>:    jmp    0x401007 <func4+57>
   0x0000000000400ff2 <+36>:    mov    $0x0,%eax
   0x0000000000400ff7 <+41>:    cmp    %edi,%ecx
   0x0000000000400ff9 <+43>:    jge    0x401007 <func4+57>
   0x0000000000400ffb <+45>:    lea    0x1(%rcx),%esi
   0x0000000000400ffe <+48>:    callq  0x400fce <func4>
   0x0000000000401003 <+53>:    lea    0x1(%rax,%rax,1),%eax
   0x0000000000401007 <+57>:    add    $0x8,%rsp
   0x000000000040100b <+61>:    retq   
End of assembler dump.

這個代碼裡面包含遞迴，我們可以手動把這段代碼翻譯到 C 語言：

// edx = 14, esi = 0, edi = a
int func4(int edi, int esi, int edx){
    int mid = l + ((r-l)>>1);
    if(mid <= a){
        if(mid==a){
            return 0;
        }
        l = mid + 1;
        return 2*func4(a, l, r) + 1;
    }else{
        r = mid - 1;
        return 2*func4(a, l, r);
    }
}

這是二分尋找，我們很容易得到答案 a=7，於是返回 0

得到最終的答案 7 0

0x0000000000400fd2 <+4>:     mov    %edx,%eax
0x0000000000400fd4 <+6>:     sub    %esi,%eax
0x0000000000400fd6 <+8>:     mov    %eax,%ecx
0x0000000000400fd8 <+10>:    shr    $0x1f,%ecx
0x0000000000400fdb <+13>:    add    %ecx,%eax
0x0000000000400fdd <+15>:    sar    %eax
0x0000000000400fdf <+17>:    lea    (%rax,%rsi,1),%ecx

這一段代碼就是在計算 mid，非常好理解，但是有個問題：shr $0x1f,%ecx 是在做什麼？

偏置

整數除法要求向零捨入。對於正數，向下捨入；對於負數，向上捨入。除以2的冪可以用右移操作替代。

但是，對於補碼右移，很可能出現捨入錯誤。

我們進行右移的時候，其實是捨去了最低位，是一種向下取整

$x = \underbrace{\sum_{i=k}^{w-1} x_i 2^i}_{\text{高位部分}} + \underbrace{\sum_{i=0}^{k-1} x_i 2^i}_{\text{低位部分}}$

當我們執行右移 x >> k 時：高位部分的權重全部除以了 $2^k$ ，變成了整數結果。低位部分（餘數）直接被丟棄了。

對於負數而言，這一操作進行了向下取整，但我們要求對負數進行向上取整。

因此，我們需要引入偏置。

$\text{對於整數 } x \text{ 和 } y(y>0)，\lceil x/y \rceil = \lfloor (x+y-1)/y \rfloor$

於是 (x+(1<<k)-1)>>k 得到 $\lceil x/2^k \rceil$

也就是下面這兩行的含義

1 2	0x0000000000400fd8 <+10>: shr $0x1f,%ecx 0x0000000000400fdb <+13>: add %ecx,%eax

Phase 5

我們先disas看代碼

Dump of assembler code for function phase_5:
   0x0000000000401062 <+0>:     push   %rbx
   0x0000000000401063 <+1>:     sub    $0x20,%rsp
   0x0000000000401067 <+5>:     mov    %rdi,%rbx
   0x000000000040106a <+8>:     mov    %fs:0x28,%rax
   0x0000000000401073 <+17>:    mov    %rax,0x18(%rsp)
   0x0000000000401078 <+22>:    xor    %eax,%eax
   0x000000000040107a <+24>:    callq  0x40131b <string_length>
   0x000000000040107f <+29>:    cmp    $0x6,%eax
   0x0000000000401082 <+32>:    je     0x4010d2 <phase_5+112>
   0x0000000000401084 <+34>:    callq  0x40143a <explode_bomb>
   0x0000000000401089 <+39>:    jmp    0x4010d2 <phase_5+112>
   0x000000000040108b <+41>:    movzbl (%rbx,%rax,1),%ecx
   0x000000000040108f <+45>:    mov    %cl,(%rsp)
   0x0000000000401092 <+48>:    mov    (%rsp),%rdx
   0x0000000000401096 <+52>:    and    $0xf,%edx
   0x0000000000401099 <+55>:    movzbl 0x4024b0(%rdx),%edx
   0x00000000004010a0 <+62>:    mov    %dl,0x10(%rsp,%rax,1)
   0x00000000004010a4 <+66>:    add    $0x1,%rax
   0x00000000004010a8 <+70>:    cmp    $0x6,%rax
   0x00000000004010ac <+74>:    jne    0x40108b <phase_5+41>
   0x00000000004010ae <+76>:    movb   $0x0,0x16(%rsp)
   0x00000000004010b3 <+81>:    mov    $0x40245e,%esi
   0x00000000004010b8 <+86>:    lea    0x10(%rsp),%rdi
   0x00000000004010bd <+91>:    callq  0x401338 <strings_not_equal>
   0x00000000004010c2 <+96>:    test   %eax,%eax
   0x00000000004010c4 <+98>:    je     0x4010d9 <phase_5+119>
   0x00000000004010c6 <+100>:   callq  0x40143a <explode_bomb>
   0x00000000004010cb <+105>:   nopl   0x0(%rax,%rax,1)
   0x00000000004010d0 <+110>:   jmp    0x4010d9 <phase_5+119>
   0x00000000004010d2 <+112>:   mov    $0x0,%eax
   0x00000000004010d7 <+117>:   jmp    0x40108b <phase_5+41>
   0x00000000004010d9 <+119>:   mov    0x18(%rsp),%rax
   0x00000000004010de <+124>:   xor    %fs:0x28,%rax
   0x00000000004010e7 <+133>:   je     0x4010ee <phase_5+140>
   0x00000000004010e9 <+135>:   callq  0x400b30 <__stack_chk_fail@plt>
   0x00000000004010ee <+140>:   add    $0x20,%rsp
   0x00000000004010f2 <+144>:   pop    %rbx
   0x00000000004010f3 <+145>:   retq   
End of assembler dump.

很快識別出來，這一段代碼中有兩個記憶體地址：0x4024b0 0x40245e

讀一下：

1 2	0x4024b0 <array.3449>: "maduiersnfotvbylSo you think you can stop the bomb with ctrl-c, do you?" 0x40245e: "flyers"

第一個 array.3449 是一個字串，我們就記為 a[]

上面的代碼可以分個段

0x0000000000401062 <+0>:     push   %rbx
0x0000000000401063 <+1>:     sub    $0x20,%rsp
0x0000000000401067 <+5>:     mov    %rdi,%rbx
0x000000000040106a <+8>:     mov    %fs:0x28,%rax
0x0000000000401073 <+17>:    mov    %rax,0x18(%rsp)
0x0000000000401078 <+22>:    xor    %eax,%eax
0x000000000040107a <+24>:    callq  0x40131b <string_length>
0x000000000040107f <+29>:    cmp    $0x6,%eax
0x0000000000401082 <+32>:    je     0x4010d2 <phase_5+112>
0x0000000000401084 <+34>:    callq  0x40143a <explode_bomb>
0x0000000000401089 <+39>:    jmp    0x4010d2 <phase_5+112>

這裡是前面初始化的部分，我們可以看到預留了棧空間，應該是讀取了一個字串，長度為 6，存在棧上。

0x00000000004010d2 <+112>:   mov    $0x0,%eax
0x00000000004010d7 <+117>:   jmp    0x40108b <phase_5+41>
...
0x000000000040108b <+41>:    movzbl (%rbx,%rax,1),%ecx
0x000000000040108f <+45>:    mov    %cl,(%rsp)
0x0000000000401092 <+48>:    mov    (%rsp),%rdx
0x0000000000401096 <+52>:    and    $0xf,%edx
0x0000000000401099 <+55>:    movzbl 0x4024b0(%rdx),%edx
0x00000000004010a0 <+62>:    mov    %dl,0x10(%rsp,%rax,1)
0x00000000004010a4 <+66>:    add    $0x1,%rax
0x00000000004010a8 <+70>:    cmp    $0x6,%rax
0x00000000004010ac <+74>:    jne    0x40108b <phase_5+41>

以上是一個 for 循環，循環 6 次，取 edx 的後四位，這是一個 0~15 的數，記為 i，於是把 a[i] 加入棧中對應位置

0x00000000004010ae <+76>:    movb   $0x0,0x16(%rsp)
0x00000000004010b3 <+81>:    mov    $0x40245e,%esi
0x00000000004010b8 <+86>:    lea    0x10(%rsp),%rdi
0x00000000004010bd <+91>:    callq  0x401338 <strings_not_equal>
0x00000000004010c2 <+96>:    test   %eax,%eax
0x00000000004010c4 <+98>:    je     0x4010d9 <phase_5+119>
0x00000000004010c6 <+100>:   callq  0x40143a <explode_bomb>
0x00000000004010cb <+105>:   nopl   0x0(%rax,%rax,1)
0x00000000004010d0 <+110>:   jmp    0x4010d9 <phase_5+119>
...
0x00000000004010d9 <+119>:   mov    0x18(%rsp),%rax
0x00000000004010de <+124>:   xor    %fs:0x28,%rax
0x00000000004010e7 <+133>:   je     0x4010ee <phase_5+140>
0x00000000004010e9 <+135>:   callq  0x400b30 <__stack_chk_fail@plt>
0x00000000004010ee <+140>:   add    $0x20,%rsp
0x00000000004010f2 <+144>:   pop    %rbx
0x00000000004010f3 <+145>:   retq

這裡有價值的片段只有

0x00000000004010ae <+76>:    movb   $0x0,0x16(%rsp)
0x00000000004010b3 <+81>:    mov    $0x40245e,%esi
0x00000000004010b8 <+86>:    lea    0x10(%rsp),%rdi
0x00000000004010bd <+91>:    callq  0x401338 <strings_not_equal>
0x00000000004010c2 <+96>:    test   %eax,%eax
0x00000000004010c4 <+98>:    je     0x4010d9 <phase_5+119>
0x00000000004010c6 <+100>:   callq  0x40143a <explode_bomb>
0x00000000004010cb <+105>:   nopl   0x0(%rax,%rax,1)
0x00000000004010d0 <+110>:   jmp    0x4010d9 <phase_5+119>

這是比較字串。

我們不難發現，這道題的邏輯是查表映射：程序會把輸入字元對 16 取模得到的數值作為索引，去尋找那個長字串（maduiers…）中的字元。為了讓最終取出來的字元拼成 flyers，我們需要反向尋找 flyers 中每個字母在表中對應的下標位置，然後構造一個輸入字串，使其每一位的 ASCII 碼模 16 後正好等於這些下標。

這個過程可以總結為： Input Char -> ASCII Hex -> AND 0xF (取後4位) -> Table Index -> Lookup Table Char -> Target “flyers”

於是我們可以得到答案 ionefg 或者 IONEFG

其實還可以有一些其他答案，留給讀者去發現

Phase 6

先看代碼

0x00000000004010f4 <+0>:     push   %r14
0x00000000004010f6 <+2>:     push   %r13
0x00000000004010f8 <+4>:     push   %r12
0x00000000004010fa <+6>:     push   %rbp
0x00000000004010fb <+7>:     push   %rbx
0x00000000004010fc <+8>:     sub    $0x50,%rsp
0x0000000000401100 <+12>:    mov    %rsp,%r13
0x0000000000401103 <+15>:    mov    %rsp,%rsi
0x0000000000401106 <+18>:    callq  0x40145c <read_six_numbers>
0x000000000040110b <+23>:    mov    %rsp,%r14
0x000000000040110e <+26>:    mov    $0x0,%r12d
0x0000000000401114 <+32>:    mov    %r13,%rbp
0x0000000000401117 <+35>:    mov    0x0(%r13),%eax
0x000000000040111b <+39>:    sub    $0x1,%eax
0x000000000040111e <+42>:    cmp    $0x5,%eax
0x0000000000401121 <+45>:    jbe    0x401128 <phase_6+52>
0x0000000000401123 <+47>:    callq  0x40143a <explode_bomb>
0x0000000000401128 <+52>:    add    $0x1,%r12d
0x000000000040112c <+56>:    cmp    $0x6,%r12d
0x0000000000401130 <+60>:    je     0x401153 <phase_6+95>
0x0000000000401132 <+62>:    mov    %r12d,%ebx
0x0000000000401135 <+65>:    movslq %ebx,%rax
0x0000000000401138 <+68>:    mov    (%rsp,%rax,4),%eax
0x000000000040113b <+71>:    cmp    %eax,0x0(%rbp)
0x000000000040113e <+74>:    jne    0x401145 <phase_6+81>
0x0000000000401140 <+76>:    callq  0x40143a <explode_bomb>
0x0000000000401145 <+81>:    add    $0x1,%ebx
0x0000000000401148 <+84>:    cmp    $0x5,%ebx
0x000000000040114b <+87>:    jle    0x401135 <phase_6+65>
0x000000000040114d <+89>:    add    $0x4,%r13
0x0000000000401151 <+93>:    jmp    0x401114 <phase_6+32>
0x0000000000401153 <+95>:    lea    0x18(%rsp),%rsi
0x0000000000401158 <+100>:   mov    %r14,%rax
0x000000000040115b <+103>:   mov    $0x7,%ecx
0x0000000000401160 <+108>:   mov    %ecx,%edx
0x0000000000401162 <+110>:   sub    (%rax),%edx
0x0000000000401164 <+112>:   mov    %edx,(%rax)
0x0000000000401166 <+114>:   add    $0x4,%rax
0x000000000040116a <+118>:   cmp    %rsi,%rax
0x000000000040116d <+121>:   jne    0x401160 <phase_6+108>
0x000000000040116f <+123>:   mov    $0x0,%esi
0x0000000000401174 <+128>:   jmp    0x401197 <phase_6+163>
0x0000000000401176 <+130>:   mov    0x8(%rdx),%rdx
0x000000000040117a <+134>:   add    $0x1,%eax
0x000000000040117d <+137>:   cmp    %ecx,%eax
0x000000000040117f <+139>:   jne    0x401176 <phase_6+130>
0x0000000000401181 <+141>:   jmp    0x401188 <phase_6+148>
0x0000000000401183 <+143>:   mov    $0x6032d0,%edx
0x0000000000401188 <+148>:   mov    %rdx,0x20(%rsp,%rsi,2)
0x000000000040118d <+153>:   add    $0x4,%rsi
0x0000000000401191 <+157>:   cmp    $0x18,%rsi
0x0000000000401195 <+161>:   je     0x4011ab <phase_6+183>
0x0000000000401197 <+163>:   mov    (%rsp,%rsi,1),%ecx
0x000000000040119a <+166>:   cmp    $0x1,%ecx
0x000000000040119d <+169>:   jle    0x401183 <phase_6+143>
0x000000000040119f <+171>:   mov    $0x1,%eax
0x00000000004011a4 <+176>:   mov    $0x6032d0,%edx
0x00000000004011a9 <+181>:   jmp    0x401176 <phase_6+130>
0x00000000004011ab <+183>:   mov    0x20(%rsp),%rbx
0x00000000004011b0 <+188>:   lea    0x28(%rsp),%rax
0x00000000004011b5 <+193>:   lea    0x50(%rsp),%rsi
0x00000000004011ba <+198>:   mov    %rbx,%rcx
0x00000000004011bd <+201>:   mov    (%rax),%rdx
0x00000000004011c0 <+204>:   mov    %rdx,0x8(%rcx)
0x00000000004011c4 <+208>:   add    $0x8,%rax
0x00000000004011c8 <+212>:   cmp    %rsi,%rax
0x00000000004011cb <+215>:   je     0x4011d2 <phase_6+222>
0x00000000004011cd <+217>:   mov    %rdx,%rcx
0x00000000004011d0 <+220>:   jmp    0x4011bd <phase_6+201>
0x00000000004011d2 <+222>:   movq   $0x0,0x8(%rdx)
0x00000000004011da <+230>:   mov    $0x5,%ebp
0x00000000004011df <+235>:   mov    0x8(%rbx),%rax
0x00000000004011e3 <+239>:   mov    (%rax),%eax
0x00000000004011e5 <+241>:   cmp    %eax,(%rbx)
0x00000000004011e7 <+243>:   jge    0x4011ee <phase_6+250>
0x00000000004011e9 <+245>:   callq  0x40143a <explode_bomb>
0x00000000004011ee <+250>:   mov    0x8(%rbx),%rbx
0x00000000004011f2 <+254>:   sub    $0x1,%ebp
0x00000000004011f5 <+257>:   jne    0x4011df <phase_6+235>
0x00000000004011f7 <+259>:   add    $0x50,%rsp
0x00000000004011fb <+263>:   pop    %rbx
0x00000000004011fc <+264>:   pop    %rbp
0x00000000004011fd <+265>:   pop    %r12
0x00000000004011ff <+267>:   pop    %r13
0x0000000000401201 <+269>:   pop    %r14
0x0000000000401203 <+271>:   retq

分開來看：

0x00000000004010f4 <+0>:     push   %r14
0x00000000004010f6 <+2>:     push   %r13
0x00000000004010f8 <+4>:     push   %r12
0x00000000004010fa <+6>:     push   %rbp
0x00000000004010fb <+7>:     push   %rbx
0x00000000004010fc <+8>:     sub    $0x50,%rsp
0x0000000000401100 <+12>:    mov    %rsp,%r13
0x0000000000401103 <+15>:    mov    %rsp,%rsi

這一段是設置棧幀

1	0x0000000000401106 <+18>: callq 0x40145c <read_six_numbers>

這裡讀了 6 個數字，我們在 Phase 2 已經看到，這六個數字存在從 rsp 開始的一個數組中。

0x000000000040110b <+23>:    mov    %rsp,%r14
0x000000000040110e <+26>:    mov    $0x0,%r12d
0x0000000000401114 <+32>:    mov    %r13,%rbp
0x0000000000401117 <+35>:    mov    0x0(%r13),%eax
0x000000000040111b <+39>:    sub    $0x1,%eax
0x000000000040111e <+42>:    cmp    $0x5,%eax
0x0000000000401121 <+45>:    jbe    0x401128 <phase_6+52>
0x0000000000401123 <+47>:    callq  0x40143a <explode_bomb>
0x0000000000401128 <+52>:    add    $0x1,%r12d
0x000000000040112c <+56>:    cmp    $0x6,%r12d
0x0000000000401130 <+60>:    je     0x401153 <phase_6+95>
0x0000000000401132 <+62>:    mov    %r12d,%ebx
0x0000000000401135 <+65>:    movslq %ebx,%rax
0x0000000000401138 <+68>:    mov    (%rsp,%rax,4),%eax
0x000000000040113b <+71>:    cmp    %eax,0x0(%rbp)
0x000000000040113e <+74>:    jne    0x401145 <phase_6+81>
0x0000000000401140 <+76>:    callq  0x40143a <explode_bomb>
0x0000000000401145 <+81>:    add    $0x1,%ebx
0x0000000000401148 <+84>:    cmp    $0x5,%ebx
0x000000000040114b <+87>:    jle    0x401135 <phase_6+65>
0x000000000040114d <+89>:    add    $0x4,%r13
0x0000000000401151 <+93>:    jmp    0x401114 <phase_6+32>

此處代碼構建了一個典型的嵌套循環結構：外層循環由 %r12d 計數，內層循環則由 %ebx 控制。

0x0000000000401117 <+35>:    mov    0x0(%r13),%eax
0x000000000040111b <+39>:    sub    $0x1,%eax
0x000000000040111e <+42>:    cmp    $0x5,%eax
...
0x000000000040114d <+89>:    add    $0x4,%r13
0x0000000000401151 <+93>:    jmp    0x401114 <phase_6+32>

首先分析外層循環：它通過 %r13 指針遍歷輸入數組，首要任務是進行邊界檢查，確保讀取到的每一個數字都小於或等於 6。

再來看內層循環：

0x0000000000401132 <+62>:    mov    %r12d,%ebx
0x0000000000401135 <+65>:    movslq %ebx,%rax
0x0000000000401138 <+68>:    mov    (%rsp,%rax,4),%eax
0x000000000040113b <+71>:    cmp    %eax,0x0(%rbp)
0x000000000040113e <+74>:    jne    0x401145 <phase_6+81>
0x0000000000401140 <+76>:    callq  0x40143a <explode_bomb>
0x0000000000401145 <+81>:    add    $0x1,%ebx
0x0000000000401148 <+84>:    cmp    $0x5,%ebx
0x000000000040114b <+87>:    jle    0x401135 <phase_6+65>

這裡從當前外層數字開始，判斷數組之後的每一個數位（int 類型，4 位元組，故 (%rsp,%rax,4) 獲得當前數字），判斷這個數字是否和外層數字相同。

於是，我們發現，這一層循環判斷輸入的每個數字是否互不相同。

總結一下，這個嵌套循環檢查我們的輸入是否是六個互不相同的小於等於 6 的數字

0x0000000000401153 <+95>:    lea    0x18(%rsp),%rsi
0x0000000000401158 <+100>:   mov    %r14,%rax
0x000000000040115b <+103>:   mov    $0x7,%ecx
0x0000000000401160 <+108>:   mov    %ecx,%edx
0x0000000000401162 <+110>:   sub    (%rax),%edx
0x0000000000401164 <+112>:   mov    %edx,(%rax)
0x0000000000401166 <+114>:   add    $0x4,%rax
0x000000000040116a <+118>:   cmp    %rsi,%rax
0x000000000040116d <+121>:   jne    0x401160 <phase_6+108>

這裡又有一個循環。前文已知，r14 就是 rsp，也就是棧指針。這裡遍歷每一個數 x，重新賦值，x = 7-x

0x000000000040116f <+123>:   mov    $0x0,%esi
0x0000000000401174 <+128>:   jmp    0x401197 <phase_6+163>
0x0000000000401176 <+130>:   mov    0x8(%rdx),%rdx
0x000000000040117a <+134>:   add    $0x1,%eax
0x000000000040117d <+137>:   cmp    %ecx,%eax
0x000000000040117f <+139>:   jne    0x401176 <phase_6+130>
0x0000000000401181 <+141>:   jmp    0x401188 <phase_6+148>
0x0000000000401183 <+143>:   mov    $0x6032d0,%edx
0x0000000000401188 <+148>:   mov    %rdx,0x20(%rsp,%rsi,2)
0x000000000040118d <+153>:   add    $0x4,%rsi
0x0000000000401191 <+157>:   cmp    $0x18,%rsi
0x0000000000401195 <+161>:   je     0x4011ab <phase_6+183>
0x0000000000401197 <+163>:   mov    (%rsp,%rsi,1),%ecx
0x000000000040119a <+166>:   cmp    $0x1,%ecx
0x000000000040119d <+169>:   jle    0x401183 <phase_6+143>
0x000000000040119f <+171>:   mov    $0x1,%eax
0x00000000004011a4 <+176>:   mov    $0x6032d0,%edx
0x00000000004011a9 <+181>:   jmp    0x401176 <phase_6+130>

先讀取輸入的元素 x，如果小於等於 1，把 edx 賦值為 0x6032d0，然後把 x 放在一個臨時數組中，然後繼續到下一個元素，直到遍歷完整個數組 (0x18 = 24 = 4*6)

如果元素 x 大於 1，把 eax 賦值為 1，edx 賦值為 0x6032d0，之後執行 x-1 次 mov 0x8(%rdx),%rdx 操作

這裡疑似是鍊表，出現了記憶體地址 0x6032d0，我們來看看：

(gdb) x/12xg 0x6032d0
0x6032d0 <node1>:       0x000000010000014c      0x00000000006032e0
0x6032e0 <node2>:       0x00000002000000a8      0x00000000006032f0
0x6032f0 <node3>:       0x000000030000039c      0x0000000000603300
0x603300 <node4>:       0x00000004000002b3      0x0000000000603310
0x603310 <node5>:       0x00000005000001dd      0x0000000000603320
0x603320 <node6>:       0x00000006000001bb      0x0000000000000000

這裡注意，在 64 位系統中，指針占用 8 位元組（即 64 位）。

顯然是鍊表，0x8(%rdx) 代表 next 指針

故上述操作得到一個數組，設輸入數組的第 i 個數為 x，數組中第 i 個數對應鍊表中第 x 個數的地址。

1
2
3

0x00000000004011ab <+183>:   mov    0x20(%rsp),%rbx
0x00000000004011b0 <+188>:   lea    0x28(%rsp),%rax
0x00000000004011b5 <+193>:   lea    0x50(%rsp),%rsi

這裡是一些初始化。rsi 是邊界指針，標記循環的終止。0x20 到 0x50 正好 6*8=48

0x00000000004011ba <+198>:   mov    %rbx,%rcx
0x00000000004011bd <+201>:   mov    (%rax),%rdx
0x00000000004011c0 <+204>:   mov    %rdx,0x8(%rcx)
0x00000000004011c4 <+208>:   add    $0x8,%rax
0x00000000004011c8 <+212>:   cmp    %rsi,%rax
0x00000000004011cb <+215>:   je     0x4011d2 <phase_6+222>
0x00000000004011cd <+217>:   mov    %rdx,%rcx
0x00000000004011d0 <+220>:   jmp    0x4011bd <phase_6+201>

這裡遍歷了我們剛才得到的鍊表地址數組。寫成 C 語言或許更好理解。

Node *current = node_ptrs[0]; // %rbx, %rcx 初始化
int i = 1; // 對應 %rax 指向 node_ptrs[1]

while (i < 6) {
    Node *next_node = node_ptrs[i]; // mov (%rax), %rdx
    current->next = next_node;      // mov %rdx, 0x8(%rcx)
    current = next_node;            // mov %rdx, %rcx
    i++;                            // add $0x8, %rax
}

這一個循環對於鍊表結構進行了修改。

1	0x00000000004011d2 <+222>: movq $0x0,0x8(%rdx)

這句話則把最後一個節點的 next 賦值為 NULL，確保鍊表結構

接下來又有一個循環：

0x00000000004011da <+230>:   mov    $0x5,%ebp
0x00000000004011df <+235>:   mov    0x8(%rbx),%rax
0x00000000004011e3 <+239>:   mov    (%rax),%eax
0x00000000004011e5 <+241>:   cmp    %eax,(%rbx)
0x00000000004011e7 <+243>:   jge    0x4011ee <phase_6+250>
0x00000000004011e9 <+245>:   callq  0x40143a <explode_bomb>
0x00000000004011ee <+250>:   mov    0x8(%rbx),%rbx
0x00000000004011f2 <+254>:   sub    $0x1,%ebp
0x00000000004011f5 <+257>:   jne    0x4011df <phase_6+235>

遍歷鍊表，確保鍊表倒序排列。

看到這裡，我們就可以得到答案了：

(gdb) x/12xg 0x6032d0
0x6032d0 <node1>:       0x000000010000014c      0x00000000006032e0
0x6032e0 <node2>:       0x00000002000000a8      0x00000000006032f0
0x6032f0 <node3>:       0x000000030000039c      0x0000000000603300
0x603300 <node4>:       0x00000004000002b3      0x0000000000603310
0x603310 <node5>:       0x00000005000001dd      0x0000000000603320
0x603320 <node6>:       0x00000006000001bb      0x0000000000000000

找到鍊表值的倒序索引即可，注意值是 int 類型，只取後四位。於是可以得到 3 4 5 6 1 2

但我們還要注意，輸入進行過 7-x 操作（見上文），所以我們調整答案 4 3 2 1 6 5

最後一個 Phase 有點複雜，巧妙融合了嵌套循環校驗、數組映射變換以及鍊表重組等多種技術。

隱藏關

/* Hmm...  Six phases must be more secure than one phase! */
input = read_line();             /* Get input                   */
phase_1(input);                  /* Run the phase               */
phase_defused();                 /* Drat!  They figured it out!
      * Let me know how they did it. */
printf("Phase 1 defused. How about the next one?\n");

/* The second phase is harder.  No one will ever figure out
 * how to defuse this... */
input = read_line();
phase_2(input);
phase_defused();
printf("That's number 2.  Keep going!\n");

/* I guess this is too easy so far.  Some more complex code will
 * confuse people. */
input = read_line();
phase_3(input);
phase_defused();
printf("Halfway there!\n");

/* Oh yeah?  Well, how good is your math?  Try on this saucy problem! */
input = read_line();
phase_4(input);
phase_defused();
printf("So you got that one.  Try this one.\n");

/* Round and 'round in memory we go, where we stop, the bomb blows! */
input = read_line();
phase_5(input);
phase_defused();
printf("Good work!  On to the next...\n");

/* This phase will never be used, since no one will get past the
 * earlier ones.  But just in case, make this one extra hard. */
input = read_line();
phase_6(input);
phase_defused();

bomb 代碼中，每一個 phase 後都運行 phase_defused。我們來看看：

Dump of assembler code for function phase_defused:
   0x00000000004015c4 <+0>:     sub    $0x78,%rsp
   0x00000000004015c8 <+4>:     mov    %fs:0x28,%rax
   0x00000000004015d1 <+13>:    mov    %rax,0x68(%rsp)
   0x00000000004015d6 <+18>:    xor    %eax,%eax
   0x00000000004015d8 <+20>:    cmpl   $0x6,0x202181(%rip)        # 0x603760 <num_input_strings>
   0x00000000004015df <+27>:    jne    0x40163f <phase_defused+123>
   0x00000000004015e1 <+29>:    lea    0x10(%rsp),%r8
   0x00000000004015e6 <+34>:    lea    0xc(%rsp),%rcx
   0x00000000004015eb <+39>:    lea    0x8(%rsp),%rdx
   0x00000000004015f0 <+44>:    mov    $0x402619,%esi
   0x00000000004015f5 <+49>:    mov    $0x603870,%edi
   0x00000000004015fa <+54>:    callq  0x400bf0 <__isoc99_sscanf@plt>
   0x00000000004015ff <+59>:    cmp    $0x3,%eax
   0x0000000000401602 <+62>:    jne    0x401635 <phase_defused+113>
   0x0000000000401604 <+64>:    mov    $0x402622,%esi
   0x0000000000401609 <+69>:    lea    0x10(%rsp),%rdi
   0x000000000040160e <+74>:    callq  0x401338 <strings_not_equal>
   0x0000000000401613 <+79>:    test   %eax,%eax
   0x0000000000401615 <+81>:    jne    0x401635 <phase_defused+113>
   0x0000000000401617 <+83>:    mov    $0x4024f8,%edi
   0x000000000040161c <+88>:    callq  0x400b10 <puts@plt>
   0x0000000000401621 <+93>:    mov    $0x402520,%edi
   0x0000000000401626 <+98>:    callq  0x400b10 <puts@plt>
   0x000000000040162b <+103>:   mov    $0x0,%eax
   0x0000000000401630 <+108>:   callq  0x401242 <secret_phase>
   0x0000000000401635 <+113>:   mov    $0x402558,%edi
   0x000000000040163a <+118>:   callq  0x400b10 <puts@plt>
   0x000000000040163f <+123>:   mov    0x68(%rsp),%rax
   0x0000000000401644 <+128>:   xor    %fs:0x28,%rax
   0x000000000040164d <+137>:   je     0x401654 <phase_defused+144>
   0x000000000040164f <+139>:   callq  0x400b30 <__stack_chk_fail@plt>
   0x0000000000401654 <+144>:   add    $0x78,%rsp
   0x0000000000401658 <+148>:   retq

1	0x00000000004015d8 <+20>: cmpl $0x6,0x202181(%rip) # 0x603760 <num_input_strings>

這裡要求六關全部通過之後才能進入 secret_phase

我們可以設置條件斷點：b phase_defused if num_input_strings == 6

注意到：

1	0x0000000000401630 <+108>: callq 0x401242 <secret_phase>

這裡有非常多的記憶體地址，其中：

(gdb) x/s 0x402619
0x402619:       "%d %d %s"
(gdb) x/s 0x603870
0x603870 <input_strings+240>:   "7 0"
(gdb) x/s 0x402622
0x402622:       "DrEvil"

判斷 Phase 4 輸入之後是否有一個字串 DrEvil，如果有，進入隱藏關！

再來看看隱藏關的代碼：

Dump of assembler code for function secret_phase:
   0x0000000000401242 <+0>:     push   %rbx
   0x0000000000401243 <+1>:     callq  0x40149e <read_line>
   0x0000000000401248 <+6>:     mov    $0xa,%edx
   0x000000000040124d <+11>:    mov    $0x0,%esi
   0x0000000000401252 <+16>:    mov    %rax,%rdi
   0x0000000000401255 <+19>:    callq  0x400bd0 <strtol@plt>
   0x000000000040125a <+24>:    mov    %rax,%rbx
   0x000000000040125d <+27>:    lea    -0x1(%rax),%eax
   0x0000000000401260 <+30>:    cmp    $0x3e8,%eax
   0x0000000000401265 <+35>:    jbe    0x40126c <secret_phase+42>
   0x0000000000401267 <+37>:    callq  0x40143a <explode_bomb>
   0x000000000040126c <+42>:    mov    %ebx,%esi
   0x000000000040126e <+44>:    mov    $0x6030f0,%edi
   0x0000000000401273 <+49>:    callq  0x401204 <fun7>
   0x0000000000401278 <+54>:    cmp    $0x2,%eax
   0x000000000040127b <+57>:    je     0x401282 <secret_phase+64>
   0x000000000040127d <+59>:    callq  0x40143a <explode_bomb>
   0x0000000000401282 <+64>:    mov    $0x402438,%edi
   0x0000000000401287 <+69>:    callq  0x400b10 <puts@plt>
   0x000000000040128c <+74>:    callq  0x4015c4 <phase_defused>
   0x0000000000401291 <+79>:    pop    %rbx
   0x0000000000401292 <+80>:    retq   
End of assembler dump.

看到 strtol，知道這裡讀入了一個整數

0x000000000040125a <+24>:    mov    %rax,%rbx
0x000000000040125d <+27>:    lea    -0x1(%rax),%eax
0x0000000000401260 <+30>:    cmp    $0x3e8,%eax
0x0000000000401265 <+35>:    jbe    0x40126c <secret_phase+42>
0x0000000000401267 <+37>:    callq  0x40143a <explode_bomb>

要求讀取的整數小於等於 1001。注意 jbe 是無符號數的跳轉檢查，所以這裡其實也隱性限制了下限。所以嚴格的輸入限制是 [1, 1001] 之間的整數。

1
2
3

0x000000000040126c <+42>:    mov    %ebx,%esi
0x000000000040126e <+44>:    mov    $0x6030f0,%edi
0x0000000000401273 <+49>:    callq  0x401204 <fun7>

傳參，進入 fun7

0x0000000000401278 <+54>:    cmp    $0x2,%eax
0x000000000040127b <+57>:    je     0x401282 <secret_phase+64>
0x000000000040127d <+59>:    callq  0x40143a <explode_bomb>
0x0000000000401282 <+64>:    mov    $0x402438,%edi

這裡要求 fun7 的返回值等於 2

接下來我們看看 fun7，手動分個段

Dump of assembler code for function fun7:
   0x0000000000401204 <+0>:     sub    $0x8,%rsp
   0x0000000000401208 <+4>:     test   %rdi,%rdi
   0x000000000040120b <+7>:     je     0x401238 <fun7+52>
   
   0x000000000040120d <+9>:     mov    (%rdi),%edx
   0x000000000040120f <+11>:    cmp    %esi,%edx
   0x0000000000401211 <+13>:    jle    0x401220 <fun7+28>
   
   0x0000000000401213 <+15>:    mov    0x8(%rdi),%rdi
   0x0000000000401217 <+19>:    callq  0x401204 <fun7>
   0x000000000040121c <+24>:    add    %eax,%eax
   0x000000000040121e <+26>:    jmp    0x40123d <fun7+57>
   
   0x0000000000401220 <+28>:    mov    $0x0,%eax
   0x0000000000401225 <+33>:    cmp    %esi,%edx
   0x0000000000401227 <+35>:    je     0x40123d <fun7+57>
   0x0000000000401229 <+37>:    mov    0x10(%rdi),%rdi
   0x000000000040122d <+41>:    callq  0x401204 <fun7>
   0x0000000000401232 <+46>:    lea    0x1(%rax,%rax,1),%eax
   0x0000000000401236 <+50>:    jmp    0x40123d <fun7+57>
   
   0x0000000000401238 <+52>:    mov    $0xffffffff,%eax
   
   0x000000000040123d <+57>:    add    $0x8,%rsp
   0x0000000000401241 <+61>:    retq   
End of assembler dump.

遍歷當前 rdi 之後的兩個指針，遞迴，有點像二叉樹。我們來看看初始參數：

(gdb) x/60xg 0x6030f0
0x6030f0 <n1>:  0x0000000000000024      0x0000000000603110
0x603100 <n1+16>:       0x0000000000603130      0x0000000000000000
0x603110 <n21>: 0x0000000000000008      0x0000000000603190
0x603120 <n21+16>:      0x0000000000603150      0x0000000000000000
0x603130 <n22>: 0x0000000000000032      0x0000000000603170
0x603140 <n22+16>:      0x00000000006031b0      0x0000000000000000
0x603150 <n32>: 0x0000000000000016      0x0000000000603270
0x603160 <n32+16>:      0x0000000000603230      0x0000000000000000
0x603170 <n33>: 0x000000000000002d      0x00000000006031d0
0x603180 <n33+16>:      0x0000000000603290      0x0000000000000000
0x603190 <n31>: 0x0000000000000006      0x00000000006031f0
0x6031a0 <n31+16>:      0x0000000000603250      0x0000000000000000
0x6031b0 <n34>: 0x000000000000006b      0x0000000000603210
0x6031c0 <n34+16>:      0x00000000006032b0      0x0000000000000000
0x6031d0 <n45>: 0x0000000000000028      0x0000000000000000
0x6031e0 <n45+16>:      0x0000000000000000      0x0000000000000000
0x6031f0 <n41>: 0x0000000000000001      0x0000000000000000
0x603200 <n41+16>:      0x0000000000000000      0x0000000000000000
0x603210 <n47>: 0x0000000000000063      0x0000000000000000
0x603220 <n47+16>:      0x0000000000000000      0x0000000000000000
0x603230 <n44>: 0x0000000000000023      0x0000000000000000
0x603240 <n44+16>:      0x0000000000000000      0x0000000000000000
0x603250 <n42>: 0x0000000000000007      0x0000000000000000
0x603260 <n42+16>:      0x0000000000000000      0x0000000000000000
0x603270 <n43>: 0x0000000000000014      0x0000000000000000
0x603280 <n43+16>:      0x0000000000000000      0x0000000000000000
0x603290 <n46>: 0x000000000000002f      0x0000000000000000
0x6032a0 <n46+16>:      0x0000000000000000      0x0000000000000000
0x6032b0 <n48>: 0x00000000000003e9      0x0000000000000000
0x6032c0 <n48+16>:      0x0000000000000000      0x0000000000000000

確實是一顆二叉樹！（這裡的 60 是我試出來的）

fun7 傳入的參數為 rdi 和 esi

0x0000000000401208 <+4>:     test   %rdi,%rdi
0x000000000040120b <+7>:     je     0x401238 <fun7+52>
...
0x0000000000401238 <+52>:    mov    $0xffffffff,%eax
0x000000000040123d <+57>:    add    $0x8,%rsp
0x0000000000401241 <+61>:    retq

如果遍歷到葉子結點，直接返回 0xffffffff。

1
2
3

0x000000000040120d <+9>:     mov    (%rdi),%edx
0x000000000040120f <+11>:    cmp    %esi,%edx
0x0000000000401211 <+13>:    jle    0x401220 <fun7+28>

查看當前節點的值，如果值大於 esi：

0x0000000000401213 <+15>:    mov    0x8(%rdi),%rdi
0x0000000000401217 <+19>:    callq  0x401204 <fun7>
0x000000000040121c <+24>:    add    %eax,%eax
0x000000000040121e <+26>:    jmp    0x40123d <fun7+57>
...
0x000000000040123d <+57>:    add    $0x8,%rsp
0x0000000000401241 <+61>:    retq

訪問左子節點，返回值乘以二

如果當前節點的值和 rsi 相等：

0x0000000000401220 <+28>:    mov    $0x0,%eax
0x0000000000401225 <+33>:    cmp    %esi,%edx
0x0000000000401227 <+35>:    je     0x40123d <fun7+57>
...
0x000000000040123d <+57>:    add    $0x8,%rsp
0x0000000000401241 <+61>:    retq

直接返回

否則，訪問右子節點：

0x0000000000401229 <+37>:    mov    0x10(%rdi),%rdi
0x000000000040122d <+41>:    callq  0x401204 <fun7>
0x0000000000401232 <+46>:    lea    0x1(%rax,%rax,1),%eax
0x0000000000401236 <+50>:    jmp    0x40123d <fun7+57>
...
0x000000000040123d <+57>:    add    $0x8,%rsp
0x0000000000401241 <+61>:    retq

返回值乘以二再加一

我們可以用 C 語言翻譯上述代碼：

long fun7(struct Node *node, int target_val) {
    // 1. 如果節點為空
    if (node == NULL) {
        return -1; // 對應匯編中的 mov $0xffffffff, %eax
    }

    int current_val = node->value; // mov (%rdi), %edx

    // 2. 如果當前節點值 > 目標值 (target_val < current_val)
    // 匯編邏輯：cmp %esi, %edx -> jle (跳過) -> 否則執行這裡
    if (current_val > target_val) {
        // 遞迴調用左子節點 (偏移量 0x8)
        // 對應 callq fun7, 然後 add %eax, %eax
        return 2 * fun7(node->left, target_val);
    }
    
    // 3. 如果當前節點值 == 目標值
    // 匯編邏輯：cmp %esi, %edx -> je (跳轉到返回0)
    if (current_val == target_val) {
        return 0; // 找到目標，返回 0
    }

    // 4. 如果當前節點值 < 目標值
    // 匯編邏輯：此時只剩下這種情況
    // 遞迴調用右子節點 (偏移量 0x10)
    // 對應 callq fun7, 然後 lea 0x1(%rax,%rax,1) -> 2*rax + 1
    return 2 * fun7(node->right, target_val) + 1;
}

我們再來看看二叉樹的結構，根據：

(gdb) x/60xg 0x6030f0
0x6030f0 <n1>:  0x0000000000000024      0x0000000000603110
0x603100 <n1+16>:       0x0000000000603130      0x0000000000000000
0x603110 <n21>: 0x0000000000000008      0x0000000000603190
0x603120 <n21+16>:      0x0000000000603150      0x0000000000000000
0x603130 <n22>: 0x0000000000000032      0x0000000000603170
0x603140 <n22+16>:      0x00000000006031b0      0x0000000000000000
0x603150 <n32>: 0x0000000000000016      0x0000000000603270
0x603160 <n32+16>:      0x0000000000603230      0x0000000000000000
0x603170 <n33>: 0x000000000000002d      0x00000000006031d0
0x603180 <n33+16>:      0x0000000000603290      0x0000000000000000
0x603190 <n31>: 0x0000000000000006      0x00000000006031f0
0x6031a0 <n31+16>:      0x0000000000603250      0x0000000000000000
0x6031b0 <n34>: 0x000000000000006b      0x0000000000603210
0x6031c0 <n34+16>:      0x00000000006032b0      0x0000000000000000
0x6031d0 <n45>: 0x0000000000000028      0x0000000000000000
0x6031e0 <n45+16>:      0x0000000000000000      0x0000000000000000
0x6031f0 <n41>: 0x0000000000000001      0x0000000000000000
0x603200 <n41+16>:      0x0000000000000000      0x0000000000000000
0x603210 <n47>: 0x0000000000000063      0x0000000000000000
0x603220 <n47+16>:      0x0000000000000000      0x0000000000000000
0x603230 <n44>: 0x0000000000000023      0x0000000000000000
0x603240 <n44+16>:      0x0000000000000000      0x0000000000000000
0x603250 <n42>: 0x0000000000000007      0x0000000000000000
0x603260 <n42+16>:      0x0000000000000000      0x0000000000000000
0x603270 <n43>: 0x0000000000000014      0x0000000000000000
0x603280 <n43+16>:      0x0000000000000000      0x0000000000000000
0x603290 <n46>: 0x000000000000002f      0x0000000000000000
0x6032a0 <n46+16>:      0x0000000000000000      0x0000000000000000
0x6032b0 <n48>: 0x00000000000003e9      0x0000000000000000
0x6032c0 <n48+16>:      0x0000000000000000      0x0000000000000000

graph TD
    N1((36)) --> N21((8))
    N1 --> N22((50))

    N21 --> N31((6))
    N21 --> N32((22))

    N22 --> N33((45))
    N22 --> N34((107))

    N31 --> N41((1))
    N31 --> N42((7))

    N32 --> N43((20))
    N32 --> N44((35))

    N33 --> N45((40))
    N33 --> N46((47))

    N34 --> N47((99))
    N34 --> N48((1001))

要求最終輸出為 2，2 = 1*2

先向左，再向右，然後找到了答案。

於是，我們得到答案 22

總結

於是，最終答案是：

Border relations with Canada have never been better.
1 2 4 8 16 32
0 207
7 0 DrEvil
ionefg
4 3 2 1 6 5
22

最後讓 AI 生成一段小結

CSAPP Bomb Lab 是一個非常經典的實驗，它不僅是一次對匯編語言 (x86-64) 的深度練習，更是一場邏輯推理的解謎遊戲。

回顧整個拆彈過程，我們經歷了從簡單到複雜的演進：

基礎控制流：從 Phase 1 的字串比較，到 Phase 2 的循環與棧上數組操作。
高級控制流：Phase 3 展示了 switch 語句如何通過跳轉表實現，Phase 4 則通過遞迴讓我們深入理解了棧幀的生長與銷毀以及二分尋找算法。
數據操縱：Phase 5 的位運算與字元數組索引映射，考察了對指針和記憶體定址的敏感度。
數據結構：Phase 6 的鍊表重排以及隱藏關卡的二叉搜索樹（BST），讓我們看到了高級數據結構在匯編層面的具體形態（指針即地址）。

在這個過程中，gdb 是最強大的武器。熟練掌握斷點設置、暫存器查看 (i r) 和記憶體檢查 (x/) 是通關的關鍵。同時，我們也深刻體會到了編譯器最佳化的“智慧”（如利用 lea 進行算術運算、利用無符號數比較合併上下界檢查）和 C 語言與機器碼之間的映射關係。

當看到終端最終列印出 “Congratulations! You’ve defused the bomb!” 時，所有的查表、計算和堆棧分析都是值得的。希望這篇解析能對你理解計算機底層系統有所幫助。 Happy Hacking!

普通视图

幾種梯度計算方法

1. 數值微分

2. 符號微分

自動微分

工程實現

Tensor

Ops / Function

GradMode

backward

優化器與梯度清零

其他

The Foundation: Seq2Seq with RNNs

The Solution: The Attention Mechanism

How It Works Step-by-Step

Key Advantages of Attention

Generalizing to Attention Layers

The Inputs

The Computation Process

Flavors of Attention

The Rise of Transformers

Applications

Large Language Models (LLMs)

Vision Transformers (ViTs)

Modern Architectural Upgrades

Pre-Norm (vs. Post-Norm)

RMSNorm (Root Mean-Square Normalization)

SwiGLU Activation in MLPs

Mixture of Experts (MoE)

Reference

Sequence Architectures

The Recurrence Mechanism

A. Hidden State Update

B. Output Generation

Backpropagation Through Time

RNN Tradeoffs

Long Short Term Memory

LSTM Mathematics

Plain Normalization

Batch Normalization

Algorithm 1: Batch Normalization Forward Pass

Differentiation

1. Gradients for Learnable Parameters

2. Gradient w.r.t. Intermediate Statistics

3. Gradient w.r.t. Input xix_ixi​

Training and Inference with Batch Normalization

Layer Normalization

Reference

Normalization Layers

Regularization: Dropout

Activation Functions

Case Study

AlexNet, VGG

ResNet

Weight Initialization

Data Preprocessing

Data Augmentation

Transfer Learning

Hyperparameters

Step 1: Check Initial Loss

Step 2: Overfit a Small Sample

Step 3: Find a Working Learning Rate

Step 4: Coarse Hyperparameter Search

Step 5: Refine the Search

Step 6: Analyze Curves

Summary Insight

Feature Extraction

Linear Classification Intuition

Architecture

Convolutional Layers

Zero Padding

Stride

Pooling

Translation Equivariance

1. The Setup: Forward Propagation

2. The Core Derivation: Gradient with respect to Logits (zzz)

3. Gradients with respect to Weights (WWW) and Bias (bbb)

4. Scaling up: The Mini-Batch Form

Summary

1. Notation (Matrix Form)

3. Gradient w.r.t. Input $x_i$

2. The Core Derivation: Gradient with respect to Logits ( $z$ )

3. Gradients with respect to Weights ( $W$ ) and Bias ( $b$ )