阅读视图

发现新文章,点击刷新页面。
🔲 ☆

CS231n Lecture Note VI: CNN Architectures and Training

In this note, we will continue discussing CNNs.

Normalization Layers

Normalization layers standardize activations and then apply a learned scale and shift:

x^=xμσ2+ϵ,y=γx^+β\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y = \gamma \hat{x} + \beta

The difference between methods is which dimensions are used to compute μ\mu and σ\sigma (for input shape N,C,H,WN, C, H, W):

  • BatchNorm: per channel, over (N,H,W)(N, H, W); depends on batch statistics
  • LayerNorm: per sample, over all features; independent of batch
  • InstanceNorm: per sample & channel, over (H,W)(H, W); removes instance contrast
  • GroupNorm: per sample, over channel groups + spatial dims; stable for small batches

Key idea: same formula, different normalization axes → different behavior and use cases.

Regularization: Dropout

In each forward pass, randomly set a subset of activations to zero with probability pp (a hyperparameter, often 0.50.5). To keep the expected activation unchanged, the remaining activations are scaled:

x~=mx1p,mBernoulli(1p)\tilde{x} = \frac{m \odot x}{1 - p}, \quad m \sim \text{Bernoulli}(1 - p)

  • This forces the network to have redundant representation, preventing co-adaptation of neurons
  • Acts like implicit ensemble averaging
  • Applied only during training; disabled at inference (no randomness)

With inverted dropout, no test-time scaling is needed because the activations were already rescaled during training.

  • Training: Add some kind of randomness
  • Testing: Average out randomness (sometimes approximate)

Activation Functions

Goal: Introduce non-linearity to the model.

  • Sigmoid:
    • Problem: the grandient gets small with many layers.
  • ReLU (Rectified Linear Unit):
    • Does not saturate
    • Converge faster than Sigmoid
    • Problem: not zero centered; dead when x<0x <0.
  • GELU (Gaussian Error Linear Unit): f(x)=xΦ(x)f(x) = x\Phi(x)
    • Smoothness facilitates training
    • Problem: higher computational cost; large negative values get small gradients

Case Study

AlexNet, VGG

Small filters, deep network.

The very first CNNs.

ResNet

Increasing the depth of a CNN does not always improve performance; in fact, very deep plain networks can exhibit the degradation problem, where training error increases as layers are added.

Although a deeper model should theoretically perform at least as well as a shallower one (since additional layers could learn identity mappings), in practice, standard layers struggle to approximate identity functions, making optimization difficult.

ResNet addresses this by reformulating the learning objective: instead of directly learning a mapping H(x)H(x), each block learns a residual function F(x)=H(x)xF(x) = H(x) - x, so the output becomes H(x)=F(x)+xH(x) = F(x) + x.

The shortcut (skip) connection that adds the input xx back to the learned residual improves gradient flow and makes it significantly easier to train very deep networks, especially when the desired transformation is close to identity.

Weight Initialization

If the value is too small, all activations tend to zero for deeper network layers.

If too large, activations blow up quickly.

The fix: Kaiming Initialization

ReLU correction: std = sqrt( 2/Din )

1
2
3
4
5
6
7
8
dims = [4096] * 7
hs = []
x = np.random.randn(16, dims[0])

for Din, Dout in zip(dims[:-1], dims[1:]):
W = np.random.randn(Din, Dout) * np.sqrt(2 / Din) # Kaiming Initialization
x = np.maximum(0, x.dot(W))
hs.append(x)

This makes the std almost constant. Activations are nicely scaled for all layers.

Data Preprocessing

Image Normalization: center and scale for each channel

  • Subtract per-channel mean and Divide by per-channel std (almost all modern models) (stats along each channel = 3 numbers)
  • Requires pre-computing means and std for each pixel channel (given your dataset)

Data Augmentation

We can do flips, random crops and scales, color jitters, random cutouts to enhance our training data.

Transfer Learning

How do we train without much data?

We can start with a large dataset like the ImageNet and train a deep CNN, learning general features.

With a small dataset, we can use the pretrained model and freeze most layers. We replace and reinitialize the final layer to match our new classes. We train only this last layer (a linear classifier).

With a bigger dataset, we can start from the pretrained model and initialize all layers with the pretrained weights. We then finetune more (or all) layers.

Deep learning frameworks provide pretrained models that we can transfer from:

Hyperparameters

A structured approach to stabilizing and optimizing deep learning training:

Step 1: Check Initial Loss

Verify the loss starts at a reasonable value (e.g., \log(\text{num_classes}) for classification). Large deviations typically indicate implementation issues such as incorrect labels or loss computation.

Step 2: Overfit a Small Sample

Train on a tiny subset (10–100 samples) and confirm the model can achieve near-zero loss. Failure here suggests problems with model capacity, optimization, or data preprocessing.

Step 3: Find a Working Learning Rate

Using the full dataset and a small weight decay (e.g., 10410^{-4}), test learning rates: 1e1,1e2,1e3,1e4,1e51e^{-1}, 1e^{-2}, 1e^{-3}, 1e^{-4}, 1e^{-5}

Run ~100 iterations and select the rate that yields a clear, stable loss decrease.

Step 4: Coarse Hyperparameter Search

Explore a broad range of values (learning rate, weight decay, batch size, model size) using grid or random search. Train each configuration for ~1–5 epochs to identify promising regions.

Step 5: Refine the Search

Narrow the search around strong candidates and train longer. Optionally introduce schedules (e.g., cosine decay) or warmup.

Step 6: Analyze Curves

Inspect training vs. validation loss and accuracy:

  • Overfitting → increase regularization
  • Underfitting → increase capacity or training time
  • Instability → adjust learning rate or optimizer

Summary Insight

Most failures are due to basic setup issues (bad LR, incorrect loss, broken pipeline), not fine-grained hyperparameter choices. Systematic debugging is more effective than premature tuning.

🔲 ☆

CS231n Lecture Note V: Convolution Neural Networks Basics

When images get really large, we would have too many neurons to train in a regular neural network. The full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting. What’s more, regular neural networks ignore the spatial structures of the image, missing out on some important information.

As a result, a new type of neural network, the Convolutional Neural Network (CNN) was developed.

Feature Extraction

Extract features from raw pixels to feed the model with more information.

Examples include Color Histogram and Histogram of Oriented Gradients. Hand-engineered features (e.g., HOG, SIFT) are largely obsolete in modern deep learning pipelines.

Stack them all together.

Traditional feature extraction is manually designed and fixed; CNNs learn hierarchical feature representations end-to-end.

Data and compute can solve problems better than human designers?

Linear Classification Intuition

Matrix multiplication involves vector dot products. Dot products measure how strongly an input aligns with learned feature directions, which act like soft, continuous templates.

Architecture

ConvNets use convolution and pooling operators to extract features while respecting 2D image structures.

The input remains a 3D tensor (e.g. 32x32x3), preserving spatial structures.

The layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth.

The neurons in a layer will only be connected to a small region of the layer before it.

CNN Architecture

We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks).

A simple ConvNet could have the architecture [INPUT - CONV - RELU - POOL - FC].

  • CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume.
  • POOL layer will perform a downsampling operation along the spatial dimensions (width, height).

ConvNets transform the original image layer by layer from the original pixel values to the final class scores.

The parameters in the CONV/FC layers will be trained with gradient descent.

Convolutional Layers

A convolutional layer performs a linear operation on the input using a set of learnable filters (also called kernels). Although commonly referred to as “convolution,” most deep learning libraries actually implement cross-correlation; the distinction is usually not important in practice.

The CONV layer’s parameters consist of a set of learnable filters. Each filter is spatially small but spans the full depth of the input.

Each filter typically has a single scalar bias, shared across all spatial positions in its activation map.

During the forward pass, each filter slides across the input volume. At every spatial location, the layer computes a dot product between the filter weights and the corresponding local region of the input, then adds a bias term. This produces a 2D activation map (feature map) for that filter.

A convolutional layer contains multiple filters. Each filter produces its own activation map, and these maps are stacked along the depth (channel) dimension to form the output volume. As a result, the number of filters determines the number of output channels.

Filters are initialized differently to break the symmetry.

For convolution with size KK, each element in the output depend on a K×KK \times K receptive field. When you stack convolutional layers, the effective receptive field grows. Each deeper layer aggregates information from a larger region of the original input.

Apart from 2D conv, we also have 1D conv and 3D conv etc.

Zero Padding

Without zero-padding, feature maps may shrink with each layer. So we add padding around the input before sliding the filter.

With the input size W×WW \times W, filter size K×KK \times K, and padding size PP, we get an output of size WK+2×PW - K + 2 \times P.

A common setting is to set PP as K12\frac{K-1}{2}, having the output of the same size of the input.

Stride

To increase effective receptive fields more quickly, we can skip some using strides.

With the input size W×WW \times W, filter size K×KK \times K, stride SS, and padding size PP, we get an output of size WK+2×P2+1\frac{W - K + 2 \times P}{2} + 1.

In this way, we are downsampling the image and the receptive field grows exponentially.

Pooling

Pooling layers downsample a feature map by summarizing local neighborhoods.

In max pooling, a small window, such as 2×22 \times 2, slides across the input and outputs the largest value in each window. This keeps the strongest local activation and reduces spatial resolution.

In average pooling, the layer outputs the mean value in each window instead of the maximum.

Translation Equivariance

CNNs use shared filters so the same local features can be detected at different positions in an image, while preserving information about where those features occur. This is called translation equivariance.

🔲 ☆

CS231n Lecture Note IV: Neural Networks and Backpropagation

Here we are introducing Neural Networks and Backpropagation.

Neural Networks

We will start with a 2-layer example:

f=W2max(0,W1x)f = W_2 \max(0, W_1 x)

xRD,W1RH×D,W2RC×Hx \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}

The maxmax here creates some non-linearity between W1W1 and W2W2. It is called the activation function.

In practice, we will usually add a learnable bias at each layer as well.

Here’s an example of a 3-layer one.

f=W3max(0,W2max(0,W1x))f = W_3 \max(0, W_2 \max(0, W_1 x))

xRD,W1RH1×D,W2RH2×H1,W3RC×H2x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H_1 \times D}, W_2 \in \mathbb{R}^{H_2 \times H_1}, W_3 \in \mathbb{R}^{C \times H_2}

The networks we are talking about here are more accurately called fully-connected networks, or sometimes called multi-layer perceptrons (MLP).

Activation Functions

The most important characteristic of activation functions is to create some non-linearity.

There are multiple activation functions to choose from.

The function max(0,x)max(0,x) is called ReLU. It is a good default choice for most problems.

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

We also have:

  • Leaky ReLU: max(0.1x,x)\max(0.1x, x) — Allows a small gradient for negative inputs. Avoids dead neurons.
  • Sigmoid: σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}} — S-shaped curve squashing output between 0 and 1.
  • Tanh: tanh(x)\tanh(x) — S-shaped curve squashing output between -1 and 1.
  • ELU: Exponential Linear Unit.
  • GELU & SiLU: Newer functions often used in modern architectures like Transformers.

Choosing an activation function is very empirical.

Architecture

In a Neural Network, we have an input layer, an output layer, and several hidden layers.

Sample Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from numpy.random import randn

# Describe the Network
# N: Batch size (number of inputs processed at once)
# D_in: Input dimension (features)
# H: Hidden Layer Dimension
# D_out: Output dimentsion
N, D_in, H, D_out = 64, 1000, 100, 10
x, y = randn(N, D_in), randn(N, D_out)
w1, w2 = randn(D_in, H), randn(H, D_out)

for t in range(2000):
# Forward pass
h = 1 / (1 + np.exp(-x.dot(w1)))
y_pred = h.dot(w2)
loss = np.square(y_pred - y).sum()
print(t, loss)

# Calculate the analytical gradients
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h.T.dot(grad_y_pred)
grad_h = grad_y_pred.dot(w2.T)
grad_w1 = x.T.dot(grad_h * h * (1 - h))

# Gradient descent
w1 -= 1e-4 * grad_w1
w2 -= 1e-4 * grad_w2

The number of neurson in the hidden layers is often related to the capacity of information. More neurons, more capacity.

A regularizer keeps the network simple and prevents overfitting. Do not use size of neural network as a regularizer. Use stronger regularization instead.

Derivatives

f(x,y)=x+yfx=1,fy=1f(x, y) = x + y \to \frac{\partial f}{\partial x} = 1, \frac{\partial f}{\partial y} = 1

f(x,y)=max(x,y)fx=1(xy),fy=1(yx)f(x, y) = \max(x, y) \to \frac{\partial f}{\partial x} = \mathbb{1}(x \geq y), \frac{\partial f}{\partial y} = \mathbb{1}(y \geq x)

f(x,y)=xyfx=y,fy=xf(x, y) = xy \to \frac{\partial f}{\partial x} = y, \frac{\partial f}{\partial y} = x

The derivative on each variable tells you the sensitivity of the whole expression on its value.

  • add gate: gradient distributor
  • mul gate: “swap multiplier”
  • copy gate: gradient adder
  • max gate: gradient router

Backpropagation

The problem now is to calculate the gradients in order to learn the parameters.

Obviously we can’t simply derive it on paper, for it can be very tedious and not feasible for complex models.

Here’s the better idea: Computational graphs + Backpropagation.

Backpropagation is generally chain rules.

Downstream=Upstream×Local\text{Downstream} = \text{Upstream} \times \text{Local}

Lx=Lzzx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial x}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
================================================================================
COMPUTATIONAL GRAPH NODE: BACKPROPAGATION
================================================================================

FORWARD PASS
Computes output z from inputs x, y

[ Input x ] ---------------------\
\
\
v
_____________
| |
| Node f | -----------------> [ Output z ]
|_____________|
^
/
/
[ Input y ] ---------------------/


================================================================================

BACKWARD PASS
Chain Rule: Downstream = Upstream * Local

Upstream Gradient
(Received)
dL/dz
|
v
_____________ _____________
Downstream Gradient | | | |
(To x) | Local Grad | | |
dL/dx <----------------| dz/dx | <----------------| |
= |_____________| | |
dL/dz * dz/dx | |
| Node f |
| |
_____________ | |
Downstream Gradient | | | |
(To y) | Local Grad | | |
dL/dy <----------------| dz/dy | <----------------| |
= |_____________| |_____________|
dL/dz * dz/dy

Modularize: Forward/Backward API

Gate / Node / Function object: Actual PyTorch code

1
2
3
4
5
6
7
8
9
10
11
12
class Multiply(torch.autograd.Function):
@staticmethod
def forward(ctx, x, y):
ctx.save_for_backward(x, y) # Need to cache some values for use in backward
z = x * y
return z
@staticmethod
def backward(ctx, grad_z): # Upstream gradient (grad_z)
x, y = ctx.saved_tensors
grad_x = y * grad_z # dz/dx * dL/dz <-- Multiply upstream and local gradients
grad_y = x * grad_z # dz/dy * dL/dz
return grad_x, grad_y

Aside: Vector Derivatives

For vector derivatives:

  1. Vector to Scalar

xRN,yRx \in \mathbb{R}^N, y \in \mathbb{R}

The derivative is a gradient.

yxRN(yx)n=yxn\frac{\partial y}{\partial x} \in \mathbb{R}^N \quad \left(\frac{\partial y}{\partial x}\right)_n = \frac{\partial y}{\partial x_n}

  1. Vector to Vector

xRN,yRMx \in \mathbb{R}^N, y \in \mathbb{R}^M

The derivative is a Jacobian Matrix.

yxRN×M(yx)n,m=ymxn\frac{\partial y}{\partial x} \in \mathbb{R}^{N \times M} \quad \left(\frac{\partial y}{\partial x}\right)_{n,m} = \frac{\partial y_m}{\partial x_n}

Aside: Jacobian Matrix

A Jacobian Matrix is the derivative for a function that maps a Vector to a Vector.

Definition

Let f:RnRm\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m be a function such that each of its first-order partial derivatives exists on Rn\mathbb{R}^n. This function takes a point x=(x1,,xn)Rn\mathbf{x} = (x_1, \ldots, x_n) \in \mathbb{R}^n as input and produces the vector f(x)=(f1(x),,fm(x))Rm\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x})) \in \mathbb{R}^m as output.

Then the Jacobian matrix of f\mathbf{f}, denoted Jf\mathbf{J}_{\mathbf{f}}, is the m×nm \times n matrix whose (i,j)(i, j) entry is fixj\frac{\partial f_i}{\partial x_j}; explicitly:

Jf=[fx1fxn]=[Tf1Tfm]=[f1x1f1xnfmx1fmxn]\mathbf{J}_{\mathbf{f}} = \begin{bmatrix} \frac{\partial \mathbf{f}}{\partial x_1} & \cdots & \frac{\partial \mathbf{f}}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \nabla^T f_1 \\ \vdots \\ \nabla^T f_m \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

where Tfi\nabla^T f_i is the transpose (row vector) of the gradient of the ii-th component.

Backpropagation with Vectors

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
================================================================================
COMPUTATIONAL GRAPH NODE: BACKPROP WITH VECTORS
================================================================================

FORWARD PASS (Green Arrows)
Computes output vector z from input vectors x, y

[ Vector x ] --------------------\
Shape: Dx \
\
v
_____________ [ Vector z ]
| | Shape: Dz
| Node f | ----------------->
|_____________|
^
/
/
[ Vector y ] --------------------/
Shape: Dy


================================================================================

BACKWARD PASS (Red Arrows)
Matrix-Vector Multiply: Downstream = Local * Upstream

Upstream Gradient
(dL/dz)
Shape: Dz
|
v
_____________ _____________
Downstream Gradient | | | |
(dL/dx) | Jacobian | | |
Shape: Dx <-------------| [Dx x Dz] | <----------| |
= | dz/dx | | |
(dz/dx) * (dL/dz) |_____________| | |
| |
| Node f |
| |
_____________ | |
Downstream Gradient | | | |
(dL/dy) | Jacobian | | |
Shape: Dy <-------------| [Dy x Dz] | <----------| |
= | dz/dy | |_____________|
(dz/dy) * (dL/dz) |_____________|

Jacobians can be sparse. For element-wise operations, the off-diagonal elements are always zero; the Jacobian becomes a diagonal matrix.

Four Fundamental Equations of Backpropagation

These allow us to propagate error (δ\delta) from the output backward through the network. We get them from simple chain rules.

Notation:

  • \odot: Hadamard (element-wise) product
  • σ\sigma': Derivative of the activation function
  • δl\delta^l: Error term for layer ll

(BP1) Error at the Output Layer

δL=aCσ(zL)\delta^L = \nabla_a C \odot \sigma'(z^L)

Interpretation: How much the output layer “missed” the target, scaled by the activation’s sensitivity.

(BP2) Propagating Error to Hidden Layers

δl=((wl+1)Tδl+1)σ(zl)\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)

Interpretation: The error at layer ll is the weighted sum of errors from the next layer (l+1l+1), pulled backward through the transpose of the weights.

(BP3) Gradient for Biases

Cbjl=δjl\frac{\partial C}{\partial b^l_j} = \delta^l_j

Interpretation: The gradient for a bias is exactly equal to the error at that neuron.

(BP4) Gradient for Weights

Cwjkl=akl1δjl\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j

Interpretation: The gradient for a weight is the product of the input activation (al1a^{l-1}) and the output error (δl\delta^l).

🔲 ☆

CS231n Lecture Note III: Optimization

With the score function and the loss function, now we focus on how we minimize the loss. Optimization is the process of finding the set of parameters WW that minimize the loss function.

Strategy #1: A first very bad idea solution: Random search

Core idea: iterative refinement

Strategy #2: Follow the slope

In 1-dimension, the derivative of a function:

df(x)dx=limh0f(x+h)f(x)h\frac{df(x)}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension WL\nabla_W L.

The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient.

Numerical Gradient

We can get an approximate numerical gradient. It is often sufficient to use a very small value (such as 1e-5).

This is easy to write, but might be slow.

Analytic Gradient

We can also use calculus to get the exact value of gradients.

Practical considerations

Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

It often works better to compute the numeric gradient using the centered difference formula: [f(x+h)−f(x−h)]/2h.

Gradient Descent

1
2
3
4
5
# Vanilla Gradient Descent

while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad # perform parameter update

But in large-scale applications, the training data can have on order of millions of examples. It seems wasteful to compute the full loss function over the entire training set in order to perform only a single parameter update.

Stochastic Gradient Descent (SGD)

A very common approach to addressing this challenge is to compute the gradient over batches of the training data, which is called the Mini-batch Gradient Descent.

The gradient from a mini-batch is a good approximation of the gradient of the full objective. Therefore, much faster convergence can be achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter updates.

1
2
3
4
5
6
# Vanilla Minibatch Gradient Descent

while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weights_grad # perform parameter update

The extreme case of this is a setting where the mini-batch contains only a single example. This process is called Stochastic Gradient Descent (SGD).

The size of the mini-batch is a hyperparameter. It is usually based on memory constraints (if any), or set to some value, e.g. 32, 64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2.

Problems with SGD

  1. Jittering
  2. Local minima or saddle points
  3. Gradients from mini-batches might be too noisy.

SGD + Momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations.

vt+1=ρvt+f(xt)v_{t+1} = \rho v_t + \nabla f(x_t)

xt+1=xtαvt+1x_{t+1} = x_t - \alpha v_{t+1}

1
2
3
4
5
vx = 0
while True:
dx = compute_gradient(x)
vx = rho * vx + dx
x -= learning_rate * vx
  • vv: Velocity (accumulates gradient directions)
  • ρ\rho (rho): Momentum coefficient (friction), typically 0.9 or 0.99.
  • α\alpha (alpha): Learning rate.
  • f(x)\nabla f(x): Gradient.

Aside: Nesterov Momentum

Standard Momentum calculates the gradient first, and then add the previous velocity to it.

Unlike Standard Momentum, Nesterov Momentum adds the previous velocity to the parameter first, and then calculate the gradient with the updated parameters. It looks ahead, making it more stable.

vt+1=ρvtαf(xt+ρvt)v_{t+1} = \rho v_t - \alpha \nabla f(x_t + \rho v_t)

xt+1=xt+vt+1x_{t+1} = x_t + v_{t+1}

We can let x~t=xt+ρvt\tilde{x}_t = x_t + \rho v_t to make it look nicer. Rearrange:

vt+1=ρvtαf(x~t)v_{t+1} = \rho v_t - \alpha \nabla f(\tilde{x}_t)

x~t+1=x~tρvt+(1+ρ)vt+1\tilde{x}_{t+1} = \tilde{x}_t - \rho v_t + (1 + \rho)v_{t+1}

=x~t+vt+1+ρ(vt+1vt)= \tilde{x}_t + v_{t+1} + \rho(v_{t+1} - v_t)

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

RMSProp

RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm designed to improve the performance and speed of training deep learning models.

It adds element-wise scaling of the gradient based on historical sums of squares in each dimension (with decay).

Mathematical Formulation:

  1. Compute Gradient

gt=θg_t = \nabla \theta

  1. Update Moving Average of Squared Gradients

E[g2]t=γE[g2]t1+(1γ)gt2E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma)g_t^2

  1. Update Parameters

θt+1=θtηE[g2]t+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \odot g_t

where:

  • Learning Rate (η): Controls the step size during parameter updates.
  • Decay Rate (γ): Determines how quickly the moving average of squared gradients decays.
  • Epsilon (ϵ): A small constant (e.g., 1e-8) added to the denominator to prevent division by zero and ensure numerical stability.
1
2
3
4
5
6
7
grad_squared = 0
while True:
dx = compute_gradient(x)
# Update moving average of the squared gradients
grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx
# Update weights
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

Aside: AdaGrad

1
2
3
4
5
6
grad_squared = 0
while True:
dx = compute_gradient(x)
# Difference HERE:
grad_squared = dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

RMSProp is basically AdaGrad with decay.

Adam & AdamW

Adam (Adaptive Moment Estimation) optimizer combines the advantages of Momentum and RMSprop techniques to adjust learning rates during training.

1
2
3
4
5
6
7
8
9
10
11
12
13
first_moment = 0
second_moment = 0
for t in range(1, num_iterations):
dx = compute_gradient(x)
# Momentum
first_moment = beta1 * first_moment + (1 - beta1) * dx
# RMSProp
second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
# Bias Correction
first_unbias = first_moment / (1 - beta1 ** t)
second_unbias = second_moment / (1 - beta2 ** t)
# RMSProp
x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)

However, since the first and second momenet estimates start at zero, the initial step might be gigantic. So we implement bias correction to prevent early-stage instability.

For Standard Adam, the regularization is done in x before gradent computation.

For AdamW (Weight Dacay), the regularization term is added to the final x after the moments.

Learning Rate Schedules

The learning rate is a hyperparameter. We may make it decay over time.

  1. Reduce learning rate at a few fixedpoints.
  2. Cosine decay
  3. Linear decay
  4. Inverse sqrt decay
  5. etc.

Empirical rule of thumb: If you increase the batch size by N, also scale the initial learning rate by N.

Second-Order Optimization

We can use gradient and Hessian to form a quadratic approximation, then step to its minima.

Second-Order Taylor Series:

J(θ)J(θ0)+(θθ0)θJ(θ0)+12(θθ0)H(θθ0)J(\boldsymbol{\theta}) \approx J(\boldsymbol{\theta}_0) + (\boldsymbol{\theta} - \boldsymbol{\theta}_0)^\top \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0) + \frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}_0)^\top \mathbf{H} (\boldsymbol{\theta} - \boldsymbol{\theta}_0)

Solving for the critical point we obtain the Newton parameter update:

θ=θ0H1θJ(θ0)\boldsymbol{\theta}^* = \boldsymbol{\theta}_0 - \mathbf{H}^{-1} \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0)

This can be bad for deep learning, for the amount of computation required.

  • Quasi-Newton methods (BGFS)
  • L-BGFS (Limited Memory BGFS): Does not store the full inverse Hessian.

Aside: Hessian Matrices

The Hessian matrix (or simply the Hessian), denoted as H\mathbf{H}, represents the second-order partial derivatives of a function. While the gradient (\nabla) tells you the slope (direction of steepest descent), the Hessian tells you the curvature of the loss function surface.

It is a square matrix of second-order partial derivatives of a scalar-valued function, f:RnRf: \mathbb{R}^n \to \mathbb{R}. It describes the local curvature of a function of many variables.

H(f)=[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2]\mathbf{H}(f) = \begin{bmatrix}\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\\vdots & \vdots & \ddots & \vdots \\\frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2}\end{bmatrix}

Hij=2fxixj\mathbf{H}_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Hessian Matrices are symmetric.

🔲 ☆

CS231n Lecture Note II: Linear Classifiers

With the disadvantages of the KNN algorithm, we need to come up with a more powerful approach. The new approach will have two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels.

Score Function

The score function maps the pixel values of an image to confidence scores for each class.

As before, let’s assume a training dataset of images xiRD\mathbf{x}_i \in \mathbf{R}^D, each associated with a label yiy_i. Here i=1Ni = 1 \dots N and yi1Ky_i \in 1 \dots K.

That is, we have N\mathbf{N} examples (each with a dimensionality D\mathbf{D}) and K\mathbf{K} distinct categories.

We will define the score function f:RDRKf : \mathbf{R}^D \mapsto \mathbf{R}^K that maps the raw image pixels to class scores.

Linear Classifier

We will start out with arguably the simplest possible function, a linear mapping.

f(xi,W,b)=Wxi+bf(\mathbf{x}_i, \mathbf{W}, \mathbf{b}) = \mathbf{W}\mathbf{x}_i + \mathbf{b}

In the above equation, we are assuming that the image xi\mathbf{x}_i has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix W\mathbf{W} (of size [K x D]), and the vector b\mathbf{b} (of size [K x 1]) are the parameters of the function.

The parameters in W\mathbf{W} are often called the weights, and b\mathbf{b} is called the bias vector because it influences the output scores, but without interacting with the actual data xi\mathbf{x}_i.

The input data are given and fixed. The goal is to set W,b\mathbf{W,b} in such way that the computed scores match the ground truth labels across the whole training set.

Bias Tricks

We can combine the two sets of parameters into a single matrix that holds both of them by extending the vector xi\mathbf{x}_i with one additional dimension that always holds the constant 1\mathbf{1} - a default bias dimension.

f(xi,W)=Wxif(\mathbf{x}_i, \mathbf{W}) = \mathbf{W}\mathbf{x}_i

Bias Tricks

Image Data Preprocessing

In Machine Learning, it is a very common practice to always perform normalization of input features. In particular, it is important to center your data by subtracting the mean from every feature.

Loss Function

We will develop Multiclass Support Vector Machine (SVM) loss.

The score function takes the pixels and computes the vector f(xi,W)f(\mathbf{x}_i, \mathbf{W}) of class scores, which we will abbreviate to s\mathbf{s} (short for scores).

The Multiclass SVM loss for the i-th example is then formalized as follows:

Li=jyimax(0,sjsyi+Δ)L_i = \sum_{j \neq y_i} \max(0, s_j - s_{y_i} + \Delta)

The function accumulates the error of incorrect classes within Delta.

In summary, the SVM loss function wants the score of the correct class yiy_i to be larger than the incorrect class scores by at least by Δ\Delta (delta).

The threshold at zero max(0, -) function is often called the hinge loss. We also have squared hinge loss SVM (or L2-SVM), which uses the form max(0, -)² that penalizes violated margins more strongly.

The Multiclass Support Vector Machine "wants" the score of the correct class to be higher than all other scores by at least a margin of delta.

Regularization

We wish to encode some preference for a certain set of weights W over others to remove this ambiguity. We can do so by extending the loss function with a regularization penalty R(W)R(\mathbf{W}). The most common regularization penalty is the squared L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters:

R(W)=klWk,l2R(\mathbf{W}) = \sum_k \sum_l W_{k,l}^2

Including the regularization penalty completes the full Multiclass Support Vector Machine loss, which is made up of two components: the data loss (which is the average loss LiL_i over all examples) and the regularization loss.

L=1NiLidata loss+λR(W)regularization lossL = \underbrace{\frac{1}{N} \sum_i L_i}_{\text{data loss}} + \underbrace{\lambda R(\mathbf{W})}_{\text{regularization loss}}

Or in full form:

L=1Nijyi[max(0,f(xi;W)jf(xi;W)yi+Δ)]+λklWk,l2L = \frac{1}{N} \sum_i \sum_{j \neq y_i} \left[ \max(0, f(\mathbf{x}_i; \mathbf{W})_j - f(\mathbf{x}_i; \mathbf{W})_{y_i} + \Delta) \right] + \lambda \sum_k \sum_l W_{k,l}^2

Penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. It keeps the weights small and simple. This can improve the generalization performance of the classifiers on test images and lead to less overfitting.

It prevents the model from doing too well on the training data.

Note that due to the regularization penalty we can never achieve loss of exactly 0.0 on all examples.

Practical Considerations

Setting Delta: It turns out that this hyperparameter can safely be set to Δ=1.0\Delta = 1.0 in all cases. (The exact value of the margin between the scores is in some sense meaningless because the weights can shrink or stretch the differences arbitrarily.)

Binary Support Vector Machine: The loss for the i-th example can be written as

Li=Cmax(0,1yiwTxi)+R(W)L_i = C \max(0, 1 - y_i \mathbf{w}^T \mathbf{x}_i) + R(\mathbf{W})

C\mathbf{C} in this formulation and λ\lambda in our formulation control the same tradeoff and are related through reciprocal relation C1λC \propto \frac{1}{\lambda}.

Other Multiclass SVM formulations: Multiclass SVM presented in this section is one of few ways of formulating the SVM over multiple classes.

Another commonly used form is the One-Vs-All (OVA) SVM which trains an independent binary SVM for each class vs. all other classes. Related, but less common to see in practice is also the All-vs-All (AVA) strategy.

The last formulation you may see is a Structured SVM, which maximizes the margin between the score of the correct class and the score of the highest-scoring incorrect runner-up class.

Softmax Classifier

In the Softmax Classifier, we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form:

Li=log(efyijefj)or equivalentlyLi=fyi+logjefjL_i = -\log\left(\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}\right) \hspace{1cm} \text{or equivalently} \hspace{1cm} L_i = -f_{y_i} + \log\sum_j e^{f_j}

where we are using the notation fjf_j to mean the j-th element of the vector of class scores f\mathbf{f}.

The function fj(z)=ezjkezkf_j(\mathbf{z}) = \frac{e^{z_j}}{\sum_k e^{z_k}} is called the softmax function: It takes a vector of arbitrary real-valued scores (in z\mathbf{z}) and squashes it to a vector of values between zero and one that sum to one.

Information Theory View

The cross-entropy between a “true” distribution p\mathbf{p} and an estimated distribution q\mathbf{q} is defined as:

H(p,q)=xp(x)logq(x)H(\mathbf{p}, \mathbf{q}) = -\sum_x p(x) \log q(x)

Minimizing Cross-Entropy is equivalent to minimizing the KL Divergence.

H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{KL}(p||q)

Because the true distribution pp is fixed (its entropy H(p)H(p) is zero in this scenario), minimizing cross-entropy is the same as forcing the predicted distribution qq to look exactly like the true distribution pp.

The Softmax Loss objective is to force the neural network to output a probability distribution where the correct class has a probability very close to 1.0, and all other classes are close to 0.0.

Information Theory Supplementary

Information Entropy measures the uncertainty or unpredictability of a random variable.

The more unpredictable an event is (lower probability), the more information is gained when it occurs, and the higher the entropy. Conversely, if an event has a probability of 1 (certainty), its entropy is 0.

For a discrete random variable XX with possible outcomes {x1,...,xn}\{x_1, ..., x_n\} and probabilities P(xi)P(x_i), the entropy H(X)H(X) is defined as:

H(X)=i=1nP(xi)logP(xi)H(X) = -\sum_{i=1}^{n} P(x_i) \log P(x_i)

Cross Entropy measures the total cost of using distribution qq to represent distribution pp.

Minimizing the cross-entropy H(p,q)H(p, q) is mathematically equivalent to minimizing the KL Divergence. It forces the predicted distribution qq to become as close as possible to the true distribution pp.

Probabilistic View

P(yixi;W)=efyijefjP(y_i \mid x_i; W) = \frac{e^{f_{y_i}}}{\sum_j e^{f_j}}

The formula maps raw scores to a range of (0,1)(0, 1) such that the sum of all class probabilities equals 1.

Using the Cross-Entropy loss function during training is equivalent to maximizing the likelihood of the correct class.

Numeric Stability

Dividing large numbers can be numerically unstable, so it is important to use a normalization trick.

efyijefj=CefyiCjefj=efyi+logCjefj+logC\frac{e^{f_{y_i}}}{\sum_j e^{f_j}} = \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} = \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}}

A common choice for CC is to set logC=maxjfj\log C = -\max_j f_j. This simply states that we should shift the values inside the vector ff so that the highest value is zero. In code:

1
2
3
4
5
6
f = np.array([123, 456, 789]) # example with 3 classes and each having large scores
p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup

# instead: first shift the values of f so that the highest number is 0:
f -= np.max(f) # f becomes [-666, -333, 0]
p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer

Note

To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss.

The Softmax classifier uses the cross-entropy loss.

SVM vs Softmax

SVM vs Softmax

The SVM interprets these as class scores and its loss function encourages the correct class to have a score higher by a margin than the other class scores.

The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low).

Softmax classifier provides “probabilities” for each class.

The “probabilities” are dependent on the regularization strength. They are better thought of as confidences where the ordering of the scores is interpretable.

In practice, SVM and Softmax are usually comparable. Compared to the Softmax classifier, the SVM is a more local objective.

The Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better.

However, the SVM is happy once the margins are satisfied and it does not micromanage the exact scores beyond this constraint.

🔲 ☆

CS231n Lecture Note I: Image Classification

Image Classification

Task: assigning an input image one label from a fixed set of categories

Images are defined as tensors of integers between [0,255], e.g. 800 x 600 x 3

Challenges:

  • Viewpoint variation
  • Scale variation
  • Deformation
  • Occlusion
  • Illumination conditions
  • Background clutter
  • Intra-class variation

A good image classification model must be invariant to the cross product of all these variations, while simultaneously retaining sensitivity to the inter-class variations.

The data-driven approach: first accumulating a training dataset of labeled images, then develop learning algorithms to learn about them.

The image classification pipeline:

  1. Input: Input a set of N images, each labeled with one of K different classes.
  2. Learn: use the training set to learn what every one of the classes looks like. training a classifier, or learning a model.
  3. Evaluate: evaluate the quality of the classifier by asking it to predict labels for a new set of images that it has never seen before.

Nearest Neighbor Classifier

Here is the English translation of your content.

L1 Distance (Manhattan Distance)

L1 Distance is the sum of the absolute values of the differences between corresponding dimensions of two vectors. The calculation formula is:

L1(X,Y)=i=1nxiyi=x1y1+x2y2+...+xnynL1(X, Y) = \sum_{i=1}^{n} |x_i - y_i| = |x_1 - y_1| + |x_2 - y_2| + ... + |x_n - y_n|

L2 Distance (Euclidean Distance)

L2 Distance is the square root of the sum of the squared differences between corresponding dimensions of two vectors. This is what we commonly refer to as the straight-line distance between two points. The calculation formula is:

L2(X,Y)=i=1n(xiyi)2=(x1y1)2+(x2y2)2+...+(xnyn)2L2(X, Y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + ... + (x_n - y_n)^2}

Note on L2: Squaring amplifies values, thereby magnifying the influence of outliers.

Evaluation

For evaluation, we use accuracy, which measures the fraction of predictions that were correct.

k-Nearest Neighbr Classifier

The idea: we will find the top k closest images, and have them vote on the label of the test image.

Hyperparameters

It’s often not obvious what values/settings one should choose for hyperparameters.

We cannot use the test set for the purpose of tweaking hyperparameters.

-> Split your training set into training set and a validation set. Use validation set to tune all hyperparameters. At the end run a single time on the test set and report performance.

Cross-validation

Cross-validation: iterating over different validation sets and averaging the performance across these.

Cross-validation

🔲 ☆

CS231n Lecture Note IV: Neural Networks and Backpropagation

Here we are introducing Neural Networks and Backpropagation.

Neural Networks

We will start with a 2-layer example:

f=W2max(0,W1x)f = W_2 \max(0, W_1 x)

xRD,W1RH×D,W2RC×Hx \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}

The maxmax here creates some non-linearity between W1W1 and W2W2. It is called the activation function.

In practice, we will usually add a learnable bias at each layer as well.

Here’s an example of a 3-layer one.

f=W3max(0,W2max(0,W1x))f = W_3 \max(0, W_2 \max(0, W_1 x))

xRD,W1RH1×D,W2RH2×H1,W3RC×H2x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H_1 \times D}, W_2 \in \mathbb{R}^{H_2 \times H_1}, W_3 \in \mathbb{R}^{C \times H_2}

The networks we are talking about here are more accurately called fully-connected networks, or sometimes called multi-layer perceptrons (MLP).

Activation Functions

The most important characteristic of activation functions is to create some non-linearity.

There are multiple activation functions to choose from.

The function max(0,x)max(0,x) is called ReLU. It is a good default choice for most problems.

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

We also have:

  • Leaky ReLU: max(0.1x,x)\max(0.1x, x) — Allows a small gradient for negative inputs. Avoids dead neurons.
  • Sigmoid: σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}} — S-shaped curve squashing output between 0 and 1.
  • Tanh: tanh(x)\tanh(x) — S-shaped curve squashing output between -1 and 1.
  • ELU: Exponential Linear Unit.
  • GELU & SiLU: Newer functions often used in modern architectures like Transformers.

Choosing an activation function is very empirical.

Architecture

In a Neural Network, we have an input layer, an output layer, and several hidden layers.

Sample Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from numpy.random import randn

# Describe the Network
# N: Batch size (number of inputs processed at once)
# D_in: Input dimension (features)
# H: Hidden Layer Dimension
# D_out: Output dimentsion
N, D_in, H, D_out = 64, 1000, 100, 10
x, y = randn(N, D_in), randn(N, D_out)
w1, w2 = randn(D_in, H), randn(H, D_out)

for t in range(2000):
# Forward pass
h = 1 / (1 + np.exp(-x.dot(w1)))
y_pred = h.dot(w2)
loss = np.square(y_pred - y).sum()
print(t, loss)

# Calculate the analytical gradients
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h.T.dot(grad_y_pred)
grad_h = grad_y_pred.dot(w2.T)
grad_w1 = x.T.dot(grad_h * h * (1 - h))

# Gradient descent
w1 -= 1e-4 * grad_w1
w2 -= 1e-4 * grad_w2

The number of neurson in the hidden layers is often related to the capacity of information. More neurons, more capacity.

A regularizer keeps the network simple and prevents overfitting. Do not use size of neural network as a regularizer. Use stronger regularization instead.

Derivatives

f(x,y)=x+yfx=1,fy=1f(x, y) = x + y \to \frac{\partial f}{\partial x} = 1, \frac{\partial f}{\partial y} = 1

f(x,y)=max(x,y)fx=1(xy),fy=1(yx)f(x, y) = \max(x, y) \to \frac{\partial f}{\partial x} = \mathbb{1}(x \geq y), \frac{\partial f}{\partial y} = \mathbb{1}(y \geq x)

f(x,y)=xyfx=y,fy=xf(x, y) = xy \to \frac{\partial f}{\partial x} = y, \frac{\partial f}{\partial y} = x

The derivative on each variable tells you the sensitivity of the whole expression on its value.

  • add gate: gradient distributor
  • mul gate: “swap multiplier”
  • copy gate: gradient adder
  • max gate: gradient router

Backpropagation

The problem now is to calculate the gradients in order to learn the parameters.

Obviously we can’t simply derive it on paper, for it can be very tedious and not feasible for complex models.

Here’s the better idea: Computational graphs + Backpropagation.

Backpropagation is generally chain rules.

Downstream=Upstream×Local\text{Downstream} = \text{Upstream} \times \text{Local}

Lx=Lzzx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial x}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
================================================================================
COMPUTATIONAL GRAPH NODE: BACKPROPAGATION
================================================================================

FORWARD PASS
Computes output z from inputs x, y

[ Input x ] ---------------------\
\
\
v
_____________
| |
| Node f | -----------------> [ Output z ]
|_____________|
^
/
/
[ Input y ] ---------------------/


================================================================================

BACKWARD PASS
Chain Rule: Downstream = Upstream * Local

Upstream Gradient
(Received)
dL/dz
|
v
_____________ _____________
Downstream Gradient | | | |
(To x) | Local Grad | | |
dL/dx <----------------| dz/dx | <----------------| |
= |_____________| | |
dL/dz * dz/dx | |
| Node f |
| |
_____________ | |
Downstream Gradient | | | |
(To y) | Local Grad | | |
dL/dy <----------------| dz/dy | <----------------| |
= |_____________| |_____________|
dL/dz * dz/dy

Modularize: Forward/Backward API

Gate / Node / Function object: Actual PyTorch code

1
2
3
4
5
6
7
8
9
10
11
12
class Multiply(torch.autograd.Function):
@staticmethod
def forward(ctx, x, y):
ctx.save_for_backward(x, y) # Need to cache some values for use in backward
z = x * y
return z
@staticmethod
def backward(ctx, grad_z): # Upstream gradient (grad_z)
x, y = ctx.saved_tensors
grad_x = y * grad_z # dz/dx * dL/dz <-- Multiply upstream and local gradients
grad_y = x * grad_z # dz/dy * dL/dz
return grad_x, grad_y

Aside: Vector Derivatives

For vector derivatives:

  1. Vector to Scalar

xRN,yRx \in \mathbb{R}^N, y \in \mathbb{R}

The derivative is a gradient.

yxRN(yx)n=yxn\frac{\partial y}{\partial x} \in \mathbb{R}^N \quad \left(\frac{\partial y}{\partial x}\right)_n = \frac{\partial y}{\partial x_n}

  1. Vector to Vector

xRN,yRMx \in \mathbb{R}^N, y \in \mathbb{R}^M

The derivative is a Jacobian Matrix.

yxRN×M(yx)n,m=ymxn\frac{\partial y}{\partial x} \in \mathbb{R}^{N \times M} \quad \left(\frac{\partial y}{\partial x}\right)_{n,m} = \frac{\partial y_m}{\partial x_n}

Aside: Jacobian Matrix

A Jacobian Matrix is the derivative for a function that maps a Vector to a Vector.

Definition

Let f:RnRm\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m be a function such that each of its first-order partial derivatives exists on Rn\mathbb{R}^n. This function takes a point x=(x1,,xn)Rn\mathbf{x} = (x_1, \ldots, x_n) \in \mathbb{R}^n as input and produces the vector f(x)=(f1(x),,fm(x))Rm\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x})) \in \mathbb{R}^m as output.

Then the Jacobian matrix of f\mathbf{f}, denoted Jf\mathbf{J}_{\mathbf{f}}, is the m×nm \times n matrix whose (i,j)(i, j) entry is fixj\frac{\partial f_i}{\partial x_j}; explicitly:

Jf=[fx1fxn]=[Tf1Tfm]=[f1x1f1xnfmx1fmxn]\mathbf{J}_{\mathbf{f}} = \begin{bmatrix} \frac{\partial \mathbf{f}}{\partial x_1} & \cdots & \frac{\partial \mathbf{f}}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \nabla^T f_1 \\ \vdots \\ \nabla^T f_m \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

where Tfi\nabla^T f_i is the transpose (row vector) of the gradient of the ii-th component.

Backpropagation with Vectors

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
================================================================================
COMPUTATIONAL GRAPH NODE: BACKPROP WITH VECTORS
================================================================================

FORWARD PASS (Green Arrows)
Computes output vector z from input vectors x, y

[ Vector x ] --------------------\
Shape: Dx \
\
v
_____________ [ Vector z ]
| | Shape: Dz
| Node f | ----------------->
|_____________|
^
/
/
[ Vector y ] --------------------/
Shape: Dy


================================================================================

BACKWARD PASS (Red Arrows)
Matrix-Vector Multiply: Downstream = Local * Upstream

Upstream Gradient
(dL/dz)
Shape: Dz
|
v
_____________ _____________
Downstream Gradient | | | |
(dL/dx) | Jacobian | | |
Shape: Dx <-------------| [Dx x Dz] | <----------| |
= | dz/dx | | |
(dz/dx) * (dL/dz) |_____________| | |
| |
| Node f |
| |
_____________ | |
Downstream Gradient | | | |
(dL/dy) | Jacobian | | |
Shape: Dy <-------------| [Dy x Dz] | <----------| |
= | dz/dy | |_____________|
(dz/dy) * (dL/dz) |_____________|

Jacobians can be sparse. For element-wise operations, the off-diagonal elements are always zero; the Jacobian becomes a diagonal matrix.

Four Fundamental Equations of Backpropagation

These allow us to propagate error (δ\delta) from the output backward through the network. We get them from simple chain rules.

Notation:

  • \odot: Hadamard (element-wise) product
  • σ\sigma': Derivative of the activation function
  • δl\delta^l: Error term for layer ll

(BP1) Error at the Output Layer

δL=aCσ(zL)\delta^L = \nabla_a C \odot \sigma'(z^L)

Interpretation: How much the output layer “missed” the target, scaled by the activation’s sensitivity.

(BP2) Propagating Error to Hidden Layers

δl=((wl+1)Tδl+1)σ(zl)\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)

Interpretation: The error at layer ll is the weighted sum of errors from the next layer (l+1l+1), pulled backward through the transpose of the weights.

(BP3) Gradient for Biases

Cbjl=δjl\frac{\partial C}{\partial b^l_j} = \delta^l_j

Interpretation: The gradient for a bias is exactly equal to the error at that neuron.

(BP4) Gradient for Weights

Cwjkl=akl1δjl\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j

Interpretation: The gradient for a weight is the product of the input activation (al1a^{l-1}) and the output error (δl\delta^l).

🔲 ☆

CS231n Lecture Note III: Optimization

With the score function and the loss function, now we focus on how we minimize the loss. Optimization is the process of finding the set of parameters WW that minimize the loss function.

Strategy #1: A first very bad idea solution: Random search

Core idea: iterative refinement

Strategy #2: Follow the slope

In 1-dimension, the derivative of a function:

df(x)dx=limh0f(x+h)f(x)h\frac{df(x)}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension WL\nabla_W L.

The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient.

Numerical Gradient

We can get an approximate numerical gradient. It is often sufficient to use a very small value (such as 1e-5).

This is easy to write, but might be slow.

Analytic Gradient

We can also use calculus to get the exact value of gradients.

Practical considerations

Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

It often works better to compute the numeric gradient using the centered difference formula: [f(x+h)−f(x−h)]/2h.

Gradient Descent

1
2
3
4
5
# Vanilla Gradient Descent

while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad # perform parameter update

But in large-scale applications, the training data can have on order of millions of examples. It seems wasteful to compute the full loss function over the entire training set in order to perform only a single parameter update.

Stochastic Gradient Descent (SGD)

A very common approach to addressing this challenge is to compute the gradient over batches of the training data, which is called the Mini-batch Gradient Descent.

The gradient from a mini-batch is a good approximation of the gradient of the full objective. Therefore, much faster convergence can be achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter updates.

1
2
3
4
5
6
# Vanilla Minibatch Gradient Descent

while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weights_grad # perform parameter update

The extreme case of this is a setting where the mini-batch contains only a single example. This process is called Stochastic Gradient Descent (SGD).

The size of the mini-batch is a hyperparameter. It is usually based on memory constraints (if any), or set to some value, e.g. 32, 64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2.

Problems with SGD

  1. Jittering
  2. Local minima or saddle points
  3. Gradients from mini-batches might be too noisy.

SGD + Momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations.

vt+1=ρvt+f(xt)v_{t+1} = \rho v_t + \nabla f(x_t)

xt+1=xtαvt+1x_{t+1} = x_t - \alpha v_{t+1}

1
2
3
4
5
vx = 0
while True:
dx = compute_gradient(x)
vx = rho * vx + dx
x -= learning_rate * vx
  • vv: Velocity (accumulates gradient directions)
  • ρ\rho (rho): Momentum coefficient (friction), typically 0.9 or 0.99.
  • α\alpha (alpha): Learning rate.
  • f(x)\nabla f(x): Gradient.

Aside: Nesterov Momentum

Standard Momentum calculates the gradient first, and then add the previous velocity to it.

Unlike Standard Momentum, Nesterov Momentum adds the previous velocity to the parameter first, and then calculate the gradient with the updated parameters. It looks ahead, making it more stable.

vt+1=ρvtαf(xt+ρvt)v_{t+1} = \rho v_t - \alpha \nabla f(x_t + \rho v_t)

xt+1=xt+vt+1x_{t+1} = x_t + v_{t+1}

We can let x~t=xt+ρvt\tilde{x}_t = x_t + \rho v_t to make it look nicer. Rearrange:

vt+1=ρvtαf(x~t)v_{t+1} = \rho v_t - \alpha \nabla f(\tilde{x}_t)

x~t+1=x~tρvt+(1+ρ)vt+1\tilde{x}_{t+1} = \tilde{x}_t - \rho v_t + (1 + \rho)v_{t+1}

=x~t+vt+1+ρ(vt+1vt)= \tilde{x}_t + v_{t+1} + \rho(v_{t+1} - v_t)

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

RMSProp

RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm designed to improve the performance and speed of training deep learning models.

It adds element-wise scaling of the gradient based on historical sums of squares in each dimension (with decay).

Mathematical Formulation:

  1. Compute Gradient

gt=θg_t = \nabla \theta

  1. Update Moving Average of Squared Gradients

E[g2]t=γE[g2]t1+(1γ)gt2E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma)g_t^2

  1. Update Parameters

θt+1=θtηE[g2]t+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \odot g_t

where:

  • Learning Rate (η): Controls the step size during parameter updates.
  • Decay Rate (γ): Determines how quickly the moving average of squared gradients decays.
  • Epsilon (ϵ): A small constant (e.g., 1e-8) added to the denominator to prevent division by zero and ensure numerical stability.
1
2
3
4
5
6
7
grad_squared = 0
while True:
dx = compute_gradient(x)
# Update moving average of the squared gradients
grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx
# Update weights
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

Aside: AdaGrad

1
2
3
4
5
6
grad_squared = 0
while True:
dx = compute_gradient(x)
# Difference HERE:
grad_squared = dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

RMSProp is basically AdaGrad with decay.

Adam & AdamW

Adam (Adaptive Moment Estimation) optimizer combines the advantages of Momentum and RMSprop techniques to adjust learning rates during training.

1
2
3
4
5
6
7
8
9
10
11
12
13
first_moment = 0
second_moment = 0
for t in range(1, num_iterations):
dx = compute_gradient(x)
# Momentum
first_moment = beta1 * first_moment + (1 - beta1) * dx
# RMSProp
second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
# Bias Correction
first_unbias = first_moment / (1 - beta1 ** t)
second_unbias = second_moment / (1 - beta2 ** t)
# RMSProp
x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)

However, since the first and second momenet estimates start at zero, the initial step might be gigantic. So we implement bias correction to prevent early-stage instability.

For Standard Adam, the regularization is done in x before gradent computation.

For AdamW (Weight Dacay), the regularization term is added to the final x after the moments.

Learning Rate Schedules

The learning rate is a hyperparameter. We may make it decay over time.

  1. Reduce learning rate at a few fixedpoints.
  2. Cosine decay
  3. Linear decay
  4. Inverse sqrt decay
  5. etc.

Empirical rule of thumb: If you increase the batch size by N, also scale the initial learning rate by N.

Second-Order Optimization

We can use gradient and Hessian to form a quadratic approximation, then step to its minima.

Second-Order Taylor Series:

J(θ)J(θ0)+(θθ0)θJ(θ0)+12(θθ0)H(θθ0)J(\boldsymbol{\theta}) \approx J(\boldsymbol{\theta}_0) + (\boldsymbol{\theta} - \boldsymbol{\theta}_0)^\top \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0) + \frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}_0)^\top \mathbf{H} (\boldsymbol{\theta} - \boldsymbol{\theta}_0)

Solving for the critical point we obtain the Newton parameter update:

θ=θ0H1θJ(θ0)\boldsymbol{\theta}^* = \boldsymbol{\theta}_0 - \mathbf{H}^{-1} \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0)

This can be bad for deep learning, for the amount of computation required.

  • Quasi-Newton methods (BGFS)
  • L-BGFS (Limited Memory BGFS): Does not store the full inverse Hessian.

Aside: Hessian Matrices

The Hessian matrix (or simply the Hessian), denoted as H\mathbf{H}, represents the second-order partial derivatives of a function. While the gradient (\nabla) tells you the slope (direction of steepest descent), the Hessian tells you the curvature of the loss function surface.

It is a square matrix of second-order partial derivatives of a scalar-valued function, f:RnRf: \mathbb{R}^n \to \mathbb{R}. It describes the local curvature of a function of many variables.

H(f)=[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2]\mathbf{H}(f) = \begin{bmatrix}\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\\vdots & \vdots & \ddots & \vdots \\\frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2}\end{bmatrix}

Hij=2fxixj\mathbf{H}_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Hessian Matrices are symmetric.

🔲 ☆

CS231n Lecture Note II: Linear Classifiers

With the disadvantages of the KNN algorithm, we need to come up with a more powerful approach. The new approach will have two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels.

Score Function

The score function maps the pixel values of an image to confidence scores for each class.

As before, let’s assume a training dataset of images xiRD\mathbf{x}_i \in \mathbf{R}^D, each associated with a label yiy_i. Here i=1Ni = 1 \dots N and yi1Ky_i \in 1 \dots K.

That is, we have N\mathbf{N} examples (each with a dimensionality D\mathbf{D}) and K\mathbf{K} distinct categories.

We will define the score function f:RDRKf : \mathbf{R}^D \mapsto \mathbf{R}^K that maps the raw image pixels to class scores.

Linear Classifier

We will start out with arguably the simplest possible function, a linear mapping.

f(xi,W,b)=Wxi+bf(\mathbf{x}_i, \mathbf{W}, \mathbf{b}) = \mathbf{W}\mathbf{x}_i + \mathbf{b}

In the above equation, we are assuming that the image xi\mathbf{x}_i has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix W\mathbf{W} (of size [K x D]), and the vector b\mathbf{b} (of size [K x 1]) are the parameters of the function.

The parameters in W\mathbf{W} are often called the weights, and b\mathbf{b} is called the bias vector because it influences the output scores, but without interacting with the actual data xi\mathbf{x}_i.

The input data are given and fixed. The goal is to set W,b\mathbf{W,b} in such way that the computed scores match the ground truth labels across the whole training set.

Bias Tricks

We can combine the two sets of parameters into a single matrix that holds both of them by extending the vector xi\mathbf{x}_i with one additional dimension that always holds the constant 1\mathbf{1} - a default bias dimension.

f(xi,W)=Wxif(\mathbf{x}_i, \mathbf{W}) = \mathbf{W}\mathbf{x}_i

Bias Tricks

Image Data Preprocessing

In Machine Learning, it is a very common practice to always perform normalization of input features. In particular, it is important to center your data by subtracting the mean from every feature.

Loss Function

We will develop Multiclass Support Vector Machine (SVM) loss.

The score function takes the pixels and computes the vector f(xi,W)f(\mathbf{x}_i, \mathbf{W}) of class scores, which we will abbreviate to s\mathbf{s} (short for scores).

The Multiclass SVM loss for the i-th example is then formalized as follows:

Li=jyimax(0,sjsyi+Δ)L_i = \sum_{j \neq y_i} \max(0, s_j - s_{y_i} + \Delta)

The function accumulates the error of incorrect classes within Delta.

In summary, the SVM loss function wants the score of the correct class yiy_i to be larger than the incorrect class scores by at least by Δ\Delta (delta).

The threshold at zero max(0, -) function is often called the hinge loss. We also have squared hinge loss SVM (or L2-SVM), which uses the form max(0, -)² that penalizes violated margins more strongly.

The Multiclass Support Vector Machine "wants" the score of the correct class to be higher than all other scores by at least a margin of delta.

Regularization

We wish to encode some preference for a certain set of weights W over others to remove this ambiguity. We can do so by extending the loss function with a regularization penalty R(W)R(\mathbf{W}). The most common regularization penalty is the squared L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters:

R(W)=klWk,l2R(\mathbf{W}) = \sum_k \sum_l W_{k,l}^2

Including the regularization penalty completes the full Multiclass Support Vector Machine loss, which is made up of two components: the data loss (which is the average loss LiL_i over all examples) and the regularization loss.

L=1NiLidata loss+λR(W)regularization lossL = \underbrace{\frac{1}{N} \sum_i L_i}_{\text{data loss}} + \underbrace{\lambda R(\mathbf{W})}_{\text{regularization loss}}

Or in full form:

L=1Nijyi[max(0,f(xi;W)jf(xi;W)yi+Δ)]+λklWk,l2L = \frac{1}{N} \sum_i \sum_{j \neq y_i} \left[ \max(0, f(\mathbf{x}_i; \mathbf{W})_j - f(\mathbf{x}_i; \mathbf{W})_{y_i} + \Delta) \right] + \lambda \sum_k \sum_l W_{k,l}^2

Penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. It keeps the weights small and simple. This can improve the generalization performance of the classifiers on test images and lead to less overfitting.

It prevents the model from doing too well on the training data.

Note that due to the regularization penalty we can never achieve loss of exactly 0.0 on all examples.

Practical Considerations

Setting Delta: It turns out that this hyperparameter can safely be set to Δ=1.0\Delta = 1.0 in all cases. (The exact value of the margin between the scores is in some sense meaningless because the weights can shrink or stretch the differences arbitrarily.)

Binary Support Vector Machine: The loss for the i-th example can be written as

Li=Cmax(0,1yiwTxi)+R(W)L_i = C \max(0, 1 - y_i \mathbf{w}^T \mathbf{x}_i) + R(\mathbf{W})

C\mathbf{C} in this formulation and λ\lambda in our formulation control the same tradeoff and are related through reciprocal relation C1λC \propto \frac{1}{\lambda}.

Other Multiclass SVM formulations: Multiclass SVM presented in this section is one of few ways of formulating the SVM over multiple classes.

Another commonly used form is the One-Vs-All (OVA) SVM which trains an independent binary SVM for each class vs. all other classes. Related, but less common to see in practice is also the All-vs-All (AVA) strategy.

The last formulation you may see is a Structured SVM, which maximizes the margin between the score of the correct class and the score of the highest-scoring incorrect runner-up class.

Softmax Classifier

In the Softmax Classifier, we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form:

Li=log(efyijefj)or equivalentlyLi=fyi+logjefjL_i = -\log\left(\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}\right) \hspace{1cm} \text{or equivalently} \hspace{1cm} L_i = -f_{y_i} + \log\sum_j e^{f_j}

where we are using the notation fjf_j to mean the j-th element of the vector of class scores f\mathbf{f}.

The function fj(z)=ezjkezkf_j(\mathbf{z}) = \frac{e^{z_j}}{\sum_k e^{z_k}} is called the softmax function: It takes a vector of arbitrary real-valued scores (in z\mathbf{z}) and squashes it to a vector of values between zero and one that sum to one.

Information Theory View

The cross-entropy between a “true” distribution p\mathbf{p} and an estimated distribution q\mathbf{q} is defined as:

H(p,q)=xp(x)logq(x)H(\mathbf{p}, \mathbf{q}) = -\sum_x p(x) \log q(x)

Minimizing Cross-Entropy is equivalent to minimizing the KL Divergence.

H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{KL}(p||q)

Because the true distribution pp is fixed (its entropy H(p)H(p) is zero in this scenario), minimizing cross-entropy is the same as forcing the predicted distribution qq to look exactly like the true distribution pp.

The Softmax Loss objective is to force the neural network to output a probability distribution where the correct class has a probability very close to 1.0, and all other classes are close to 0.0.

Information Theory Supplementary

Information Entropy measures the uncertainty or unpredictability of a random variable.

The more unpredictable an event is (lower probability), the more information is gained when it occurs, and the higher the entropy. Conversely, if an event has a probability of 1 (certainty), its entropy is 0.

For a discrete random variable XX with possible outcomes {x1,...,xn}\{x_1, ..., x_n\} and probabilities P(xi)P(x_i), the entropy H(X)H(X) is defined as:

H(X)=i=1nP(xi)logP(xi)H(X) = -\sum_{i=1}^{n} P(x_i) \log P(x_i)

Cross Entropy measures the total cost of using distribution qq to represent distribution pp.

Minimizing the cross-entropy H(p,q)H(p, q) is mathematically equivalent to minimizing the KL Divergence. It forces the predicted distribution qq to become as close as possible to the true distribution pp.

Probabilistic View

P(yixi;W)=efyijefjP(y_i \mid x_i; W) = \frac{e^{f_{y_i}}}{\sum_j e^{f_j}}

The formula maps raw scores to a range of (0,1)(0, 1) such that the sum of all class probabilities equals 1.

Using the Cross-Entropy loss function during training is equivalent to maximizing the likelihood of the correct class.

Numeric Stability

Dividing large numbers can be numerically unstable, so it is important to use a normalization trick.

efyijefj=CefyiCjefj=efyi+logCjefj+logC\frac{e^{f_{y_i}}}{\sum_j e^{f_j}} = \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} = \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}}

A common choice for CC is to set logC=maxjfj\log C = -\max_j f_j. This simply states that we should shift the values inside the vector ff so that the highest value is zero. In code:

1
2
3
4
5
6
f = np.array([123, 456, 789]) # example with 3 classes and each having large scores
p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup

# instead: first shift the values of f so that the highest number is 0:
f -= np.max(f) # f becomes [-666, -333, 0]
p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer

Note

To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss.

The Softmax classifier uses the cross-entropy loss.

SVM vs Softmax

SVM vs Softmax

The SVM interprets these as class scores and its loss function encourages the correct class to have a score higher by a margin than the other class scores.

The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low).

Softmax classifier provides “probabilities” for each class.

The “probabilities” are dependent on the regularization strength. They are better thought of as confidences where the ordering of the scores is interpretable.

In practice, SVM and Softmax are usually comparable. Compared to the Softmax classifier, the SVM is a more local objective.

The Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better.

However, the SVM is happy once the margins are satisfied and it does not micromanage the exact scores beyond this constraint.

🔲 ☆

CS231n Lecture Note I: Image Classification

Image Classification

Task: assigning an input image one label from a fixed set of categories

Images are defined as tensors of integers between [0,255], e.g. 800 x 600 x 3

Challenges:

  • Viewpoint variation
  • Scale variation
  • Deformation
  • Occlusion
  • Illumination conditions
  • Background clutter
  • Intra-class variation

A good image classification model must be invariant to the cross product of all these variations, while simultaneously retaining sensitivity to the inter-class variations.

The data-driven approach: first accumulating a training dataset of labeled images, then develop learning algorithms to learn about them.

The image classification pipeline:

  1. Input: Input a set of N images, each labeled with one of K different classes.
  2. Learn: use the training set to learn what every one of the classes looks like. training a classifier, or learning a model.
  3. Evaluate: evaluate the quality of the classifier by asking it to predict labels for a new set of images that it has never seen before.

Nearest Neighbor Classifier

Here is the English translation of your content.

L1 Distance (Manhattan Distance)

L1 Distance is the sum of the absolute values of the differences between corresponding dimensions of two vectors. The calculation formula is:

L1(X,Y)=i=1nxiyi=x1y1+x2y2+...+xnynL1(X, Y) = \sum_{i=1}^{n} |x_i - y_i| = |x_1 - y_1| + |x_2 - y_2| + ... + |x_n - y_n|

L2 Distance (Euclidean Distance)

L2 Distance is the square root of the sum of the squared differences between corresponding dimensions of two vectors. This is what we commonly refer to as the straight-line distance between two points. The calculation formula is:

L2(X,Y)=i=1n(xiyi)2=(x1y1)2+(x2y2)2+...+(xnyn)2L2(X, Y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + ... + (x_n - y_n)^2}

Note on L2: Squaring amplifies values, thereby magnifying the influence of outliers.

Evaluation

For evaluation, we use accuracy, which measures the fraction of predictions that were correct.

k-Nearest Neighbr Classifier

The idea: we will find the top k closest images, and have them vote on the label of the test image.

Hyperparameters

It’s often not obvious what values/settings one should choose for hyperparameters.

We cannot use the test set for the purpose of tweaking hyperparameters.

-> Split your training set into training set and a validation set. Use validation set to tune all hyperparameters. At the end run a single time on the test set and report performance.

Cross-validation

Cross-validation: iterating over different validation sets and averaging the performance across these.

Cross-validation

❌