普通视图

昨天以前std::bodun::blog

std::bodun::blog
Stichable Neural Networks
TLDR; the Stichable Neural Networks paper includes some interesting concepts. It allows the creation of multiple neural networks with varying complexity and performance trade-offs from a family of pretrained models. Key Principles How to choose anchors from well-performed pretrained models in a model family The design of stitching layers The stitching direction and strategy Simple but effective training strateg A key question about combining sub-networks from different pretrained models is how
2020年9月13日 08:00

Stichable Neural Networks

std::bodun::blog

2020年9月13日 08:00

TLDR; the Stichable Neural Networks paper includes some interesting concepts. It allows the creation of multiple neural networks with varying complexity and performance trade-offs from a family of pretrained models.

Key Principles

How to choose anchors from well-performed pretrained models in a model family
The design of stitching layers
The stitching direction and strategy
Simple but effective training strateg

A key question about combining sub-networks from different pretrained models is how to maintain accuracy. The paper concludes that the final performance of these combinations is nearly predictable due to an interpolation-like performance curve between anchors. This predictability allows for selective pre-training of stitches based on various deployment scenarios.

The Choice of Anchors

Anchors that are pretrained on different tasks can learn very different representations due to the large distribution gap of different domains. Therefore, the selected anchors should be consistent in terms of the pretrained domain.

The Stitching Layer and its Initialization

SN-Net is built upon pretrained models. Therefore, the anchors have already learned good representations, which allows to directly obtain an accurate transformation matrix by solving the least squares problem:

$$||AM_o - B|| = min||AM - b||_F$$

where $A \in R^{N \times D_1}$ and $B \in R^{N \times D_2}$ are two feature maps of the same spatial size but with different number of hidden dimensions.

This function indicates a closed form expression based on singular value decomposition, in which case the optimal solution can be achieved through an orthogonal projection in the space of matrices:

$$M_o = A^\dagger B$$

where $A^\dagger$ denotes the Moore-Penrose pseudoinverse of $A$.

Where to Stitch

SN-Net takes Fast-to-Slow as the default stitching direction, meaning it will stitch bigger and slower network after smaller and faster networks to achieve better model performance. Besides, it also proposes a nearest stitching strategy by limiting the stitching between two anchors of the nearest model complexity/performance.

Way to Stitch

Prior works shows neighboring layers dealing with the same scale feature maps share similar representations. Therefore, SN-Net uses slideing window: where the same window shares a common stitching layer.

sliding-window

Stitching Space

The stitching space is controlled by the configuring the sliding window kernel size $k$ and step size $s$.

Training Strategy

The training algorithm of SN-Net can be described as:

sn-net-training

The training algorithm can be summarized as:

Firstly define a configuration set that contains all possible stitches
Initialize all stitching layers with least-squares matching
At each training iteration, we randomly sample a stitch and follow the standard training process as in common practices

TensorIR Transformation

std::bodun::blog

2022年8月30日 08:00

In the previous post, we’ve explored how to write primitive functions in TensorIR. Here, we will see how to transform TensorIR into other (potentially more performant) variants. The content is drived from the mlc course taught by Tianqi Chen.

Batched BMM ReLu

A batched matrix multiplication followed by a ReLu operation can be expressed using numpy as:

def lnumpy_mm_relu_v2(A: np.ndarray, B: np.ndarray, C: np.ndarray):
 Y = np.empty((16, 128, 128), dtype="float32")
 for n in range(16):
 for i in range(128):
 for j in range(128):
 for k in range(128):
 if k == 0:
 Y[n, i, j] = 0
 Y[n, i, j] = Y[n, i, j] + A[n, i, k] * B[n, k, j]
 for n in range(16):
 for i in range(128):
 for j in range(128):
 C[n, i, j] = max(Y[n, i, j], 0)

Translating the numpy code into TensorIR we get:

@tvm.script.ir_module
class MyBmmRule:
 @T.prim_func
 def bmm_relu(A: T.Buffer[(16, 128, 128), "float32"],
 W: T.Buffer[(16, 128, 128), "float32"],
 Y: T.Buffer[(16, 128, 128), "float32"]):
 T.func_attr({"global_symbol": "bmm_relu", "tir.noalias": True})
 # we must to allocate the buffer here!
 Y_ = T.alloc_buffer([16, 128, 128], dtype="float32")
 for n, i, j, k in T.grid(16, 128, 128, 128):
 with T.block("M"):
 vn = T.axis.spatial(16, n)
 vi = T.axis.spatial(128, i)
 vj = T.axis.spatial(128, j)
 vk = T.axis.reduce(128, k)
 with T.init():
 Y_[vn, vi, vj] = T.float32(0)
 Y_[vn, vi, vj] += A[vn, vi, vk] * W[vn, vk, vj]
 for n, i, j in T.grid(16, 128, 128):
 with T.block("R"):
 vn = T.axis.spatial(16, n)
 vi = T.axis.spatial(128, i)
 vj = T.axis.spatial(128, j)
 Y[vn, vi, vj] = T.max(Y_[vn, vi, vj], T.float32(0))

Our ultimate goal is to transform the TensorIR above to the following form:

@tvm.script.ir_module
class TargetModule:
 @T.prim_func
 def bmm_relu(A: T.Buffer[(16, 128, 128), "float32"], B: T.Buffer[(16, 128, 128), "float32"], C: T.Buffer[(16, 128, 128), "float32"]) -> None:
 T.func_attr({"global_symbol": "bmm_relu", "tir.noalias": True})
 Y = T.alloc_buffer([16, 128, 128], dtype="float32")
 for i0 in T.parallel(16):
 for i1, i2_0 in T.grid(128, 16):
 for ax0_init in T.vectorized(8):
 with T.block("M_init"):
 n, i = T.axis.remap("SS", [i0, i1])
 j = T.axis.spatial(128, i2_0 * 8 + ax0_init)
 Y[n, i, j] = T.float32(0)
 for ax1_0 in T.serial(32):
 for ax1_1 in T.unroll(4):
 for ax0 in T.serial(8):
 with T.block("M_update"):
 n, i = T.axis.remap("SS", [i0, i1])
 j = T.axis.spatial(128, i2_0 * 8 + ax0)
 k = T.axis.reduce(128, ax1_0 * 4 + ax1_1)
 Y[n, i, j] = Y[n, i, j] + A[n, i, k] * B[n, k, j]
 for i2_1 in T.vectorized(8):
 with T.block("R"):
 n, i = T.axis.remap("SS", [i0, i1])
 j = T.axis.spatial(128, i2_0 * 8 + i2_1)
 C[n, i, j] = T.max(Y[n, i, j], T.float32(0))

Before we perform the transformation, let’s understand what the transformed TensorIR is doing by looking at several loops here.

First, taking a look at

for i1, i2_0 in T.grid(128, 16):
 for ax0_init in T.vectorized(8):
 with T.block("M_init"):
 n, i = T.axis.remap("SS", [i0, i1])
 j = T.axis.spatial(128, i2_0 * 8 + ax0_init)
 Y[n, i, j] = T.float32(0)

The code block is initializing the Y matrix to be 0. But it does so by initializing every 8 consecutive elements in each row of Y using a vectorized operation (which might be faster).

The next loop is bit tricky:

for ax1_0 in T.serial(32):
 for ax1_1 in T.unroll(4):
 for ax0 in T.serial(8):
 with T.block("M_update"):
 n, i = T.axis.remap("SS", [i0, i1])
 j = T.axis.spatial(128, i2_0 * 8 + ax0)
 k = T.axis.reduce(128, ax1_0 * 4 + ax1_1)
 Y[n, i, j] = Y[n, i, j] + A[n, i, k] * B[n, k, j]

This loop is actually performing the matrix multiplication of A and B. We mutiply a row in A with a column in B and sum up the result into a number.

Here, i is mapped to i1, which means we access A one row at a time.i k = T.axis.reduce(128, ax1_0 * 4 + ax1_1) means we access one row in matrix A and one column in matrix B sequentially duing mutiplying, while applying unrolling in hope for better access efficency (\(128 = 32\times 4))). j = T.axis.spatial(128, i2_0 * 8 + ax0) really just means accessing each column sequentially, nothing special.

Perform Transformation

To perform tranformation on any TensorIP, it’s very important to follow the steps listed below:

Get block
Get loops
Organize loops by split, reorder, compute_at/reverse_compute_at
Decompose reduction
vectorize/unroll/parallel

Applying step 1, 2, and 3, we first get the block from the original TensorIR:

sch = tvm.tir.Schedule(MyBmmRule)
# Step 1. Get blocks
block_M = sch.get_block("M", func_name="bmm_relu")

# Step 2. Get loops
n, i, j, k = sch.get_loops(block_M)

# Step 3. Organize loops
k0, k1 = sch.split(k, factors=[32, 4])
j0, j1 = sch.split(j, factors=[16, 8])

The reason we split k and j in such a way is: we already mentioned k dimension is accessed sequentially but with unrolling (4) applied; when matrix Y is initialized, a vectorized operation (applied on 8 elements) is applied to dimension j, or every 8 elements in one row(TVM is row-major, therefore might be faster).

But the next question is: how do we reorder the spitted loop? I spent a lot of time trying to figure that out. Turns out the simplest way is to write out the implementation in numpy and proceed from there. Remember, we’ve already splitted k and j, which are used during matrix multiplication, so our new matrix multipliation in numy would be:

for j0 in range(16):
 for k0 in range(32):
 for k1 in range(4):
 for j1 in range(8):
 Y[i, 8*j0+j1] += A[i, 4*k0 + k1] * B[4*k0+k1, 8*j0+j1]

Because we move the the next column in B after traversing the previous column, we will put j1 at the innermost loop. Therefore, the transformation for TensorIR would be:

sch.reorder(j0, k0, k1, j1)

We can print out the transformed TensorIR with print(sch.mod.script()):

@tvm.script.ir_module
class Module:
 @tir.prim_func
 def bmm_relu(A: tir.Buffer[(16, 128, 128), "float32"], B: tir.Buffer[(16, 128, 128), "float32"], C: tir.Buffer[(16, 128, 128), "float32"]) -> None:
 tir.func_attr({"global_symbol": "bmm_relu", "tir.noalias": True})
 Y = tir.alloc_buffer([16, 128, 128], dtype="float32")
 for n in tir.parallel(16):
 for i, j_0, k_0, k_1, j_1 in tir.grid(128, 16, 32, 4, 8):
 with tir.block("M"):
 vn, vi = tir.axis.remap("SS", [n, i])
 vj = tir.axis.spatial(128, j_0 * 8 + j_1)
 vk = tir.axis.reduce(128, k_0 * 4 + k_1)
 tir.reads(A[vn, vi, vk], B[vn, vk, vj])
 tir.writes(Y[vn, vi, vj])
 with tir.init():
 Y[vn, vi, vj] = tir.float32(0)
 Y[vn, vi, vj] = Y[vn, vi, vj] + A[vn, vi, vk] * B[vn, vk, vj]
 for n, i, j in tir.grid(16, 128, 128):
 with tir.block("R"):
 vn, vi, vj = tir.axis.remap("SSS", [n, i, j])
 tir.reads(Y[vn, vi, vj])
 tir.writes(C[vn, vi, vj])
 C[vn, vi, vj] = tir.max(Y[vn, vi, vj], tir.float32(0))

Now, we just need to move the ReLu operation (for n, i, j in tir.grid(16, 128, 128):) into the loop above:

block_M = sch.get_block("M", func_name="bmm_relu")
sch.reverse_compute_at(block_M, j0)

Step 4 involves seperating initialization and matrix multiplication, therefore we use M_init = sch.decompose_reduction(block_M, k0), which results in:

@tvm.script.ir_module
class Module:
 @tir.prim_func
 def bmm_relu(A: tir.Buffer[(16, 128, 128), "float32"], B: tir.Buffer[(16, 128, 128), "float32"], C: tir.Buffer[(16, 128, 128), "float32"]) -> None:
 # function attr dict
 tir.func_attr({"global_symbol": "bmm_relu", "tir.noalias": True})
 # body
 # with tir.block("root")
 Y = tir.alloc_buffer([16, 128, 128], dtype="float32")
 for n in tir.parallel(16):
 for i, j_0 in tir.grid(128, 16):
 for j_1_init in tir.serial(8):
 with tir.block("M_init"):
 vn, vi = tir.axis.remap("SS", [n, i])
 vj = tir.axis.spatial(128, j_0 * 8 + j_1_init)
 tir.reads()
 tir.writes(Y[vn, vi, vj])
 Y[vn, vi, vj] = tir.float32(0)
 for k_0, k_1, j_1 in tir.grid(32, 4, 8):
 with tir.block("M_update"):
 vn, vi = tir.axis.remap("SS", [n, i])
 vj = tir.axis.spatial(128, j_0 * 8 + j_1)
 vk = tir.axis.reduce(128, k_0 * 4 + k_1)
 tir.reads(Y[vn, vi, vj], A[vn, vi, vk], B[vn, vk, vj])
 tir.writes(Y[vn, vi, vj])
 Y[vn, vi, vj] = Y[vn, vi, vj] + A[vn, vi, vk] * B[vn, vk, vj]
 for ax0 in tir.serial(8):
 with tir.block("R"):
 vn, vi = tir.axis.remap("SS", [n, i])
 vj = tir.axis.spatial(128, j_0 * 8 + ax0)
 tir.reads(Y[vn, vi, vj])
 tir.writes(C[vn, vi, vj])
 C[vn, vi, vj] = tir.max(Y[vn, vi, vj], tir.float32(0))

The final step is easy, just apply vectorize/parallel/unroll onto corresponding loop:

n, i, j_0, j_1_init = sch.get_loops(M_init)
sch.vectorize(j_1_init)

n, i, j_0, i2_1 = sch.get_loops(block_R)
sch.vectorize(i2_1)

block_M_update = sch.get_block("M_update", func_name="bmm_relu")
n, i, j_0, k_0, k_1, j_1 = sch.get_loops(block_M_update)

Print out the final TensorIR to find out its final form ( ͡❛ ͜ʖ ͡❛).

Dive into TensorIR

std::bodun::blog

2022年8月28日 08:00

TensorIR is a compiler abstraction for optimizing programs with tensor computation primitives in TVM. Imagine a DNN task as a graph, where each node represents a tensor computation. TensorIR explains how each node/tensor computation primitive in the graph is carried out. This post explains my attempt to implement 2D convolution using TensorIR. It is derived from the Machine Learning Compilation course offered by Tianqi Chen.

Implement 2D Convolution

2D convolution is a common operation in image processing. The image below captures how 2D convolution operates. I won’t go into details here. But you can find plenty information online regarding convolution.

2D-convolution

First, we initialize both the input matrix and the weight matrix:

# batch, input_channel_dim, image_height, image_width, output_channel_dim, kernel_width & height
N, CI, H, W, CO, K = 1, 1, 8, 8, 2, 3
# output_height, output_width, assuming kernel has stride=1 and padding=0
OUT_H, OUT_W = H - K + 1, W - K + 1
data = np.arange(N*CI*H*W).reshape(N, CI, H, W)
weight = np.arange(CO*CI*K*K).reshape(CO, CI, K, K)

We can validate the results using torch.nn.functional.conv2d() from PyTorch.

One thing Tianqi recommended for starters is to write the implementation first in numpy, and then translate the numpy implementation to TensorIR. I started my implementation directly from TensorIR, before totally getting confused. So here’s how I approach the problem.

First, and perhaps most importantly, you should figure out the accessing pattern of the output matrix, and gradually fill up the compute rules for each element in the output matrix. So, we know the output matrix has a shape of (N, CO, OUT_H, OUT_w) (which corresponds to batch, number of output channels, output height, and output width). The numpy loop will look like:

for b in np.arange(0, N):
 for co in np.arange(0, CO):
 for h in np.arange(0, OUT_H):
 for w in np.arange(0, OUT_W):
 Y[b, co, h, w] = 0

Here, we access element in the output matrix one by one and initialize each element to be 0. Next, we will try to figure out how to compute each element. We know each element in the output matrix is just the sum of element-wise multiplication of both the 2D convolutional kernel (1 by 3 by 3) and the corresponding area in the input matrix (1 by 3 by 3):

for b in np.arange(0, N):
 for co in np.arange(0, CO):
 for h in np.arange(0, OUT_H):
 for w in np.arange(0, OUT_W):
 # init to 0
 Y[b, co, h, w] = 0
 # 2d conv kernel
 for ci in np.arange(0, CI):
 for kh in np.arange(0, K):
 for kw in np.arange(0, K):
 # reduction
 Y[b, co, h, w] += A[b, ci, h+kh, w+kw] * W[co, ci, kh, kw]

We can verify the function has the same output as torch.nn.functional.conv2d() from PyTorch.

The next part is to translate the numpy code into TensorIR. I won’t go into every the details of every single line here, but you can find all explanations from this note.

The nested loop can be encapsulated using T.grid() like this:

@tvm.script.ir_module
class MyConv:
 @T.prim_func
 def conv2d(data: T.Buffer[(N, CI, H, W), "int64"],
 weight: T.Buffer[(CO, CI, K, K), "int64"],
 result: T.Buffer[(N, CO, OUT_H, OUT_W), "int64"]):
 T.func_attr({"global_symbol": "conv2d", "tir.noalias": True})
 # loop through each elem in the output matrix
 for b, o, h, w in T.grid(N, CO, OUT_H, OUT_W):
 # kernel access pattern
 for kc, kh, kw in T.grid(CI, K, K):

Next, we define the block (a basic unit of computation in TensorIR). A block contains a set of block axes (vi, vj, vk) and computations defined around them. Here, we define the property about each block axes:

class MyConv:
 @T.prim_func
 def conv2d(data: T.Buffer[(N, CI, H, W), "int64"],
 weight: T.Buffer[(CO, CI, K, K), "int64"],
 result: T.Buffer[(N, CO, OUT_H, OUT_W), "int64"]):
 T.func_attr({"global_symbol": "conv2d", "tir.noalias": True})
 # impl
 for b, o, h, w in T.grid(N, CO, OUT_H, OUT_W):
 for kc, kh, kw in T.grid(CI, K, K):
 with T.block("A"):
 vb = T.axis.spatial(N, b)
 vc_o = T.axis.spatial(CO, o)
 vh = T.axis.spatial(OUT_H, h)
 vw = T.axis.spatial(OUT_W, w)
 vc_i = T.axis.reduce(CI, kc)
 vw_h = T.axis.reduce(K, kh)
 vw_w = T.axis.reduce(K, kw)

The outer loop all receives T.axis.spatial(), because we access each element in the output matrix element by element (spatially), without doing anything else. On the other hand, we see parameters in the innter loop receives T.axis.reduce(). Remember, each element in the output matrix is just the sum of element-wise multiplication of both the 2D convolutional kernel (1 by 3 by 3) and the corresponding area in the input matrix (1 by 3 by 3). Therefore, after the element-wise multiplication finishes, we need perform a reduction operation over all three axes. More concretely, we will sum up all elements in the row(K), column(K), and channel(CI): (1, 3, 3) -> (1)

@tvm.script.ir_module
class MyConv:
 @T.prim_func
 def conv2d(data: T.Buffer[(N, CI, H, W), "int64"],
 weight: T.Buffer[(CO, CI, K, K), "int64"],
 result: T.Buffer[(N, CO, OUT_H, OUT_W), "int64"]):
 T.func_attr({"global_symbol": "conv2d", "tir.noalias": True})
 # impl
 for b, o, h, w in T.grid(N, CO, OUT_H, OUT_W):
 for kc, kh, kw in T.grid(CI, K, K):
 with T.block("A"):
 vb = T.axis.spatial(N, b)
 vc_o = T.axis.spatial(CO, o)
 vh = T.axis.spatial(OUT_H, h)
 vw = T.axis.spatial(OUT_W, w)
 vc_i = T.axis.reduce(CI, kc)
 vw_h = T.axis.reduce(K, kh)
 vw_w = T.axis.reduce(K, kw)

 with T.init():
 result[vb, vc_o, vh, vw] = T.int64(0)
 # compute rule
 result[vb, vc_o, vh, vw] += data[vb, vc_i, vh+vw_h, vw+vw_w] * weight[vc_o, vc_i, vw_h, vw_w]

Pathways: Google's New ML System

std::bodun::blog

2022年3月31日 08:00

Google recently released the paper about its new ML system called Pathways. I’m a bit surprised since I expect it to introduce a brand new model architecture. In fact, this paper is not easy to digest at all. I feel like it’s written for people who spent many years developing ML frameworks. Anyway, we will try to understand why it is developed and how it works. Also, you should check this post (in Chinese). This post explains many concepts in Pathways much more clearly. Many contents here are credited to this post.

This paper spends a long time discussing single-controller and multi-controller. It’s really confusing to understand all these SPMD, MPMD, single-controller, and multi-controller stuffs. Pathways claims the future ML framework should go back to single-controller. By “back” I mean ML frameworks were originally single-controller, then they adopted multi-controller. Now, we are going back to single-controller again.

Single-Controller

TensorFlow v1 is a classic example of single-controller system. The high level idea is the user would define a dataflow graph through a Python client. This graph is then submitted to the session.run (runtime system). The system consists of a single master and many other workers. The mater will compile and the dataflow graph submitted by the client, then divides the graph into sub-graphs. Then the master submits those subgraphs to other workers.

In this case, each worker computes its own share of sub-graph. The client + master are the controller.

tf1-spmd

Fig. control messages (oranges lines) need to go through slow DCN between Ctrlr and hosts

As the paper suggests, dispatching computations in a single-controller system requires communnication across (data center network) DCN. All the orange lines are control messages flowing through DCN. We can see the workers are idle for a long time between each step, even though there’s no gap between adjust steps on the controller.

The controller submits jobs to all workers in each step, then waits all workers to finish computing their own sub-graphs. The problem is: 1) waiting for all workers to finish computation in a lock-step fashion is inefficient; 2) send and wait for control messages (orange line) is costly since these messages go through slow DCN.

Multi-Controller Systems

Contrary to single-controller systems, multi-controller systems like Jax adopts a different philosophy. Under multi-controller systems, each worker shares the same code and executes different stage/branch of the code. This is why they are called SPMD systems (single-program-multiple-data).

jax-spmd

Fig. Dispatching jobs only happens locally on hosts without going through DCN

Take MPI process as an example, every MPI process is an entrance (client) to the program (In single-controller systems, only the client-master can be the entrance).

Since multi-controller systems doesn’t have a centralized coordinator, all workers in can initiate communication with each other, using much faster channels such as PCIe or NVLink. In the multi-controller graph, the black dotted lines represents message between hosts and devices (through PCIe); the communication between devices happens through fast NVLink. So we don’t have the big overhead introduced by DCN.

If you want to get a taste of how PyTorch vs TensorFlow v1’s (multi-controller vs single-controller) programming style feels like, here are two examples: Writing Distributed Applications with PyTorch and End-to-End Tutorial for Distributed TensorFlow 1.x.

Going Back to Single-Controller

We could stick with multi-controller systems forever. If every worker node shares symmetric workloads and communications (like all-reduce, all-gather, etc.), then there’s nothing to be worried about. After all, multi-controller seems much more efficient than single-controller based on what we’ve discussed so far.

However, pipeline parallelism changes the story. Under pipeline parallelism, different workers in the pipeline will execute at different programs. Thus we have MPMD (multi-program-multi-data). For example, we can have one worker doing convolution for batch 1 while another worker is doing encoding work on batch 2. At each stage of the pipeline, the worker is doing different jobs on a different data batch (think of a CPU pipeline where each stage is executing different instructions).

tf1-non-spmd

Take the above graph as an example, assume we have three workers 1, 2, 3 from top to bottom. Each worker is performing asymmetric workloads and doing irregular point-to-point communications (instead of symmetric communications like all-gather). Obviously, multi-controller doesn’t fit into this kind of workload. How do you write a single copy of code that does all these irregular communications under multi-process scenarios?

Thus, Pathways proposes we should go back to single-controller, so that we can let the master node handle all these nasty communication patterns.

Deadlock

Single-Controller brings back gang-scheduling and centralized coordinator. The reason to use gang-scheduling and centralized coordinator is to help preventing deadlocks. However, the rational behind this design decision is hard to interpret from reading the paper. I’m going to use the post from Jinhui Yan (the developer behind OneFlow) to explain why gang-scheduling and centralized coordinator prevent deadlocks.

Gang-scheduling is essential in the case of TPUs, since they are single-threaded and only run non-preemptible kernels, so the system will deadlock if communicating computations are not enqueued in a consistent order.

We can think of a computing device as a FIFO task queue (e,g. CUDA streams, TPU, or CPU…). Each FIFO task queue essentially have a stream of tasks to process.

FIFO-queue

Src. Jinhui Yan

The paper emphasizes that TPUs are single-threaded and only run non-preemptible kernels. That means we can think of each TPU as a single FIFO task queue. Once we enqueue a task, it can not be preempted from the queue. We need to wait until this task finishes its computation before we can execute the next task in the queue. This is a problem!

deadlock

Src. Jinhui Yan

Imagine we have two devices (1 and 2), represented as two FIFO queues. Device 1 chooses to enqueue task A first and then B; device 2 decides to enqueue task B first and then A. Both tasks A and B are performing an all-scatter operation. Therefore, task A on device 1 needs to wait for messages from task A on device 2. Similarly, task B on device 2 needs to wait for messages from task B on device 1.

deadlock-conditions

This is a classical example of deadlock in operating systems.

Solutions to Deadlock

Using gang-scheduling helps preventing deadlocks, because it enforces a global enqueueing order across multiple FIFO queues, instead of letting each queue handling tasks separately.

The paper also mentions allowing device (e.g. GPUs) to execute tasks concurrently can prevent deadlocks. This is because concurrency eliminates the non-preemption property which is required for deadlocks to happen.

concurrency

Src. Jinhui Yan

If each devices allows concurrency executions (each device has multiple queues), then the task on one queue can be preemptied to allow the other task start executing, thus no deadlock (this is not strictly the case, the post explains an interesting scenario in NCCL where deadlocks can still happen if there are too many communications).

Add Mermaid to Hugo with Dark Mode

std::bodun::blog

2022年2月15日 08:00

Recently, I was revisiting materials in Deep Learning. I need tools that generate diagrams easily. Drawing the graphs from scratch and upload them individually to the image hosting platform is a daunting process. This is when Mermaid comes into rescue. Now I can generate diagrams directly using Markdown. Here’s how to do it inside a Hugo site.

I use the etch theme, but this process should apply to all sites using Hugo. First, we create a new file /layouts/shortcodes/mermaid.html. We fill up mermaid.html with:

<script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
<script>
 let isDark = window.matchMedia('(prefers-color-scheme: dark)').matches;
 let mermaidTheme = (isDark) ? 'dark' : 'default';
 let mermaidConfig = {
 theme: mermaidTheme,
 logLevel: 'fatal',
 securityLevel: 'strict',
 startOnLoad: true,
 arrowMarkerAbsolute: false,

 er: {
 diagramPadding: 20,
 layoutDirection: 'TB',
 minEntityWidth: 100,
 minEntityHeight: 75,
 entityPadding: 15,
 stroke: 'gray',
 fill: 'honeydew',
 fontSize: 12,
 useMaxWidth: true,
 },
 flowchart: {
 diagramPadding: 8,
 htmlLabels: true,
 curve: 'basis',
 },
 sequence: {
 diagramMarginX: 50,
 diagramMarginY: 10,
 actorMargin: 50,
 width: 150,
 height: 65,
 boxMargin: 10,
 boxTextMargin: 5,
 noteMargin: 10,
 messageMargin: 35,
 messageAlign: 'center',
 mirrorActors: true,
 bottomMarginAdj: 1,
 useMaxWidth: true,
 rightAngles: false,
 showSequenceNumbers: false,
 },
 gantt: {
 titleTopMargin: 25,
 barHeight: 20,
 barGap: 4,
 topPadding: 50,
 leftPadding: 75,
 gridLineStartPadding: 35,
 fontSize: 11,
 fontFamily: '"Open-Sans", "sans-serif"',
 numberSectionStyles: 4,
 axisFormat: '%Y-%m-%d',
 topAxis: false,
 },
 };
 mermaid.initialize(mermaidConfig);
</script>

This setup allows us to change Mermaid-generated diagrams’ theme based on the website’s current (light/dark) theme. This configuration is borrowed from the Setup.md from mermaid-js (except the theme part). You can find more information there about configuring mermaid.

You can also do this in /partials, but it will slow down the loading time because the mermaid js file is always loaded, regardless whether you are actually using mermaid.

Next, we add the follow lines to the file /layouts/shortcodes/mermaid.html:

<center>
<div class="mermaid">
 {{.Inner}}
</div>
</center>

Feel free to remove the <center> tag if you want to customize the diagram’s layout. And… we are done!

Here is an example sequenceDiagram. You should see that this diagram will adjust its theme accordingly based on light/dark mode. We use the example code from mermaid doc (just uncomment mermaid in the shortcode {{/*< mermaid >*/}}):

{{/*< mermaid >*/}}
sequenceDiagram
 participant Alice
 participant Bob
 Alice->>John: Hello John, how are you?
 loop Healthcheck
 John->>John: Fight against hypochondria
 end
 Note right of John: Rational thoughts <br/>prevail!
 John-->>Alice: Great!
 John->>Bob: How about you?
 Bob-->>John: Jolly good!
{{/*< /mermaid >*/}}

sequenceDiagram participant Alice participant Bob Alice->>John: Hello John, how are you? loop Healthcheck John->>John: Fight against hypochondria end Note right of John: Rational thoughts
prevail! John-->>Alice: Great! John->>Bob: How about you? Bob-->>John: Jolly good!

This diagram will adjust its theme based on light/dark theme. You can find more features from the Mermaid website.

Cross Entropy Loss

std::bodun::blog

2022年2月13日 08:00

Many deep learning tasks involve classification, where a model outputs a series of probabilities for their corresponding labels. The goal is to correctly predict a given input’s label. Mathematically, it means generating max probabilities for the correct label. The probabilities are generated through a process called softmax.

The softmax function outputs a vector $\hat{y}$, which represents estimated conditional probabilities of each class given an input $x$, For example, $\hat{y}_1 = P(y=\textrm{car}\ |\ x)$. Assume we have many features $x^{(i)}$ and their corresponding labels $y^{(i)}$. Then outputs of the model can be expressed succinctly as

\[ P(Y\ |\ X) = \prod^{k}_{i=1} P(y^{(i)} | \ x^{(i)}) \]

Our goal is to maximize $P(Y | X)$. This is equivalent to minimizing the negative log-likelihood $ -\textrm{log} P(Y\ |\ X) = \sum^{k}_{i=1} -\textrm{log} P(y^{(i)} | \ x^{(i)}) $.

This loss function called the cross-entropy loss. It is widely used in many classification tasks. Our objective is to reduce the value of this loss function. This is equivalent to maximizing the predicted probability for the correct label.

To see why this works. Let take a toy example. Suppose we have three classes. Our model produces a vector with three probabilities for each input given.

import numpy as np

# produces two probability vector for two inputs
y_hat = np.array([[0.1, 0.3, 0.6], [0.2, 0.3, 0.5]])

The label is represented as the indices of the probabilities in y_hat, which will give us the generated probability for a the correct label.

y = np.array([0, 2])

Then, we implement the cross-entropy loss function as:

def cross_entropy(y_hat, y):
 return - np.log(y_hat[range(len(y_hat)), y])

Finally, we calculate the loss value for our given probability vectors:

cross_entropy(y_hat, y)

The result is array([2.30258509, 0.69314718]). In the first output [0.1, 0.3, 0.6], the label is at index 0. But our model gives max probability to index 2, and only $0.1$ to the label, thus the greater loss value. In the second probability vector [0.2, 0.3, 0.5], we made the right prediction as we give the max probability to index 2 corresponding to the label, thus the smaller loss value.

Maximum Likelihood for Classification

std::bodun::blog

2022年1月24日 08:00

Let’s say we want to classify an input text $y$ and give it a label $x$. Formally, we want to find:

\[ \textrm{argmax} P(x | y) \]

By Bayes’ rule this is the same as

\[ \textrm{argmax} \frac{P(y|x)P(y)}{P(x)} \]

Suppose we have five documents as training data and one document as the input as testing data. Our objective is to give a label to the test sentence.

text-example

Credit: Eunsol Choi

Let’s define the probability of class as ($N$ is the total number of classes)

\[ p(x) = \frac{count(x)}{N} \]

and the probability of a word appearing given a class label (total number of vocabs)

\[ p(w_i|x) = \frac{count(w_i,x) + 1}{count(x) + |V|} \]

The conditional probabilities for $p(w_i|y)$ is

conditional-probabilities

Now, we want to find out which language label should we assign the sentence “Chinese Chinese Chinese Tokyo Japan”. This is the same as asking which labels ($x$)) should we pick so that $P(W|x)P(x)$ yields the greatest value. Mathematically, we want to find out where the gradient of the function $P(W|x)P(x)$ is flat.

If we label the sentence as j (Japanese), we have $P(j | d_5) \propto \frac{1}{4}\cdot (\frac{2}{9}^3)\cdot \frac{2}{9}\cdot \frac{2}{9} \approx 0.0001$. If we calculate $P(c|d_5)$, we get 0.0003, which generates the largest value for $P(x | y)$.

Machine Learning System Resources

std::bodun::blog

2022年1月8日 08:00

This is my personal list of resources related to machine learning systems. Feel free to drop me an email if you think there’s something worth mentioning. I will try to update this page frequently to include the most recent stuffs in mlsys.

Resources

Facebook’s external large-scale work
NGC Container Doc: great for development, without having to manually install CUDA, pytorch, and other dependencies.
Awesome-System-for-Machine-Learning: A curated list of research in machine learning systems (MLSys). Paper notes are also provided.
Mastering LLM Techniques: Inference Optimization: summary of techniques used for LLM deployment
LLM Visualization

Courses

Deep Learning Systems: Algorithms and Implementation
Machine Learning Compilation: offered by Tianqi Chen, intro to ML compiler. Open to all.
15-884: Machine Learning Systems: offered by Tianqi Chen at CMU.
CSE 291F: Advanced Data Analytics and ML Systems: offered by Arun Kumar at UCSD.
Tinyflow: tutorial code on how to build your own Deep Learning System in 2k Lines.
CSE 599W: Systems for ML: offered by Tianqi Chen at UW.
CS8803-SMR: Special Topics: Systems for Machine Learning: offered by Alexey Tumanov at Georgia Tech. [schedule]
CS 744: Big Data Systems: offered by Aditya Akella back at UW-Madison.
CS 329S: Machine Learning Systems Design: offered by Stanford.
EECS 598: Systems for AI: offered by Mosharaf Chowdhury at UMich.
CS 378: Systems for Machine Learning: offered by Aditya Akella at UT.
Machine Learning Systems (Fall 2019): from UCB
ECE 382V SysML: Computer Systems and Machine Learning Interplay: taught by Neeraja Yadwadkar at UT.
CSE 234 Data Systems for Machine Learning from UCSD.
CS336: Language Modeling from Scratch from Stanford.

Labs & Faculties

Tutorials

The Illustrated Transformer: best introduction for transformer
Dive into Deep Learning: interactive deep learning book with code, math, and discussions.
CS231n: Convolutional Neural Networks for Visual Recognition
Pytorch-Internals: must-read for PyTorch basics.
Differential Programming with JAX
Getting started with JAX (MLPs, CNNs & RNNs)
Physics-based Deep Learning
机器学习系统：设计和实现
Dive into Deep Learning Compiler
Autodidax: JAX core from scratch: really really good resource for learning Jax internals.
Extending JAX with custom C++ and CUDA code
Vector, Matrix, and Tensor Derivatives
ML Memory Optimization: slides from UW. Visualization of dataflow graph helps understand how to optimize memory.
Pytorch模型加速系列（一）——新的Torch-TensorRT以及TorchScript/FX/dynamo
Semianalysis: many good posts
How to Load PyTorch Models 340 Times Faster with Ray
分布式深度学习系统
MLsys各方向综述
金雪锋: MindSpore 技术负责人
解读谷歌Pathways架构（一）：Single-controller与Multi-controller
Why are ML Compilers so Hard?
Alpa: Automated Model-Parallel Deep Learning
Data Transfer Speed Comparison: Ray Plasma Store vs. S3
从零开始学深度学习编译器
一文搞懂 TorchDynamo 原理

LLM Optimization

The Illustrated GRPO: Group Relative Policy Optimization Explained: very good tutorial showing how GRPO works with examples.

Communication

Seminars

Stanford MLSys Seminar

Papers

This section could potentially be extremely long..

Training

Really broad topic…

The Llama 3 Herd of Models: really great paper explaining how SOTA models are trained in real world!

LLM

You an also refer to Awesome-LLM

Large Transformer Model Inference Optimization
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face: use LLM as the controller to coordinate with exernal models for complicated tasks
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity: load as sparse, compute as dense
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism: Overlapping LLM forward with KV caching computation.
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache: managing KV cache in distributed settings.
MatFormer: Nested Transformer for Elastic Inference: adaptive blocks during inference.

NAS

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs: Applying block-wise local distillation to every alternative subblock replacement in parallel and scoring its quality and inference cost to build a “library” of blocks. Then, using Mixed-Integer-Programming to assemble a heterogeneous architecture that optimizes quality under constraints such as throughput, latency and memory usage.

Diffusion

Inference

INFaaS: Automated Model-less Inference Serving: developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale: transformer-specific inference optimization done by DeepSpeed.
Clipper: A Low-Latency Online Prediction Serving System: a nice overview of inference system. Not SOTA but a good starter.
Orloj: Predictably Serving Unpredictable DNNs: shares similarity to Clipper, but targeting models that may yield unpredictable performance.
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving: how to multiplex devices to serve multiple model while meeting latency constraint.
MOSEL: Inference Serving Using Dynamic Modality Selection: dynamic modality selections for accuracy and SLO tradeoff
BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models: finer-grained LLM serving
Orca: A Distributed Serving System for Transformer-Based Generative Models: iteration-based LLM inference
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs: inference on heterogeneous accelerators using max-flow
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving: automatic configuration of inter and intra-node parallelism for model for both prefill and decoding.

Multitenancy

MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
A Survey of Multi-Tenant Deep Learning Inference on GPU: efficient resource management for multi-tenant inference.
Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction: overlap DNN operators of different models in an online fashion
THEMIS : Fair and Efficient GPU Cluster Scheduling: minimize the maximum finish time fairness across all ML apps while efficiently utilizing cluster GPUs

Dynamic Neural Network

Dynamic Neural Networks: A Survey
Dynamic Multimodal Fusion
Hash Layers For Large Sparse Models: Using hashing for MoE gating.
An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems: use evolution algorithm to update model structure during the training phase.
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping: dynamically skip FFN layers in Transformers suing cosine similarity.
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference: monotonically decrease LLM layers to ease KV cache management

Auto Placement

Reasoning LLM

Federated Learning

A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection

Switch & ML

Memory Management

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

System Design

Pathways: Asynchronous Distributed Dataflow for ML: Google’s new DL systems, specifically designed for TPU.
Ray: A Distributed Framework for Emerging AI Applications: RiseLab’s new distributed system. Using shared memory for data communication.
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch: shared many similarities to Google’s Pathways.

Trade-off

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training: find trade-offs between DNN training performance optimization and energy consumption, by configuring batch size and GPU power limit.

Structured LLM Generation

Async Training

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models: training individual LMs before merging
Large Scale Distributed Deep Networks: parameter server

Self-play

O1 Replication Journey: A strategic Progress Report

Costs

RouteLLM: Learning to Route LLMs with Preference Data

RAG

From Local to Global: A Graph RAG Approach to Query-Focused Summarization: GraphRAG, compressing information into graphs so retrieval may contain more information
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting: SpeculativeRAG: use smaller models with a subset of retrieved documents for faster draft generation, and use the larger LM for verification.
Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models: summarize retrieval documents instead of merely just retrieving documents as is.
Seven Failure Points When Engineering a Retrieval Augmented Generation System: a study showing how different configurations in different retrieval stages may affect generation quality.
Active Retrieval Augmented Generation: determine when to perform retrieval, and how to induce the query for retrieval
MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery

Megatron with FastMoE

std::bodun::blog

2021年12月1日 08:00

This is a guide on setting up Megatron-LM with FastMoE. Megatron is a transformer developed by the Applied Deep Learning Research team at NVIDIA. FastMoE enables PyTorch support for the Mixture of Experts (MoE) models. We use the FastMoE layer to replace the MLP layers in the transformer language model.

Prerequisites

Docker

We recommend using one of NGC’s recent PyTorch containers. The Megatron-LM repo uses pytorch:20.12-py3. We pull the image with:

docker pull nvcr.io/nvidia/pytorch:20.12-py3

Note: it’s possible to use the official PyTorch image. However, there are a few dependencies missing, which requires manual installation. Also, PyTorch with versions greater than 1.8 seems to have problem during forward passing so we don’t use the official PyTorch image here.

After the image is pulled successfully, we want to start a container. The NGC site contains instructions on how to start a docker image. We use the following script:

docker run --gpus all -it --rm --ipc=host -v /home/edwardhu/:/home/edwardhu/ --name pytorch-moe <image_id>

Note: we might encounter problems before starting up the docker container. Make sure we set the GPG and remote repo for the nvidia-docker2 package on the host and install required packages:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
 && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
 && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Set up FastMoE

After we spin up the container, we clone the fastmoe repo and enter project. There is a setup.py file in the root of the project. Then we execute:

USE_NCCL=1 python setup.py install

to install FastMoE. For some reason, there is a compilation error saying that broadcastUniqueNCCLID(&ncclID)’s definition can not be found. We see there is a condition check right above the error function:

#if defined(TORCH_VERSION_MAJOR) && (TORCH_VERSION_MAJOR > 1 || \
 (TORCH_VERSION_MAJOR == 1 && TORCH_VERSION_MINOR >= 8))

For some reason, the check failed despite the container has PyTorch version 1.8.0a0+1606899. According to the author, the if macro was to deal with PyTorch’s API variance between v1.7.x and v1.8.x. For now, we simply comment out the if check and force the broadcastUniqueNCCLID(&ncclID, c10d::OpType::SEND, "fastmoe_nccl_comm", rank); to be used instead of the broadcastUniqueNCCLID(&ncclID) function:

//#if defined(TORCH_VERSION_MAJOR) && (TORCH_VERSION_MAJOR > 1 || \
// (TORCH_VERSION_MAJOR == 1 && TORCH_VERSION_MINOR >= 8))
 broadcastUniqueNCCLID(&ncclID,
 c10d::OpType::SEND,
 "fastmoe_nccl_comm",
 rank);
//#else
 //broadcastUniqueNCCLID(&ncclID);
//#endif
 ncclComm_t comm;
 NCCL_SAFE_CALL(ncclCommInitRank(&comm, getSize(), ncclID, rank));
 return comm;
 }
};

Finally, we need to download vocab file for later use since the Megatron repo doesn’t have one. Here, we use the vocab file from the SDNet repo. Feel free to use something else.

Megatron-LM Setup

After we set up FastMoE, we clone the Megatron-LM repo into the container. The FastMoE’s example guide on Megatron uses Megatron v2.2 release, so we need to choose the v2.2 tag in the Megatron repo.

Next, we follow the FastMoE’s guide on Megatron and apply the clip-grad-v2.2.path and fmoefy-v2.2.patch accordingly. Instructions on how to apply patches in Linux is easy to find, for example, here is one.

RACE Dataset

After setting up Megatron-LM, we download the RACE dataset for fine-tuning downstream tasks (RACE is used with BERT evaluation, the Megatron’s repo also has several other examples using GPT, here we stick to BERT). The Megatron repo also provides instructions on how to acquire these datasets for evaluation. For now, we just want to get the fine-tuning process up and running, without caring so much about the accuracy. Therefore, we don’t need to pre-train the BERT model just yet. After the dataset finished downloading, we simply need to decompress it.

Summury

The most important line to change a model to FastMoE style is through:

# Initialize FastMoE
 if args.fmoefy:
 from fmoe.megatron import patch_forward_step, patch_model_provider

 forward_step_func = patch_forward_step(forward_step_func)
 model_provider = patch_model_provider(model_provider)

More information can be found in the fmoefy patch file.

Paper Review - Dynamic Tensor Rematerialization

std::bodun::blog

2021年11月9日 08:00

Dynamic Tensor Rematerialization (DTR) treats GPU memory as a large cache, where tensors can be evicted to save memory, and recomputed if needed later.

DTR’s eviction policy relies on the heuristic $h$. The heuristic assigns a value $h(t)$ to each resident tensor $t$, approximating the cost of evicting the tensor. DTR evicts the tensor with the lowest cost based on the value of $h$. $h$ can factor in arbitrary metadata.

During every operator call in PyTorch, DTR intercepts the call and performs the following tasks:

DTR-operator-intercept

In short, whenever we perform an operation, we first recursively re-calculate all the non-resident tensors the current operation depends on, while evicting tensors we don’t need until there are enough GPU space left. To decide which tensors to evict, DTR uses the tensor with the lowest value $h$:

tensor-evict

The heuristic $h$ evicts tensors based on three properties: staleness, size, and compute cost. It evicts tensors that are: least recently used, takes large GPU memory space, and easy to recompute. $H _{DTR}$ is computed as:

\[ h _{DTR}(s, m, c) (t) := \frac{c(t)}{m(t) \cdot s(t)’} \]

Recomputing an evicted tensor $t$ may result in recomputing many more tensors that $t$ recursively depends on. Thus, the paper proposes an improved heuristic to take the recursive recomputations into account (with more maintenance cost). These tensors are called evicted neighborhood $e ^{*} (t)$.

\[ h_ {DTR-improved}(s, m, c) (t) := \frac{c(t) + \sum _{u \in e ^{*} (t)} c(u)}{m(t) \cdot s(t)’} \]

This heuristic captures the recomputation costs for all tensors that $t$ recursively depend on.

Paper Review - Capuchin: Tensor-based GPU Memory Management for Deep Learning

std::bodun::blog

2021年11月7日 08:00

This paper aims to reduce GPU memory usage during DNN training. Capuchin achieves this goal though swapping and recomputation, using tensor as unit of operation. The major question is how to balance between swapping and recomputation to achieve max resource utilization.

Swap and Recomputation Benefit

The ultimate goal of swapping and recomputation is to hide the overhead as much as possible to minimize the wait time of back-access (a tensor evicted earlier being accessed again). For swapping, we should increase the overlap between swapping and computing; for recomputation, we should use cheap operations.

Determining Tensor Re-generation Cost

For swapping, it is usually not optimal to swap back in a tensor only when we access it. The reason is copying tensor from CPU memory to GPU memory usually introduces overhead greater than the computation itself. It’s thus better to swap in a tensor earlier or proactively.

The paper uses in-trigger as the term. It means we use other tensor access between evicted-access (a tensor access that triggers the self-eviction after used in the computation) and back-access to bring back an evicted tensor a little bit earlier.

Of course, this may raise two questions:

How do we know when in-trigger should happen?
How to deal with PCIe lane interferences? E.g. one swap-in may happen later than in-trigger due to a previous swap-in still not finished.

The answer is quite simple. We use the runtime feedback at the back-access of a tensor. If the tensor is still being swapped in, it means the in-trigger time should be adjusted earlier. Note, this is based on the assumption of regular tensor access pattern in deep learning training, as illustrated in the paper.

tensor-access-pattern

Recomputation, on the other hand, is performed only in on-demand manner. No in-trigger is used for recomputation.

Capuchin relies on the principle that swap can be largely overlapped with computation, while recomputation will certainly incur performance penalty. Thus, it chooses swapping as the first choice until we cannot choose an in-trigger to perfectly hide prefetching overhead.

One thing to note here is when we select a tensor $T$ to be recomputed, but such tensor relies on another tensor that is evicted, then we need to recompute the parent of the evicted tensor instead. This could potentially happen multiple times if more recomputation targets tensor $T$. In short, recomputation and swapping cannot occur at the same time.

For more information, please refer to the original paper.

Starting Out PhD

std::bodun::blog

2021年11月5日 08:00

Today marks the third month of my PhD life. Things finally start to become a little bit clearer. I finally have some potentially concrete ideas to work on.

Finding a research topic was the most difficult part. For several months, I was wondering around like a headless chicken, reading papers after papers: serverless, ML inference, compiler, pathlet routing, RDMA, you name it. The feeling of not having a topic was suffocating.

Talking to other people, especially people not from my own research areas, is extremely beneficial. In fact, I was able to narrow down what I want to work on after discussions with a friend of mine who was working on NLP, an area complete outside of networking. Chatting with lab mates and collaborators are also extremely helpful. They usually would ask questions I would never have thought of, and save me from spending countless hours exploring cluelessly.

To me, current system research feels like application-driven. Many research projects are designed to address a very specific challenge faced in the application level. Thus, it is very likely to find an interesting system problem in a non-system conference like KDD or even ICLR.

Fault Tolerance in Distributed Systems

std::bodun::blog

2021年10月5日 08:00

No systems can provide fault-free guarantees, including distributed systems. However, failures in distributed systems are independent. It means only a subset of processes fail at once. We can exploit this feature and provide some degree of fault tolerance. The problem is, fault tolerance makes everything else much more difficult.

The most common fault models is the fail-stop. It means a process completely “bricks”. When a process fail-stops, no messages can emerge from this process any more. We also don’t know if this process will ever restart. In addition, we must account for all possible states of the faulty process, including unsent messages in the process’s queue. On the other hand, it’s important to point out a process that takes a long time to respond is indistinguishable from a fail-stop. The intuition is such processes and the faulty ones may take an unknown amount of time before message emerge.

We use an example here to illustrate how and why a system fails to provide fault-tolerance. We take a system that replicates data by broadcasting operations using logical timestamps. This system uses Lamport clock to update local clocks (our previous post on Lamport Distributed Mutual Exclusion explains how Lamport clock works). In short, we overwrite the value stored in a replica if an incoming message has later timestamp $ts$.

replication

Source: Ken McMillan

This system is not fault-tolerant. Imagine when one replica receives a write (marked by red incoming “write”) request and then tries to write the value to other replicas. This replica then fail-stops right after it writes to one replica and never writes to the other replica. In this case, not all replicas see the writes, thus violating consistency.

The solution to this problem is quite simple: reliable atomic broadcast. Under atomic broadcast, all writes are all-or-nothing. It eliminates the possibility for a process to fail-stop amid broadcasting.

Now let’s take the above example and update it with additional requirements. Instead of overwriting existing values, we append writes to an array and want to ensure every replica has the same array eventually. The major difference is that replicas needs to wait for acks with higher timestamps before it can append to its array.

This system is also not fault-tolerant. If one replica fail-stops, others will wait forever on a write, because every replicas relies on acks with higher timestamps before committing the append.

Thus, we want to extend the atomic broadcast so that updates become ordered. Under ordered atomic broadcast, writes are all-or-nothing and everyone agrees on the order of updates. If we assume the system to be fully asynchronous, ordered atomic broadcasts are not possible: (1) we can’t guarantee termination under asynchrony; (2) we could lose order. Thus, we rely on the synchronous approach.

Under the synchronous assumption, we can safely say a process fails after waiting for time $t _{fail}$, where $t _{fail} = 3 t _{msg} + 2 t _{turn}$. Here, $t _{msg}$ is the message transfer latency, and $t _{turn}$ is the max time to respond to a message.

To see why $t _{fail}$ is calculated this way, we use the following example to explain the process:

process-fail

Source: Ken McMillan

Imagine process $p$ sends a message to $q$ and waits for an ack from $q$. Before $q$ is able respond with the ack, it somehow crashes. The max time taken for $p$ to see an aco from $q$ would be two message transfer time plus the execution time by $q$, which is $2t _{msg} + t _{turn}$. We add $t$ to indicate the elapsed time when $p$ sends the request.

Now imagine $q$ is able to broadcast a request right before it fails. Later, another process is able to forward this request back to $p$. Then $p$ needs to wait for three message transfer time plus two message processing time before it can assume that it will no longer receive message from $q$.

process-fail

Under the synchronous assumption, ordered reliable atomic broadcast works as follows:

When a client send a request to process $p$, the process records logical time $r _l(p,p)$ and physical time $r _{t}(p,p)$. Then it broadcast the request.
When process $p$ receives a message $m$, and if message $m$ contains previously unseen timestamp $t_m$, then we record the logical time we see message $m$ at $p$, denoted as $r_l(p, p_m)$ as well as the physical time $r _t(p, p_m)$. Then we broadcast the request. Finally, we send an ack back to the originator $p_m$ without updated timestamp.
When process $p$ receives a message $m$, $p$ updates $t _l(p, p_m)$. It means we update $p$’s notion of the latest timestamp of another process who just acked.
Process $p$ sets another process $q$ as “failed” (denoted by $f(p, q)$) if $t _l(p, q) \leq r _l(p, p’) < +\infty$ and $t _t < r _t(p, p’) + t _{fail}$. In short, it means if we broadcast message and don’t receive any response after time $t _{fail}$, and our last recorded logical time of process $q$ is before our broadcast, then we know process $q$ must have failed.
Then we perform updates when for all $q \neq p$, $r _l(p,p) < r _l (p, q)$ (meaning everyone else’s request time is later than mine) and $t _l(p,q) > r _l(p,p) \lor f(p, q)$. Intuitively, it means we only perform updates when we receive message from other processes after our broadcast, and when we think other processes’ timestamps are after us, or when they have all failed.

Consistency Models Explained

std::bodun::blog

2021年9月23日 08:00

In a distributed system, eventual consistency provides a weak guarantee that data updates will be reflected in all nodes eventually. However, the downside of eventual consistency is that clients could potentially observe awkward intermediate states. For example, appending numbers to a client may result in states like [10], [10,13], [10,12,13].

Therefore, we need stronger consistency guarantees, which is easier to reason about. These consistency models provide various degree of consistency guarantees. However, it’s not always feasible to provide the strongest consistency guarantee. Usually, one needs to trade off consistency for availability and partition resilience (CAP theorem). Many contents here are attributed to Prof. Ken McMillan.

Global Consistency

The most primitive notion of consistency is global consistency. It means we have some history of events that is totally ordered.

It is (globally) consistency if it respects the intended sequential semantics of the events.

– Kenneth McMillan

Take read/write as an example, we have a sequence of write operations to the same memory location $12$ at various timestamps, and we want to read values from the this location:

\[ \texttt{write}(0, 12, 2) \rightarrow \texttt{write}(1, 12, 3) \rightarrow \texttt{read}(2, 12, 3) \]

If we have global consistency, every read at a given memory location $x$ will yield the most recently written value to $x$.

In reality, it’s impossible to implement a global clock in a distributed system. Therefore, no node can observe the entire history of ordered events, making global consistency hard to implement.

Linearizability

In essence, linearizability is an approximation of global consistency. The difference is linearizability is based on logical time, as opposed to physical time used in global consistency. It means we don’t care in what order the event occur physically. We just want the ordering of events to be consistent to what we know about time, and what we know about time is based on causality.

Linearizability has two assumptions:

Clients don’t have shared clock, but are able to send messages to each other. In other words, we want to create an illusion of global consistency by causality.
If an event $e_1$ ends before another event $e_2$ begins in physical time, then $e_1$ happens-before $e_2$. We define the happen before relationship as $hb(e_1, e_2)$. In simpler terms, is it possible for us to assume a causal connection between two events?

Take the following scenario as an example. We can say $hb(e_1, e_2)$ holds because we can establish a causality relation between these two events. We are able to assume $P_1$ sent a message $m_1$ to $P_2$, which caused the execution of $e_2$.

causality

Here is another example. Here, we can not establish a causal connection between $e_1$ and $e_2$ because we can not assume $m_1$ caused the execution of $e_2$

non-causality

To say a set of events, denoted as $E$, is linearizable, the following conditions must be met:

There exists a total order $<_{lin}$ over events $E$ s.t.
- $(E,\ <_{lin})$ is globally consistent.
- $hb(e_1, e_2) \rightarrow e_1 <_{lin} e_2$

In other words, a set of event $E$ is linearizable if it respects the happen-before relationship, and is totally ordered.

Let’s look at one example. Suppose we have two processes $P_1$ and $P_2$. $P_1$ writes 1 to location 0, and later reads from 0 and gets 1. $P_2$ writes 2 to location 0 before $P_1$ finishes its write.

linearizable

Source: Ken McMillan

These events are linearizable. We can order these three events as:

\[ \texttt{write}(1, 0, 2) \rightarrow \texttt{write}(0, 0, 1) \rightarrow \texttt{read}(0, 0, 1) \]

We know $\texttt{write}(0, 0, 1)$ happens before $\texttt{read}(0, 0, 1)$. The read gets the most recently written (not physical time, but causality) value, satisfying global consistency. Therefore, these events are linearizable.

Here is another example showing events that are not linearizable.

not-linearizable

Source: Ken McMillan

No matter how you order these four events, there will always be a contradiction. For example, $rd(0,0,1)$ happens after $wr(1,0,2)$ and $wr(0,0,1)$. In order to satisfy global consistency requirement (reading the most recently written value), we must order these three events as

\[ wr(1,0,2) \rightarrow wr(0,0,1)\rightarrow rd(0,0,1)\rightarrow\ ? \]

However, $wr(0,0,1)$ happens before $rd(1,0,2)$, so $rd(1,0,2)$ must be put after $wr(0,0,1)$, but that way the most recently written value would be 1 and it would be impossible to read value 2, thus violating global consistency.

Commit Points

A different and perhaps easier way of thinking linearizability is using commit points. We say a set of events $E$ is linearizable if every event can be assigned a physical commit time $t_e$ such that:

$t_e$ occurs during the execution of an event $e$.
$(E,\lt _{lin})$ is globally consistent, where $e < _{lin}\ d$ iff $t_e < t_d$

The following picture presents a scenario where we set three commit points on three write/read operations.

commit-points

Source: Ken McMillan

We know these events are linearizable because the three commit points we picked respects the $\lt_{\textrm{lin}}$ relationship. The commit point $wr(0,0,1)$ is set after the commit point for $wr(1,0,2)$ and we know $wr(1,0,2)< _{lin}wr(0,0,1)$.

Sequential Consistency

We can relax the requirement of linearizability even more, which leads us to sequential consistency. Sequential consistency (Lamport) is based on slightly different assumptions compared to linearizability:

Assume clients don’t send messages to each other
$hb _{sc}(e_1, e_2)$ only holds if $e _1$ executed before $e _2$ in the same process.

These assumptions indicates each process doesn’t know the relative order of operations happening on other processes. Thus, we don’t have happen-before arc between processes.

Take the following example:

sequential

Source: Ken McMillan

We know these events meets sequential consistency by the following order. The reason is that we can’t say $hb_{sc}(wr(0,0,1), wr(1, 0, 2))$ must hold. This example would not be linearizable because $wr(0,0,1)$ happens before $wr(1,0,2)$.

\[ wr(1, 0, 2) \rightarrow wr(0,0,1) \rightarrow rd(0,0,1) \]

Take another example that is not sequentially consistent:

not-sequential

Source: Ken McMillan

For $rd(0,0,2)$ to be true, it must be that $hb_{sc}(wr(0,0,1), wr(1,0,2))$ holds; for $rd(1,0,1)$ to be true, iut must be that $hb_{sc}(wr(1,0,2), wr(0,0,1))$ holds. Now we have a circle of ordering constraint, thus reaching a contradiction.

Causal Consistency

Causal consistency is an even weaker consistency model compared to sequential consistency. However, unlike all the consistency models we discussed before, causal consistency only applies to read/write operations. In causal consistency model, we define a causal order on those read/write operations such that read operations must see writes in order that respects causality.

Precisely, we define a reads-from map, denoted as $RF$. $RF$ of a read event $e$ is going to produce the write operation that gave me the read value (there will be ambiguity if there are two writes writing the same value). For example, $RF(rd(1,0,2))$ will produce the value $2$, which is equal to the value written by a write operation $wr(0,0,2)$. Putting $RF$ in formal terms:

\[ RF(rd(p,a,v)) = wr(p’,a’,v’) \rightarrow a=a’ \land v = v' \]

In addition, $hb_{RF}$ is the least transitive relation such that:

$hb_{SC}(e, e’) \rightarrow hb_{RF}(e,e’)$
$RF(e’) = e \rightarrow hb_{RF}(e,e’)$. It means whoever gave me the value must happen before me, which represents a notion of causality.

We say a set of events $E$ is causally consistent if there exists a $RF$ map for $E$ such that:

For all reads $e\in E$, there is no write $e’ \in E$ such that $hb_{RF}(RF(e),e’)$ and $hb_{RF}(e’,e)$ have the same address.

In layman’s term, it says that if a write operation $e$ causes us to read a value $x$, it can’t be that there is another write operation $e’$ that happens after $e$ and writes some value to the same address. Because if there is such a write operation, then $RF$ will produce write operation $e’$ instead of $e$.

Take a previous example here:

causal-consistency

Source: Ken McMillan

These events are causally consistent because $RF(rd(0,0,1)) = wr(0,0,1)$ and $RF(rd(1,0,2)) = wr(1,0,2)$. Thus $hb_{RF}(wr(0,0,1), rd(0,0,1))$ and $hb _{RF}(wr(1,0,2), rd(1,0,2))$. We also know we can’t say $hb _{SC}(wr(0,0,1), wr(1,0,2))$ because sequential consistency assumes no communication between processes. Therefore, $hb _{RF}(wr(0,0,1), wr(1,0,2))$ doesn’t hold, and we can safely say these events are causally consistent.

Let’s look at another example:

not-causally-consistent

Source: Ken McMillan

This is because $RF(rd(2,0,3)) = wr(0,0,3)$. However, there is a write operation $wr(0,0,1)$ happening after $wr(0,0,3)$ that write 1 to location 0. Therefore, it can’t be that $wr(0,0,3)$ causes $rd(2,0,3)$ because $wr(0,0,1)$ interferes and creates a contradiction. The easiest way to detect whether a set of events is causally consistent is to see if there is a circle of dependencies.

S3 Strong Consistency

A consistency model widely used in production systems is the S3 consistency used in Amazon S3 storage service. The S3 consistency models holds:

if $hb(w_1, w_2)$ and $hb(w_2, r)$, then $RF(r) \neq w_1$
Two reads must agree on the order of writes that happen before both reads.

Here is an example that is causally consistent, but not S3 consistent:

causal-consistency

Source: Ken McMillan

The reason is $rd(1,0,2)$ sees write options as $wr(0,0,1) \rightarrow wr(1,0,2)$ and $rd(0,0,1)$ sees $wr(1,0,2)\rightarrow wr(0,0,1)$.

However, with slight adjustment to the example, we have S3 consistency.

S3-consistent

Source: Ken McMillan

The reason is because $hb(wr(1,0,2),rd(0,0,2)$ doesn’t hold. So even if $rd(1,0,2)$ sees write options as $wr(0,0,1) \rightarrow wr(1,0,2)$ and $rd(0,0,1)$ sees $wr(1,0,2)\rightarrow wr(0,0,1)$, only $wr(0,0,1)$ happens before both reads, thus they would agree on the ordering of writes.

Summary

consistency-hierarchy

Source: Ken McMillan

The consistency models discussed are only a tip of the iceberg. In fact, different storage service providers usually provide different consistency models. This may result in vendor lock-in because applications designed for one storage system may fall apart when deployed to another due to varying consistency implications.

Lamport Distributed Mutual Exclusion

std::bodun::blog

2021年9月21日 08:00

Normally, having consistent event ordering in a distributed system is hard because we have no common clock. Since we don’t have a common clock to measure with, we rely on logical properties of time in the absence of clock. Here we use causality replation between events.

In essence, Causality indicates a clock $C$ is map from events to time satisfying: $e\rightarrow e’$ implies $C(e) < C(e’)$

We can synthesize a clock by a simple protocol, usually referred as scalar clock or Lamport clock:

Each process $p$ has a local clock $C(p)$.
A message send by a process is stampled with the its corresponding local clock.
On receiving $M$, set the process’s local clock to be $max(C(p), C(M)) + 1$.

This will give us a consistent total order of events in a distributed system.

Let’s take Lamport distributed mutual exclusion (DME) as an example. We use scalar clock to agree on the order of access to critical sections. Each process broadcasts a request with its local clock time. Receiver stores the request time and responds with its update local time ($max(C(p), C(M)) + 1$).

lamport-dme

A process can only enter critical section given the condition $W$ is met: $W \equiv \forall q \neq p,\ t(p,q) > r(p,p) \land r(p,p) < r(p,q)$. $t(p, q)$ represents the latest time received by $p$ from $q$. $r(p, q)$ is the request time received by $p$ from $q$ or $+\infty$. Intuitively, it says if a process’s request time is smaller than all repsonses time and the process’s request time is smaller than all the other request time, then this process is the first one to send out the request and thus should enter critical section.

The reason why this protocol works is illustrated below:

lamport-dme-process

When $p_1$ sends a request at timestamp 2 and gets a repsonse with timestamp 3, we know $p_1$ has the greatest clock value and $p_0$ will update its own clock based on the timestamp sent from $p_1$. Now $p_1$ sees the response message from $p_0$ with timestamp 3, it knows any request from $p_0$ must have already been received, because the network channel is ordered and any request sent by $p_0$ already arrived before the response with timestamp 3.

To see Lamport DME in action, we use Ivy to specify the protocol. The source file is borrowed from Ken’s presentation. The code is annotated and self-explanatory:

#lang ivy1.8

# This is an implememtation of Lamport's distributed mutual excluson
# (DME) algorithm.

include order
include network

# We start the module with a 'global' section. This contaions the
# declarations of any resources that are used in common by all
# processes. These usually include:
#
# - Data types
# - Services, such as network services
# - Immutable global parameters, such as netwrok addresses
#
# We can't have mutable global variables, since processes, being
# distributed, don't have shared memory.
#

global {

 # Our first global data type is the type of host identifiers. We
 # will have one process for each value of this type. Host
 # identifiers take on integer values from `0` to `node_max`.
 # We create the host identifier type by instantiating the
 # `host_iterable` module. It has a parameter `max` that gives the
 # maximum value of the type (and is supplied at run time).

 instance host_id : iterable

 # Since we have three kinds of messages in our protocol, we define
 # an enumerated type for the message kind with three symbolic
 # values.

 type msg_kind = {request_kind,reply_kind,release_kind}

 # In addition, we use a sequence type to represent timestamps. The
 # `unbounded_sequence` template in the `order` library gives a
 # discrete totally ordered type with a least value `0` and a
 # `next` operator.

 instance timestamp : unbounded_sequence

 # Our messages are stucts with three fields: the message kind and the
 # host identifier of the sender and a timestamp. We order messages
 # according to the timestamp. This ordering is useful in the proof
 # of correctness.

 class msg_t = {
 field kind : msg_kind
 field sender_id : host_id
 field ts : timestamp
 # definition (M1:msg_t < M2:msg_t) = ts(M1) < ts(M2)
 }

 # A useful enumerated type to describe node state:

 type state_t = {idle,waiting,critical}

 # Finally we instantiate a network service via which our processes
 # will communicate. Here, `transport.net` is a template defined in the
 # `network` library that we included above. The template takes one
 # parameter, which is the type of messages to be sent. Our instance
 # of this template is an object called `net`.

 instance net : tcp.net(msg_t)
}


# After the global section, we introduce some distribtued processes.
# A process with parameters has one instance for each value of the
# parameters. In this case we have one parameter of type `host_id`
# which means there is one process in the system for each value of
# `host_id` in the range `0..host_id.max`. The parameter is named `self`.
# This means that the process can refer to its own host identifier by
# the name `self`.

process node(self:host_id) = {

 # A process usually begins by declaring an *interface*. This
 # consists of a set of *actions* that are either calls in from the
 # environment (exports) or calls out to the environment (imports).

 # Our action is an export `request`, which our client uses to
 # request to enter the critical section. It takes no parameters.

 export action request

 # Our second action is an import `grant`. This is a callback to
 # the client indicating that is is safe to enter the critical
 # section.

 import action grant

 # Our third action is an export `release`. This is called by the
 # client when exiting the critical section, indicating it is safe to
 # another process to enter.

 export action release



 common {
 specification {

 var client_state(H:host_id) : state_t

 after init {
 client_state(H) := idle;
 }

 before request(self:host_id) {
 require client_state(self) = idle;
 client_state(self) := waiting;
 }

 before grant(self:host_id) {
 require client_state(self) = waiting;
 require client_state(X) ~= critical;
 client_state(self) := critical;
 }

 before release(self:host_id) {
 require client_state(self) = critical;
 client_state(self) := idle;
 }

 }
 }

 implementation {

 # Next we declare per-process objects. Each process needs a socket
 # on network `net` in order to communicate. We declare the socket
 # here. The socket `sock` is an instance of the template `socket`
 # declared by the network service `net`.

 instance sock : net.socket

 # We also declare some local (per-process) types and variables.

 var state : state_t

 # We also keep track of the current timestamp

 var ts : timestamp

 # Each process maintains a 'request queue', which a map from host_ids to
 # the timestamp of the current request from that host, or `0` if none.

 var request_ts(X:host_id) : timestamp

 # This map records the highest timestamp of a reply received from
 # each host.

 var reply_ts(X:host_id) : timestamp

 # Having declared our variables, we initialize them. Code in an
 # `after init` section runs on initialization of the process. You
 # aren't allowed to do much here, just assign values to local
 # variables.

 after init {
 state := idle;
 ts := 0;
 request_ts(X) := 0;
 reply_ts(X) := 0;
 }

 # Now we come to the implementation code. Here we implement our
 # exported actions, if any, and also any callback actions from the
 # services we use (i.e., actions that these services import from
 # us).

 # We start with the `request` action. This builds a request message,
 # appends it to the request queue, and broadcasts it. The action `broadcast` is
 # a local action (i.e., a subroutine) and is defined later.

 implement request {
 ts := ts.next;
 var outgoing : msg_t;
 outgoing.kind := request_kind;
 outgoing.sender_id := self;
 outgoing.ts := ts;
 broadcast(outgoing);
 request_ts(self) := ts;
 state := waiting;
 # BUG: should check waiting condition here, if host_id.max = 0
 }

 # Next we implement the callback `recv` from our network socket,
 # indicating we have an incoming message. This is called
 # `sock.recv`. It gives us as input parameters the network address
 # of the sending socket (not useful here) and the incoming
 # message.


 implement sock.recv(src:tcp.endpoint,incoming:msg_t) {

 # debug "recv" with self = self, src = src, msg = incoming;

 # First, we update out timestamp to reflect the incoming
 # message.

 ts := timestamp.max2(incoming.ts,ts).next;

 # We partly construct an outgoing message

 var outgoing : msg_t;
 outgoing.sender_id := self;
 outgoing.ts := ts;

 # What we do here depends on the kind of message.

 # When we receive a `request` message, we put it on our request queue,
 # and return a reply message to the sender.

 if incoming.kind = request_kind {
 outgoing.kind := reply_kind;
 request_ts(incoming.sender_id) := incoming.ts;
 unicast(outgoing,incoming.sender_id);
 }

 # When we receive a `release` message, the sender's request
 # must be at the head of our queue. We dequeue it.

 else if incoming.kind = release_kind {
 request_ts(incoming.sender_id) := 0;

 }

 # On a reply, we update the highest timestamp received from
 # this sender. Because of in-order devlivery, the timestamps
 # are received in increasing order, so the incoming one must
 # be the greatest so far.

 else if incoming.kind = reply_kind {
 reply_ts(incoming.sender_id) := incoming.ts;
 }

 # Having proceesed the incoming message, we might now be able
 # to enter our critical section. We do this if:
 #
 # - We are in the waiting state
 # - Our request message has the least timestamp in lexicographic order
 # - Every host has sent a reply later than our request

 # debug "waiting" with self = self, rq = request_ts(X), ts = reply_ts(X);

 if state = waiting
 & forall X. X ~= self ->
 (request_ts(X) = 0 | lexord(request_ts(self),self,request_ts(X),X))
 & reply_ts(X) > request_ts(self)
 {
 state := critical;
 grant;
 }
 }

 implement release {
 ts := ts.next;
 request_ts(self) := 0;
 var outgoing : msg_t;
 outgoing.sender_id := self;
 outgoing.ts := ts;
 outgoing.kind := release_kind;
 broadcast(outgoing);
 state := idle;
 }

 # At the end, we have definitions of internal (non-interface)
 # actions (in other words, subroutines) and functions (i.e., pure
 # functions).

 # This function takes two timestamp-host_id pairs and determines
 # whether (X1,Y1) < (X2,Y2) in lexicogrpahic order.

 function lexord(X1:timestamp,Y1:host_id,X2:timestamp,Y2:host_id) =
 X1 < X2 | X1 = X2 & Y1 < Y2


 # The action `unicast` sends a message to just one process.
 # To actually send a mesage to a socket, we call the `send` action
 # of our socket, giving it the receiving socket's network address
 # and the message to be sent. Notice we can get the network
 # address of process with identifier `idx` with the expression
 # `node(idx).sock.id`. This might seem odd, as we asre asking for
 # the local state of an object in another process. This is allowed
 # because the network addresses of the sockets are immutable
 # parameters that are determined at initialization and are
 # provided to all processes.

 action unicast(outgoing:msg_t, dst_id : host_id) = {
 # debug "send" with dst = dst_id, msg = outgoing;
 sock.send(node(dst_id).sock.id,outgoing);
 }

 # Action `broadcast` sends a message to all processes with
 # identifiers not equal to `self`. We use a 'for' loop to
 # iterate over the type `host_id`. The 'for' construct defines
 # two variables:
 #
 # - `it` is an 'iterator' of type `host.iter`
 # - `dst_id` is the value of the type the iterator refers to
 #
 # The reason we do it this way is the the finite subrange type
 # `host_id` has no value the is 'past the end' of the type, so
 # you can't write a traditional 'for' loop over this type. The
 # iterator type, however, does have a value corresponding to
 # 'past the end'.

 action broadcast(outgoing:msg_t) = {
 for it,dst_id in host_id.iter {
 # do not send to self!
 if dst_id ~= self {
 unicast(outgoing, dst_id);
 }
 }
 }
 }
}

# To compile and run with 3 nodes:
#
# $ ivyc lamport_mutex.ivy
# $ ivy_launch host_id.max=3
#
# To test:
#
# $ ivyc target=test lamport_mutex.ivy
# $ ivy_launch host_id.max=3
#
# Bounded model checking:
#
# TODO: As usual, we need the assumption that all endpoint id's are
# distinct.

axiom node(X).sock.id = node(Y).sock.id -> X = Y

# This says to try bounded model checking up to 20 steps (but Ivy
# won't actually get that far). The second parameter say to unroll the
# loops three times. This means that BMC ignores all executions in
# which a loop is executed more than three times. We need this because of
# the loop in `node.broadcast`

attribute method = bmc[20][3]

# Try adding a bug and see if you can find it with testing and bmc. Change
# this definition above:
#
# function lexord(X1:timestamp,Y1:host_id,X2:timestamp,Y2:host_id) =
# X1 < X2 | X1 = X2 & Y1 < Y2
#
# to this:
#
# function lexord(X1:timestamp,Y1:host_id,X2:timestamp,Y2:host_id) =
# X1 <= X2 | X1 = X2 & Y1 < Y2
#
# This mistake could allow two nodes with requests with the same timestamp
# to enter the CS at the same time. Here's a counter-example produced
# by BMC (it takes a while!):
#
# > node.request(1)
# > node.request(0)
# > node.sock.recv(0,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:request,msg_t.sender_id:1,msg_t.ts:1})
# > node.sock.recv(1,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:request,msg_t.sender_id:0,msg_t.ts:1})
# > node.sock.recv(1,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:reply,msg_t.sender_id:0,msg_t.ts:2})
# < node.enter_cs(1)
# > node.sock.recv(0,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:reply,msg_t.sender_id:1,msg_t.ts:2})
# < node.enter_cs(0)
# lamport_mutex_save.ivy: line 137: error: assertion failed

Writing Specifications for a Distributed System using Ivy

std::bodun::blog

2021年9月8日 08:00

Before we jump into writing specifications in a distributed setting, we first define what a specification is. I take the definition from the magnificent Ken McMillan: a specification is a statement.

A statement describes an abstract view of a program. The view itself is often at an interface, which hides or abstracts internal states. A specification is stated in terms of two elements:

Assumption: properties of the environment the system relies on
Guarantee: properties that most hold if the assumption(s) is met

The way we write specifications is through an abstract program that observes or monitors all program events. This abstract program is able to remember the execution history of program being monitored, and decides, at any given moment, whether an action is allowable according to the specification.

One way to implement this abstract monitor program is to use guarded command form:

Let $A$ be a set of program actions
An event $e(x_1,\ …,\ x_n)$ is an action $e\in A$ with parameter values $x_1,\ …,\ x_n$ of the right types for $e$.
Let $S$ be a set of states and $s_0 \in S$ be the initial state.
Guarded command set $G$ is specified as:

\[e(V):\ \gamma (S,V) \rightarrow {S := \tau(S, V)}\]

It means if a guarded command $\gamma$ determines a given event $e$ satisfies certain specifications with parameter $V$ under state $S$, then we accept the code and then deterministically update the state with a function $\tau$.

The observation $E$ of system is going to be a finite sequence of events, which corresponds to the system behavior, denoted as $e_0(V_0)…e_{n-1}(V_{n-1})$. A run of $E$ is a state sequence $s_0\ …s_n$ such that for $i\in 0\ … n- 1$, $\gamma(s_i, V_i)$ is true and $s_{i+1} = \tau(s_i, v_i)$. Observation $E$ is accepted by the specification iff it has a run. We can test whether an observation is accepted by just executing the guarded commands. In layman’s term, if all guarded commands accepts the their corresponding event at a given time, then the sequence events must satisfy our specification and should be accepted.

Now let’s replicated file as an example. Out first informal attempt to the specification for “append” operation would be:

Assumption: network is ordered and non-duplicating
Guarantee: if no further append requests, eventually replicas are equal

However, the problem with this specification is that this is a liveness property, meaning that we can’t practically test such property by observing a finite sequence of events. Therefore, we resort to a different safety specification we can test:

If all sent messages are delivered, the two replicas are identical.

Now we convert liveness to safety by explicitly defining the moment hen the eventuality should hold.

Liveness property means a good thing eventually happens. A liveness property can be refuted by finite execution. Safety property means a bad thing never happens. A safety property can always be refuted by a finite execution.

To see how replicated file specification plays in action, we use the example given in Prof. McMillan’s presentation. The code is written in Ivy and is pretty self-explanatory. In this demo we only have two processes.

To install Ivy, simply execute virtualenv ivyenv && source ivyenv/bin/activate && pip install ms-ivy. This is tested on Ubuntu 18.04 LTS and may vary slight on other distros.

#lang ivy1.8

include numbers
include collections
include network

global {
 alias byte = uint[8]
 instance file : vector(byte)
 type pid = {0..1}
 instance net : tcp.net(byte)
}

process host(self:pid) = {
 export action append(val:byte)
 import action show(content:file)
 instance sock : net.socket
 var contents : file

 after init{
 contents := file.empty;
 }

 implement append {
 contents := contents.append(val);
 sock.send(host(1-self).sock.id, val);
 show(contents);
 }

 implement sock.recv(src:tcp.endpoint, val:byte) {
 contents := contents.append(val);
 show(contents);
 }
}

Then we form our specification based on the guarantee that if all sent messages are delivered, the two replicas are identical. The specification is equivalent to the guarded command we’ve talked about earlier.

specification {
 var msg_count : nat

 after init {
 msg_count := 0;
 }

 after host.sock.send(self:pid, dst:tcp.endpoint, val:byte) {
 msg_count := msg_count + 1;
 }

 after host.sock.recv(self:pid, src:tcp.endpoint, val:byte) {
 msg_count := msg_count - 1;
 ensure msg_count = 0 -> host(0).contents.eq(host(1).contents);
 }
}

We wrote the above code into a file named append.ivy and we generate the testing code using ivyc target=test append.ivy. Then we run the code using ivy_launch append.ivy.

Interestingly, the program yields an error message:

`ivy_shell`; ./append "[[0,{addr:0x7f000001,port:49124}],[1,{addr:0x7f000001,port:49125}]]"
> host.append(1,251)
< host.show(1,[251])
< host.show(0,[251])
> host.append(1,46)
< host.show(1,[251,46])
> host.append(0,183)
< host.show(0,[251,183])
< host.show(0,[251,183,46])
< host.show(1,[251,46,183])
assertion_failed("append.ivy: line 49")
append.ivy: line 49: error: assertion failed

What happens is the program generates tests that randomizes message arrival times and we can see a delivered message may arrive after its target sends another message, therefore creating corrupted file contents.

Notice that here we are actually running on real network to find counter examples, the downside is the test may be arbitrary long depending on the randomized testing cases. Instead, we will use bounded model checking (BMC) to test if the specification is correct. This way we can reply purely on the logic of our specification instead of running on the real network. The Ivy checker uses Z3 Theorem Prover.

BMC construct a boolean formula that is satisfiable if and only if the underlying state transition system can realize a finite sequence of state transitions that reaches certain states of interest.

To tell Ivy using bounded model checking, we add the following lines to append.ivy:

axiom host(0).sock.id ~= host(1).sock.id

attribute method=bmc[10]

Executing ivy_check detailed=false append.ivy, we see an error message:

> host.append(1,80)
< host.show(1,[80])
> host.append(0,64)
< host.show(0,[64])
> host.sock.recv(0,{tcp.endpoint.addr:...,tcp.endpoint.port:...},80)
< host.show(0,[64,80])
> host.sock.recv(1,{tcp.endpoint.addr:...,tcp.endpoint.port:...},64)
< host.show(1,[80,64])
append.ivy: line 49: error: assertion failed

Sometimes BMC can help us find the error faster because it is systematically checking all possible actions. However, increasing the number of steps for the BMC can result in the exploration space growing exponentially, so we are going to use some combination of BMC and randomized test cases.

Deploy Hugo Site to GitHub Pages

std::bodun::blog

2021年8月27日 08:00

Update: The official guide from Hugo is for deploying from public repo. This post is intended for deploying from private repo.

This post assumes the user has already setup two separate repositories: a private repository for Hugo source files, and a public repository for GitHub Pages.

Note: test Hugo site by executing hugo server in the source code directory to make sure the site is generated properly.

Then, we need to generate a pair of keys by using the following command:

ssh-keygen -t rsa -b 4096 -C "$(git config user.email)" -f deployment -N ""

This will create two files: deployment and deployment.pub, which corresponds to a private key and a public key.

Next, execute cat deployment and copy the private key. Navigate to the private source repository -> Settings -> Secrets -> New repository secret. Paste the private key and save the change.

Github-actions-secrets

I’ve already added the private key to the source directory and named it PRIVATE_KEY. You can named it however you want.

Then, we go to the public repository for hosting our website. Navigate to the public site repository -> Settings -> Deploy keys -> Add deploy key. Execute cat deployment.pub and copy paste the result. You should see a SSH key added:

Github-keys

Finally, create a directory in the private repository in the following directory: .github/workflows/deploy.yml.

name: github pages

on:
 push:
 branches:
 - main  # Set a branch to deploy
 pull_request:

jobs:
 deploy:
 runs-on: ubuntu-20.04
 steps:
 - uses: actions/checkout@v2
 with:
 submodules: true # Fetch Hugo themes (true OR recursive)
 fetch-depth: 0 # Fetch all history for .GitInfo and .Lastmod

 - name: Setup Hugo
 uses: peaceiris/actions-hugo@v2
 with:
 hugo-version: 'latest'
 extended: true

 - name: Build
 run: hugo --minify

 - name: Deploy
 uses: peaceiris/actions-gh-pages@v3
 with:
 deploy_key: ${{ secrets.PRIVATE_KEY }}
 external_repository: your_username/public_repository_name
 publish_branch: branch_to_publish
 publish_dir: ./public

Finally, make sure you create a file named .nojekyll in the root directory of the public repository to prevent GitHub Pages from building the site using Jekyll.

Every time you make commits to the private repository, the site will be automatically generated and published on the public repository.

Writing in the Sciences - Structure

std::bodun::blog

2021年8月8日 08:00

This post covers how to improve sentence structures, and builds to to writing strong paragraphs. Most contents comes from the Writing in the Sciences course offered on Coursera.

Punctuation

Here is the list of punctuations ranked based on their power to separate:

Comma (,)
Colon (:)
Dash (-)
Parentheses ( () )
Semicolon (;)
Period (.)

The formality of these punctuations are ranked as:

Dash (-)
Parentheses ( () )
The others (comma (,), colon (:), semicolon (;), period (.))

A dash is a mark of separation stronger than a comma, less formal than a colon, and more relaxed than parentheses.

– Strunk and White

Semicolon

It connects two independent clauses (a clause always contains a subject and predicate; an independent clause can stand alone as complete sentence.)

Here is an example: ‘‘It was the best of times; it was the worst of times.’’

Semicolons can also be used to separate items in lists that contain internal punctuation. If some clauses contain commas, the comma inside the clause is no longer sufficient to separate different items in a list, because we don’t know where the boundaries are.

Parenthesis

Parentheses are used to insert an afterthought or explanation into a passage that is grammatically complete without it.

Colon

Colons are used after an independent clause to introduce a list, quote, explanation, conclusion, or amplification.

The colon has more effect than the comma, less power to separate than the semicolon, and more formality than the dash.

– Strunk and White

Dash

Dash can add emphasis or insert an abrupt definition of description almost anywhere in the sentence.

Use a dash only when a more common mark of punctuation seems inadequate.

– Strunk and White

Here is an example illustrating how dash emphasizes and adds information: ‘‘Researchers who study shipworms say these mislabeled animals–they are clams, not worms–are actually a scientific treasure’’.

I like the example provided in the class to illustrate how to use dash to join and condense a sentence. The original sentence is:

Finally, the lessons of clinical epidemiology are not meant to be limited to academic physician-epidemiologists, who sometimes have more interest in analyzing data than caring for patents. Clinical epidemiology holds the promise of providing clinicians with the tools necessary to improve the outcomes of their patients.

By using dash, we can connect these two sentences together, whiling maintaining the description on physician-epidemiologists:

Finally, clinical epidemiology is not limited to academic physician-epidemiologists–who are sometimes more interested in analyzing data than caring for patients–but provides clinicians with tools to improve their patients’ outcomes.

Parallelism

It is often better–in scientific writing–to write paris of ideas joined by ‘‘and’’, ‘‘or’’, or ‘‘but’’ in parallel form.

Here is an example sentence with a list of things in parallel form: ‘‘NASA’s intrepid Mars rover, Curiosity, has been through a lot in the past year. It flew 354 million miles, blasted through the Mars atmosphere, deployed a supersonic parachute, unfurled a giant sky crane, and touched down gently on the surface of Mars’’.

Paragraph

There are several tips fo writing paragraphs:

1 paragraph = 1 idea
Give away the punch line early. Scientists like putting details, details, details, data, and conclusion, which is a nightmare for readers. Invert the order!
Paragraph flow is helped by:
- logical flow of ideas. Less pointers improves readability.
- parallel sentence structures
- if necessary, transition words.
Reader remembers the first and the last sentence best.
Sequential in time
From general to specific
Logical arguments (if else)

Repetition

It’s ok to repeat a word. It’s important to ask yourself if the second instance of the word necessary. If the word is needed, is a synonym really better than repeating the word? Using synonyms–especially in scientific writing–may lead readers to think you are referring to a different instrument, model, etc.

Other notes including Cut the Clutter, Verbs, and Writing Process are also available.

Writing in the Sciences - Verbs

std::bodun::blog

2021年7月31日 08:00

This is an overview of the second chapter of Writing in the Sciences offered by Stanford. This chapter focuses on writing with strong, active verbs. Lessons include how to:

write in the active voice
avoid turning verbs into nouns
choose strong verbs
get to the main verb of a sentence quickly

Active Voice

There are three advantages of using active voice:

Emphasizes author responsibility
Improves readability
Reduces ambiguity

Author responsibility

Here is an example sentence: ‘‘No attempt was made to contact non-responders because they were deemed unimportant to the analysis’’. When we put it in the active voice, we get ‘’We did not attempt to contact non-responders because we deemed them unimportant to the analysis’’. The active voice version places more emphasis on the role of the authors in the decision making, subtly indicating human judgement and potential fallibility.

Readability

Putting sentences into active voice often leads us to be more direct. For example, putting ‘‘a strong correlation was found between use of passive voice and other sins of writing’’ into active voice yields ‘‘We found a strong correlation between use of the passive voice and other sins of writing’’. Active voice tends to make sentences more direct.

Ambiguity

The example sentence is: ‘‘General dysfunction of the immune system at the leukocyte level is suggested by both animal and human studies. Turning the sentence into active voice gives: ‘‘Both human and animal studies suggest that diabetics have general immune dysfunction at the leukocyte level’’. A sentence in form of agent - verb - recipient forces us to be more specific, thus reducing ambiguity of a sentence.

It is important to point out that passive voice may be appropriate in the methods section where what was done is more important than who did it.

After all, human agents are responsible for designing experiments, and they are present in the laboratory. Writing awkward phrases to avoid admitting their responsibility and their presence is an odd way of being objective.

– Jane J. Robinson, Science 7 June 1957: 1160.

Write with Verbs

Verbs with Embedded Meaning

For example, phrases like ‘‘reports that approximately’’ can be shortened to ‘’estimates’’ with ‘‘approximately’’ as its embedded meaning. They can make a big difference in sentences.

Avoid ‘’to be’’ verbs

There verbs are rather boring. Substituting ‘’to be’’ verbs can lead to exciting contents.

Don’t Turn Verbs into Nouns

Nouns slow readers down by the lack of action. Turning nouns into verbs gives a clearer picture of what is going. It has a bonus of avoiding ambiguity.

Turning verbs into nouns sometimes leads to the use of weaker verbs. For example, ‘‘decide’’ can be transformed into ‘‘make a decision’’, where ‘‘make’’ is a much weaker verb than ‘‘decide’’.

Don’t Bury the Main Verb

The principle is to keep the predicate close to the subject. Here is a sentence:

‘’one study of 930 adults with multiple sclerosis (MS) receiving care and one of two managed care settings or in a fee-for-service setting found that only two-thirds of those needing to contact a neurologist for an MS-related problem in the prior 6 months had done so’’

Readers struggle to understand the sentence due the clutter between the subject and the predicate. Moving ‘‘found’’ to the front of the sentence gives us ‘‘One study found that…’’. The reader are less bothered by the descriptive stuff as long as he/she has gotten the verb.

Example

Here is a great example provided in the course:

Important studies to examine the descriptive epidemiology of autism, including the prevalence and changes in the characteristics of the population over time, have begun.

There are multiple problems in this sentence. 1) the main verb appears at the very end of the sentence while the main subject ‘‘studies’’ is placed at the beginning.; 2) fluff words like ‘‘important’’. 3) redundant phrases: ‘‘changes’’ almost always happen ‘‘over time’’; 4) ‘‘of the population’’ sounds vague. After addressing those issues, the sentence becomes:

Studies have begun to describe the epidemiology of autism, including recent changes in the disorder’s prevalence and characteristics.

Grammar Tips

Data is/are:

‘‘Data’’ is plural.

Compared to/with:

Compared to: used to point out similarities between two objects.
Compared with: (used more often in science) used to pointed our differences between similar things.

That/which:

‘‘That’’ is the restrictive (defining) pronoun (doesn’t have comma). Eliminating essential clause changes the meaning of the sentence.
‘‘Which’’ is the nonrestrictive (non-defining) pronoun. Eliminating the non-essential clause alters the basic meaning of the sentence. (must be set off by commas).

Careful writers, watchful for small conveniences, go witch-hunting, remove the defining which-es, and by doing so improve their work.

– Strunk and White

Singular antecedents:

Do not use ‘’they’’ or ‘’their’’ when the subject is singular. To avoid gender choice, turn to a plural.

Other notes including Cut the Clutter, Structure, and Writing Process are also available.

普通视图

Key Principles

The Choice of Anchors

The Stitching Layer and its Initialization

Where to Stitch

Way to Stitch

Stitching Space

Training Strategy

Batched BMM ReLu

Perform Transformation

Implement 2D Convolution

Single-Controller

Multi-Controller Systems

Going Back to Single-Controller

Deadlock

Solutions to Deadlock

Resources

Courses

Labs & Faculties

Tutorials

LLM Optimization

Communication

Seminars

Papers

Training

LLM

NAS

Diffusion

KV Cache

Datasets

ML Compilers

Graph Optimization

Inference

Multitenancy

Dynamic Neural Network

Auto Placement

Reasoning LLM

Federated Learning

Switch & ML

Memory Management

System Design

Trade-off

Structured LLM Generation

Async Training

Self-play

Costs

RAG

Prerequisites

Docker

Set up FastMoE

Megatron-LM Setup

RACE Dataset

Summury

Swap and Recomputation Benefit

Determining Tensor Re-generation Cost

Global Consistency

Linearizability

Commit Points

Sequential Consistency

Causal Consistency

S3 Strong Consistency

Summary

Punctuation

Semicolon

Parenthesis

Colon

Dash

Parallelism

Paragraph

Repetition

Active Voice

Author responsibility

Readability

Ambiguity

Write with Verbs

Verbs with Embedded Meaning

Avoid ‘’to be’’ verbs

Don’t Turn Verbs into Nouns

Don’t Bury the Main Verb

Example