阅读视图

发现新文章,点击刷新页面。
🔲 ☆

CSAPP Cache Lab II: Optimizing Matrix Transposition

In this part of the Cache Lab, the mission is simple yet devious: optimize matrix transposition for three specific sizes: 32x32, 64x64, and 61x67. Our primary enemy? Cache misses.

Matrix Transposition

A standard transposition swaps rows and columns directly:

1
2
3
4
5
6
7
8
9
10
11
12
void trans(int M, int N, int A[N][M], int B[M][N])
{
int i, j, tmp;

for (i = 0; i < N; i++) {
for (j = 0; j < M; j++) {
tmp = A[i][j];
B[j][i] = tmp;
}
}

}

While correct, this approach is a cache-miss nightmare because it ignores how data is actually stored in memory.

Cache Overview

To optimize effectively, we first have to understand our hardware constraints. The lab specifies a directly mapped cache with the following parameters:

ParameterValue
Sets (S)32
Block Size (B)32 bytes
Associativity (E)1 (Direct-mapped)
Integer Size4 bytes
Capacity per line8 integers

We will use Matrix Tiling and Loop Unrolling to optimize the codes.

32x32 Case

In this case, a row of the matrix needs 32/8 = 4 sets of cache to store. And cache conflicts occur every 32/4 = 8 rows. This makes 8x8 tiling the sweet spot.

By processing the matrix in 8×88 \times 8 blocks, we ensure that once a line of A is loaded, we use all 8 integers before it gets evicted. We also use loop unrolling with 8 local variables to minimize the overhead of accessing B.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
int i,j,k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;
for(i = 0; i<N; i+=8){
for(j = 0; j<M; j+=8){
for(k = i; k<N && k<i+8; k++) {
// Read row from A
tmp1 = A[k][j];
tmp2 = A[k][j+1];
tmp3 = A[k][j+2];
tmp4 = A[k][j+3];
tmp5 = A[k][j+4];
tmp6 = A[k][j+5];
tmp7 = A[k][j+6];
tmp8 = A[k][j+7];

// Write to columns of B
B[j][k] = tmp1;
B[j+1][k] = tmp2;
B[j+2][k] = tmp3;
B[j+3][k] = tmp4;
B[j+4][k] = tmp5;
B[j+5][k] = tmp6;
B[j+6][k] = tmp7;
B[j+7][k] = tmp8;
}
}
}

61x67 Case

Since 61 and 67 are not powers of two, the conflict misses don’t occur in a regular pattern like they do in the square matrices. This “irregularity” is actually a blessing. We can get away with simple tiling. A 16x16 block size typically yields enough performance to pass the miss-count threshold.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
int BLOCK_SIZE = 16;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
for(j = 0; j<M; j+=BLOCK_SIZE){
a = i+BLOCK_SIZE;
b = j+BLOCK_SIZE;
for(k = i; k<N && k<a; k++) {
for(l = j; l<M && l<b; l++){
tmp = A[k][l];
B[l][k] = tmp;
}
}
}
}

64x64 Case

This is the hardest part. In a 64x64 matrix, a row needs 8 sets, but conflict misses occur every 32/8=432/8 = 4 rows. If we use 8x8 tiling, the bottom half of the block will evict the top half.

We can try a 4x4 matrix tiling first.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
int BLOCK_SIZE = 4;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
for(j = 0; j<M; j+=BLOCK_SIZE){
a = i+BLOCK_SIZE;
b = j+BLOCK_SIZE;
for(k = i; k<N && k<a; k++) {
for(l = j; l<M && l<b; l++){
tmp = A[k][l];
B[l][k] = tmp;
}
}
}
}

But this isn’t enough to pass the miss-count threshold.

We try a 8x8 matrix tiling. We solve this by partitioning the 8×88 \times 8 block into four 4×44 \times 4 sub-blocks and using the upper-right corner of B as a “buffer” to store data temporarily.

Block A=(ATLATRABLABR)TransposeBlock B=(ATLTABLTATRTABRT)\text{Block } A = \begin{pmatrix} A_{TL} & A_{TR} \\ A_{BL} & A_{BR} \end{pmatrix} \quad \xrightarrow{\text{Transpose}} \quad \text{Block } B = \begin{pmatrix} A_{TL}^T & A_{BL}^T \\ A_{TR}^T & A_{BR}^T \end{pmatrix}

Here are the steps:

  1. Transpose ATLA_{TL} into BTLB_{TL} while simultaneously moving ATRA_{TR} into BTRB_{TR} (as a temp storage).
  2. Move the stored ATRA_{TR} from BTRB_{TR} to its final position, while moving ABLA_{BL} into its spot.
  3. Transpose ABRA_{BR} into BBRB_{BR}.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
int i, j, k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;

// Iterate through the matrix in 8x8 blocks to improve spatial locality
for (i = 0; i < N; i += 8) {
for (j = 0; j < M; j += 8) {

/**
* STEP 1: Handle the top half of the 8x8 block (rows i to i+3)
*/
for (k = 0; k < 4; k++) {
// Read 8 elements from row i+k of matrix A into registers
tmp1 = A[i + k][j]; tmp2 = A[i + k][j + 1];
tmp3 = A[i + k][j + 2]; tmp4 = A[i + k][j + 3]; // Top-left 4x4
tmp5 = A[i + k][j + 4]; tmp6 = A[i + k][j + 5];
tmp7 = A[i + k][j + 6]; tmp8 = A[i + k][j + 7]; // Top-right 4x4

// Transpose top-left 4x4 from A directly into top-left of B
B[j][i + k] = tmp1;
B[j + 1][i + k] = tmp2;
B[j + 2][i + k] = tmp3;
B[j + 3][i + k] = tmp4;

// Temporarily store top-right 4x4 of A in the top-right of B
// This avoids cache misses by using the already-loaded cache line in B
B[j][i + k + 4] = tmp5;
B[j + 1][i + k + 4] = tmp6;
B[j + 2][i + k + 4] = tmp7;
B[j + 3][i + k + 4] = tmp8;
}

/**
* STEP 2: Handle the bottom half and fix the temporary placement
*/
for (k = 0; k < 4; k++) {
// Read bottom-left 4x4 column-wise from A
tmp1 = A[i + 4][j + k]; tmp2 = A[i + 5][j + k];
tmp3 = A[i + 6][j + k]; tmp4 = A[i + 7][j + k];

// Read bottom-right 4x4 column-wise from A
tmp5 = A[i + 4][j + k + 4]; tmp6 = A[i + 5][j + k + 4];
tmp7 = A[i + 6][j + k + 4]; tmp8 = A[i + 7][j + k + 4];

// Retrieve the top-right elements we temporarily stored in B in Step 1
int t1 = B[j + k][i + 4];
int t2 = B[j + k][i + 5];
int t3 = B[j + k][i + 6];
int t4 = B[j + k][i + 7];

// Move bottom-left of A into the top-right of B
B[j + k][i + 4] = tmp1;
B[j + k][i + 5] = tmp2;
B[j + k][i + 6] = tmp3;
B[j + k][i + 7] = tmp4;

// Move the retrieved temporary values into the bottom-left of B
B[j + k + 4][i] = t1;
B[j + k + 4][i + 1] = t2;
B[j + k + 4][i + 2] = t3;
B[j + k + 4][i + 3] = t4;

// Place bottom-right of A into the bottom-right of B
B[j + k + 4][i + 4] = tmp5;
B[j + k + 4][i + 5] = tmp6;
B[j + k + 4][i + 6] = tmp7;
B[j + k + 4][i + 7] = tmp8;
}
}
}

Note: The key trick here is traversing B by columns where possible (so B stays right in the cache) and utilizing local registers (temporary variables) to bridge the gap between conflicting cache lines.

Conclusion

Optimizing matrix transposition is less about the math and more about mechanical sympathy—understanding the underlying hardware to write code that plays nice with the CPU’s cache.

The jump from the naive version to these optimized versions isn’t just a marginal gain; it’s often a 10x reduction in cache misses. It serves as a stark reminder that in systems programming, how you access your data is just as important as the algorithm itself.

🔲 ☆

CSAPP Cache Lab I: Let's simulate a cache memory!

For the CSAPP Cache Lab, the students are asked to write a small C program (200~300 lines) that simulates a cache memory.

The full code is here on GitHub.

Understanding a Cache

1. The Anatomy of a Cache (SS, EE, BB, mm)

A cache can be described with the following four parameters:

  • S=2sS = 2^s (Cache Sets): The cache is divided into sets.
  • EE (Cache Lines per set): This is the “associativity.”
    • If E=1E=1, it’s a direct-mapped cache. If E>1E>1, it’s set-associative.
    • Each line contains a valid bit, a tag, and the actual data block.
  • B=2bB = 2^b (Block Size): The number of bytes stored in each line.
    • The bb bits at the end of an address tell the cache the offset within that block.
  • mm: The bits of the machine memory address.

2. Address Decomposition

When the CPU wants to access a 64-bit address, the cache doesn’t look at the whole number at once. It slices the address into three distinct fields:

FieldPurpose
TagUsed to uniquely identify the memory block within a specific set. t = m - b - s
Set IndexDetermines which set the address maps to.
Block OffsetIdentifies the specific byte within the cache line.

3. The “Search and Match” Process

When our simulator receives an address (e.g., from an L or S operation in the trace file), it follows these steps:

  1. Find the Set: Use the set index bits to jump to the correct set in our cache structure.
  2. Search the Lines: Look through all the lines in that set.
  • Hit: If a line has valid == true AND the tag matches the address tag.
  • Miss: If no line matches.
  1. Handle the Miss:
  • Cold Start: If there is an empty line (valid == false), fill it with the new tag and set valid = true.
  • Eviction: If all lines are full, we must kick one out. This is where the LRU (Least Recently Used) policy comes in: we find the line that hasn’t been touched for the longest time and replace it.

Lab Requirements

For this Lab Project, we will write a cache simulator that takes a valgrind memory trace as an input.

Input

The input looks like:

1
2
3
4
I 0400d7d4,8
M 0421c7f0,4
L 04f6b868,8
S 7ff0005c8,8

Each line denotes one or two memory accesses. The format of each line is

1
[space]operation address,size

The operation field denotes the type of memory access:

  • “I” denotes an instruction load, “L” a data load,
  • “S” a data store
  • “M” a data modify (i.e., a data load followed by a data store).

Mind you: There is never a space before each “I”. There is always a space before each “M”, “L”, and “S”.

The address field specifies a 64-bit hexadecimal memory address. The size field specifies the number of bytes accessed by the operation.

CLI

Our program should take the following command line arguments:

Usage: ./csim-ref [-hv] -s <s> -E <E> -b <b> -t <tracefile>

  • -h: Optional help flag that prints usage info
  • -v: Optional verbose flag that displays trace info
  • -s <s>: Number of set index bits (S = 2s is the number of sets)
  • -E <E>: Associativity (number of lines per set)
  • -b <b>: Number of block bits (B = 2b is the block size)
  • -t <tracefile>: Name of the valgrind trace to replay

Caveats

For this lab, we ignore all Is (the instruction cache accesses).

We assume that memory accesses are aligned properly, such that a single memory access never crosses block boundaries.

The Codes

We basically start from scratch, given an almost blank csim.c file to fill in. The file comes with only a main function and no header files.

Data Models

1
2
3
4
5
6
7
8
9
10
11
12
13
// Data Model
char* fileName = NULL;
int set_bit = -1;
long long sets = -1;
int associativity = -1;
int block_bit = -1;
long long block_size = -1;
bool verboseMode = false;

int global_timer = 0; // For LRU

int memory_bit = 64; // Assuming 64-bit addresses
int tag_bit = 0; // Tag bits

Handling Command-Line Arguments

First, we add the int argc, char** argv parameters to the main function. argc stands for argument count, while argv stands for argument values.

We use getopt to parse arguments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
void handleArgs(int argc, char** argv){
int opt;

while ((opt = getopt(argc, argv, "hvs:E:b:t:")) != -1) {
switch(opt) {
case 'h':
printUsage(argv);
exit(0);
case 'v':
verboseMode = true;
break;
case 't':
fileName = optarg;
break;
case 's':
set_bit = atoi(optarg);
break;
case 'E':
associativity = atoi(optarg);
break;
case 'b':
block_bit = atoi(optarg);
break;
case '?':
printUsage(argv);
exit(1);
default:
exit(1);
}
}

if(fileName == NULL || set_bit == -1 || associativity == -1 || block_bit == -1) {
printf("Missing required command line argument");
printUsage(argv);
exit(1);
}

sets = 1LL << set_bit;
block_size = 1LL << block_bit;

tag_bit = memory_bit - (set_bit + block_bit);
}

getopt comes in unistd.h, but the compiler option is set to -std=c99, which hides all POSIX extensions. GNU systems provide a standalone <getopt.h> header. So we include getopt.h instead.

1
opt = getopt(argc, argv, "hvs:E:b:t:")
  • h and v: These are boolean flags.
  • s:, E:, b:, and t:: These are required arguments. The colon tells getopt that these flags must be followed by a value (e.g., -s 4).

After parsing the arguments, we set the initial value of our Cache Data Model.

1
2
3
4
sets = 1LL << set_bit;
block_size = 1LL << block_bit;

tag_bit = memory_bit - (set_bit + block_bit);

Initialize Cache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Cache Line Structure
typedef struct CacheLine {
bool valid;
long long tag;
/*
Need LRU stamp to implement LRU eviction policy
*/
int lru_counter;
} CacheLine;

CacheLine** cache = NULL;

void initCache() {
// Initialize cache data structures
cache = (CacheLine**) malloc(sizeof(CacheLine*) * sets);
for(int i = 0; i<sets; i++){
cache[i] = (CacheLine*) calloc(associativity, sizeof(CacheLine));
}
}

Caution: malloc has to be initialized. Or the data might contain garbage values.

So we use calloc. The calloc (stands for contiguous allocation) function is similar to malloc but it initializes the allocated memory to zero.

And don’t forget to free the allocated memory!

1
2
3
4
5
void freeCache() {
// Free allocated memory for cache
for(int i = 0; i<sets; i++) free(cache[i]);
free(cache);
}

Handling File Input

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  // Handle trace file
FILE *traceFile = fopen(fileName, "r");
if (traceFile == NULL) {
printf("Error opening file: %s\n", fileName);
exit(1);
}
char operation;
long long address;
int size;
while (fscanf(traceFile, " %c %llx,%d", &operation, &address, &size) == 3) {
switch (operation) {
case 'L':
// Handle load operation
loadData(address, size);
break;
case 'S':
// Handle store operation
storeData(address, size);
break;
case 'M':
// Handle modify operation
modifyData(address, size);
break;
default:
// Ignore other operations
break;
}
}
// Close trace file
fclose(traceFile);

Caution:

  1. fscanf does not skip spaces before %c, so we add a space before %c in the format string.
  2. !feof(traceFile) does not work correctly here.It only returns true after a read operation has already attempted to go past the end of the file and failed. Using it as a loop condition (e.g., while (!feof(p))) causes an “off-by-one” error, where the loop executes one extra time with garbage data from the last successful read.

Parsing Addresses

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Parse Line Structure
long long getTag(long long address) {
return address >> (set_bit + block_bit);
}

long long getSetIndex(long long address) {
long long mask = (1LL << set_bit) - 1;
return (address >> block_bit) & mask;
}

long long getBlockOffset(long long address) {
long long mask = (1LL << block_bit) - 1;
return address & mask;
}

We use bit masks to parse the addresses.

Loading Cache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
void loadData(long long address, int size) {
// Simulate accessing data at the given address
int s = getSetIndex(address);
long long t = getTag(address);
global_timer++;

for (int i = 0; i < associativity; i++) {
if (cache[s][i].valid && cache[s][i].tag == t) {
hit_count++;
cache[s][i].lru_counter = global_timer;
if (verboseMode) printf(" hit");
return;
}
}

miss_count++;
if (verboseMode) printf(" miss");

for (int i = 0; i < associativity; i++) {
if (!cache[s][i].valid) {
cache[s][i].valid = true;
cache[s][i].tag = t;
cache[s][i].lru_counter = global_timer;
return;
}
}

eviction_count++;
if (verboseMode) printf(" eviction");

int victim_index = 0;
int min_lru = cache[s][0].lru_counter;

for (int i = 1; i < associativity; i++) {
if (cache[s][i].lru_counter < min_lru) {
min_lru = cache[s][i].lru_counter;
victim_index = i;
}
}

cache[s][victim_index].tag = t;
cache[s][victim_index].lru_counter = global_timer;
}

The code simulates the process of loading cache.

We first check if the data already exists in the cache.

If it doesn’t exist, we have to scan for blank lines to load the data.

If blank lines don’t exist, we need to evict a line using the LRU strategy. We replace the victim line with the new line.

Other Operations

1
2
3
4
5
6
7
8
9
10
11
void storeData(long long address, int size) {
// Simulate storing data at the given address
loadData(address, size);
}

void modifyData(long long address, int size) {
// Simulate modifying data at the given address
loadData(address, size);
hit_count++;
if (verboseMode) printf(" hit\n");
}

For this simulator, storing data and modifying data are basically the same thing as loading data.

Print Summary

We are asked to output the answer using the printSummary function.

1
2
// Print Summary
printSummary(hit_count, miss_count, eviction_count);

And Voila!

1
2
3
4
5
6
7
8
9
10
11
                        Your simulator     Reference simulator
Points (s,E,b) Hits Misses Evicts Hits Misses Evicts
3 (1,1,1) 9 8 6 9 8 6 traces/yi2.trace
3 (4,2,4) 4 5 2 4 5 2 traces/yi.trace
3 (2,1,4) 2 3 1 2 3 1 traces/dave.trace
3 (2,1,3) 167 71 67 167 71 67 traces/trans.trace
3 (2,2,3) 201 37 29 201 37 29 traces/trans.trace
3 (2,4,3) 212 26 10 212 26 10 traces/trans.trace
3 (5,1,5) 231 7 0 231 7 0 traces/trans.trace
6 (5,1,5) 265189 21775 21743 265189 21775 21743 traces/long.trace
27

Summary

In this project, we moved from the theory of hierarchy to the practical reality of memory management. By building this simulator, we reinforced several core concepts of computer systems.

With our simulator passing all the trace tests, we’ve effectively mirrored how a CPU “thinks” about memory. The next step is applying these insights to optimize actual code, ensuring our algorithms play nicely with the hardware we’ve just simulated.

🔲 ☆

CSAPP Cache Lab II: Optimizing Matrix Transposition

In this part of the Cache Lab, the mission is simple yet devious: optimize matrix transposition for three specific sizes: 32x32, 64x64, and 61x67. Our primary enemy? Cache misses.

Matrix Transposition

A standard transposition swaps rows and columns directly:

1
2
3
4
5
6
7
8
9
10
11
12
void trans(int M, int N, int A[N][M], int B[M][N])
{
int i, j, tmp;

for (i = 0; i < N; i++) {
for (j = 0; j < M; j++) {
tmp = A[i][j];
B[j][i] = tmp;
}
}

}

While correct, this approach is a cache-miss nightmare because it ignores how data is actually stored in memory.

Cache Overview

To optimize effectively, we first have to understand our hardware constraints. The lab specifies a directly mapped cache with the following parameters:

ParameterValue
Sets (S)32
Block Size (B)32 bytes
Associativity (E)1 (Direct-mapped)
Integer Size4 bytes
Capacity per line8 integers

We will use Matrix Tiling and Loop Unrolling to optimize the codes.

32x32 Case

In this case, a row of the matrix needs 32/8 = 4 sets of cache to store. And cache conflicts occur every 32/4 = 8 rows. This makes 8x8 tiling the sweet spot.

By processing the matrix in 8×88 \times 8 blocks, we ensure that once a line of A is loaded, we use all 8 integers before it gets evicted. We also use loop unrolling with 8 local variables to minimize the overhead of accessing B.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
int i,j,k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;
for(i = 0; i<N; i+=8){
for(j = 0; j<M; j+=8){
for(k = i; k<N && k<i+8; k++) {
// Read row from A
tmp1 = A[k][j];
tmp2 = A[k][j+1];
tmp3 = A[k][j+2];
tmp4 = A[k][j+3];
tmp5 = A[k][j+4];
tmp6 = A[k][j+5];
tmp7 = A[k][j+6];
tmp8 = A[k][j+7];

// Write to columns of B
B[j][k] = tmp1;
B[j+1][k] = tmp2;
B[j+2][k] = tmp3;
B[j+3][k] = tmp4;
B[j+4][k] = tmp5;
B[j+5][k] = tmp6;
B[j+6][k] = tmp7;
B[j+7][k] = tmp8;
}
}
}

61x67 Case

Since 61 and 67 are not powers of two, the conflict misses don’t occur in a regular pattern like they do in the square matrices. This “irregularity” is actually a blessing. We can get away with simple tiling. A 16x16 block size typically yields enough performance to pass the miss-count threshold.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
int BLOCK_SIZE = 16;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
for(j = 0; j<M; j+=BLOCK_SIZE){
a = i+BLOCK_SIZE;
b = j+BLOCK_SIZE;
for(k = i; k<N && k<a; k++) {
for(l = j; l<M && l<b; l++){
tmp = A[k][l];
B[l][k] = tmp;
}
}
}
}

64x64 Case

This is the hardest part. In a 64x64 matrix, a row needs 8 sets, but conflict misses occur every 32/8=432/8 = 4 rows. If we use 8x8 tiling, the bottom half of the block will evict the top half.

We can try a 4x4 matrix tiling first.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
int BLOCK_SIZE = 4;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
for(j = 0; j<M; j+=BLOCK_SIZE){
a = i+BLOCK_SIZE;
b = j+BLOCK_SIZE;
for(k = i; k<N && k<a; k++) {
for(l = j; l<M && l<b; l++){
tmp = A[k][l];
B[l][k] = tmp;
}
}
}
}

But this isn’t enough to pass the miss-count threshold.

We try a 8x8 matrix tiling. We solve this by partitioning the 8×88 \times 8 block into four 4×44 \times 4 sub-blocks and using the upper-right corner of B as a “buffer” to store data temporarily.

Block A=(ATLATRABLABR)TransposeBlock B=(ATLTABLTATRTABRT)\text{Block } A = \begin{pmatrix} A_{TL} & A_{TR} \\ A_{BL} & A_{BR} \end{pmatrix} \quad \xrightarrow{\text{Transpose}} \quad \text{Block } B = \begin{pmatrix} A_{TL}^T & A_{BL}^T \\ A_{TR}^T & A_{BR}^T \end{pmatrix}

Here are the steps:

  1. Transpose ATLA_{TL} into BTLB_{TL} while simultaneously moving ATRA_{TR} into BTRB_{TR} (as a temp storage).
  2. Move the stored ATRA_{TR} from BTRB_{TR} to its final position, while moving ABLA_{BL} into its spot.
  3. Transpose ABRA_{BR} into BBRB_{BR}.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
int i, j, k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;

// Iterate through the matrix in 8x8 blocks to improve spatial locality
for (i = 0; i < N; i += 8) {
for (j = 0; j < M; j += 8) {

/**
* STEP 1: Handle the top half of the 8x8 block (rows i to i+3)
*/
for (k = 0; k < 4; k++) {
// Read 8 elements from row i+k of matrix A into registers
tmp1 = A[i + k][j]; tmp2 = A[i + k][j + 1];
tmp3 = A[i + k][j + 2]; tmp4 = A[i + k][j + 3]; // Top-left 4x4
tmp5 = A[i + k][j + 4]; tmp6 = A[i + k][j + 5];
tmp7 = A[i + k][j + 6]; tmp8 = A[i + k][j + 7]; // Top-right 4x4

// Transpose top-left 4x4 from A directly into top-left of B
B[j][i + k] = tmp1;
B[j + 1][i + k] = tmp2;
B[j + 2][i + k] = tmp3;
B[j + 3][i + k] = tmp4;

// Temporarily store top-right 4x4 of A in the top-right of B
// This avoids cache misses by using the already-loaded cache line in B
B[j][i + k + 4] = tmp5;
B[j + 1][i + k + 4] = tmp6;
B[j + 2][i + k + 4] = tmp7;
B[j + 3][i + k + 4] = tmp8;
}

/**
* STEP 2: Handle the bottom half and fix the temporary placement
*/
for (k = 0; k < 4; k++) {
// Read bottom-left 4x4 column-wise from A
tmp1 = A[i + 4][j + k]; tmp2 = A[i + 5][j + k];
tmp3 = A[i + 6][j + k]; tmp4 = A[i + 7][j + k];

// Read bottom-right 4x4 column-wise from A
tmp5 = A[i + 4][j + k + 4]; tmp6 = A[i + 5][j + k + 4];
tmp7 = A[i + 6][j + k + 4]; tmp8 = A[i + 7][j + k + 4];

// Retrieve the top-right elements we temporarily stored in B in Step 1
int t1 = B[j + k][i + 4];
int t2 = B[j + k][i + 5];
int t3 = B[j + k][i + 6];
int t4 = B[j + k][i + 7];

// Move bottom-left of A into the top-right of B
B[j + k][i + 4] = tmp1;
B[j + k][i + 5] = tmp2;
B[j + k][i + 6] = tmp3;
B[j + k][i + 7] = tmp4;

// Move the retrieved temporary values into the bottom-left of B
B[j + k + 4][i] = t1;
B[j + k + 4][i + 1] = t2;
B[j + k + 4][i + 2] = t3;
B[j + k + 4][i + 3] = t4;

// Place bottom-right of A into the bottom-right of B
B[j + k + 4][i + 4] = tmp5;
B[j + k + 4][i + 5] = tmp6;
B[j + k + 4][i + 6] = tmp7;
B[j + k + 4][i + 7] = tmp8;
}
}
}

Note: The key trick here is traversing B by columns where possible (so B stays right in the cache) and utilizing local registers (temporary variables) to bridge the gap between conflicting cache lines.

Conclusion

Optimizing matrix transposition is less about the math and more about mechanical sympathy—understanding the underlying hardware to write code that plays nice with the CPU’s cache.

The jump from the naive version to these optimized versions isn’t just a marginal gain; it’s often a 10x reduction in cache misses. It serves as a stark reminder that in systems programming, how you access your data is just as important as the algorithm itself.

🔲 ☆

CSAPP Cache Lab I: Let's simulate a cache memory!

For the CSAPP Cache Lab, the students are asked to write a small C program (200~300 lines) that simulates a cache memory.

The full code is here on GitHub.

Understanding a Cache

1. The Anatomy of a Cache (SS, EE, BB, mm)

A cache can be described with the following four parameters:

  • S=2sS = 2^s (Cache Sets): The cache is divided into sets.
  • EE (Cache Lines per set): This is the “associativity.”
    • If E=1E=1, it’s a direct-mapped cache. If E>1E>1, it’s set-associative.
    • Each line contains a valid bit, a tag, and the actual data block.
  • B=2bB = 2^b (Block Size): The number of bytes stored in each line.
    • The bb bits at the end of an address tell the cache the offset within that block.
  • mm: The bits of the machine memory address.

2. Address Decomposition

When the CPU wants to access a 64-bit address, the cache doesn’t look at the whole number at once. It slices the address into three distinct fields:

FieldPurpose
TagUsed to uniquely identify the memory block within a specific set. t = m - b - s
Set IndexDetermines which set the address maps to.
Block OffsetIdentifies the specific byte within the cache line.

3. The “Search and Match” Process

When our simulator receives an address (e.g., from an L or S operation in the trace file), it follows these steps:

  1. Find the Set: Use the set index bits to jump to the correct set in our cache structure.
  2. Search the Lines: Look through all the lines in that set.
  • Hit: If a line has valid == true AND the tag matches the address tag.
  • Miss: If no line matches.
  1. Handle the Miss:
  • Cold Start: If there is an empty line (valid == false), fill it with the new tag and set valid = true.
  • Eviction: If all lines are full, we must kick one out. This is where the LRU (Least Recently Used) policy comes in: we find the line that hasn’t been touched for the longest time and replace it.

Lab Requirements

For this Lab Project, we will write a cache simulator that takes a valgrind memory trace as an input.

Input

The input looks like:

1
2
3
4
I 0400d7d4,8
M 0421c7f0,4
L 04f6b868,8
S 7ff0005c8,8

Each line denotes one or two memory accesses. The format of each line is

1
[space]operation address,size

The operation field denotes the type of memory access:

  • “I” denotes an instruction load, “L” a data load,
  • “S” a data store
  • “M” a data modify (i.e., a data load followed by a data store).

Mind you: There is never a space before each “I”. There is always a space before each “M”, “L”, and “S”.

The address field specifies a 64-bit hexadecimal memory address. The size field specifies the number of bytes accessed by the operation.

CLI

Our program should take the following command line arguments:

Usage: ./csim-ref [-hv] -s <s> -E <E> -b <b> -t <tracefile>

  • -h: Optional help flag that prints usage info
  • -v: Optional verbose flag that displays trace info
  • -s <s>: Number of set index bits (S = 2s is the number of sets)
  • -E <E>: Associativity (number of lines per set)
  • -b <b>: Number of block bits (B = 2b is the block size)
  • -t <tracefile>: Name of the valgrind trace to replay

Caveats

For this lab, we ignore all Is (the instruction cache accesses).

We assume that memory accesses are aligned properly, such that a single memory access never crosses block boundaries.

The Codes

We basically start from scratch, given an almost blank csim.c file to fill in. The file comes with only a main function and no header files.

Data Models

1
2
3
4
5
6
7
8
9
10
11
12
13
// Data Model
char* fileName = NULL;
int set_bit = -1;
long long sets = -1;
int associativity = -1;
int block_bit = -1;
long long block_size = -1;
bool verboseMode = false;

int global_timer = 0; // For LRU

int memory_bit = 64; // Assuming 64-bit addresses
int tag_bit = 0; // Tag bits

Handling Command-Line Arguments

First, we add the int argc, char** argv parameters to the main function. argc stands for argument count, while argv stands for argument values.

We use getopt to parse arguments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
void handleArgs(int argc, char** argv){
int opt;

while ((opt = getopt(argc, argv, "hvs:E:b:t:")) != -1) {
switch(opt) {
case 'h':
printUsage(argv);
exit(0);
case 'v':
verboseMode = true;
break;
case 't':
fileName = optarg;
break;
case 's':
set_bit = atoi(optarg);
break;
case 'E':
associativity = atoi(optarg);
break;
case 'b':
block_bit = atoi(optarg);
break;
case '?':
printUsage(argv);
exit(1);
default:
exit(1);
}
}

if(fileName == NULL || set_bit == -1 || associativity == -1 || block_bit == -1) {
printf("Missing required command line argument");
printUsage(argv);
exit(1);
}

sets = 1LL << set_bit;
block_size = 1LL << block_bit;

tag_bit = memory_bit - (set_bit + block_bit);
}

getopt comes in unistd.h, but the compiler option is set to -std=c99, which hides all POSIX extensions. GNU systems provide a standalone <getopt.h> header. So we include getopt.h instead.

1
opt = getopt(argc, argv, "hvs:E:b:t:")
  • h and v: These are boolean flags.
  • s:, E:, b:, and t:: These are required arguments. The colon tells getopt that these flags must be followed by a value (e.g., -s 4).

After parsing the arguments, we set the initial value of our Cache Data Model.

1
2
3
4
sets = 1LL << set_bit;
block_size = 1LL << block_bit;

tag_bit = memory_bit - (set_bit + block_bit);

Initialize Cache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Cache Line Structure
typedef struct CacheLine {
bool valid;
long long tag;
/*
Need LRU stamp to implement LRU eviction policy
*/
int lru_counter;
} CacheLine;

CacheLine** cache = NULL;

void initCache() {
// Initialize cache data structures
cache = (CacheLine**) malloc(sizeof(CacheLine*) * sets);
for(int i = 0; i<sets; i++){
cache[i] = (CacheLine*) calloc(associativity, sizeof(CacheLine));
}
}

Caution: malloc has to be initialized. Or the data might contain garbage values.

So we use calloc. The calloc (stands for contiguous allocation) function is similar to malloc but it initializes the allocated memory to zero.

And don’t forget to free the allocated memory!

1
2
3
4
5
void freeCache() {
// Free allocated memory for cache
for(int i = 0; i<sets; i++) free(cache[i]);
free(cache);
}

Handling File Input

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  // Handle trace file
FILE *traceFile = fopen(fileName, "r");
if (traceFile == NULL) {
printf("Error opening file: %s\n", fileName);
exit(1);
}
char operation;
long long address;
int size;
while (fscanf(traceFile, " %c %llx,%d", &operation, &address, &size) == 3) {
switch (operation) {
case 'L':
// Handle load operation
loadData(address, size);
break;
case 'S':
// Handle store operation
storeData(address, size);
break;
case 'M':
// Handle modify operation
modifyData(address, size);
break;
default:
// Ignore other operations
break;
}
}
// Close trace file
fclose(traceFile);

Caution:

  1. fscanf does not skip spaces before %c, so we add a space before %c in the format string.
  2. !feof(traceFile) does not work correctly here.It only returns true after a read operation has already attempted to go past the end of the file and failed. Using it as a loop condition (e.g., while (!feof(p))) causes an “off-by-one” error, where the loop executes one extra time with garbage data from the last successful read.

Parsing Addresses

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Parse Line Structure
long long getTag(long long address) {
return address >> (set_bit + block_bit);
}

long long getSetIndex(long long address) {
long long mask = (1LL << set_bit) - 1;
return (address >> block_bit) & mask;
}

long long getBlockOffset(long long address) {
long long mask = (1LL << block_bit) - 1;
return address & mask;
}

We use bit masks to parse the addresses.

Loading Cache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
void loadData(long long address, int size) {
// Simulate accessing data at the given address
int s = getSetIndex(address);
long long t = getTag(address);
global_timer++;

for (int i = 0; i < associativity; i++) {
if (cache[s][i].valid && cache[s][i].tag == t) {
hit_count++;
cache[s][i].lru_counter = global_timer;
if (verboseMode) printf(" hit");
return;
}
}

miss_count++;
if (verboseMode) printf(" miss");

for (int i = 0; i < associativity; i++) {
if (!cache[s][i].valid) {
cache[s][i].valid = true;
cache[s][i].tag = t;
cache[s][i].lru_counter = global_timer;
return;
}
}

eviction_count++;
if (verboseMode) printf(" eviction");

int victim_index = 0;
int min_lru = cache[s][0].lru_counter;

for (int i = 1; i < associativity; i++) {
if (cache[s][i].lru_counter < min_lru) {
min_lru = cache[s][i].lru_counter;
victim_index = i;
}
}

cache[s][victim_index].tag = t;
cache[s][victim_index].lru_counter = global_timer;
}

The code simulates the process of loading cache.

We first check if the data already exists in the cache.

If it doesn’t exist, we have to scan for blank lines to load the data.

If blank lines don’t exist, we need to evict a line using the LRU strategy. We replace the victim line with the new line.

Other Operations

1
2
3
4
5
6
7
8
9
10
11
void storeData(long long address, int size) {
// Simulate storing data at the given address
loadData(address, size);
}

void modifyData(long long address, int size) {
// Simulate modifying data at the given address
loadData(address, size);
hit_count++;
if (verboseMode) printf(" hit\n");
}

For this simulator, storing data and modifying data are basically the same thing as loading data.

Print Summary

We are asked to output the answer using the printSummary function.

1
2
// Print Summary
printSummary(hit_count, miss_count, eviction_count);

And Voila!

1
2
3
4
5
6
7
8
9
10
11
                        Your simulator     Reference simulator
Points (s,E,b) Hits Misses Evicts Hits Misses Evicts
3 (1,1,1) 9 8 6 9 8 6 traces/yi2.trace
3 (4,2,4) 4 5 2 4 5 2 traces/yi.trace
3 (2,1,4) 2 3 1 2 3 1 traces/dave.trace
3 (2,1,3) 167 71 67 167 71 67 traces/trans.trace
3 (2,2,3) 201 37 29 201 37 29 traces/trans.trace
3 (2,4,3) 212 26 10 212 26 10 traces/trans.trace
3 (5,1,5) 231 7 0 231 7 0 traces/trans.trace
6 (5,1,5) 265189 21775 21743 265189 21775 21743 traces/long.trace
27

Summary

In this project, we moved from the theory of hierarchy to the practical reality of memory management. By building this simulator, we reinforced several core concepts of computer systems.

With our simulator passing all the trace tests, we’ve effectively mirrored how a CPU “thinks” about memory. The next step is applying these insights to optimize actual code, ensuring our algorithms play nicely with the hardware we’ve just simulated.

❌