阅读视图

发现新文章,点击刷新页面。
🔲 ⭐

Greyson Chance 2023 Beijing

Greyson Chance 2023 Beijing

总结

记得最早开始听,应该是在初中,从最初的No Fear到后来的最爱的Seasons。他19年来中国我是完全不知道,都是大学开班会,同学看到我头像加了我,我才知道,这次也是在她朋友圈看到了消息,哈哈哈哈,respect!这次终于赶上了!

哈哈哈哈哈哈,这次认识了好多新朋友,大家都好nice,白玫瑰小队下次又见!我甚至连之前一直在B站看的Seasons杭州场的MV的up都认识了,属于说非常巧了。

现场真的好震撼,导致我回去久久不能平复,GC真的行走的CD哈哈哈。我给他说能不能唱Seasons,他说抱歉,希望你能享受今晚。上大学后我其实很少关注GC了,但是Seasons总会在一年的那么些日子里单曲循环,新专辑更是听都没听过,4月后我就开始慢慢听着新专辑的歌,期待现场见了哈哈!

结束后我又想了想,初中,高中,大学,疫情,好像时间真的过得很快,转眼间我都毕业了。这次我真的感到好像当年自己青春的遗憾慢慢画上一个感叹号了!

愿我一直No Fear,

Move forword like the seasons!

全场视频

https://www.aliyundrive.com/s/ZCaugLSs8TJ

🔲 ⭐

重启Life分类-Seasons

重启Life分类-Seasons

在听完GC北京场后,感触颇深,再次启动Life分类还是有必要哈哈哈!

今天写这篇博客好像也脱了很久。

【上次因为糟糕的排版删除了23年的厦门篇,Sorry,后期会补上】

同时恭喜队伍成功进入CPC2023决赛,这波是青岛见了,哈哈哈!手动撒花!

我发现有些瞬间还是必须照片或者文字记下来,不然后面真的会忘记。

立个FLAG今年在Life分类更新完23年的旅行以及这次的GC北京场。

后续的计划大概是,博客里面写文字内容,同时贴上云盘的视频和照片。因为这样就可以避免糟糕的排版了,主要还是对前端不太熟。

☑️ ⭐

论文阅读:Towards Efficient SpMV on Sunway Manycore Architectures

论文阅读:Towards Efficient SpMV on Sunway Manycore Architectures

文章链接:

Towards Efficient SpMV on Sunway Manycore Architectures | Proceedings of the 2018 International Conference on Supercomputing (acm.org)

文章总结

dual-side multi-level partitioning technique

三层分块:Block->Tile->Slice

其中在Tile这一层会有空Tile块,不需要计算

其中Slice这一层也会有空Slice切片,不需要计算

最底层Slice切片是我们的计算核心

多级队列:负载均衡—>The work sharing mechanism in the block and slice queuesguarantee the workload balance across fleets and cores.

image-20230711215435026

映射细节:

image-20230711220304175

计算核心处理逻辑

一行8个核心:7个计算核心,1个I/O核心

计算核心负责SPMV计算

I/O核心负责将结果写回内存

多个slice组合—>batch,方便DMA,并进行数据预取(单位batch),注意计算核心slice依然没有改变

利用向量寄存器,巧妙搭载msg

image-20230711221015472

I/O核心的处理逻辑

整个block计算完才写回,避免反复访存

向量计算器meg->reduce

利用神威RMA

☑️ ⭐

论文阅读:稀疏矩阵向量乘法在申威众核架构上的性能优化

稀疏矩阵向量乘法在申威众核架构上的性能优化

文章链接:

稀疏矩阵向量乘法在申威众核架构上的性能优化 - 中国知网 (cnki.net)

文章总结

固定划分方法

  1. 所有计算完再写回
  2. 子矩阵(任务)->子矩阵小块(核心计算)
  3. 将子矩阵小块中的非零元存储在一起,以适应申威处理器上的DMA操作。(Packing)
  4. 核心:寄存器通信->RMA
  5. 根据LDM大小提前计算,充分利用LDM空间,换句话说就是保证计算所需都在LDM中
  6. 加载冗余X,避免DMA隔断

一维负载均衡划分方法

  1. 尽量使从核处理非零元数量相当
  2. 4个层次:原矩阵->矩阵带->子矩阵->小块
  3. 交替分配矩阵带给从核行
  4. 核心计算依然是子矩阵小块,同固定划分方法

二维负载均衡划分方法

  1. 矩阵带分配采取贪心的方法,尽可能保证不同行之间的负载均衡
  2. 解决一维负载均衡方法带来的细粒度同步问题
  3. 非零元过少的矩阵带,交给一个从核完成,而不再均分给同行上的几个从核
  4. 一维和二维主要解决预处理,计算还是固定划分中的子矩阵小块
  5. 排序->根据矩阵带非零元数量从大到小排序(逻辑排序)
  6. Select函数会在所有ROWS行的从核中,选择出目前非零元数量最少的一行,并将当前的矩阵带i分配给它
1
2
3
4
5
6
7
8
输入:tiles,nnz_tile,ROWS
输出:set
nnz_set<-0
for i = 0 to tiles - 1 do
Select id if nnz_set[id] is minimal //注意:这里是选从核行,不是矩阵带
nnz_set[id] += nnz_tiles[i]
set[id] = set[id]U{i} //任务分配
end for

这里注意,我们在固定划分那里解决了数据局部性差等问题,之后的一二维划分,都是在做任务分配,核心计算子矩阵小块一直未改变

☑️ ⭐

论文阅读:面向国产申威 26010 众核处理器的 SpMV 实现与优化

面向国产申威 26010 众核处理器的 SpMV 实现与优化

文章链接:

面向国产申威26010众核处理器的SpMV实现与优化 - 中国知网 (cnki.net)

文章总结

存储格式:CSR

数据名称定义:

  • col:非零元的列号
  • data:非零元数值,连续存放
  • row_off:x数组,每行第一个元素前面的非零元素,最后一个rowoff代表总的非零元素
  • vec:计算向量
  • y:结果向量
  • rows:行号
  • srow:为当前申威处理器一个从核的 LDM 可以容纳的最多稀疏行大小

X动静态buffer

x静态buffer,初始化后一直不改变。(论文中有两种初始化方法)

x动态buffer,未命中,则用dma更新x动态buffer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
//slave.c
LDM->malloc x_sbuffer //x静态buffer
LDM->malloc x_dbuffer //x动态buffer
sstart,send,dstart,dend
dma(vec->x_sbuffer,sstart,send)
dma(vec->x_dbuffer,dstart,dend)
for srow
for row
if(in x_sbuffer)
cal
else if(in x_dbuffer)
cal
else
update_dma(vec->x_dbuffer,dstart,dend)
cal
store

负载均衡

动静态划分

第一轮静态:每个从核分配相同的任务量即srow

后续采用动态:哪个从核先计算完,就先从任务池里面拿新的任务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
//slave.c
eg:64个从核
task_num //任务数
now_addr //当前任务计算到哪里了
//第一轮 每个从核执行相同的任务量即srow

cal srow

//其余任务保存到任务池,用锁保护,实现互斥操作

//任务池,当task_num为0时计算完毕
mutex = 1
P(mutex)
//互斥区
task_num
now_addr
V(mutex)
//取now_addr
cal srow
//随后 继续loop,直到task_num为0
🔲 ⭐

Packing into contiguous memory

Packing into contiguous memory

这将带来惊人的性能提升:

img

img

我们现在达到了处理器90%的涡轮增压峰值!

img

img

Optimization_4x4_12

在调用AddDot4x4之前,我们现在打包到4xk的A块。我们看到性能下降。如果检查内部内核,就会注意到每个4xk的A块都被重复打包,每次执行外部循环一次。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Block sizes */
#define mc 256
#define kc 128

#define min( i, j ) ( (i)<(j) ? (i): (j) )

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );
void PackMatrixA( int, double *, int, double * );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, p, pb, ib;

/* This time, we compute a mc x n block of C by a call to the InnerKernel */

for ( p=0; p<k; p+=kc ){
pb = min( k-p, kc );
for ( i=0; i<m; i+=mc ){
ib = min( m-i, mc );
InnerKernel( ib, n, pb, &A( i,p ), lda, &B(p, 0 ), ldb, &C( i,0 ), ldc );
}
}
}

void InnerKernel( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;
double
packedA[ m * k ];

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */
PackMatrixA( k, &A( i, 0 ), lda, &packedA[ i*k ] );
AddDot4x4( k, &packedA[ i*k ], 4, &B( 0,j ), ldb, &C( i,j ), ldc );
}
}
}

void PackMatrixA( int k, double *a, int lda, double *a_to )
{
int j;

for( j=0; j<k; j++){ /* loop over columns of A */
double
*a_ij_pntr = &A( 0, j );

*a_to++ = *a_ij_pntr;
*a_to++ = *(a_ij_pntr+1);
*a_to++ = *(a_ij_pntr+2);
*a_to++ = *(a_ij_pntr+3);
}
}

#include <mmintrin.h>
#include <xmmintrin.h> // SSE
#include <pmmintrin.h> // SSE2
#include <emmintrin.h> // SSE3

typedef union
{
__m128d v;
double d[2];
} v2df_t;

void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

And now we use vector registers and instructions */

int p;
v2df_t
c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg,
c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg,
a_0p_a_1p_vreg,
a_2p_a_3p_vreg,
b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg;

double
/* Point to the current elements in the four columns of B */
*b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr;

b_p0_pntr = &B( 0, 0 );
b_p1_pntr = &B( 0, 1 );
b_p2_pntr = &B( 0, 2 );
b_p3_pntr = &B( 0, 3 );

c_00_c_10_vreg.v = _mm_setzero_pd();
c_01_c_11_vreg.v = _mm_setzero_pd();
c_02_c_12_vreg.v = _mm_setzero_pd();
c_03_c_13_vreg.v = _mm_setzero_pd();
c_20_c_30_vreg.v = _mm_setzero_pd();
c_21_c_31_vreg.v = _mm_setzero_pd();
c_22_c_32_vreg.v = _mm_setzero_pd();
c_23_c_33_vreg.v = _mm_setzero_pd();

for ( p=0; p<k; p++ ){
a_0p_a_1p_vreg.v = _mm_load_pd( (double *) &A( 0, p ) );
a_2p_a_3p_vreg.v = _mm_load_pd( (double *) &A( 2, p ) );

b_p0_vreg.v = _mm_loaddup_pd( (double *) b_p0_pntr++ ); /* load and duplicate */
b_p1_vreg.v = _mm_loaddup_pd( (double *) b_p1_pntr++ ); /* load and duplicate */
b_p2_vreg.v = _mm_loaddup_pd( (double *) b_p2_pntr++ ); /* load and duplicate */
b_p3_vreg.v = _mm_loaddup_pd( (double *) b_p3_pntr++ ); /* load and duplicate */

/* First row and second rows */
c_00_c_10_vreg.v += a_0p_a_1p_vreg.v * b_p0_vreg.v;
c_01_c_11_vreg.v += a_0p_a_1p_vreg.v * b_p1_vreg.v;
c_02_c_12_vreg.v += a_0p_a_1p_vreg.v * b_p2_vreg.v;
c_03_c_13_vreg.v += a_0p_a_1p_vreg.v * b_p3_vreg.v;

/* Third and fourth rows */
c_20_c_30_vreg.v += a_2p_a_3p_vreg.v * b_p0_vreg.v;
c_21_c_31_vreg.v += a_2p_a_3p_vreg.v * b_p1_vreg.v;
c_22_c_32_vreg.v += a_2p_a_3p_vreg.v * b_p2_vreg.v;
c_23_c_33_vreg.v += a_2p_a_3p_vreg.v * b_p3_vreg.v;
}

C( 0, 0 ) += c_00_c_10_vreg.d[0]; C( 0, 1 ) += c_01_c_11_vreg.d[0];
C( 0, 2 ) += c_02_c_12_vreg.d[0]; C( 0, 3 ) += c_03_c_13_vreg.d[0];

C( 1, 0 ) += c_00_c_10_vreg.d[1]; C( 1, 1 ) += c_01_c_11_vreg.d[1];
C( 1, 2 ) += c_02_c_12_vreg.d[1]; C( 1, 3 ) += c_03_c_13_vreg.d[1];

C( 2, 0 ) += c_20_c_30_vreg.d[0]; C( 2, 1 ) += c_21_c_31_vreg.d[0];
C( 2, 2 ) += c_22_c_32_vreg.d[0]; C( 2, 3 ) += c_23_c_33_vreg.d[0];

C( 3, 0 ) += c_20_c_30_vreg.d[1]; C( 3, 1 ) += c_21_c_31_vreg.d[1];
C( 3, 2 ) += c_22_c_32_vreg.d[1]; C( 3, 3 ) += c_23_c_33_vreg.d[1];
}

Optimization_4x4_13

这个版本保存了A的打包块,以便在InnerKernel的外部循环的第一次迭代之后,使用保存的版本。性能的提升是显而易见的!与上一个版本相比,唯一的变化是增加了if (j== 0)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Block sizes */
#define mc 256
#define kc 128

#define min( i, j ) ( (i)<(j) ? (i): (j) )

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );
void PackMatrixA( int, double *, int, double * );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, p, pb, ib;

/* This time, we compute a mc x n block of C by a call to the InnerKernel */

for ( p=0; p<k; p+=kc ){
pb = min( k-p, kc );
for ( i=0; i<m; i+=mc ){
ib = min( m-i, mc );
InnerKernel( ib, n, pb, &A( i,p ), lda, &B(p, 0 ), ldb, &C( i,0 ), ldc );
}
}
}

void InnerKernel( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;
double
packedA[ m * k ];

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */
if ( j == 0 ) PackMatrixA( k, &A( i, 0 ), lda, &packedA[ i*k ] );
AddDot4x4( k, &packedA[ i*k ], 4, &B( 0,j ), ldb, &C( i,j ), ldc );
}
}
}

void PackMatrixA( int k, double *a, int lda, double *a_to )
{
int j;

for( j=0; j<k; j++){ /* loop over columns of A */
double
*a_ij_pntr = &A( 0, j );

*a_to++ = *a_ij_pntr;
*a_to++ = *(a_ij_pntr+1);
*a_to++ = *(a_ij_pntr+2);
*a_to++ = *(a_ij_pntr+3);
}
}

#include <mmintrin.h>
#include <xmmintrin.h> // SSE
#include <pmmintrin.h> // SSE2
#include <emmintrin.h> // SSE3

typedef union
{
__m128d v;
double d[2];
} v2df_t;

void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

And now we use vector registers and instructions */

int p;
v2df_t
c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg,
c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg,
a_0p_a_1p_vreg,
a_2p_a_3p_vreg,
b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg;

double
/* Point to the current elements in the four columns of B */
*b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr;

b_p0_pntr = &B( 0, 0 );
b_p1_pntr = &B( 0, 1 );
b_p2_pntr = &B( 0, 2 );
b_p3_pntr = &B( 0, 3 );

c_00_c_10_vreg.v = _mm_setzero_pd();
c_01_c_11_vreg.v = _mm_setzero_pd();
c_02_c_12_vreg.v = _mm_setzero_pd();
c_03_c_13_vreg.v = _mm_setzero_pd();
c_20_c_30_vreg.v = _mm_setzero_pd();
c_21_c_31_vreg.v = _mm_setzero_pd();
c_22_c_32_vreg.v = _mm_setzero_pd();
c_23_c_33_vreg.v = _mm_setzero_pd();

for ( p=0; p<k; p++ ){
a_0p_a_1p_vreg.v = _mm_load_pd( (double *) a );
a_2p_a_3p_vreg.v = _mm_load_pd( (double *) ( a+2 ) );
a += 4;

b_p0_vreg.v = _mm_loaddup_pd( (double *) b_p0_pntr++ ); /* load and duplicate */
b_p1_vreg.v = _mm_loaddup_pd( (double *) b_p1_pntr++ ); /* load and duplicate */
b_p2_vreg.v = _mm_loaddup_pd( (double *) b_p2_pntr++ ); /* load and duplicate */
b_p3_vreg.v = _mm_loaddup_pd( (double *) b_p3_pntr++ ); /* load and duplicate */

/* First row and second rows */
c_00_c_10_vreg.v += a_0p_a_1p_vreg.v * b_p0_vreg.v;
c_01_c_11_vreg.v += a_0p_a_1p_vreg.v * b_p1_vreg.v;
c_02_c_12_vreg.v += a_0p_a_1p_vreg.v * b_p2_vreg.v;
c_03_c_13_vreg.v += a_0p_a_1p_vreg.v * b_p3_vreg.v;

/* Third and fourth rows */
c_20_c_30_vreg.v += a_2p_a_3p_vreg.v * b_p0_vreg.v;
c_21_c_31_vreg.v += a_2p_a_3p_vreg.v * b_p1_vreg.v;
c_22_c_32_vreg.v += a_2p_a_3p_vreg.v * b_p2_vreg.v;
c_23_c_33_vreg.v += a_2p_a_3p_vreg.v * b_p3_vreg.v;
}

C( 0, 0 ) += c_00_c_10_vreg.d[0]; C( 0, 1 ) += c_01_c_11_vreg.d[0];
C( 0, 2 ) += c_02_c_12_vreg.d[0]; C( 0, 3 ) += c_03_c_13_vreg.d[0];

C( 1, 0 ) += c_00_c_10_vreg.d[1]; C( 1, 1 ) += c_01_c_11_vreg.d[1];
C( 1, 2 ) += c_02_c_12_vreg.d[1]; C( 1, 3 ) += c_03_c_13_vreg.d[1];

C( 2, 0 ) += c_20_c_30_vreg.d[0]; C( 2, 1 ) += c_21_c_31_vreg.d[0];
C( 2, 2 ) += c_22_c_32_vreg.d[0]; C( 2, 3 ) += c_23_c_33_vreg.d[0];

C( 3, 0 ) += c_20_c_30_vreg.d[1]; C( 3, 1 ) += c_21_c_31_vreg.d[1];
C( 3, 2 ) += c_22_c_32_vreg.d[1]; C( 3, 3 ) += c_23_c_33_vreg.d[1];
}

Optimization_4x4_14

我们现在打包b的kx4块,注意,在这个版本中,面板是重复打包的,这会对性能产生不利影响。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Block sizes */
#define mc 256
#define kc 128

#define min( i, j ) ( (i)<(j) ? (i): (j) )

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );
void PackMatrixA( int, double *, int, double * );
void PackMatrixB( int, double *, int, double * );
void InnerKernel( int, int, int, double *, int, double *, int, double *, int, int );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, p, pb, ib;

/* This time, we compute a mc x n block of C by a call to the InnerKernel */

for ( p=0; p<k; p+=kc ){
pb = min( k-p, kc );
for ( i=0; i<m; i+=mc ){
ib = min( m-i, mc );
InnerKernel( ib, n, pb, &A( i,p ), lda, &B(p, 0 ), ldb, &C( i,0 ), ldc, i==0 );
}
}
}

void InnerKernel( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc, int first_time )
{
int i, j;
double
packedA[ m * k ], packedB[ k*n ];

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
PackMatrixB( k, &B( 0, j ), ldb, &packedB[ j*k ] );
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */
if ( j == 0 )
PackMatrixA( k, &A( i, 0 ), lda, &packedA[ i*k ] );
AddDot4x4( k, &packedA[ i*k ], 4, &packedB[ j*k ], k, &C( i,j ), ldc );
}
}
}

void PackMatrixA( int k, double *a, int lda, double *a_to )
{
int j;

for( j=0; j<k; j++){ /* loop over columns of A */
double
*a_ij_pntr = &A( 0, j );

*a_to = *a_ij_pntr;
*(a_to+1) = *(a_ij_pntr+1);
*(a_to+2) = *(a_ij_pntr+2);
*(a_to+3) = *(a_ij_pntr+3);

a_to += 4;
}
}

void PackMatrixB( int k, double *b, int ldb, double *b_to )
{
int i;
double
*b_i0_pntr = &B( 0, 0 ), *b_i1_pntr = &B( 0, 1 ),
*b_i2_pntr = &B( 0, 2 ), *b_i3_pntr = &B( 0, 3 );

for( i=0; i<k; i++){ /* loop over rows of B */
*b_to++ = *b_i0_pntr++;
*b_to++ = *b_i1_pntr++;
*b_to++ = *b_i2_pntr++;
*b_to++ = *b_i3_pntr++;
}
}

#include <mmintrin.h>
#include <xmmintrin.h> // SSE
#include <pmmintrin.h> // SSE2
#include <emmintrin.h> // SSE3

typedef union
{
__m128d v;
double d[2];
} v2df_t;

void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

And now we use vector registers and instructions */

int p;
v2df_t
c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg,
c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg,
a_0p_a_1p_vreg,
a_2p_a_3p_vreg,
b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg;

c_00_c_10_vreg.v = _mm_setzero_pd();
c_01_c_11_vreg.v = _mm_setzero_pd();
c_02_c_12_vreg.v = _mm_setzero_pd();
c_03_c_13_vreg.v = _mm_setzero_pd();
c_20_c_30_vreg.v = _mm_setzero_pd();
c_21_c_31_vreg.v = _mm_setzero_pd();
c_22_c_32_vreg.v = _mm_setzero_pd();
c_23_c_33_vreg.v = _mm_setzero_pd();

for ( p=0; p<k; p++ ){
a_0p_a_1p_vreg.v = _mm_load_pd( (double *) a );
a_2p_a_3p_vreg.v = _mm_load_pd( (double *) ( a+2 ) );
a += 4;

b_p0_vreg.v = _mm_loaddup_pd( (double *) b ); /* load and duplicate */
b_p1_vreg.v = _mm_loaddup_pd( (double *) (b+1) ); /* load and duplicate */
b_p2_vreg.v = _mm_loaddup_pd( (double *) (b+2) ); /* load and duplicate */
b_p3_vreg.v = _mm_loaddup_pd( (double *) (b+3) ); /* load and duplicate */

b += 4;

/* First row and second rows */
c_00_c_10_vreg.v += a_0p_a_1p_vreg.v * b_p0_vreg.v;
c_01_c_11_vreg.v += a_0p_a_1p_vreg.v * b_p1_vreg.v;
c_02_c_12_vreg.v += a_0p_a_1p_vreg.v * b_p2_vreg.v;
c_03_c_13_vreg.v += a_0p_a_1p_vreg.v * b_p3_vreg.v;

/* Third and fourth rows */
c_20_c_30_vreg.v += a_2p_a_3p_vreg.v * b_p0_vreg.v;
c_21_c_31_vreg.v += a_2p_a_3p_vreg.v * b_p1_vreg.v;
c_22_c_32_vreg.v += a_2p_a_3p_vreg.v * b_p2_vreg.v;
c_23_c_33_vreg.v += a_2p_a_3p_vreg.v * b_p3_vreg.v;
}

C( 0, 0 ) += c_00_c_10_vreg.d[0]; C( 0, 1 ) += c_01_c_11_vreg.d[0];
C( 0, 2 ) += c_02_c_12_vreg.d[0]; C( 0, 3 ) += c_03_c_13_vreg.d[0];

C( 1, 0 ) += c_00_c_10_vreg.d[1]; C( 1, 1 ) += c_01_c_11_vreg.d[1];
C( 1, 2 ) += c_02_c_12_vreg.d[1]; C( 1, 3 ) += c_03_c_13_vreg.d[1];

C( 2, 0 ) += c_20_c_30_vreg.d[0]; C( 2, 1 ) += c_21_c_31_vreg.d[0];
C( 2, 2 ) += c_22_c_32_vreg.d[0]; C( 2, 3 ) += c_23_c_33_vreg.d[0];

C( 3, 0 ) += c_20_c_30_vreg.d[1]; C( 3, 1 ) += c_21_c_31_vreg.d[1];
C( 3, 2 ) += c_22_c_32_vreg.d[1]; C( 3, 3 ) += c_23_c_33_vreg.d[1];
}

Optimization_4x4_15

并且,我们再次添加了一些代码,这样我们就可以避免重新打包b的kx4块。现在性能令人印象深刻!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Block sizes */
#define mc 256
#define kc 128
#define nb 1000

#define min( i, j ) ( (i)<(j) ? (i): (j) )

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );
void PackMatrixA( int, double *, int, double * );
void PackMatrixB( int, double *, int, double * );
void InnerKernel( int, int, int, double *, int, double *, int, double *, int, int );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, p, pb, ib;

/* This time, we compute a mc x n block of C by a call to the InnerKernel */

for ( p=0; p<k; p+=kc ){
pb = min( k-p, kc );
for ( i=0; i<m; i+=mc ){
ib = min( m-i, mc );
InnerKernel( ib, n, pb, &A( i,p ), lda, &B(p, 0 ), ldb, &C( i,0 ), ldc, i==0 );
}
}
}

void InnerKernel( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc, int first_time )
{
int i, j;
double
packedA[ m * k ];
static double
packedB[ kc*nb ]; /* Note: using a static buffer is not thread safe... */

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
if ( first_time )
PackMatrixB( k, &B( 0, j ), ldb, &packedB[ j*k ] );
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */
if ( j == 0 )
PackMatrixA( k, &A( i, 0 ), lda, &packedA[ i*k ] );
AddDot4x4( k, &packedA[ i*k ], 4, &packedB[ j*k ], k, &C( i,j ), ldc );
}
}
}

void PackMatrixA( int k, double *a, int lda, double *a_to )
{
int j;

for( j=0; j<k; j++){ /* loop over columns of A */
double
*a_ij_pntr = &A( 0, j );

*a_to = *a_ij_pntr;
*(a_to+1) = *(a_ij_pntr+1);
*(a_to+2) = *(a_ij_pntr+2);
*(a_to+3) = *(a_ij_pntr+3);

a_to += 4;
}
}

void PackMatrixB( int k, double *b, int ldb, double *b_to )
{
int i;
double
*b_i0_pntr = &B( 0, 0 ), *b_i1_pntr = &B( 0, 1 ),
*b_i2_pntr = &B( 0, 2 ), *b_i3_pntr = &B( 0, 3 );

for( i=0; i<k; i++){ /* loop over rows of B */
*b_to++ = *b_i0_pntr++;
*b_to++ = *b_i1_pntr++;
*b_to++ = *b_i2_pntr++;
*b_to++ = *b_i3_pntr++;
}
}

#include <mmintrin.h>
#include <xmmintrin.h> // SSE
#include <pmmintrin.h> // SSE2
#include <emmintrin.h> // SSE3

typedef union
{
__m128d v;
double d[2];
} v2df_t;

void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

And now we use vector registers and instructions */

int p;
v2df_t
c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg,
c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg,
a_0p_a_1p_vreg,
a_2p_a_3p_vreg,
b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg;

c_00_c_10_vreg.v = _mm_setzero_pd();
c_01_c_11_vreg.v = _mm_setzero_pd();
c_02_c_12_vreg.v = _mm_setzero_pd();
c_03_c_13_vreg.v = _mm_setzero_pd();
c_20_c_30_vreg.v = _mm_setzero_pd();
c_21_c_31_vreg.v = _mm_setzero_pd();
c_22_c_32_vreg.v = _mm_setzero_pd();
c_23_c_33_vreg.v = _mm_setzero_pd();

for ( p=0; p<k; p++ ){
a_0p_a_1p_vreg.v = _mm_load_pd( (double *) a );
a_2p_a_3p_vreg.v = _mm_load_pd( (double *) ( a+2 ) );
a += 4;

b_p0_vreg.v = _mm_loaddup_pd( (double *) b ); /* load and duplicate */
b_p1_vreg.v = _mm_loaddup_pd( (double *) (b+1) ); /* load and duplicate */
b_p2_vreg.v = _mm_loaddup_pd( (double *) (b+2) ); /* load and duplicate */
b_p3_vreg.v = _mm_loaddup_pd( (double *) (b+3) ); /* load and duplicate */

b += 4;

/* First row and second rows */
c_00_c_10_vreg.v += a_0p_a_1p_vreg.v * b_p0_vreg.v;
c_01_c_11_vreg.v += a_0p_a_1p_vreg.v * b_p1_vreg.v;
c_02_c_12_vreg.v += a_0p_a_1p_vreg.v * b_p2_vreg.v;
c_03_c_13_vreg.v += a_0p_a_1p_vreg.v * b_p3_vreg.v;

/* Third and fourth rows */
c_20_c_30_vreg.v += a_2p_a_3p_vreg.v * b_p0_vreg.v;
c_21_c_31_vreg.v += a_2p_a_3p_vreg.v * b_p1_vreg.v;
c_22_c_32_vreg.v += a_2p_a_3p_vreg.v * b_p2_vreg.v;
c_23_c_33_vreg.v += a_2p_a_3p_vreg.v * b_p3_vreg.v;
}

C( 0, 0 ) += c_00_c_10_vreg.d[0]; C( 0, 1 ) += c_01_c_11_vreg.d[0];
C( 0, 2 ) += c_02_c_12_vreg.d[0]; C( 0, 3 ) += c_03_c_13_vreg.d[0];

C( 1, 0 ) += c_00_c_10_vreg.d[1]; C( 1, 1 ) += c_01_c_11_vreg.d[1];
C( 1, 2 ) += c_02_c_12_vreg.d[1]; C( 1, 3 ) += c_03_c_13_vreg.d[1];

C( 2, 0 ) += c_20_c_30_vreg.d[0]; C( 2, 1 ) += c_21_c_31_vreg.d[0];
C( 2, 2 ) += c_22_c_32_vreg.d[0]; C( 2, 3 ) += c_23_c_33_vreg.d[0];

C( 3, 0 ) += c_20_c_30_vreg.d[1]; C( 3, 1 ) += c_21_c_31_vreg.d[1];
C( 3, 2 ) += c_22_c_32_vreg.d[1]; C( 3, 3 ) += c_23_c_33_vreg.d[1];
}
☑️ ⭐

Blocking to maintain performance

Blocking to maintain performance

现在,性能得到了保持:

img

img

Optimization_4x4_11

我们注意到,对于迄今为止的所有优化,当涉及的矩阵比L2缓存所能容纳的矩阵大得多时,性能会大幅下降。在这个优化中,我们创建了一个额外的分块级别来克服这个问题。我们现在有一个主例程,它调用GotoBLAS和BLIS使用的内部内核,然后AddDot4x4例程是BLIS使用的微内核。

这一步主要是为了分块,把原来的MY_MMult变成了InnerKernel,而现在的MY_MMult作用就是为了分块。分块大小通过宏定义给出。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Block sizes */
#define mc 256
#define kc 128

#define min( i, j ) ( (i)<(j) ? (i): (j) )

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j, p, pb, ib;

/* This time, we compute a mc x n block of C by a call to the InnerKernel */

for ( p=0; p<k; p+=kc ){
pb = min( k-p, kc );
for ( i=0; i<m; i+=mc ){
ib = min( m-i, mc );
InnerKernel( ib, n, pb, &A( i,p ), lda, &B(p, 0 ), ldb, &C( i,0 ), ldc );
}
}
}

void InnerKernel( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */

AddDot4x4( k, &A( i,0 ), lda, &B( 0,j ), ldb, &C( i,j ), ldc );
}
}
}

#include <mmintrin.h>
#include <xmmintrin.h> // SSE
#include <pmmintrin.h> // SSE2
#include <emmintrin.h> // SSE3

typedef union
{
__m128d v;
double d[2];
} v2df_t;

void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

And now we use vector registers and instructions */

int p;
v2df_t
c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg,
c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg,
a_0p_a_1p_vreg,
a_2p_a_3p_vreg,
b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg;

double
/* Point to the current elements in the four columns of B */
*b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr;

b_p0_pntr = &B( 0, 0 );
b_p1_pntr = &B( 0, 1 );
b_p2_pntr = &B( 0, 2 );
b_p3_pntr = &B( 0, 3 );

c_00_c_10_vreg.v = _mm_setzero_pd();
c_01_c_11_vreg.v = _mm_setzero_pd();
c_02_c_12_vreg.v = _mm_setzero_pd();
c_03_c_13_vreg.v = _mm_setzero_pd();
c_20_c_30_vreg.v = _mm_setzero_pd();
c_21_c_31_vreg.v = _mm_setzero_pd();
c_22_c_32_vreg.v = _mm_setzero_pd();
c_23_c_33_vreg.v = _mm_setzero_pd();

for ( p=0; p<k; p++ ){
a_0p_a_1p_vreg.v = _mm_load_pd( (double *) &A( 0, p ) );
a_2p_a_3p_vreg.v = _mm_load_pd( (double *) &A( 2, p ) );

b_p0_vreg.v = _mm_loaddup_pd( (double *) b_p0_pntr++ ); /* load and duplicate */
b_p1_vreg.v = _mm_loaddup_pd( (double *) b_p1_pntr++ ); /* load and duplicate */
b_p2_vreg.v = _mm_loaddup_pd( (double *) b_p2_pntr++ ); /* load and duplicate */
b_p3_vreg.v = _mm_loaddup_pd( (double *) b_p3_pntr++ ); /* load and duplicate */

/* First row and second rows */
c_00_c_10_vreg.v += a_0p_a_1p_vreg.v * b_p0_vreg.v;
c_01_c_11_vreg.v += a_0p_a_1p_vreg.v * b_p1_vreg.v;
c_02_c_12_vreg.v += a_0p_a_1p_vreg.v * b_p2_vreg.v;
c_03_c_13_vreg.v += a_0p_a_1p_vreg.v * b_p3_vreg.v;

/* Third and fourth rows */
c_20_c_30_vreg.v += a_2p_a_3p_vreg.v * b_p0_vreg.v;
c_21_c_31_vreg.v += a_2p_a_3p_vreg.v * b_p1_vreg.v;
c_22_c_32_vreg.v += a_2p_a_3p_vreg.v * b_p2_vreg.v;
c_23_c_33_vreg.v += a_2p_a_3p_vreg.v * b_p3_vreg.v;
}

C( 0, 0 ) += c_00_c_10_vreg.d[0]; C( 0, 1 ) += c_01_c_11_vreg.d[0];
C( 0, 2 ) += c_02_c_12_vreg.d[0]; C( 0, 3 ) += c_03_c_13_vreg.d[0];

C( 1, 0 ) += c_00_c_10_vreg.d[1]; C( 1, 1 ) += c_01_c_11_vreg.d[1];
C( 1, 2 ) += c_02_c_12_vreg.d[1]; C( 1, 3 ) += c_03_c_13_vreg.d[1];

C( 2, 0 ) += c_20_c_30_vreg.d[0]; C( 2, 1 ) += c_21_c_31_vreg.d[0];
C( 2, 2 ) += c_22_c_32_vreg.d[0]; C( 2, 3 ) += c_23_c_33_vreg.d[0];

C( 3, 0 ) += c_20_c_30_vreg.d[1]; C( 3, 1 ) += c_21_c_31_vreg.d[1];
C( 3, 2 ) += c_22_c_32_vreg.d[1]; C( 3, 3 ) += c_23_c_33_vreg.d[1];
}
☑️ ⭐

Further optimizing

Further optimizing

现在我们开始以不同的方式优化1x4的情况。

请注意,我们现在使用的常规寄存器比物理上可用的寄存器多得多……

We notice a considerable performance boost:

img

img

不过,还有很大的改进空间。

Optimization_4x4_8

现在我们使用寄存器来存储B当前行的元素。(注意,对于一次计算C四个元素的情况,我们没有这样做。)性能实际上略有下降。但是这个步骤支持进一步的优化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */

AddDot4x4( k, &A( i,0 ), lda, &B( 0,j ), ldb, &C( i,j ), ldc );
}
}
}


void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

In this version, we use registers for elements in the current row
of B as well */

int p;
register double
/* hold contributions to
C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 )
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 )
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 )
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ) */
c_00_reg, c_01_reg, c_02_reg, c_03_reg,
c_10_reg, c_11_reg, c_12_reg, c_13_reg,
c_20_reg, c_21_reg, c_22_reg, c_23_reg,
c_30_reg, c_31_reg, c_32_reg, c_33_reg,
/* hold
A( 0, p )
A( 1, p )
A( 2, p )
A( 3, p ) */
a_0p_reg,
a_1p_reg,
a_2p_reg,
a_3p_reg,
b_p0_reg,
b_p1_reg,
b_p2_reg,
b_p3_reg;

double
/* Point to the current elements in the four columns of B */
*b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr;

b_p0_pntr = &B( 0, 0 );
b_p1_pntr = &B( 0, 1 );
b_p2_pntr = &B( 0, 2 );
b_p3_pntr = &B( 0, 3 );

c_00_reg = 0.0; c_01_reg = 0.0; c_02_reg = 0.0; c_03_reg = 0.0;
c_10_reg = 0.0; c_11_reg = 0.0; c_12_reg = 0.0; c_13_reg = 0.0;
c_20_reg = 0.0; c_21_reg = 0.0; c_22_reg = 0.0; c_23_reg = 0.0;
c_30_reg = 0.0; c_31_reg = 0.0; c_32_reg = 0.0; c_33_reg = 0.0;

for ( p=0; p<k; p++ ){
a_0p_reg = A( 0, p );
a_1p_reg = A( 1, p );
a_2p_reg = A( 2, p );
a_3p_reg = A( 3, p );

b_p0_reg = *b_p0_pntr++;
b_p1_reg = *b_p1_pntr++;
b_p2_reg = *b_p2_pntr++;
b_p3_reg = *b_p3_pntr++;

/* First row */
c_00_reg += a_0p_reg * b_p0_reg;
c_01_reg += a_0p_reg * b_p1_reg;
c_02_reg += a_0p_reg * b_p2_reg;
c_03_reg += a_0p_reg * b_p3_reg;

/* Second row */
c_10_reg += a_1p_reg * b_p0_reg;
c_11_reg += a_1p_reg * b_p1_reg;
c_12_reg += a_1p_reg * b_p2_reg;
c_13_reg += a_1p_reg * b_p3_reg;

/* Third row */
c_20_reg += a_2p_reg * b_p0_reg;
c_21_reg += a_2p_reg * b_p1_reg;
c_22_reg += a_2p_reg * b_p2_reg;
c_23_reg += a_2p_reg * b_p3_reg;

/* Four row */
c_30_reg += a_3p_reg * b_p0_reg;
c_31_reg += a_3p_reg * b_p1_reg;
c_32_reg += a_3p_reg * b_p2_reg;
c_33_reg += a_3p_reg * b_p3_reg;
}

C( 0, 0 ) += c_00_reg; C( 0, 1 ) += c_01_reg; C( 0, 2 ) += c_02_reg; C( 0, 3 ) += c_03_reg;
C( 1, 0 ) += c_10_reg; C( 1, 1 ) += c_11_reg; C( 1, 2 ) += c_12_reg; C( 1, 3 ) += c_13_reg;
C( 2, 0 ) += c_20_reg; C( 2, 1 ) += c_21_reg; C( 2, 2 ) += c_22_reg; C( 2, 3 ) += c_23_reg;
C( 3, 0 ) += c_30_reg; C( 3, 1 ) += c_31_reg; C( 3, 2 ) += c_32_reg; C( 3, 3 ) += c_33_reg;
}

Optimization_4x4_9

从4x4_8到4x4_9是一个微妙的变化:我们不是一次一行地更新4x4块C的行,而是一次计算两行。这为我们使用向量操作做好了准备,我们用向量操作更新对C(0,j)和C(1,j) (j =0,…,3)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */

AddDot4x4( k, &A( i,0 ), lda, &B( 0,j ), ldb, &C( i,j ), ldc );
}
}
}


void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

A simple rearrangement to prepare for the use of vector registers */

int p;
register double
/* hold contributions to
C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 )
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 )
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 )
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ) */
c_00_reg, c_01_reg, c_02_reg, c_03_reg,
c_10_reg, c_11_reg, c_12_reg, c_13_reg,
c_20_reg, c_21_reg, c_22_reg, c_23_reg,
c_30_reg, c_31_reg, c_32_reg, c_33_reg,
/* hold
A( 0, p )
A( 1, p )
A( 2, p )
A( 3, p ) */
a_0p_reg,
a_1p_reg,
a_2p_reg,
a_3p_reg,
b_p0_reg,
b_p1_reg,
b_p2_reg,
b_p3_reg;

double
/* Point to the current elements in the four columns of B */
*b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr;

b_p0_pntr = &B( 0, 0 );
b_p1_pntr = &B( 0, 1 );
b_p2_pntr = &B( 0, 2 );
b_p3_pntr = &B( 0, 3 );

c_00_reg = 0.0; c_01_reg = 0.0; c_02_reg = 0.0; c_03_reg = 0.0;
c_10_reg = 0.0; c_11_reg = 0.0; c_12_reg = 0.0; c_13_reg = 0.0;
c_20_reg = 0.0; c_21_reg = 0.0; c_22_reg = 0.0; c_23_reg = 0.0;
c_30_reg = 0.0; c_31_reg = 0.0; c_32_reg = 0.0; c_33_reg = 0.0;

for ( p=0; p<k; p++ ){
a_0p_reg = A( 0, p );
a_1p_reg = A( 1, p );
a_2p_reg = A( 2, p );
a_3p_reg = A( 3, p );

b_p0_reg = *b_p0_pntr++;
b_p1_reg = *b_p1_pntr++;
b_p2_reg = *b_p2_pntr++;
b_p3_reg = *b_p3_pntr++;

/* First row and second rows */
c_00_reg += a_0p_reg * b_p0_reg;
c_10_reg += a_1p_reg * b_p0_reg;

c_01_reg += a_0p_reg * b_p1_reg;
c_11_reg += a_1p_reg * b_p1_reg;

c_02_reg += a_0p_reg * b_p2_reg;
c_12_reg += a_1p_reg * b_p2_reg;

c_03_reg += a_0p_reg * b_p3_reg;
c_13_reg += a_1p_reg * b_p3_reg;

/* Third and fourth rows */
c_20_reg += a_2p_reg * b_p0_reg;
c_30_reg += a_3p_reg * b_p0_reg;

c_21_reg += a_2p_reg * b_p1_reg;
c_31_reg += a_3p_reg * b_p1_reg;

c_22_reg += a_2p_reg * b_p2_reg;
c_32_reg += a_3p_reg * b_p2_reg;

c_23_reg += a_2p_reg * b_p3_reg;
c_33_reg += a_3p_reg * b_p3_reg;
}

C( 0, 0 ) += c_00_reg; C( 0, 1 ) += c_01_reg; C( 0, 2 ) += c_02_reg; C( 0, 3 ) += c_03_reg;
C( 1, 0 ) += c_10_reg; C( 1, 1 ) += c_11_reg; C( 1, 2 ) += c_12_reg; C( 1, 3 ) += c_13_reg;
C( 2, 0 ) += c_20_reg; C( 2, 1 ) += c_21_reg; C( 2, 2 ) += c_22_reg; C( 2, 3 ) += c_23_reg;
C( 3, 0 ) += c_30_reg; C( 3, 1 ) += c_31_reg; C( 3, 2 ) += c_32_reg; C( 3, 3 ) += c_33_reg;
}

Optimization_4x4_10

在这里,我们开始使用向量寄存器和向量操作。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */

AddDot4x4( k, &A( i,0 ), lda, &B( 0,j ), ldb, &C( i,j ), ldc );
}
}
}

#include <mmintrin.h>
#include <xmmintrin.h> // SSE
#include <pmmintrin.h> // SSE2
#include <emmintrin.h> // SSE3

typedef union
{
__m128d v;
double d[2];
} v2df_t;

void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

And now we use vector registers and instructions */

int p;

v2df_t
c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg,
c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg,
a_0p_a_1p_vreg,
a_2p_a_3p_vreg,
b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg;

double
/* Point to the current elements in the four columns of B */
*b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr;

b_p0_pntr = &B( 0, 0 );
b_p1_pntr = &B( 0, 1 );
b_p2_pntr = &B( 0, 2 );
b_p3_pntr = &B( 0, 3 );

c_00_c_10_vreg.v = _mm_setzero_pd();
c_01_c_11_vreg.v = _mm_setzero_pd();
c_02_c_12_vreg.v = _mm_setzero_pd();
c_03_c_13_vreg.v = _mm_setzero_pd();
c_20_c_30_vreg.v = _mm_setzero_pd();
c_21_c_31_vreg.v = _mm_setzero_pd();
c_22_c_32_vreg.v = _mm_setzero_pd();
c_23_c_33_vreg.v = _mm_setzero_pd();

for ( p=0; p<k; p++ ){
a_0p_a_1p_vreg.v = _mm_load_pd( (double *) &A( 0, p ) );
a_2p_a_3p_vreg.v = _mm_load_pd( (double *) &A( 2, p ) );

b_p0_vreg.v = _mm_loaddup_pd( (double *) b_p0_pntr++ ); /* load and duplicate */
b_p1_vreg.v = _mm_loaddup_pd( (double *) b_p1_pntr++ ); /* load and duplicate */
b_p2_vreg.v = _mm_loaddup_pd( (double *) b_p2_pntr++ ); /* load and duplicate */
b_p3_vreg.v = _mm_loaddup_pd( (double *) b_p3_pntr++ ); /* load and duplicate */

/* First row and second rows * 向量化,一次计算两个double/
c_00_c_10_vreg.v += a_0p_a_1p_vreg.v * b_p0_vreg.v;
c_01_c_11_vreg.v += a_0p_a_1p_vreg.v * b_p1_vreg.v;
c_02_c_12_vreg.v += a_0p_a_1p_vreg.v * b_p2_vreg.v;
c_03_c_13_vreg.v += a_0p_a_1p_vreg.v * b_p3_vreg.v;

/* Third and fourth rows */
c_20_c_30_vreg.v += a_2p_a_3p_vreg.v * b_p0_vreg.v;
c_21_c_31_vreg.v += a_2p_a_3p_vreg.v * b_p1_vreg.v;
c_22_c_32_vreg.v += a_2p_a_3p_vreg.v * b_p2_vreg.v;
c_23_c_33_vreg.v += a_2p_a_3p_vreg.v * b_p3_vreg.v;
}

C( 0, 0 ) += c_00_c_10_vreg.d[0]; C( 0, 1 ) += c_01_c_11_vreg.d[0];
C( 0, 2 ) += c_02_c_12_vreg.d[0]; C( 0, 3 ) += c_03_c_13_vreg.d[0];

C( 1, 0 ) += c_00_c_10_vreg.d[1]; C( 1, 1 ) += c_01_c_11_vreg.d[1];
C( 1, 2 ) += c_02_c_12_vreg.d[1]; C( 1, 3 ) += c_03_c_13_vreg.d[1];

C( 2, 0 ) += c_20_c_30_vreg.d[0]; C( 2, 1 ) += c_21_c_31_vreg.d[0];
C( 2, 2 ) += c_22_c_32_vreg.d[0]; C( 2, 3 ) += c_23_c_33_vreg.d[0];

C( 3, 0 ) += c_20_c_30_vreg.d[1]; C( 3, 1 ) += c_21_c_31_vreg.d[1];
C( 3, 2 ) += c_22_c_32_vreg.d[1]; C( 3, 3 ) += c_23_c_33_vreg.d[1];
}
☑️ ⭐

Repeating the same optimizations

Repeating the same optimizations

在这一点上,我们再次开始看到一些性能改进:

img

img

Optimization_4x4_3

对循环变量i进行展开。由原来AddDot1x4变为AddDot4x4,一次计算16个乘积。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Routine for computing C = A * B + C */

void AddDot( int, double *, int, double *, double * );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */

AddDot4x4( k, &A( i,0 ), lda, &B( 0,j ), ldb, &C( i,j ), ldc );
}
}
}


void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C */

/* First row */
AddDot( k, &A( 0, 0 ), lda, &B( 0, 0 ), &C( 0, 0 ) );
AddDot( k, &A( 0, 0 ), lda, &B( 0, 1 ), &C( 0, 1 ) );
AddDot( k, &A( 0, 0 ), lda, &B( 0, 2 ), &C( 0, 2 ) );
AddDot( k, &A( 0, 0 ), lda, &B( 0, 3 ), &C( 0, 3 ) );

/* Second row */
AddDot( k, &A( 1, 0 ), lda, &B( 0, 0 ), &C( 1, 0 ) );
AddDot( k, &A( 1, 0 ), lda, &B( 0, 1 ), &C( 1, 1 ) );
AddDot( k, &A( 1, 0 ), lda, &B( 0, 2 ), &C( 1, 2 ) );
AddDot( k, &A( 1, 0 ), lda, &B( 0, 3 ), &C( 1, 3 ) );

/* Third row */
AddDot( k, &A( 2, 0 ), lda, &B( 0, 0 ), &C( 2, 0 ) );
AddDot( k, &A( 2, 0 ), lda, &B( 0, 1 ), &C( 2, 1 ) );
AddDot( k, &A( 2, 0 ), lda, &B( 0, 2 ), &C( 2, 2 ) );
AddDot( k, &A( 2, 0 ), lda, &B( 0, 3 ), &C( 2, 3 ) );

/* Four row */
AddDot( k, &A( 3, 0 ), lda, &B( 0, 0 ), &C( 3, 0 ) );
AddDot( k, &A( 3, 0 ), lda, &B( 0, 1 ), &C( 3, 1 ) );
AddDot( k, &A( 3, 0 ), lda, &B( 0, 2 ), &C( 3, 2 ) );
AddDot( k, &A( 3, 0 ), lda, &B( 0, 3 ), &C( 3, 3 ) );
}


/* Create macro to let X( i ) equal the ith element of x */

#define X(i) x[ (i)*incx ]

void AddDot( int k, double *x, int incx, double *y, double *gamma )
{
/* compute gamma := x' * y + gamma with vectors x and y of length n.

Here x starts at location x with increment (stride) incx and y starts at location y and has (implicit) stride of 1.
*/

int p;

for ( p=0; p<k; p++ ){
*gamma += X( p ) * y[ p ];
}
}

Optimization_4x4_4

把AddDot计算kernel合并到AddDot4x4里面。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );
void AddDot( int, double *, int, double *, double * );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */

AddDot4x4( k, &A( i,0 ), lda, &B( 0,j ), ldb, &C( i,j ), ldc );
}
}
}


void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

In this version, we "inline" AddDot */

int p;

/* First row 第一行*/
// AddDot( k, &A( 0, 0 ), lda, &B( 0, 0 ), &C( 0, 0 ) );
for ( p=0; p<k; p++ ){
C( 0, 0 ) += A( 0, p ) * B( p, 0 );
}
// AddDot( k, &A( 0, 0 ), lda, &B( 0, 1 ), &C( 0, 1 ) );
for ( p=0; p<k; p++ ){
C( 0, 1 ) += A( 0, p ) * B( p, 1 );
}
// AddDot( k, &A( 0, 0 ), lda, &B( 0, 2 ), &C( 0, 2 ) );
for ( p=0; p<k; p++ ){
C( 0, 2 ) += A( 0, p ) * B( p, 2 );
}
// AddDot( k, &A( 0, 0 ), lda, &B( 0, 3 ), &C( 0, 3 ) );
for ( p=0; p<k; p++ ){
C( 0, 3 ) += A( 0, p ) * B( p, 3 );
}

/* Second row 第二行*/
// AddDot( k, &A( 1, 0 ), lda, &B( 0, 0 ), &C( 1, 0 ) );
for ( p=0; p<k; p++ ){
C( 1, 0 ) += A( 1, p ) * B( p, 0 );
}
// AddDot( k, &A( 1, 0 ), lda, &B( 0, 1 ), &C( 1, 1 ) );
for ( p=0; p<k; p++ ){
C( 1, 1 ) += A( 1, p ) * B( p, 1 );
}
// AddDot( k, &A( 1, 0 ), lda, &B( 0, 2 ), &C( 1, 2 ) );
for ( p=0; p<k; p++ ){
C( 1, 2 ) += A( 1, p ) * B( p, 2 );
}
// AddDot( k, &A( 1, 0 ), lda, &B( 0, 3 ), &C( 1, 3 ) );
for ( p=0; p<k; p++ ){
C( 1, 3 ) += A( 1, p ) * B( p, 3 );
}

/* Third row 第三行*/
// AddDot( k, &A( 2, 0 ), lda, &B( 0, 0 ), &C( 2, 0 ) );
for ( p=0; p<k; p++ ){
C( 2, 0 ) += A( 2, p ) * B( p, 0 );
}
// AddDot( k, &A( 2, 0 ), lda, &B( 0, 1 ), &C( 2, 1 ) );
for ( p=0; p<k; p++ ){
C( 2, 1 ) += A( 2, p ) * B( p, 1 );
}
// AddDot( k, &A( 2, 0 ), lda, &B( 0, 2 ), &C( 2, 2 ) );
for ( p=0; p<k; p++ ){
C( 2, 2 ) += A( 2, p ) * B( p, 2 );
}
// AddDot( k, &A( 2, 0 ), lda, &B( 0, 3 ), &C( 2, 3 ) );
for ( p=0; p<k; p++ ){
C( 2, 3 ) += A( 2, p ) * B( p, 3 );
}

/* Four row 第四行*/
// AddDot( k, &A( 3, 0 ), lda, &B( 0, 0 ), &C( 3, 0 ) );
for ( p=0; p<k; p++ ){
C( 3, 0 ) += A( 3, p ) * B( p, 0 );
}
// AddDot( k, &A( 3, 0 ), lda, &B( 0, 1 ), &C( 3, 1 ) );
for ( p=0; p<k; p++ ){
C( 3, 1 ) += A( 3, p ) * B( p, 1 );
}
// AddDot( k, &A( 3, 0 ), lda, &B( 0, 2 ), &C( 3, 2 ) );
for ( p=0; p<k; p++ ){
C( 3, 2 ) += A( 3, p ) * B( p, 2 );
}
// AddDot( k, &A( 3, 0 ), lda, &B( 0, 3 ), &C( 3, 3 ) );
for ( p=0; p<k; p++ ){
C( 3, 3 ) += A( 3, p ) * B( p, 3 );
}
}

Optimization_4x4_5

合并16个for循环。

现在,当矩阵变大时,我们看到了性能上的好处,因为数据在被放入寄存器后会得到更多的重用。

以前是:1x4_5(一次计算C的4个元素)现在是:4x4_5(一次计算C的16个元素)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );
void AddDot( int, double *, int, double *, double * );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */

AddDot4x4( k, &A( i,0 ), lda, &B( 0,j ), ldb, &C( i,j ), ldc );
}
}
}


void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

In this version, we merge each set of four loops, computing four
inner products simultaneously. */

int p;

for ( p=0; p<k; p++ ){
/* First row */
C( 0, 0 ) += A( 0, p ) * B( p, 0 );
C( 0, 1 ) += A( 0, p ) * B( p, 1 );
C( 0, 2 ) += A( 0, p ) * B( p, 2 );
C( 0, 3 ) += A( 0, p ) * B( p, 3 );

/* Second row */
C( 1, 0 ) += A( 1, p ) * B( p, 0 );
C( 1, 1 ) += A( 1, p ) * B( p, 1 );
C( 1, 2 ) += A( 1, p ) * B( p, 2 );
C( 1, 3 ) += A( 1, p ) * B( p, 3 );

/* Third row */
C( 2, 0 ) += A( 2, p ) * B( p, 0 );
C( 2, 1 ) += A( 2, p ) * B( p, 1 );
C( 2, 2 ) += A( 2, p ) * B( p, 2 );
C( 2, 3 ) += A( 2, p ) * B( p, 3 );

/* Fourth row */
C( 3, 0 ) += A( 3, p ) * B( p, 0 );
C( 3, 1 ) += A( 3, p ) * B( p, 1 );
C( 3, 2 ) += A( 3, p ) * B( p, 2 );
C( 3, 3 ) += A( 3, p ) * B( p, 3 );
}
}

Optimization_4x4_6

矩阵C和A采用寄存器来存。

我们为C的4x4块和A的当前4x1列的元素使用(常规)寄存器,这一事实使性能受益。请注意,我们使用的是比实际存在的更多的常规寄存器,所以任何人都可以猜测编译器会用它做什么。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */

AddDot4x4( k, &A( i,0 ), lda, &B( 0,j ), ldb, &C( i,j ), ldc );
}
}
}


void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

In this version, we accumulate in registers and put A( 0, p ) in a register */

int p;
register double
/* hold contributions to
C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 )
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 )
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 )
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ) */
c_00_reg, c_01_reg, c_02_reg, c_03_reg,
c_10_reg, c_11_reg, c_12_reg, c_13_reg,
c_20_reg, c_21_reg, c_22_reg, c_23_reg,
c_30_reg, c_31_reg, c_32_reg, c_33_reg,
/* hold
A( 0, p )
A( 1, p )
A( 2, p )
A( 3, p ) */
a_0p_reg,
a_1p_reg,
a_2p_reg,
a_3p_reg;

c_00_reg = 0.0; c_01_reg = 0.0; c_02_reg = 0.0; c_03_reg = 0.0;
c_10_reg = 0.0; c_11_reg = 0.0; c_12_reg = 0.0; c_13_reg = 0.0;
c_20_reg = 0.0; c_21_reg = 0.0; c_22_reg = 0.0; c_23_reg = 0.0;
c_30_reg = 0.0; c_31_reg = 0.0; c_32_reg = 0.0; c_33_reg = 0.0;

for ( p=0; p<k; p++ ){
a_0p_reg = A( 0, p );
a_1p_reg = A( 1, p );
a_2p_reg = A( 2, p );
a_3p_reg = A( 3, p );

/* First row */
c_00_reg += a_0p_reg * B( p, 0 );
c_01_reg += a_0p_reg * B( p, 1 );
c_02_reg += a_0p_reg * B( p, 2 );
c_03_reg += a_0p_reg * B( p, 3 );

/* Second row */
c_10_reg += a_1p_reg * B( p, 0 );
c_11_reg += a_1p_reg * B( p, 1 );
c_12_reg += a_1p_reg * B( p, 2 );
c_13_reg += a_1p_reg * B( p, 3 );

/* Third row */
c_20_reg += a_2p_reg * B( p, 0 );
c_21_reg += a_2p_reg * B( p, 1 );
c_22_reg += a_2p_reg * B( p, 2 );
c_23_reg += a_2p_reg * B( p, 3 );

/* Four row */
c_30_reg += a_3p_reg * B( p, 0 );
c_31_reg += a_3p_reg * B( p, 1 );
c_32_reg += a_3p_reg * B( p, 2 );
c_33_reg += a_3p_reg * B( p, 3 );
}

C( 0, 0 ) += c_00_reg; C( 0, 1 ) += c_01_reg; C( 0, 2 ) += c_02_reg; C( 0, 3 ) += c_03_reg;
C( 1, 0 ) += c_10_reg; C( 1, 1 ) += c_11_reg; C( 1, 2 ) += c_12_reg; C( 1, 3 ) += c_13_reg;
C( 2, 0 ) += c_20_reg; C( 2, 1 ) += c_21_reg; C( 2, 2 ) += c_22_reg; C( 2, 3 ) += c_23_reg;
C( 3, 0 ) += c_30_reg; C( 3, 1 ) += c_31_reg; C( 3, 2 ) += c_32_reg; C( 3, 3 ) += c_33_reg;
}

Optimization_4x4_7

这里我们改为使用指针来跟踪B的当前4x1块。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );

void MY_MMult( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 */
for ( i=0; i<m; i+=4 ){ /* Loop over the rows of C */
/* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
one routine (four inner products) */

AddDot4x4( k, &A( i,0 ), lda, &B( 0,j ), ldb, &C( i,j ), ldc );
}
}
}


void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc )
{
/* So, this routine computes a 4x4 block of matrix A

C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).

Notice that this routine is called with c = C( i, j ) in the
previous routine, so these are actually the elements

C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 )
C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 )
C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 )
C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 )

in the original matrix C

In this version, we use pointer to track where in four columns of B we are */

int p;
register double
/* hold contributions to
C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 )
C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 )
C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 )
C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ) */
c_00_reg, c_01_reg, c_02_reg, c_03_reg,
c_10_reg, c_11_reg, c_12_reg, c_13_reg,
c_20_reg, c_21_reg, c_22_reg, c_23_reg,
c_30_reg, c_31_reg, c_32_reg, c_33_reg,
/* hold
A( 0, p )
A( 1, p )
A( 2, p )
A( 3, p ) */
a_0p_reg,
a_1p_reg,
a_2p_reg,
a_3p_reg;
double
/* Point to the current elements in the four columns of B */
*b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr;

c_00_reg = 0.0; c_01_reg = 0.0; c_02_reg = 0.0; c_03_reg = 0.0;
c_10_reg = 0.0; c_11_reg = 0.0; c_12_reg = 0.0; c_13_reg = 0.0;
c_20_reg = 0.0; c_21_reg = 0.0; c_22_reg = 0.0; c_23_reg = 0.0;
c_30_reg = 0.0; c_31_reg = 0.0; c_32_reg = 0.0; c_33_reg = 0.0;

for ( p=0; p<k; p++ ){
a_0p_reg = A( 0, p );
a_1p_reg = A( 1, p );
a_2p_reg = A( 2, p );
a_3p_reg = A( 3, p );

b_p0_pntr = &B( p, 0 );
b_p1_pntr = &B( p, 1 );
b_p2_pntr = &B( p, 2 );
b_p3_pntr = &B( p, 3 );

/* First row */
c_00_reg += a_0p_reg * *b_p0_pntr;
c_01_reg += a_0p_reg * *b_p1_pntr;
c_02_reg += a_0p_reg * *b_p2_pntr;
c_03_reg += a_0p_reg * *b_p3_pntr;

/* Second row */
c_10_reg += a_1p_reg * *b_p0_pntr;
c_11_reg += a_1p_reg * *b_p1_pntr;
c_12_reg += a_1p_reg * *b_p2_pntr;
c_13_reg += a_1p_reg * *b_p3_pntr;

/* Third row */
c_20_reg += a_2p_reg * *b_p0_pntr;
c_21_reg += a_2p_reg * *b_p1_pntr;
c_22_reg += a_2p_reg * *b_p2_pntr;
c_23_reg += a_2p_reg * *b_p3_pntr;

/* Four row */
c_30_reg += a_3p_reg * *b_p0_pntr++;
c_31_reg += a_3p_reg * *b_p1_pntr++;
c_32_reg += a_3p_reg * *b_p2_pntr++;
c_33_reg += a_3p_reg * *b_p3_pntr++;
}

C( 0, 0 ) += c_00_reg; C( 0, 1 ) += c_01_reg; C( 0, 2 ) += c_02_reg; C( 0, 3 ) += c_03_reg;
C( 1, 0 ) += c_10_reg; C( 1, 1 ) += c_11_reg; C( 1, 2 ) += c_12_reg; C( 1, 3 ) += c_13_reg;
C( 2, 0 ) += c_20_reg; C( 2, 1 ) += c_21_reg; C( 2, 2 ) += c_22_reg; C( 2, 3 ) += c_23_reg;
C( 3, 0 ) += c_30_reg; C( 3, 1 ) += c_31_reg; C( 3, 2 ) += c_32_reg; C( 3, 3 ) += c_33_reg;
}
❌