In my CS parallelism class homework we are giving a simple nested for-loop and asked to make it faster without changing the semantics. In other words, we only consider increasing the cache hit rate. Here are the codes:
#define N (128)
double A[N][N][N], B[N][N], C[N][N][N], CC[N][N][N];
...
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
for (l=0; l<N; l++)
C[i][j][k] += A[l][i][j]*B[l][k];
By permutating and unrolling the loops and check whether the running time reduces, here are the best codes I can change:
for (i=0; i<N; i++)
for (l=0; l<N; l+=2)
for (j=0; j<N; j+=2)
for (k=0; k<N; k+=2)
{
CC[i][j][k] += A[l][i][j]*B[l][k];
CC[i][j][k+1] += A[l][i][j]*B[l][k+1];
CC[i][j][k] += A[l+1][i][j]*B[l+1][k];
CC[i][j][k+1] += A[l+1][i][j]*B[l+1][k+1];
CC[i][j+1][k] += A[l][i][j+1]*B[l][k];
CC[i][j+1][k+1] += A[l][i][j+1]*B[l][k+1];
CC[i][j+1][k] += A[l+1][i][j+1]*B[l+1][k];
CC[i][j+1][k+1] += A[l+1][i][j+1]*B[l+1][k+1];
}
Tensor Size = 128 Base-TensorMult: 113.4 MFLOPS; Time = 4.736 sec; Test-TensorMult: 1863.8 MFLOPS; Time = 0.288 sec; No differences found between base and test versions
You see? It can be more than 16 times faster!! So we really need to take caches into consideration when we are coding, though the codes can be really ugly :(