Cache matters

0 comments

In my CS parallelism class homework we are giving a simple nested for-loop and asked to make it faster without changing the semantics. In other words, we only consider increasing the cache hit rate. Here are the codes:


 
#define N (128)
double A[N][N][N], B[N][N], C[N][N][N],  CC[N][N][N]; 
...

  for (i=0; i<N; i++)
   for (j=0; j<N; j++)
    for (k=0; k<N; k++)
     for (l=0; l<N; l++)
       C[i][j][k] += A[l][i][j]*B[l][k];
 

By permutating and unrolling the loops and check whether the running time reduces, here are the best codes I can change:

  
for (i=0; i<N; i++)
   for (l=0; l<N; l+=2)
     for (j=0; j<N; j+=2)   
      for (k=0; k<N; k+=2)
      {

      CC[i][j][k] += A[l][i][j]*B[l][k];
      CC[i][j][k+1] += A[l][i][j]*B[l][k+1];
 
      CC[i][j][k] += A[l+1][i][j]*B[l+1][k];
      CC[i][j][k+1] += A[l+1][i][j]*B[l+1][k+1];
 
      CC[i][j+1][k] += A[l][i][j+1]*B[l][k];
      CC[i][j+1][k+1] += A[l][i][j+1]*B[l][k+1];

      
      CC[i][j+1][k] += A[l+1][i][j+1]*B[l+1][k];
      CC[i][j+1][k+1] += A[l+1][i][j+1]*B[l+1][k+1];
      }
 

 
And here is the result: (Base is the first codes, and test is the next I changed)

Tensor Size = 128
Base-TensorMult: 113.4 MFLOPS; Time = 4.736 sec; 
Test-TensorMult: 1863.8 MFLOPS; Time = 0.288 sec;
No differences found between base and test versions
 

You see? It can be more than 16 times faster!! So we really need to take caches into consideration when we are coding, though the codes can be really ugly :(