In this article I will show a cache efficient algorithm for matrix multiplication and the impact of caches on the performance of the algorithms. Matrix multiplication is a very simple algorithm with upper and lower bounded Q(n^{3}) complexity. If you are not familiar with the algorithm I would suggest first to read the Matrix multiplication at Wikipedia. CPU Cache is a low capacity very fast storage which physical location is very close to the CPU. CPU caches make our programs run faster in three different ways:
 The program operations that are executed repeatedly are cached and therefore our program runs faster because the new instructions are not fetched from the slow memory but the cache – this is mainly relevant to a specialized instruction cache and is not in our focus since we will not deal with the complexity optimizations.
 The repeatedly accessed variables in the code are cached and the subsequent access is faster – this is known as temporal locality.
 Variables that are located next to each other in memory, such as array elements, are accessed faster. When the first array element is accessed for the first time in the program, that element together with the subsequent few array elements are brought into the cache – this is known as spatial locality.
For more information about how cache is implemented to provide the above features you can refer again to the Wikipedia entry CPU Cache or a very nice article by Jon Stokes titled Understanding CPU caching and performance.
The Matrix Multiplication Algorithms
Now, after a brief overview, let's return back to our matrix multiplication algorithm. Here I have two algorithms:
 The well known STANDARD algorithm for matrix multiplication that most of us are taught at high school.
 An OPTIMIZED algorithm, which gets advantage from the spatial locality of the array members (i.e. the (3) way caches make programs run faster) by traversing the matrices' elements in a different order than in the STANDARD algorithm.
After multiplying matrices A_{m,n} and B_{n,k} each entry c_{i,j} in the resultant matrix C_{m,k} is:
c_{i,j}=a_{i,0}.b_{0,j} + a_{i,1}.b_{1,j} + .. + a_{i,n}.b_{n,j}, where (0 ≤ i ≤ m) and (0 ≤ i ≤ k).
The Standard Algorithm
Below is the standard algorithm for multiplying 2 square matrices. It resembles the formula above by computing each element in matrix C in by traversing the matrix C only one time.

The Optimized Algorithm
Below is the standard algorithm for multiplying 2 square matrices. This algorithm incrementally computes the elements in matrix C by iterating on matrix C N^{2} times. The two algorithms are different only at line 7. The optimized algorithm traverses matrix B along columns, thus benefiting from the spatial locality. Also, it traverses matrix C multiple times thus benefiting from temporal locality of L2.
1: for (k = 0; k < n; k++) 2: { 3: for (i = 0; i < n; i++) 4: { 5: for (j = 0; j < n; j++) 6: { 7: c[i][j] = c[i][j] + a[i][k]*b[k][j]; 8: } 9: } 10: } 
Performance Results
The table below shows the performance of the two algorithms for different sizes of the matrices. For simplicity, here we assume that matrices are square. Also the table shows the cache misses obtained with VTune profiling tool. The characteristics of the computer I run the application are 1.66GHz Dual Core CPU, with 2MB L2 cache per core and 32KB L1 instruction and data cache. I compiled the source using the Microsoft Visual C++ compiler. The L1 miss rate is per retired instruction and L2 miss rate is per referenced data.
Size 
MB 
Standard 
Optimized 

Time 
L1 Miss 
L2 Miss 
Time 
L1 Miss 
L2 Miss 

128×128 
64K 
0.1 
5.5% 
0% 
0.09 
0.7% 
0% 
256×256 
256K 
0.23 
7.3% 
0% 
0.2 
0.5% 
0% 
512×512 
1MB 
1.5 
7.1% 
0.1% 
1.5 
0.5% 
0.2% 
1024×1024 
4MB 
44.2 
7.2% 
12.5% 
7.9 
0.5% 
0.6% 
2048×2048 
16MB 
394.7 
10.1% 
12.5% 
63.5 
0.5% 
0.6% 
Why the Optimized Algorithm is Faster?
Now let's try to explain why the optimized algorithm runs faster. From the table above, we see that for small numbers the performance hurt is not evident. The reason for this is that matrices are small enough to fit almost all in the L1 cache and all in L2 cache.
1^{st} major reason for the performance penalty with big matrices in the standard algorithm comes from L1 cache. The standard algorithm traverses matrix B along its columns and this way not benefiting from spatial locality in the L1 cache. 2^{nd} performance penalty comes from L2 cache. The standard algorithm does not benefit both from the spatial locality because it traverses the matrix B along its columns and whenever a noncached row is accessed it should be cached (note this is valid for both L1 and L2). Moreover, the standard matrix does not benefit from the temporal locality as it accesses each element of matrix C only once (not repeatedly). Although the element being computed element (c[k][j]) is stored in a register obviously this doesn't contribute to the performance at all.
The optimized algorithm is not as optimal on traversing matrix A, as is in the standard algorithm but the gained performance from the other optimizations makes it negligible.
Conclusion
From this small example we can see that, besides the theoretical complexity dimension as studied and analyzed in the "Introduction to Algorithms", the delivered performance of algorithms depends also on the micro architectural details of our computers. And to write performance efficient programs we should know the underlying architectural details. This algorithm for example could be further refined to benefit from the cache line sizes but this would make it microarchitectural (CPU) dependent and thus run slower in other processors.
Downloads
The source code of the matrices is available here: MatrixMuliplay.
In the code give second free is missing.
I wonder if this be applied to more efficient algorithms like Strassen’s matrix multiplication algorithm…
Ben, this algorithm, is cache efficient because of the roworder traversal. I am not very familiar with the Strassen’s matrix multiplication algorithm but it should be possible to use similar technique during the recursion while multiplying the submatrices.
I am getting TIMER_T is not defined error