CPU cache is a small and fast memory unit that CPU accesses. Understanding how it works and how to use it is crucial for high performance programming.
One good startpoint to learn CPU cache is “gallery of processor cache effects” from Igor Ostrovsky, in which some code samples and analysis graph are used to show how CPU cache could affect the program. There is a Chinese translation version of it from the famous coolshell. After reading it, you will have a clear picture how it works and could answer most of the interview questions correctly :)
Since I am studying system performance analysis recently, I think it’s better that I could get metrics about CPU cache and show some comparison results. Perf is the best tool to do this.
Perf is a performance analyzing tool in Linux which collects both kernel and userspace events and provide some nice metrics. It’s been widely used in my team to find bottomneck in CPU-bound applications.
Let’s see how to use perf to show CPU cache effects. Below is a simple program which modified based on Example 6 from Igor’s post.
The logic is pretty simple. Just create sever thread and each of the thead constantly operator one one position of the array. The positions are passed-in parameteres. The questions is: Is the excution time related to the pass in visiting positions? Here are some results.
Woo, The result shows that visiting 4 continuous elements is more than 1 second slower than visiting 4 far-away elements. If you read the post above, you will understand the reason is that in the first case elements on the same cache line are operated constantly, causing false sharing and therefore degrade the performance. It’s the same as visiting resources protected by a global lock, according to Herb Sutter’s great explanation
Well, Let’s use perf to see what really happened. Here we use perf stat command.
It’s pretty clear, isn’t it? You can see actually the second program reduce the cache miss rate from 1.54% to 0.01% and as as result dramatically reduce LLC-loads(which is the last level CPU cache) number by 99.9%, from 4,846,815 to 7,896. Perf tells us everything. Great!
Here is another simple program to show the performance gain using cache line padding technique.
Above program may load x and y together into CPU since they may in the same cache line. However, two separate threads will constantly visit x and y, causing contention issues. If you manually specify additional padding for both variables in the code, you can avoid this false sharing effect, which shows clear in the perf stat reports.
In summary, understanding how CPU cache works is pretty helpful to write high performance C++ code. At the same time, perf stat could help us to verify our cache line padding techniques does work. You can find all the sample code here. And for more details about perf, please refer to Brendan Gregg’s awesome post about perf.