Program Profiling Using Systemtap
SystemTap (stap) is both scripting language and tool for profiling program on Linux kernel-based operation system. The most important feature is its ability to probe user-land function, different from tools like strace and perf which can only or partly support user-land probing. You can find what it can do in the latest Linux performance tools graph:
Stap supports c++ and c. For java, it may support it from RHEL7 according to this. The key is the systemtap-runtime-java package, which I don’t see it in RHEL6 repo.(Yeah, I use RHEL6.)
##Install Guide
- Make sure utrace is enabled in your OS. By default, RHEL should has this option open. But it’s better to double check.
grep CONFIG_UTRACE /boot/config-`uname -r`
-
Install required kernel package, including kernel-debuginfo, kernel-debuginfo-common, kernel-devel (Optional if you only use user-land probe)
- Install debug info for program ls, which we will use as a test.
debuginfo-install `rpm -qf /bin/ls`
- Test whether the installation works and user-land probe is enabled. If succeed, it should print out a line “hello world!”
##Usage Guide
You don’t need to learn stap language very well to use it. There are many good resources on Github. I mainly use scripts from agentzh, who is pretty active on performance tuning using stap. Blow are his two repositories I use:
Although this repo is mainly for nginx performance tuning, there are also many for general purpose usage. Here I use the sample-bt and sample-bt-off-cpu to generate the FlameGraph.
- On-CPU Flame Graph
sample-bt could be used to generate in-CPU graph. This graph shows how your program consume CPU cycles from the perspective of functions. Follow the instruction here.
For example, below graph is got from profiling redis server. From the graph, you can clearly see what redis is doing and which funciton consume the most cpu cycles. Some of the functions may not be optimizable, but most of the time you are able to find the bottomneck or some program fault. Click that function, you can follow the stacktrace further.
ATTENTION: For C++ program, we need to unmangle the .bt file. Please run cat [bt file] | c++filt -n > [output bt file]
Download the svg file and open it in your browser to see the interactive graph
- Off-CPU Flame Graph
sample-bt-off-cpu could be used to generate off-CPU graph, which shows all the blocking(or latency) come from. Then running process is exactly the same with sample-bt.
The Off-CPU flame graph is a pretty good startpoint for latency analysis. Personally I have a user case here. After a newest feature been pushed to production, we notice the latency went pretty high. Then we got the off-cpu flame graph of the app and find out one specific funtion consumed most of the time. Following that, we found out there is some wrong with our cache layer once under high traffic and finally fixed that. The off-cpu flame graph really helps us quickly target the root casue and also give us a clear picture whether the application is doing the right thing.
Here is a flame graph of redis Download the svg file and open it in your browser to see the interactive graph
FlameGraph is pretty powerful in performance tuning, since it could give you a overview of the program without any domain-specific knowledge. Need to mention that FlameGraph is independent of stap, you can definitely use other tools to generate flameGraph. Please check following links.
###stapxx
stapxx is a macro language built on top of stap, which provide more functionalities and easier to use. For more detail, please review the main page. Here I mainly use the func-latency-distr script. The usage is very clear here.https://github.com/openresty/stapxx#func-latency-distr
For example, below is a comparison I did using func-latency-distr.sxx to show how big the cross-colo latency is:
Please leave a message if you have any questions. I am glad to help :)