Caliper - Java Micro-Benchmark Helper

January 27, 2013

Lots of us faced with some performance challenge encountered the problem with the right measuring time of some method invocation, loop iteration, etc. The first solution that comes to mind is to use System.nanoTime() before and after execution. And after subtraction we will get our result. But are we sure that the simplest answer is also the best one? What about comparing multiple measurements? What about JIT impact?

Fortunately, a new king is in town - the Caliper. It solves all these problems and protects us from committing other standard errors. It’s not my idea to write another tutorial (those available in Caliper’s source repository are good enough to start using this framework). Instead, let's go into details of how to interpret obtained results and choose the right parameters to perform benchmark.

The first important thing that Caliper does is it creates full scenarios cartesian using:

defined virtual machines
benchmark methods (for time measurements it should be public and starts with “time”)
user parameters (fields annotated by @Param)

Then each element of the set Caliper starts a new JVM - that’s the way it solves the problem of impacting test execution order on results.

The second thing is a warm-up. There are two reasons of doing it:

allow JIT to make all optimizations before starting appropriate measurement
estimate time needed to execute single test-method iteration
We can define warm-up time by specifying the --warmupMillis parameter.

Another important start parameter is --runMillis. But it’s very important to understand how Caliper treats “run” in this case. At first, the Caliper makes three cycles of measurement, with the following different duration scales:

first takes 1.0 * runMillis
second takes 0.5 * runMillis
third takes 1.5 * runMillis

After these cycles, a framework verifies the quality of the results using a simple rule: The standard deviation has to be less than 1% of the average. While this tolerance is not fulfilled, Caliper can perform up to 7 additional measurements to achieve an acceptable threshold. That’s the answer why 2000ms of warm-up, plus 5000ms of fixed run time is equal to 30 seconds of total test time :)
When the test runs are finished it’s time to interpret the outcome. Omitting some obvious output we get this:

21042,13 ns; σ=568,81 ns @ 10 trials

The first element is the median of results (here 21042.13 ns). The second part is the standard deviation (568 ns). And finally, after ape we have a number of performed test runs (here 10 => base 3 runs + 7 additional because of poor distribution). We should know that these trails have nothing in common with those defined in --trails parameter. The latter can be compared to the number of full suite runs.

Last but not least, and very useful, especially for determining the proper times of phases execution, is the combination of --captureVmLog and --saveResults filename.json parameters. These results in the creation of the filename.json file contain raw measurements and (what’s the most important) JIT compilation messages that are applied to different phases of test execution (warm-up run and proper run).

I hope that this article will be useful and will save some time when investigating how Caliper works.

Some useful links:
Caliper's home page: https://code.google.com/p/caliper/
Examples: http://code.google.com/p/caliper/source/browse/examples/src/main/java/examples/