Installation¶
Direct Inclusion¶
Download
nanobench.h
from therelease
and make it available in your project.Create a .cpp file, e.g.
nanobench.cpp
, where the bulk of nanobench is compiled.nanobench.cpp¶1 2
#define ANKERL_NANOBENCH_IMPLEMENT #include <nanobench.h>
Compile e.g. with
g++ -O3 -I../include -c nanobench.cpp
. This compiles the bulk of nanobench, and took 2.4 seconds on my machine. It needs to be compiled only once whenever you upgrade nanobench.
CMake Integration¶
nanobench
can be integrated with CMake’s FetchContent or as
a git submodule. Here is a full example how to this can be done:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | cmake_minimum_required(VERSION 3.14)
set(CMAKE_CXX_STANDARD 17)
project(
CMakeNanobenchExample
VERSION 1.0
LANGUAGES CXX)
include(FetchContent)
FetchContent_Declare(
nanobench
GIT_REPOSITORY https://github.com/martinus/nanobench.git
GIT_TAG v4.1.0
GIT_SHALLOW TRUE)
FetchContent_MakeAvailable(nanobench)
add_executable(MyExample my_example.cpp)
target_link_libraries(MyExample PRIVATE nanobench)
|
Usage¶
Create the actual benchmark code, in
full_example.cpp
:full_example.cpp¶1 2 3 4 5 6 7 8 9 10 11
#include <nanobench.h> #include <atomic> int main() { int y = 0; std::atomic<int> x(0); ankerl::nanobench::Bench().run("compare_exchange_strong", [&] { x.compare_exchange_strong(y, 0); }); }
The most important entry entry point is
ankerl::nanobench::Bench
. It creates a benchmarking object, optionally configures it, and then runs the code to benchmark withrun()
.Compile and link the example with
g++ -O3 -I../include nanobench.o full_example.cpp -o full_example
This takes just 0.28 seconds on my machine.
Run
./full_example
, which gives an output like this:| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 5.63 | 177,595,338.98 | 0.0% | 3.00 | 17.98 | 0.167 | 1.00 | 0.1% | 0.00 | `compare_exchange_strong`
Which renders as
ns/op
op/s
err%
ins/op
cyc/op
IPC
bra/op
miss%
total
benchmark
5.63
177,595,338.98
0.0%
3.00
17.98
0.167
1.00
0.1%
0.00
compare_exchange_strong
Which means that one
x.compare_exchange_strong(y, 0);
call takes 5.63ns on my machine (wall-clock time), or ~178 million operations per second. Runtime fluctuates by around 0.0%, so the results are very stable. Each call required 3 instructions, which took ~18 CPU cycles. There was a single branch per call, with only 0.1% misspredicted.
Nanobench does not come with a test runner, so you can easily use it with any framework you like. In the remaining examples, I’m using doctest as a unit test framework.
Note
CPU statistics like instructions, cycles, branches, branch misses are only available on Linux, through
perf events. On some systems you might need to
change permissions
through perf_event_paranoid
or use ACL.
Examples¶
Something Fast¶
Let’s benchmarks how fast we can do x += x
for uint64_t
:
1 2 3 4 5 6 7 8 9 | #include <nanobench.h>
#include <thirdparty/doctest/doctest.h>
TEST_CASE("tutorial_fast_v1") {
uint64_t x = 1;
ankerl::nanobench::Bench().run("++x", [&]() {
++x;
});
}
|
After 0.2ms we get this output:
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| - | - | - | - | :boom: `++x` (iterations overflow. Maybe your code got optimized away?)
No data there! we only get :boom: iterations overflow.
. The compiler could optimize x += x
away because we never used the output. Thanks to doNotOptimizeAway
, this is easy to fix:
1 2 3 4 5 6 7 8 9 | #include <nanobench.h>
#include <thirdparty/doctest/doctest.h>
TEST_CASE("tutorial_fast_v2") {
uint64_t x = 1;
ankerl::nanobench::Bench().run("++x", [&]() {
ankerl::nanobench::doNotOptimizeAway(x += 1);
});
}
|
This time the benchmark runs for 2.2ms and we actually get reasonable data:
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 0.31 | 3,192,444,232.50 | 0.0% | 1.00 | 1.00 | 0.998 | 0.00 | 0.0% | 0.00 | `++x`
It’s a very stable result. One run the op/s is 3,192 million/sec, the next time I execute it I get 3,168 million/sec. It always takes 1.00 instructions per operation on my machine, and can do this in ~1 cycle.
Something Slow¶
Let’s benchmark if sleeping for 100ms really takes 100ms.
1 2 3 4 5 6 7 8 9 10 11 | #include <nanobench.h>
#include <thirdparty/doctest/doctest.h>
#include <chrono>
#include <thread>
TEST_CASE("tutorial_slow_v1") {
ankerl::nanobench::Bench().run("sleep 100ms, auto", [&] {
std::this_thread::sleep_for(std::chrono::milliseconds(100));
});
}
|
After 1.1 seconds I get
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------------------
| 100,125,753.00 | 9.99 | 0.0% | 51.00 | 7,714.00 | 0.007 | 11.00 | 90.9% | 1.10 | `sleep 100ms, auto`
So we actually take 100.125ms instead of 100ms. Next time I run it, I get 100.141. Also a very stable result. Interestingly, sleep takes 51 instructions but 7,714 cycles - so we only got 0.007 instructions per cycle. That’s extremely low, but expected of sleep
. It also required 11 branches, of which 90.9% were misspredicted on average.
If the extremely slow 1.1 second is too much for you, you can manually configure the number of evaluations (epochs):
1 2 3 4 5 6 7 8 9 10 11 | #include <nanobench.h>
#include <thirdparty/doctest/doctest.h>
#include <chrono>
#include <thread>
TEST_CASE("tutorial_slow_v2") {
ankerl::nanobench::Bench().epochs(3).run("sleep 100ms", [&] {
std::this_thread::sleep_for(std::chrono::milliseconds(100));
});
}
|
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 100,099,096.00 | 9.99 | 0.0% | 51.00 | 7,182.00 | 0.007 | 11.00 | 90.9% | 0.30 | `sleep 100ms`
This time it took only 0.3 seconds, but with only 3 evaluations instead of 11. The err% will be less meaningfull, but since the benchmark is so stable it doesn’t really matter.
Something Unstable¶
Lets create an extreme artifical test that’s hard to benchmark, because runtime fluctuates randomly: Each iteration randomly skip between 0-254 random numbers:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | #include <nanobench.h>
#include <thirdparty/doctest/doctest.h>
#include <random>
TEST_CASE("tutorial_fluctuating_v1") {
std::random_device dev;
std::mt19937_64 rng(dev());
ankerl::nanobench::Bench().run("random fluctuations", [&] {
// each run, perform a random number of rng calls
auto iterations = rng() & UINT64_C(0xff);
for (uint64_t i = 0; i < iterations; ++i) {
ankerl::nanobench::doNotOptimizeAway(rng());
}
});
}
|
After 2.3ms, I get this result:
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 334.12 | 2,992,911.53 | 6.3% | 3,486.44 | 1,068.67 | 3.262 | 287.86 | 0.7% | 0.00 | :wavy_dash: `random fluctuations` (Unstable with ~56.7 iters. Increase `minEpochIterations` to e.g. 567)
So on average each loop takes about 334.12ns, but we get a warning that the results are unstable. The median percentage error is 6.3% which is quite high,
Let’s use the suggestion and set the minimum number of iterations to 5000, and try again:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #include <nanobench.h>
#include <thirdparty/doctest/doctest.h>
#include <random>
TEST_CASE("tutorial_fluctuating_v2") {
std::random_device dev;
std::mt19937_64 rng(dev());
ankerl::nanobench::Bench().minEpochIterations(5000).run(
"random fluctuations", [&] {
// each run, perform a random number of rng calls
auto iterations = rng() & UINT64_C(0xff);
for (uint64_t i = 0; i < iterations; ++i) {
ankerl::nanobench::doNotOptimizeAway(rng());
}
});
}
|
The fluctuations are much better:
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 277.31 | 3,606,106.48 | 0.7% | 3,531.75 | 885.18 | 3.990 | 291.59 | 0.7% | 0.00 | `random fluctuations`
The results are more stable, with only 0.7% error.
Comparing Results¶
I have implemented a comparison of multiple random number generators. Here several RNGs are compared to a baseline calculated from std::default_random_engine. I factored out the general benchmarking code so it’s easy to use for each of the random number generators:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | namespace {
// Benchmarks how fast we can get 64bit random values from Rng.
template <typename Rng>
void bench(ankerl::nanobench::Bench* bench, char const* name) {
std::random_device dev;
Rng rng(dev());
bench->run(name, [&]() {
auto r = std::uniform_int_distribution<uint64_t>{}(rng);
ankerl::nanobench::doNotOptimizeAway(r);
});
}
} // namespace
TEST_CASE("example_random_number_generators") {
// perform a few warmup calls, and since the runtime is not always stable
// for each generator, increase the number of epochs to get more accurate
// numbers.
ankerl::nanobench::Bench b;
b.title("Random Number Generators")
.unit("uint64_t")
.warmup(100)
.relative(true);
b.performanceCounters(true);
// sets the first one as the baseline
bench<std::default_random_engine>(&b, "std::default_random_engine");
bench<std::mt19937>(&b, "std::mt19937");
bench<std::mt19937_64>(&b, "std::mt19937_64");
bench<std::ranlux24_base>(&b, "std::ranlux24_base");
bench<std::ranlux48_base>(&b, "std::ranlux48_base");
bench<std::ranlux24>(&b, "std::ranlux24_base");
bench<std::ranlux48>(&b, "std::ranlux48");
bench<std::knuth_b>(&b, "std::knuth_b");
bench<WyRng>(&b, "WyRng");
bench<NasamRng>(&b, "NasamRng");
bench<Sfc4>(&b, "Sfc4");
bench<RomuTrio>(&b, "RomuTrio");
bench<RomuDuo>(&b, "RomuDuo");
bench<RomuDuoJr>(&b, "RomuDuoJr");
bench<Orbit>(&b, "Orbit");
bench<ankerl::nanobench::Rng>(&b, "ankerl::nanobench::Rng");
}
|
Runs for 60ms and prints this table:
| relative | ns/uint64_t | uint64_t/s | err% | ins/uint64_t | cyc/uint64_t | IPC | bra/uint64_t | miss% | total | Random Number Generators
|---------:|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:-------------------------
| 100.0% | 35.87 | 27,881,924.28 | 2.3% | 127.80 | 114.61 | 1.115 | 9.77 | 3.7% | 0.00 | `std::default_random_engine`
| 490.3% | 7.32 | 136,699,693.21 | 0.6% | 89.55 | 23.49 | 3.812 | 9.51 | 0.1% | 0.00 | `std::mt19937`
| 1,767.4% | 2.03 | 492,786,582.33 | 0.6% | 24.38 | 6.48 | 3.761 | 1.26 | 0.6% | 0.00 | `std::mt19937_64`
| 85.2% | 42.08 | 23,764,853.03 | 0.7% | 157.07 | 134.62 | 1.167 | 19.51 | 7.6% | 0.00 | `std::ranlux24_base`
| 121.3% | 29.56 | 33,824,759.51 | 0.5% | 91.03 | 94.35 | 0.965 | 10.00 | 8.1% | 0.00 | `std::ranlux48_base`
| 17.4% | 205.67 | 4,862,080.59 | 1.2% | 709.83 | 657.10 | 1.080 | 101.79 | 16.1% | 0.00 | `std::ranlux24_base`
| 8.7% | 412.46 | 2,424,497.97 | 1.8% | 1,514.70 | 1,318.43 | 1.149 | 219.09 | 16.7% | 0.00 | `std::ranlux48`
| 59.2% | 60.60 | 16,502,276.18 | 1.9% | 253.77 | 193.39 | 1.312 | 24.93 | 1.5% | 0.00 | `std::knuth_b`
| 5,187.1% | 0.69 | 1,446,254,071.66 | 0.1% | 6.00 | 2.21 | 2.714 | 0.00 | 0.0% | 0.00 | `WyRng`
| 1,431.7% | 2.51 | 399,177,833.54 | 0.0% | 21.00 | 8.01 | 2.621 | 0.00 | 0.0% | 0.00 | `NasamRng`
| 2,629.9% | 1.36 | 733,279,957.30 | 0.1% | 13.00 | 4.36 | 2.982 | 0.00 | 0.0% | 0.00 | `Sfc4`
| 3,815.7% | 0.94 | 1,063,889,655.17 | 0.0% | 11.00 | 3.01 | 3.661 | 0.00 | 0.0% | 0.00 | `RomuTrio`
| 3,529.5% | 1.02 | 984,102,081.37 | 0.3% | 9.00 | 3.25 | 2.768 | 0.00 | 0.0% | 0.00 | `RomuDuo`
| 4,580.4% | 0.78 | 1,277,113,402.06 | 0.0% | 7.00 | 2.50 | 2.797 | 0.00 | 0.0% | 0.00 | `RomuDuoJr`
| 2,291.2% | 1.57 | 638,820,992.09 | 0.0% | 11.00 | 5.00 | 2.200 | 0.00 | 0.0% | 0.00 | `ankerl::nanobench::Rng`
It shows that ankerl::nanobench::Rng
is one of the fastest RNG, and has the least amount of
fluctuation. It takes only 1.57ns to generate a random uint64_t
, so ~638 million calls per
seconds are possible. To the left we show relative performance compared to std::default_random_engine
.
Note
Here pure runtime performance is not necessarily the best benchmark. Especially the fastest RNG’s can be inlined and use instruction level parallelism to their advantage: they immediately return an old state, and while user code can already use that value, the next value is calculated in parallel. See the excellent paper at romu-random for details.
Asymptotic Complexity¶
It is possible to calculate asymptotic complexity (Big O) from multiple runs of a benchmark. Run the benchmark with different complexity N, then nanobench can calculate the best fitting curve.
The following example finds out the asymptotic complexity of std::set
’s find()
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | #include <nanobench.h>
#include <thirdparty/doctest/doctest.h>
#include <iostream>
#include <set>
TEST_CASE("tutorial_complexity_set_find") {
// Create a single benchmark instance that is used in multiple benchmark
// runs, with different settings for complexityN.
ankerl::nanobench::Bench bench;
// a RNG to generate input data
ankerl::nanobench::Rng rng;
std::set<uint64_t> set;
// Running the benchmark multiple times, with different number of elements
for (auto setSize :
{10U, 20U, 50U, 100U, 200U, 500U, 1000U, 2000U, 5000U, 10000U}) {
// fill up the set with random data
while (set.size() < setSize) {
set.insert(rng());
}
// Run the benchmark, provide setSize as the scaling variable.
bench.complexityN(set.size()).run("std::set find", [&] {
ankerl::nanobench::doNotOptimizeAway(set.find(rng()));
});
}
// calculate BigO complexy best fit and print the results
std::cout << bench.complexityBigO() << std::endl;
}
|
The loop runs the benchmark 10 times, with different set sizes from 10 to 10k.
Note
Each of the 10 benchmark runs automatically scales the number of iterations so results are still fast and accurate. In total the whole test takes about 90ms.
The Bench
object holds the benchmark results of the 10 benchmark runs. Each benchmark is recorded with a
different setting for complexityN
.
After the benchmark prints the benchmark results, we calculate & print the Big O of the most important complexity functions.
std::cout << bench.complexityBigO() << std::endl;
prints e.g. this markdown table:
| coefficient | err% | complexity
|--------------:|-------:|------------
| 6.66562e-09 | 29.1% | O(log n)
| 1.47588e-11 | 58.3% | O(n)
| 1.10742e-12 | 62.6% | O(n log n)
| 5.15683e-08 | 63.8% | O(1)
| 1.40387e-15 | 78.7% | O(n^2)
| 1.32792e-19 | 85.7% | O(n^3)
The table is sorted, best fitting complexity function first. So \(\mathcal{O}(\log{}n)\) provides the best approximation for the complexity. Interestingly, in that case error compared to \(\mathcal{O}(n)\) is not very large, which can be an indication that even though the red-black tree should theoretically have logarithmic complexity, in practices that is not perfectly the case.
Rendering Mustache-like Templates¶
Nanobench comes with a powerful Mustache-like template mechanism to process the benchmark
results into all kinds of formats. You can find a full description of all possible tags at ankerl::nanobench::render()
.
Several preconfigured format exist in the namespace ankerl::nanobench::templates
. Rendering these templates can be done
with either ankerl::nanobench::render()
, or directly with ankerl::nanobench::Bench::render()
.
The following example shows how to use the CSV - Comma-Separated Values template, without writing the standard output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #include <nanobench.h>
#include <thirdparty/doctest/doctest.h>
#include <atomic>
#include <iostream>
TEST_CASE("tutorial_render_simple") {
std::atomic<int> x(0);
ankerl::nanobench::Bench()
.output(nullptr)
.run("std::vector",
[&] {
++x;
})
.render(ankerl::nanobench::templates::csv(), std::cout);
}
|
In line 11 we call Bench::output()
with nullptr
, thus disabling the standard output.
After the benchmark we directly call Bench::render()
in line 16. Here we use the
CSV template, and write the rendered output to std::cout
. When running, we get just the CSV output to the console which looks like this:
"title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
"benchmark";"std::vector";"op";1;6.51982200647249e-09;8.26465858909014e-05;23.0034662045061;5;0.00116867939228672;0.000171959
Nanobench comes with a few preconfigured templates, residing in the namespace ankerl::nanobench::templates
. To demonstrate what these templates can do,
here is an simple example that benchmarks two random generators std::mt19937_64
and std::knuth_b
and prints both the template and the rendered
output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | #include <nanobench.h>
#include <thirdparty/doctest/doctest.h>
#include <fstream>
#include <random>
namespace {
void gen(std::string const& typeName, char const* mustacheTemplate,
ankerl::nanobench::Bench const& bench) {
std::ofstream templateOut("mustache.template." + typeName);
templateOut << mustacheTemplate;
std::ofstream renderOut("mustache.render." + typeName);
ankerl::nanobench::render(mustacheTemplate, bench, renderOut);
}
} // namespace
TEST_CASE("tutorial_mustache") {
ankerl::nanobench::Bench bench;
bench.title("Benchmarking std::mt19937_64 and std::knuth_b");
std::mt19937_64 rng1;
bench.run("std::mt19937_64", [&] {
ankerl::nanobench::doNotOptimizeAway(rng1());
});
std::knuth_b rng2;
bench.run("std::knuth_b", [&] {
ankerl::nanobench::doNotOptimizeAway(rng2());
});
gen("json", ankerl::nanobench::templates::json(), bench);
gen("html", ankerl::nanobench::templates::htmlBoxplot(), bench);
gen("csv", ankerl::nanobench::templates::csv(), bench);
}
|
CSV - Comma-Separated Values¶
The function ankerl::nanobench::templates::csv()
provides this template:
1 2 3 | "title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
{{#result}}"{{title}}";"{{name}}";"{{unit}}";{{batch}};{{median(elapsed)}};{{medianAbsolutePercentError(elapsed)}};{{median(instructions)}};{{median(branchinstructions)}};{{median(branchmisses)}};{{sumProduct(iterations, elapsed)}}
{{/result}}
|
This generates a compact CSV file, where entries are separated by a semicolon ;. Run with the example, I get this output:
1 2 3 | "title";"name";"unit";"batch";"elapsed";"error %";"instructions";"branches";"branch misses";"total"
"Benchmarking std::mt19937_64 and std::knuth_b";"std::mt19937_64";"op";1;2.54441805225653e-08;0.0236579384033733;125.989678899083;16.7645714285714;0.564133016627078;0.000218811
"Benchmarking std::mt19937_64 and std::knuth_b";"std::knuth_b";"op";1;3.19013867488444e-08;0.00091350764819687;170.013008130081;28;0.0031104199066874;0.000217248
|
Rendered as CSV table:
title |
name |
unit |
batch |
elapsed |
error % |
instructions |
branches |
branch misses |
total |
---|---|---|---|---|---|---|---|---|---|
Benchmarking std::mt19937_64 and std::knuth_b |
std::mt19937_64 |
op |
1 |
2.54441805225653e-08 |
0.0236579384033733 |
125.989678899083 |
16.7645714285714 |
0.564133016627078 |
0.000218811 |
Benchmarking std::mt19937_64 and std::knuth_b |
std::knuth_b |
op |
1 |
3.19013867488444e-08 |
0.00091350764819687 |
170.013008130081 |
28 |
0.0031104199066874 |
0.000217248 |
Note that the CSV template doesn’t provide all the data that is available.
HTML Box Plots¶
With the template ankerl::nanobench::templates::htmlBoxplot()
you get a plotly based HTML output which generates
an boxplot of the runtime. The template is rather simple.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | <html>
<head>
<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
</head>
<body>
<div id="myDiv"></div>
<script>
var data = [
{{#result}}{
name: '{{name}}',
y: [{{#measurement}}{{elapsed}}{{^-last}}, {{/last}}{{/measurement}}],
},
{{/result}}
];
var title = '{{title}}';
data = data.map(a => Object.assign(a, { boxpoints: 'all', pointpos: 0, type: 'box' }));
var layout = { title: { text: title }, showlegend: false, yaxis: { title: 'time per unit', rangemode: 'tozero', autorange: true } }; Plotly.newPlot('myDiv', data, layout, {responsive: true});
</script>
</body>
</html>
|
This generates a nice interactive boxplot, which gives a nice visual showcase of the runtime performance of the evaluated benchmarks. Each epoch is visualized as a dot, and the boxplot itself shows median, percentiles, and outliers. You’ll might want to increase the default number of epochs for an even better visualization result.
JSON - JavaScript Object Notation¶
The ankerl::nanobench::templates::json()
template gives everything, all data that is available, from all runs. The template is therefore quite complex:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | {
"results": [
{{#result}} {
"title": "{{title}}",
"name": "{{name}}",
"unit": "{{unit}}",
"batch": {{batch}},
"complexityN": {{complexityN}},
"epochs": {{epochs}},
"clockResolution": {{clockResolution}},
"clockResolutionMultiple": {{clockResolutionMultiple}},
"maxEpochTime": {{maxEpochTime}},
"minEpochTime": {{minEpochTime}},
"minEpochIterations": {{minEpochIterations}},
"epochIterations": {{epochIterations}},
"warmup": {{warmup}},
"relative": {{relative}},
"median(elapsed)": {{median(elapsed)}},
"medianAbsolutePercentError(elapsed)": {{medianAbsolutePercentError(elapsed)}},
"median(instructions)": {{median(instructions)}},
"medianAbsolutePercentError(instructions)": {{medianAbsolutePercentError(instructions)}},
"median(cpucycles)": {{median(cpucycles)}},
"median(contextswitches)": {{median(contextswitches)}},
"median(pagefaults)": {{median(pagefaults)}},
"median(branchinstructions)": {{median(branchinstructions)}},
"median(branchmisses)": {{median(branchmisses)}},
"totalTime": {{sumProduct(iterations, elapsed)}},
"measurements": [
{{#measurement}} {
"iterations": {{iterations}},
"elapsed": {{elapsed}},
"pagefaults": {{pagefaults}},
"cpucycles": {{cpucycles}},
"contextswitches": {{contextswitches}},
"instructions": {{instructions}},
"branchinstructions": {{branchinstructions}},
"branchmisses": {{branchmisses}}
}{{^-last}},{{/-last}}
{{/measurement}} ]
}{{^-last}},{{/-last}}
{{/result}} ]
}
|
This also gives the data from each separate ankerl::nanobench::Bench::epochs()
, not just the accumulated data as in the CSV template.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 | {
"results": [
{
"title": "Benchmarking std::mt19937_64 and std::knuth_b",
"name": "std::mt19937_64",
"unit": "op",
"batch": 1,
"complexityN": -1,
"epochs": 11,
"clockResolution": 1.8e-08,
"clockResolutionMultiple": 1000,
"maxEpochTime": 0.1,
"minEpochTime": 0,
"minEpochIterations": 1,
"warmup": 0,
"relative": 0,
"median(elapsed)": 2.54441805225653e-08,
"medianAbsolutePercentError(elapsed)": 0.0236579384033733,
"median(instructions)": 125.989678899083,
"medianAbsolutePercentError(instructions)": 0.035125448044942,
"median(cpucycles)": 81.3479809976247,
"median(contextswitches)": 0,
"median(pagefaults)": 0,
"median(branchinstructions)": 16.7645714285714,
"median(branchmisses)": 0.564133016627078,
"totalTime": 0.000218811,
"measurements": [
{
"iterations": 875,
"elapsed": 2.54708571428571e-08,
"pagefaults": 0,
"cpucycles": 81.472,
"contextswitches": 0,
"instructions": 125.885714285714,
"branchinstructions": 16.7645714285714,
"branchmisses": 0.574857142857143
},
{
"iterations": 809,
"elapsed": 2.58467243510507e-08,
"pagefaults": 0,
"cpucycles": 82.5290482076638,
"contextswitches": 0,
"instructions": 128.771322620519,
"branchinstructions": 17.0296662546354,
"branchmisses": 0.582200247218789
},
{
"iterations": 737,
"elapsed": 2.24097693351425e-08,
"pagefaults": 0,
"cpucycles": 71.6431478968792,
"contextswitches": 0,
"instructions": 118.374491180461,
"branchinstructions": 15.9470827679783,
"branchmisses": 0.417910447761194
},
{
"iterations": 872,
"elapsed": 2.53405963302752e-08,
"pagefaults": 0,
"cpucycles": 80.9896788990826,
"contextswitches": 0,
"instructions": 125.989678899083,
"branchinstructions": 16.7580275229358,
"branchmisses": 0.563073394495413
},
{
"iterations": 834,
"elapsed": 2.59256594724221e-08,
"pagefaults": 0,
"cpucycles": 82.7661870503597,
"contextswitches": 0,
"instructions": 127.635491606715,
"branchinstructions": 16.9352517985612,
"branchmisses": 0.575539568345324
},
{
"iterations": 772,
"elapsed": 2.25310880829016e-08,
"pagefaults": 0,
"cpucycles": 72.0129533678757,
"contextswitches": 0,
"instructions": 117.108808290155,
"branchinstructions": 15.8341968911917,
"branchmisses": 0.405440414507772
},
{
"iterations": 842,
"elapsed": 2.54441805225653e-08,
"pagefaults": 0,
"cpucycles": 81.3479809976247,
"contextswitches": 0,
"instructions": 127.266033254157,
"branchinstructions": 16.8859857482185,
"branchmisses": 0.564133016627078
},
{
"iterations": 792,
"elapsed": 2.20126262626263e-08,
"pagefaults": 0,
"cpucycles": 70.3623737373737,
"contextswitches": 0,
"instructions": 116.420454545455,
"branchinstructions": 15.7588383838384,
"branchmisses": 0.396464646464646
},
{
"iterations": 757,
"elapsed": 2.63870541611625e-08,
"pagefaults": 0,
"cpucycles": 84.332892998679,
"contextswitches": 0,
"instructions": 131.462351387054,
"branchinstructions": 17.334214002642,
"branchmisses": 0.618229854689564
},
{
"iterations": 850,
"elapsed": 2.23305882352941e-08,
"pagefaults": 0,
"cpucycles": 71.3505882352941,
"contextswitches": 0,
"instructions": 114.629411764706,
"branchinstructions": 15.5823529411765,
"branchmisses": 0.392941176470588
},
{
"iterations": 774,
"elapsed": 2.60607235142119e-08,
"pagefaults": 0,
"cpucycles": 83.1679586563308,
"contextswitches": 0,
"instructions": 130.576227390181,
"branchinstructions": 17.2635658914729,
"branchmisses": 0.590439276485788
}
]
},
{
"title": "Benchmarking std::mt19937_64 and std::knuth_b",
"name": "std::knuth_b",
"unit": "op",
"batch": 1,
"complexityN": -1,
"epochs": 11,
"clockResolution": 1.8e-08,
"clockResolutionMultiple": 1000,
"maxEpochTime": 0.1,
"minEpochTime": 0,
"minEpochIterations": 1,
"warmup": 0,
"relative": 0,
"median(elapsed)": 3.19013867488444e-08,
"medianAbsolutePercentError(elapsed)": 0.00091350764819687,
"median(instructions)": 170.013008130081,
"medianAbsolutePercentError(instructions)": 4.11992392254248e-06,
"median(cpucycles)": 101.973254086181,
"median(contextswitches)": 0,
"median(pagefaults)": 0,
"median(branchinstructions)": 28,
"median(branchmisses)": 0.0031104199066874,
"totalTime": 0.000217248,
"measurements": [
{
"iterations": 568,
"elapsed": 3.2137323943662e-08,
"pagefaults": 0,
"cpucycles": 102.55985915493,
"contextswitches": 0,
"instructions": 170.014084507042,
"branchinstructions": 28,
"branchmisses": 0.00528169014084507
},
{
"iterations": 576,
"elapsed": 3.19305555555556e-08,
"pagefaults": 0,
"cpucycles": 102.059027777778,
"contextswitches": 0,
"instructions": 170.013888888889,
"branchinstructions": 28,
"branchmisses": 0.00347222222222222
},
{
"iterations": 643,
"elapsed": 3.18973561430793e-08,
"pagefaults": 0,
"cpucycles": 101.973561430793,
"contextswitches": 0,
"instructions": 170.012441679627,
"branchinstructions": 28,
"branchmisses": 0.0031104199066874
},
{
"iterations": 591,
"elapsed": 3.1912013536379e-08,
"pagefaults": 0,
"cpucycles": 101.944162436548,
"contextswitches": 0,
"instructions": 170.013536379019,
"branchinstructions": 28,
"branchmisses": 0.00169204737732657
},
{
"iterations": 673,
"elapsed": 3.19049034175334e-08,
"pagefaults": 0,
"cpucycles": 101.973254086181,
"contextswitches": 0,
"instructions": 170.011887072808,
"branchinstructions": 28,
"branchmisses": 0.00297176820208024
},
{
"iterations": 649,
"elapsed": 3.19013867488444e-08,
"pagefaults": 0,
"cpucycles": 101.850539291217,
"contextswitches": 0,
"instructions": 170.012326656394,
"branchinstructions": 28,
"branchmisses": 0.00308166409861325
},
{
"iterations": 606,
"elapsed": 3.18547854785479e-08,
"pagefaults": 0,
"cpucycles": 101.83498349835,
"contextswitches": 0,
"instructions": 170.013201320132,
"branchinstructions": 28,
"branchmisses": 0.0033003300330033
},
{
"iterations": 650,
"elapsed": 3.18769230769231e-08,
"pagefaults": 0,
"cpucycles": 101.898461538462,
"contextswitches": 0,
"instructions": 170.012307692308,
"branchinstructions": 28,
"branchmisses": 0.00307692307692308
},
{
"iterations": 615,
"elapsed": 3.18520325203252e-08,
"pagefaults": 0,
"cpucycles": 101.858536585366,
"contextswitches": 0,
"instructions": 170.013008130081,
"branchinstructions": 28,
"branchmisses": 0.0032520325203252
},
{
"iterations": 579,
"elapsed": 3.18618307426598e-08,
"pagefaults": 0,
"cpucycles": 101.989637305699,
"contextswitches": 0,
"instructions": 170.013816925734,
"branchinstructions": 28,
"branchmisses": 0.00345423143350604
},
{
"iterations": 657,
"elapsed": 3.19558599695586e-08,
"pagefaults": 0,
"cpucycles": 102.229832572298,
"contextswitches": 0,
"instructions": 170.012176560122,
"branchinstructions": 28,
"branchmisses": 0.0030441400304414
}
]
}
]
}
|
pyperf - Python pyperf module Output¶
Pyperf is a powerful tool for benchmarking and system tuning, and it can also analyze benchmark results. This template allows generation of output so it can be used for further analysis with pyperf.
Note
Pyperf supports only a single benchmark result per generated output, so it is best to create a new
Bench
object for each benchmark.
The template looks like this. Note that it directly makes use of {{#measurement}}
, which is only possible when there is a single result in the benchmark.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | {
"benchmarks": [
{
"runs": [
{
"values": [
{{#measurement}} {{elapsed}}{{^-last}},
{{/last}}{{/measurement}}
]
}
]
}
],
"metadata": {
"loops": {{sum(iterations)}},
"inner_loops": {{batch}},
"name": "{{title}}",
"unit": "second"
},
"version": "1.0"
}
|
Here is an example that generates pyperf compatible output for a benchmark that shuffles an vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | #include <nanobench.h>
#include <thirdparty/doctest/doctest.h>
#include <algorithm>
#include <fstream>
#include <random>
TEST_CASE("shuffle_pyperf") {
std::vector<uint64_t> data(500, 0); // input data for shuffling
std::default_random_engine defaultRng(123);
auto fout = std::ofstream("pyperf_shuffle_std.json");
ankerl::nanobench::Bench()
.epochs(100)
.run("std::shuffle with std::default_random_engine",
[&]() {
std::shuffle(data.begin(), data.end(), defaultRng);
})
.render(ankerl::nanobench::templates::pyperf(), fout);
fout = std::ofstream("pyperf_shuffle_nanobench.json");
ankerl::nanobench::Rng rng(123);
ankerl::nanobench::Bench()
.epochs(100)
.run("ankerl::nanobench::Rng::shuffle",
[&]() {
rng.shuffle(data);
})
.render(ankerl::nanobench::templates::pyperf(), fout);
}
|
This benchmark run creates the two files pyperf_shuffle_std.json
and pyperf_shuffle_nanobench.json
.
Here are some of the analysis you can do:
Show Benchmark Statistics¶
Output from python3 -m pyperf stats pyperf_shuffle_std.json
:
Total duration: 364 ms
Raw value minimum: 3.57 ms
Raw value maximum: 4.21 ms
Number of calibration run: 0
Number of run with values: 1
Total number of run: 1
Number of warmup per run: 0
Number of value per run: 100
Loop iterations per value: 100
Total number of values: 100
Minimum: 35.7 us
Median +- MAD: 36.2 us +- 0.2 us
Mean +- std dev: 36.4 us +- 0.9 us
Maximum: 42.1 us
0th percentile: 35.7 us (-2% of the mean) -- minimum
5th percentile: 35.8 us (-2% of the mean)
25th percentile: 36.1 us (-1% of the mean) -- Q1
50th percentile: 36.2 us (-0% of the mean) -- median
75th percentile: 36.4 us (+0% of the mean) -- Q3
95th percentile: 36.7 us (+1% of the mean)
100th percentile: 42.1 us (+16% of the mean) -- maximum
Number of outlier (out of 35.6 us..36.9 us): 4
Show a Histogram¶
It’s often interesting to see a histogram, especially to visually find out if there are outliers involved.
Run python3 -m pyperf hist pyperf_shuffle_std.json
produces this output
35.7 us: 21 ######################################
36.0 us: 33 ############################################################
36.3 us: 37 ###################################################################
36.6 us: 5 #########
36.9 us: 0 |
37.2 us: 1 ##
37.5 us: 0 |
37.8 us: 0 |
38.1 us: 0 |
38.4 us: 0 |
38.7 us: 0 |
39.0 us: 0 |
39.3 us: 0 |
39.6 us: 1 ##
39.9 us: 0 |
40.2 us: 0 |
40.5 us: 1 ##
40.8 us: 0 |
41.1 us: 0 |
41.5 us: 0 |
41.8 us: 0 |
42.1 us: 1 ##
Compare Results¶
We have generated two results in the above examples, and we can compare them easily with python3 -m pyperf compare_to pyperf_shuffle_std.json
:
+-----------+--------------------+------------------------------+
| Benchmark | pyperf_shuffle_std | pyperf_shuffle_nanobench |
+===========+====================+==============================+
| benchmark | 36.4 us | 11.2 us: 3.24x faster (-69%) |
+-----------+--------------------+------------------------------+
For more information of pyperfs analysis capability, please see pyperf - Analyze benchmark results.