Benchmarks

I've performed a little bit of optimisation from 0.2 -> 0.3, since I'd never had time to see how fast Tapper was, or could be; it's just spawning and message passing, and both are fast in Erlang, right?

The design of Tapper assumed this would be fast, and tried to avoid doing much processing on the client-side of the API, with the exception of a bit of parameter checking, and uses only GenServer.cast/2 rather than GenServer.call/3, but what did we achieve?

Findings

The 0.2.x release, on my 3.10 GHz i7 Mac Book Pro could do ~18k start/finish sampled trace operations per second, adding about 52.00 μs of additional overhead to your code, a little more with annotations, and less if you join, rather than start traces, since you don't need to generate the Trace.Id. Note that the same pair of operations for an unsampled trace take about 3 μs (~300k/s). I also noticed that update_span/3 wasn't short-circuiting unsampled traces, since it was less than half the speed of the other procs when running unsampled!

The additional overhead over the >200k raw spawn/sends per second you can do on my hardware comes from various places, some in GenServer and OTP, which you can't avoid except by removing the advantage of using OTP, and the rest in function calls in Tapper code. You can't avoid generating the Tapper.Id on start/1, you can't avoid generating a decent monotonic time-stamp, and you can't avoid adding the Logger metadata for the trace id (assuming you want it), leaving just parameter checking and other misc code to improve.

From this baseline, I've tried a few things:

  • Remove debug logging during Tracer start-up, although this can be removed during code generation, it was superfluous anyway.
  • Remove use of Access module via [], preferring direct use of Keyword.get/3.
  • Remove multiple traversals of Keyword lists in favour of a tailored function.

Rewriting to avoid multiple traversals made more difference than just avoiding indirection, as there were several instances of Keyword.get/3, and two function calls, which is now one function call, one traversal, which does some list appending, and a bunch of pattern matches. The gain is fairly small, but it also happens to consolidate the options processing code in a pleasing way.

0.3.0 achieves ~20k start/finish sampled trace operations per second on my hardware, adding about 49 μs per pair of operations. More could be achieved by moving all the option checking/defaulting to the server code, at the expense of error locality; using macros for annotations; and some API changes, such as not using a keyword list in start/1 for the sample and debug flags, which we need to know client-side for sample/no-sample optimisation (because it enables unsampled traces, which is a much more significant boost). Alternatively, the whole thing could be re-coded outside of OTP/GenServer. Do submit your PR ! :)

Results for Tapper 0.3.0

Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Number of Available Cores: 4
Available memory: 17.179869184 GB
Elixir 1.4.4
Erlang 19.3

##### With input sampled #####
Name                                                   ips        average  deviation         median
start, finish                                      20.69 K       48.34 μs    ±13.96%       47.00 μs
child span                                         15.54 K       64.36 μs    ±18.68%       64.00 μs
child span, contextual interface                   15.43 K       64.83 μs    ±18.38%       66.00 μs
child span with some annotations                   15.42 K       64.84 μs    ±26.10%       71.00 μs
child span with some annotations, via update       13.88 K       72.07 μs    ±27.96%       77.00 μs

##### With input unsampled #####
Name                                                   ips        average  deviation         median
child span                                        311.53 K        3.21 μs   ±446.82%        3.00 μs
start, finish                                     306.84 K        3.26 μs   ±459.05%        3.00 μs
child span, contextual interface                  291.43 K        3.43 μs   ±398.55%        3.00 μs
child span with some annotations                  297.41 K        3.36 μs   ±351.59%        3.00 μs
child span with some annotations, via update      280.81 K        3.56 μs   ±481.33%        3.00 μs

Note that the deviation for unsampled traces is so high because there's really very little to measure, so any jitter makes a big difference.

Results for Tapper 0.5.0

March 2019
Tapper 0.5.0
commit: 46d0ebafc64e4cbd5be01ad457405889b885311e

The previous benchmark suite didn't include a representation of encoding the trace to HTTP headers, which is a typical activity of clients, and one that might be optimised, as its currently doing a number to hex conversion every time.

For this reason the benchmarks now include a child span, with destructuring benchmark, which uses a Tapper.Id.destructure/1 call within a child span; you can see how this negatively effects the IPS.

We also now benchmark unsampled spans, which is critical to median performance when using sampling, since it turns off much of the functionality.

Finally, there's now a benchmark for decoding trace headers, which is implemented in tapper_plug, but most of the hard work is performed by Tapper.TraceId.parse/1 and Tapper.SpanId.parse/1, so we combine those to parallel the work done.

With these extra benchmarks, we have some more realistic outcomes, and some more targets to optimisation.

The benchmarking software, Erlang and Elixir have changed too, and we now have (possibly ineffective) OS patches for Meltdown and Spectre which slow nearly everything down.

Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.8.1
Erlang 21.2.4

##### With input sampled #####
Name                                                   ips        average  deviation         median         99th %
decode_trace_headers                              323.43 K        3.09 μs   ±957.70%        2.97 μs        6.97 μs
start, finish                                      16.39 K       61.00 μs    ±31.11%       54.97 μs      131.97 μs
child span                                         14.31 K       69.89 μs    ±32.82%       59.97 μs      147.97 μs
child span, contextual interface                   13.58 K       73.62 μs    ±34.93%       62.97 μs      159.97 μs
child span with some annotations                   16.38 K       61.04 μs    ±31.78%       55.97 μs      148.97 μs
child span with some annotations, via update       15.00 K       66.69 μs    ±30.78%       60.97 μs      150.90 μs
child span, with destructuring                     10.17 K       98.36 μs    ±30.09%       84.97 μs      199.97 μs

##### With input unsampled #####
Name                                                   ips        average  deviation         median         99th %
decode_trace_headers                              322.25 K        3.10 μs   ±857.30%        2.97 μs        6.97 μs
start, finish                                     211.33 K        4.73 μs   ±413.71%        3.97 μs        9.97 μs
child span                                        208.11 K        4.81 μs   ±465.87%        3.97 μs       10.97 μs
child span, contextual interface                  173.16 K        5.78 μs   ±219.45%        4.97 μs       12.97 μs
child span with some annotations                  176.54 K        5.66 μs   ±257.49%        4.97 μs       11.97 μs
child span with some annotations, via update      187.28 K        5.34 μs   ±374.50%        4.97 μs       11.97 μs
child span, with destructuring                     53.68 K       18.63 μs    ±34.07%       17.97 μs       33.97 μs

Results for Tapper 0.6.0

April 2019 commit: 1aa4884648fd6cc826a90912e94318809719d39b

Tapper now keeps the trace ids as binaries, rather than integers. This means that any decoding or encoding of trace headers doesn't need to convert an integer to/from hex format every time, which should reduce real-world overhead. Significant effort has been put in to optimising the generation and parsing of the ids, applying many of the binary pattern-matching tricks from the core Elixir Base.encode16/2 and Integer.parse/1 functions, but optimised further for this specific use-case. Yes, I've looked at the BEAM code, and it was good. ☺️

Note that if you were relying on directly interpreting the previous Tapper.TraceId or Tapper.SpanId internal representations outside of the official Tapper API functions, your code may break!

The benchmark below shows the significant improvement in decoding trace headers and destructuring the Tapper.Id over the previous version, while other benchmarks remain stable.

Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.8.1
Erlang 21.2.4

##### With input sampled #####
Name                                                   ips        average  deviation         median         99th %
decode_trace_headers                              535.67 K        1.87 μs  ±1556.89%        1.90 μs        3.90 μs
start, finish                                      15.92 K       62.80 μs    ±31.60%       57.90 μs      136.90 μs
child span                                         14.45 K       69.22 μs    ±33.90%       59.90 μs      149.90 μs
child span, contextual interface                   14.05 K       71.15 μs    ±32.57%       60.90 μs      153.90 μs
child span with some annotations                   16.31 K       61.31 μs    ±31.21%       56.90 μs      142.90 μs
child span with some annotations, via update       15.04 K       66.49 μs    ±31.12%       60.90 μs      149.90 μs
child span, with destructuring                     14.18 K       70.54 μs    ±33.18%       60.90 μs      151.90 μs

##### With input unsampled #####
Name                                                   ips        average  deviation         median         99th %
decode_trace_headers                              544.80 K        1.84 μs   ±996.76%        1.90 μs        3.90 μs
start, finish                                     206.53 K        4.84 μs   ±426.57%        3.90 μs       10.90 μs
child span                                        196.10 K        5.10 μs   ±392.41%        4.90 μs       11.90 μs
child span, contextual interface                  174.98 K        5.72 μs   ±306.46%        4.90 μs       12.90 μs
child span with some annotations                  171.34 K        5.84 μs   ±396.31%        4.90 μs       12.90 μs
child span with some annotations, via update      167.21 K        5.98 μs   ±364.75%        4.90 μs       13.90 μs
child span, with destructuring                    179.33 K        5.58 μs   ±270.04%        4.90 μs       11.90 μs

Some Performance Tips

To get the last drop out of the current implementation:

  • Ensure that you are not over-sampling: using debug or a high sampling rate will add to the median processing time; unsampled traces have extremely low overhead.
  • Send annotations bundled with start/1 or start_span/2 if you can, rather than adding with update_span/2 to avoid the overhead of extra calls; at least batch up annotations in update_span/1 rather than calling it multiple times.
  • Use literals for annotations, rather than the helper functions, e.g. :cr rather than Tapper.client_receive() to avoid the function call overhead.
  • The contextual API (Tapper.Ctx) is a little bit slower, due to the additional process dictionary look-up/store for each operation.

Running Benchmarks

Run benchmarks with following command:

MIX_ENV=bench mix bench

The config/bench.exs config just sets the logging level to :error.

Running the profiler

To locate optimisation opportunities, I used fprof, outside of Benchee, calling the test function in loop over a range.

Run fprof with:

FPROF=1 MIX_ENV=bench mix profile.fprof --callers benchmarking/tapper_bench.exs