32
loading...
This website collects cookies to deliver better user experience
A good benchmark suite... | CI services... |
---|---|
runs on the same hardware every time | provide a different (standardized) machine for each run |
runs on dedicated hardware | run on shared resources |
runs on a frozen OS configuration | update their VM images often |
requires renting or acquiring such hardware | are free |
requires authentication mechanisms | implement authentication out of the box |
requires hardware that can be abused through public PRs | are designed for public PRs |
scikit-image
, the project that commissioned this task, uses Airspeed Velocity, or asv
, for their benchmark tests.asv
's main feature is being able to track performance measurements over time in a JSON database and generate beautiful dashboards that can be published to a static site server like GitHub Pages. For an example, look at the reports for pandas
at their speed.pydata.org website.asv
also has a special subcommand named continuous
, which might provide the functionally we needed. The help message says:Run a side-by-side comparison of two commits for continuous integration.
asv continuous A B
, asv
will create at least two† virtual environments (one per commit) and install revisions A and B in those, respectively. If the project involves compiled libraries (as is the case with scikit-image
), this can be a lengthy process!asv
will run the suite four times (A->B->A->B
)! This is done to account for the unavoidable deviations from ideality caused from co-running processes, as mentioned above.1.2
by default) an error is emitted.† asv
supports the notion of a configuration matrix, so you can test your code under different environments; e.g. NumPy versions, Python interpreters, etc.
scikit-image
as of June/July 2021, this ends up taking up to two hours. This raises two questions we will answer in the following sections:1.0
. In other words, performance should be the same. Of course, these are not ideal conditions, so some kind of error is expected. We just want it to stay reliably under an acceptable threshold.Check the GitHub Actions workflow in this fork!
1.0
. We know this will not happen, but maybe the errors are not that big and stay like that regardless the time of the day or the day of the week.y=1
. Of course, not all of them are there, but a surprising amount of them do!asv
would report a performance regression, when in fact there's none. However, in the observed measurements, the outliers were always within y ∈ (0.5, 1.4)
. That means we can affirm that the method is sensitive enough to detect performance regressions of 50% or more! This is good enough for our project and, in fact, some projects might even be happy with a threshold of 2.0
.If you are curious about how we automatically downloaded the artifacts, parsed the output, and plotted the performance ratios, check the Jupyter Notebook here.
asv
is running several passes and repeats to reduce the measurement error, but maybe some of those default counter-measures are not needed. Namely:The benchmark runs with --interleave-processes
, but it can be disabled with --no-interleave-processes
. The help message for this flag says:
Interleave benchmarks with multiple processes across commits. This can avoid measurement biases from commit ordering, can take longer.
How much longer? Does it help keep error under control? We should measure that.
By default, all tests are run several times, with different schedules. There are two benchmark attributes that govern these settings: processes
and repeat
. processes
defaults to 2
, which means that the full suite will be run twice per commit. If we only do one pass (processes=1
), we will reduce the running time in half, but will we lose too much accuracy?
asv
command-line options.Strategy | Runtime | %FP | Min | Max | Mean | Std |
---|---|---|---|---|---|---|
Default | 1h55 | 3.7 | 0.51 | 1.36 | 1.00 | 0.05 |
No interleaving | 1h39 | 9.99 | 0.43 | 1.50 | 0.99 | 0.07 |
Single pass | 1h07 | 12.5 | 0.51 | 2.76 | 1.01 | 0.07 |
We also considered running several single-pass replicas in parallel using different GitHub Actions jobs. In theory, a false positive could be spotted by comparing the values of the failing tests in the other replicas. Only true positives would reliably appear in all replicas. However, this consumes more CI resources (compilation happens several times) and is noisier from the maintainer perspective, who would need to check all replicas. Not to mention that more replicas increase the chances for more false positives!
asv
will also spend some time setting up virtual environments and installing the project. Since installing scikit-learn
involves compiling some extensions, this can add up to a non-trivial amount of time.conda
, we replaced the conda
calls with a faster implementation called mamba
. We rely on an asv
implementation detail: to find conda
, asv
will first check the value of the CONDA_EXE
environment variable. This is normally set by conda activate <env>
, but we overwrite it with the path to mamba
to have asv
use it instead.ccache
to keep the unchanged libraries around. Check the workflow file to see how it can be implemented on GitHub Actions.on.pull_request
trigger is configured to act on three event types: [opened, reopened, synchronize]
. In practice, this means after every push or after closing+opening the PR. However, there are more triggers!labeled
. This means that the workflow will be triggered whenever the PR is tagged with a label. To specify which label(s) are able to trigger the workflow, you can use an if
clause at the job
level, like this:name: Benchmark
on:
pull_request:
types: [labeled]
jobs:
benchmark:
if: ${{ github.event.label.name == 'run-benchmark' && github.event_name == 'pull_request' }}
name: Linux
runs-on: ubuntu-20.04
...
run-benchmark
. This works surprisingly well as a manual trigger! It is also restricted to authorized users (triaging permissions or above), so no need to fiddle with authentication tokens or similar complications.asv continuous
). This takes a bit more time but you can speed it up a bit with mamba
and ccache
for compiled libraries. Even in that case, it is probably overkill to run it for every push event, so we are using the on.pull_request.labeled
trigger to let the maintainers decide when to do it on demand.asv
.