Tutorial: your first co-run experiment¶
This walkthrough takes you end to end: you'll measure an all-to-all communication benchmark running alone, then under interference from a bursty "noise" aggressor, and finally visualize the difference. It uses the bundled Blink suite (CRAB's fully-supported benchmark), so everything here is runnable — not pseudocode.
By the end you'll have touched every part of the framework: crab setup, an experiment config
with a victim and an aggressor, crab run, the results layout, and crab export.
Prerequisites
- CRAB installed and its environment activated (
source .venv/bin/activate). - Access to a Slurm cluster with an allocation you can submit to —
crab runsubmits viasbatch, so this tutorial can't run on a plain laptop. - A preset for your cluster. The commands below use
leonardoas the example; substitute your own preset name. If your cluster isn't defined yet, create a preset first.
It helps to skim System-dependent vs system-independent and the victim/aggressor idea first, but you can follow along without them.
Step 1 — Build the benchmark¶
CRAB needs the Blink binaries on your cluster. Run the setup wizard:
Select the Blink Suite, then choose Build from source (the wizard clones and compiles it,
streaming the build log). When it finishes it writes a receipt to
config/environments/blink.json recording where the binaries live — the wrappers read this at run
time. See Installation & benchmark setup for
the other strategies (e.g. pointing at an existing build).
Step 2 — Write the experiment config¶
Create tutorial.json. It defines two experiments that run back-to-back: a baseline (the
victim alone) and with_noise (the same victim plus a bursty aggressor). They share one job, so
the comparison is apples-to-apples.
{
"global_options": {
"name": "tutorial_a2a",
"numnodes": "8",
"ppn": "1",
"allocationmode": "p",
"partitionsplit": "50:50",
"partitionlayout": "l",
"minruns": "5",
"maxruns": "10",
"timeout": "1200.0",
"outformat": "csv",
"sbatch_directives": { "time": "00:20:00" }
},
"experiments": {
"baseline": {
"description": "All-to-all victim alone on half the nodes — the reference.",
"apps": {
"0": { "path": "blink/a2a_comm_only.py", "args": "-msgsize 64 -iter 2000",
"collect": true, "start": "0", "end": "", "partition": 0 }
}
},
"with_noise": {
"description": "Same victim, now with a bursty all-to-all aggressor on the other half.",
"apps": {
"0": { "path": "blink/a2a_comm_only.py", "args": "-msgsize 64 -iter 2000",
"collect": true, "start": "0", "end": "", "partition": 0 },
"1": { "path": "blink/bursty_noise_a2a.py", "args": "0.001 0.01",
"collect": false, "start": "0", "end": "f", "partition": 1 }
}
}
}
}
What the key fields mean (full list in the Configuration schema):
numnodes: 8,allocationmode: "p",partitionsplit: "50:50"— allocate 8 nodes, split into two equal partitions. The victim lives inpartition 0, the aggressor inpartition 1, so each gets 4 dedicated nodes.- App
0— the victim.collect: true(its metrics are recorded) andend: ""(CRAB waits for it to finish naturally). Inbaselineit's the only app, so its half runs undisturbed; the other half sits idle. That's deliberate — the victim runs on the same 4 nodes in both experiments. - App
1— the aggressor.collect: false(not measured) andend: "f"(force-killed as soon as the victim finishes). Itsargs(0.001 0.01) are the arguments passed to the benchmark — here, noise-timing values controlling how bursty the interference is (smaller quiet interval / longer burst = heavier interference). minruns/maxruns— repeat each experiment 5–10 times, stopping early once the victim's metrics converge.
path points to the wrapper, not the binary
blink/a2a_comm_only.py is the wrapper file, given relative to
CRAB_PATH_WRAPPERS — it is not an absolute path and not the benchmark executable. That's
what keeps the config system-independent: the
wrapper looks up the actual binary location from the receipt at run time,
so the same config file runs on any cluster where you've done crab setup.
For more shapes (sequential sweeps, timed aggressors, multiple victims) see Writing experiment configs.
Don't want to hand-write JSON?
You can build a config interactively instead of editing JSON. Launch crab tui, add your
applications and set the global options across the tabs, then press s to save it to a
.json file — or space to run it right away. See
Running an experiment → TUI.
Step 3 — Run it¶
CRAB's orchestrator prepares a timestamped output directory, writes the
resolved config there, generates a Slurm batch script, and submits it with sbatch. The job itself
runs the worker, which executes baseline then with_noise on the allocated
nodes. (See Architecture for the full
two-phase flow.)
Track it with the usual Slurm tools while it's queued/running:
and, once it starts, watch the logs inside the run directory (slurm_output.log / slurm_error.log).
Step 4 — Read the results¶
Results land under data/<system>/tutorial_a2a_<timestamp>/:
data/leonardo/tutorial_a2a_2026-06-10_14-30-05-123456/
├── config.json # exactly what ran (reproducible)
├── crab_job.sh # the generated batch script
├── slurm_output.log
├── baseline/
│ └── data_app_0.csv # victim metrics, no interference
└── with_noise/
└── data_app_0.csv # victim metrics, under interference
Each data_app_0.csv has a run_id column plus one column per metric the wrapper reported (Blink
records duration statistics). Compare the two with a couple of lines of pandas:
import pandas as pd
base = pd.read_csv(".../baseline/data_app_0.csv")
noisy = pd.read_csv(".../with_noise/data_app_0.csv")
print("baseline mean:", base.filter(like="Avg-Duration").mean().item())
print("with noise mean:", noisy.filter(like="Avg-Duration").mean().item())
If the aggressor is interfering, the victim's average duration under with_noise will be higher
than baseline. See Reading results for the full layout and CSV format.
Step 5 — Visualize¶
Turn the run into a self-contained HTML dashboard:
Open tutorial.html in any browser — no server needed. Use the Compare tab to overlay the
baseline and with_noise distributions on one chart; the shift between them is the interference
you just measured.
What you did, and where to go next¶
You built a benchmark, expressed a victim-vs-aggressor study, ran it through Slurm, and quantified interference. From here:
- Scale the study up — the bundled
examples/leonardo/blink_noise_study.jsonruns this idea across 16 experiments (message-size sweeps, different collectives, light/heavy noise, delayed starts, linear vs interleaved layouts). It's the natural next read. - Vary placement — try
allocationmode: "i"(interleaved) to make victim and aggressor share the fabric more aggressively. See Writing experiment configs. - Benchmark your own application — teach CRAB to run something new in Extending CRAB.