Scoring Guide
The benchmark has an automated quantitative layer and a human qualitative layer.
Recommended final score
Final score = Human quality score + automated bonus/penalty context
For V1, do not let efficiency dominate. A fast, cheap, wrong simulation is a very small bonfire.
Recording the reviewer (required)
Reviews are performed by AI models, not people. Every recorded score must capture
which model and harness produced it: reviewer_model (e.g. claude-opus-4-8) and
reviewer_harness (e.g. claude-code). This is stored alongside the score and shown
on the leaderboard so results are attributable and reviewer drift across model
versions is visible. Do not fabricate a version you are unsure of — use unknown
rather than a guess, but prefer the exact id.
Benchmark 002 and difficulty
002 (Asia–Europe container shipping) is scored on the same 100-point rubric, but it
runs at maximum difficulty: the prompt withholds the output schema, the scenarios
contain designed traps, and the harness cross-checks the event log against the
summary. At this difficulty, presentation alone earns little — award the top of each
band only for a correctly diagnosed bottleneck and the trap scorecard in
benchmarks/002_container_shipping_throughput/templates/reviewer_form.md.
Human quality score: 100 points
| Category | Points |
|---|---|
| Conceptual modelling | 20 |
| Data and topology handling | 15 |
| Simulation correctness | 20 |
| Experimental design | 15 |
| Results and interpretation | 15 |
| Code quality and reproducibility | 10 |
| Traceability and auditability | 5 |
1. Conceptual modelling, 20 points
Assess whether the agent defines:
- system boundary
- entities
- resources
- events
- state variables
- assumptions
- limitations
- performance measures
High score: clear, concise, useful conceptual model that separates data-derived facts from introduced assumptions.
Low score: jumps straight into code or gives vague modelling waffle with no operational content.
2. Data and topology handling, 15 points
Assess whether the solution:
- reads the input files
- uses nodes and edges meaningfully
- calculates routes and travel times from the graph
- handles constrained road segments
- avoids hard-coded answers
- reacts correctly to scenario perturbations
3. Simulation correctness, 20 points
Assess whether the model:
- uses SimPy properly
- represents trucks as active entities
- represents loaders, crusher, and constrained roads as resources
- models loading, hauling, dumping, and return travel coherently
- records tonnes based on completed dump events
- handles queues and resource occupancy plausibly
4. Experimental design, 15 points
Assess whether the solution:
- runs required scenarios
- uses at least 30 replications
- controls random seeds
- reports uncertainty
- uses stochasticity sensibly
- explains warm-up choice or lack of warm-up
- supports reproducibility
5. Results and interpretation, 15 points
Assess whether the agent:
- answers the decision questions
- identifies bottlenecks plausibly
- explains operational implications
- avoids overclaiming
- presents clear results
- discusses what would improve throughput
6. Code quality and reproducibility, 10 points
Assess:
- structure
- readability
- simple dependency management
- no hard-coded local paths
- configurable parameters
- clean run instructions
- reasonable file organisation
7. Traceability and auditability, 5 points
Assess whether:
event_log.csvis useful- truck movements can be audited
- state transitions are visible
- queueing/resource behaviour can be inspected
- visualisation, if present, is derived from the simulation or event log
Automated quantitative metrics
The evaluator reports, but does not fully judge:
- runtime seconds
- return code
- Python LOC
- file counts
- output files present
- schema coverage
- scenario coverage
- behavioural sanity checks
- token usage if supplied
Suggested interpretation of automated behavioural checks
Treat behavioural checks as evidence, not as absolute truth.
For example:
trucks_12should usually produce higher throughput thantrucks_4ramp_upgradeshould usually improve or maintain throughput versus baselinecrusher_slowdownshould usually reduce throughputramp_closedshould usually reduce throughput or force rerouting
A failing behavioural check may reveal:
- a model bug
- a scenario not applied correctly
- a legitimate modelling choice that should be reviewed manually