Scoring guide

Scoring Guide

The benchmark has an automated quantitative layer and a human qualitative layer.

Final score = Human quality score + automated bonus/penalty context

For V1, do not let efficiency dominate. A fast, cheap, wrong simulation is a very small bonfire.

Recording the reviewer (required)

Reviews are performed by AI models, not people. Every recorded score must capture which model and harness produced it: reviewer_model (e.g. claude-opus-4-8) and reviewer_harness (e.g. claude-code). This is stored alongside the score and shown on the leaderboard so results are attributable and reviewer drift across model versions is visible. Do not fabricate a version you are unsure of — use unknown rather than a guess, but prefer the exact id.

Benchmark 002 and difficulty

002 (Asia–Europe container shipping) is scored on the same 100-point rubric, but it runs at maximum difficulty: the prompt withholds the output schema, the scenarios contain designed traps, and the harness cross-checks the event log against the summary. At this difficulty, presentation alone earns little — award the top of each band only for a correctly diagnosed bottleneck and the trap scorecard in benchmarks/002_container_shipping_throughput/templates/reviewer_form.md.

Human quality score: 100 points

CategoryPoints
Conceptual modelling20
Data and topology handling15
Simulation correctness20
Experimental design15
Results and interpretation15
Code quality and reproducibility10
Traceability and auditability5

1. Conceptual modelling, 20 points

Assess whether the agent defines:

High score: clear, concise, useful conceptual model that separates data-derived facts from introduced assumptions.

Low score: jumps straight into code or gives vague modelling waffle with no operational content.

2. Data and topology handling, 15 points

Assess whether the solution:

3. Simulation correctness, 20 points

Assess whether the model:

4. Experimental design, 15 points

Assess whether the solution:

5. Results and interpretation, 15 points

Assess whether the agent:

6. Code quality and reproducibility, 10 points

Assess:

7. Traceability and auditability, 5 points

Assess whether:

Automated quantitative metrics

The evaluator reports, but does not fully judge:

Suggested interpretation of automated behavioural checks

Treat behavioural checks as evidence, not as absolute truth.

For example:

A failing behavioural check may reveal: