2026-06-04__002_container_shipping_throughput__claude-code__claude-opus-4-8__max-effort
Date: 2026-06-04 · Benchmark: 002_container_shipping_throughput · Harness: claude-code · Model: claude-opus-4-8 (max-effort) · ✓ Autonomous
Scores
| Category | Points | Max |
|---|---|---|
| Conceptual modelling | 20 | 20 |
| Data and topology | 15 | 15 |
| Simulation correctness | 19 | 20 |
| Experimental design | 14 | 15 |
| Results & interpretation | 15 | 15 |
| Code quality | 8 | 10 |
| Traceability | 5 | 5 |
| Total | 96 | 100 |
Run metrics
-
Total tokens:
—(method:unknown) -
Input / output tokens:
—/— - Runtime:
1.664314 s -
Reviewer model:
claude-opus-4-8· harness:claude-code· on2026-06-04 - Recommendation: Strong submission
- Notes: Autonomously reproduced all three designed traps from the data alone: Rotterdam discharge berth is the true bottleneck (util 65% vs canal 3%), canal_upgrade is a near-no-op (1.01x), the fleet saturates (marginal 21.7k->11.0k TEU/vessel; anchorage wait 84->157->482h), and canal_closed reroutes via the Cape (verified 811/811 deliveries cape, 0 canal events, 0.86x) -- then recommends the non-obvious Rotterdam berth (+14% at fixed fleet). Genuine, reproducible SimPy DES: I re-ran it (~1.7s) and every committed total matched bit-for-bit; event-log cross-check 0.0% error on all 7 scenarios; prompt byte-identical to the benchmark. The 17 failed automated checks are purely output-naming mismatches (base_seed/sim_day/nested metrics.* and no benchmark_id vs the harness's flat synonyms), not modelling errors. Total 96/100. Minor dings: even crane-split + whole-ship (10k TEU) quantization make the headline total_teu coarse; 14-day warmup under-covers the ~28-day pipeline fill; schema naming hurts machine-readability the prompt emphasized.
Evaluation report
- Automated checks: 46 / 63 (73%)
- Behavioural checks: — / —
- Download full evaluation_report.json
| Scenario | Mean throughput |
|---|---|
| baseline | 315,333.333 |
| canal_closed | 270,333.333 |
| canal_upgrade | 317,000 |
| fleet_large | 403,333.333 |
| fleet_small | 228,666.667 |
| port_slowdown | 229,000 |
| rotterdam_upgrade | 360,000 |
Source files
- README.md
- conceptual_model.md
- data/berths.csv
- data/nodes.csv
- data/scenarios/baseline.yaml
- data/scenarios/canal_closed.yaml
- data/scenarios/canal_upgrade.yaml
- data/scenarios/fleet_large.yaml
- data/scenarios/fleet_small.yaml
- data/scenarios/port_slowdown.yaml
- data/scenarios/rotterdam_upgrade.yaml
- data/sea_legs.csv
- data/vessels.csv
- prompt.md
- requirements.txt
- results/evaluation_report.json
- results.csv
- run_metrics.json
- submission.yaml
- summary.json
- token_usage.json
Downloads
Conceptual model
Conceptual Model — Asia–Europe Container Shipping Throughput
This document defines the system being modelled, the entities and resources, the events and state, and — kept deliberately separate — which assumptions are derived from the supplied data and which are introduced modelling choices. It closes with the model’s limitations.
The implementation is a genuine SimPy discrete-event simulation
(container_sim/simulation.py); nothing here is a spreadsheet average.
1. System boundary
In scope. The liner service that lifts laden containers from two Asian
origin ports (Shanghai CNSHA, Singapore SGSIN), sails them along a directed
maritime network to the primary European import port (Rotterdam NLRTM),
discharges them, and returns the vessels empty (ballast) to reload. The horizon
is a 180-day planning window.
Objective (the thing we measure). Total TEU discharged at Rotterdam over the horizon, and the rate at which it is delivered.
Out of scope / boundary conditions.
- Hamburg (
DEHAM) is a reachable secondary port but not the delivery objective; the service never calls there. It is a distractor sink and is excluded from routing. - Cargo demand is treated as unlimited: a vessel always finds a full load at its origin (the operator’s objective is to maximise TEU to Rotterdam, so the binding limits are ships and ports, not cargo availability).
- Backhaul (Europe→Asia cargo) is out of scope; the return leg is ballast.
2. Entities
| Entity | Count | Description |
|---|---|---|
| Vessel | scenario fleet.vessel_count (8 / 12 / 20) | A neo-panamax of 10 000 TEU, service speed 19 kn, with a home port. Moves through a repeating round-trip process. |
| Voyage (round trip) | emergent | load → sail laden → wait+discharge → sail ballast → (maintenance). One delivery of 10 000 TEU per voyage. |
The fleet for a scenario is the first vessel_count rows of vessels.csv
(file order alternates SGSIN/CNSHA home ports, so any prefix is a balanced
split: 4/4, 6/6, 10/10).
3. Resources (the contended capacity)
| Resource | SimPy object | Capacity | Source |
|---|---|---|---|
| Rotterdam discharge berth | Resource | berth_count = 1 | berths.csv B_RTM |
| Shanghai / Singapore load berths | Resource | berth_count = 3 each | berths.csv B_SHA, B_SIN |
| Suez Canal (NB and SB) | Resource per directed leg | capacity = 3 (12 in canal_upgrade) | sea_legs.csv L06_* |
| Open-water / strait / coastal legs | none (pure delay) | capacity = 999 ⇒ unconstrained | sea_legs.csv |
A leg is treated as a contended chokepoint only if its capacity is below a threshold (100); in this network only the canal qualifies. Everything coded 999 is open water and modelled as a transit delay with no queue.
Handling rate. A port advertises berth_count and crane_count. Cranes are
modelled as evenly pre-assigned to berths, so one berthed vessel is worked at
(crane_count / berth_count) × moves_per_hour_per_crane TEU/h and up to
berth_count vessels are served in parallel. The terminal’s aggregate
discharge capacity is therefore exactly crane_count × moves_per_hour_per_crane,
and cranes are never double-counted. At Rotterdam this is 4 × 28 = 112 TEU/h on
the single berth (≈ 89 h to discharge 10 000 TEU).
4. Events
Per voyage, the model emits (and logs) these events:
LOAD_START → DEPART_ORIGIN → [CANAL_ENTER → CANAL_EXIT]* →
ARRIVE_DEST → DISCHARGE_START → DELIVER → [canal on return]* →
ARRIVE_HOME.
DELIVER carries the TEU discharged; summing DELIVER.teu over the event log
reconstructs delivered throughput exactly (verified for every scenario).
5. State
- Per vessel: position in its life-cycle, laden/ballast, current TEU on board, cycle start time.
- Per resource: number of units in use / queued (SimPy-managed).
- Per replication (accumulated): list of deliveries (time, TEU), anchorage waits, origin waits, completed cycle times, and busy-hours per port and per canal leg (used to derive utilisation).
6. Routing and dispatching logic
- Routing = shortest time on the directed graph. Each leg’s free-flow time
is
distance_nm / min(service_speed × speed_factor, leg_max_speed). Dijkstra finds the least-time path and is recomputed per scenario, so a closed leg simply disappears. At baseline the Suez path beats the Cape path; when the canal is closed the Cape path (IOX→CAPE→WAFR→NLRTM, +3 200 nm) becomes shortest. - Fail-clear. If no origin→Rotterdam (or return) path exists — e.g. the canal
is closed and the Cape reroute is forbidden — the model raises
RouteErrorand the scenario is recorded as failed with a reason, never as a low number. - Dispatching = return to home port (
baseline.yaml). Each vessel repeatedly serves its own origin; the tie-breaker (shortest_expected_cycle_time) is moot because each vessel has a single home port.
7. Stochasticity, seeds, and the transient
- Leg times: multiplicative lognormal noise with unit mean and cv = 0.10
(
leg_time_noise_cv), applied to every leg traversal. - Handling times: truncated normal about the mean
TEU / rate, cv = 0.10, floored at 0.1 h (truncation essentially never binds at this cv). - Seeds / reproducibility: each vessel draws from its own stream seeded by
(base_random_seed, replication, vessel_index). Re-running a replication is bit-identical (python -m container_sim verify). Because all scenarios share the baseline seed, the same vessel sees the same noise across scenarios — common random numbers, which reduces the variance of scenario differences. - Transient handling. All vessels start empty at their home port at t = 0
(a cold start). The pipeline is empty until the first vessels transit (~day 28),
so we report (a) total TEU over the full horizon (what the operator
receives) and (b) a post-warmup rate
teu_per_day(deliveries after day 14, the scenariowarmup_days) and a fully-ramped second-half rateteu_per_day_second_half. The second-half rate is the cleanest steady-state estimate.
8. Assumptions — data-derived vs introduced
8a. Derived directly from the supplied data
- Network topology, leg distances, per-leg speed limits, leg capacities and
closedflags (sea_legs.csv+ scenarioleg_overrides). - Vessel capacity (10 000 TEU), service speed (19 kn), laden/ballast factors
(1.00), per-vessel availability, and home ports (
vessels.csv). - Berth counts, crane counts, moves/h/crane (
berths.csv+berth_overrides). - Horizon (180 d), replications (30), base seed (20260603), warmup (14 d), load ports, delivery port, routing flags, and noise cv (scenario YAML).
8b. Introduced modelling choices (with justification)
- Vessels always sail full (10 000 TEU loaded, 10 000 discharged). The brief’s objective is to maximise TEU to Rotterdam with unlimited export demand, so the constraint is capacity, not cargo.
- One move ≈ one TEU (stated in the
B_RTMmetadata) — discharge time =TEU / (cranes × moves/h). - Even crane-to-berth split (Section 3). Makes Rotterdam exact (112 TEU/h); at origins it splits 5 cranes over 3 berths. Aggregate port capacity is exact.
- Capacity-constrained leg ⇔ capacity < 100 ⇒ only the canal is a queueing resource; open water is a pure delay.
- Discharge-only destination; the return leg is ballast (factors are 1.0, so ballast speed equals laden speed here).
- Availability → inter-voyage maintenance downtime calibrated so long-run
availability equals the stated value:
downtime = cycle_time × (1−A)/A(zero for A = 1). - Cold start at home ports at t = 0 (Section 7).
- Draft is ignored — no leg or port in the data carries a depth limit, so the
max_draft_mcolumn is non-binding (a deliberate distractor; documented, not used). - The two canal directions are modelled as two independent directed resources
(mirroring the two
L06rows) rather than one shared channel; at ~3 % utilisation the distinction is immaterial.
9. Limitations
- Cold-start synchronisation. Starting all vessels together produces a visible “staircase” in the cumulative-delivery curve (clustered arrivals) and a pipeline fill (~28 d) longer than the 14-day warmup. We mitigate by reporting the second-half steady-state rate; a staggered/random initial phase would smooth this but is not specified by the data.
- Low throughput variance. Delivered TEU is quantised in whole 10 000-TEU
loads; at the specified cv = 0.10 the count of completed voyages is robust, so
the
total_teuconfidence intervals are tight while cycle-time and anchorage-wait intervals are wider. This is a finding (throughput here is constraint-limited, not noise-limited), not a missing source of randomness. - Origin dwell is conservative. The even crane-split gives a slow per-ship origin load (~7.8 d); origins nonetheless stay at 26–29 % utilisation, far from binding, so this does not affect any conclusion. Rotterdam — the binding resource — is exact under any crane-allocation convention because it has one berth.
- Unmodelled real-world stochastics (weather closures, demand seasonality, unplanned breakdowns beyond the stated availability) are out of scope; the model reflects only the variability the brief specifies.
README
Asia–Europe Container Shipping Throughput — SimPy Model
A discrete-event simulation (SimPy) of a liner service moving containers from Shanghai and Singapore to Rotterdam, used to answer the operator’s seven decision questions. The model derives routes and travel times from the directed network, treats the Suez Canal and the discharge berths as contended resources, runs 30 seeded replications per scenario, and reports throughput with 95 % confidence intervals.
Headline: the binding constraint is Rotterdam’s single discharge berth, not the canal and not the fleet. Expanding the canal is a near-no-op (+0.5 %); adding ships saturates the berth (diminishing returns, 20-day anchorage waits); adding one more crane-equipped berth at Rotterdam raises throughput +14 % with the same fleet and is the precondition for any fleet growth.
1. Install and run
Requires Python 3.10+ and the standard scientific stack.
pip install -r requirements.txt # simpy, numpy, pandas, scipy, pyyaml, networkx, matplotlib
# from this submission directory:
python -m container_sim run # run all scenarios -> results.csv, summary.json, event_log.csv
python -m container_sim verify # reproducibility + event-log reconstruction self-tests
python -m container_sim figures # render the 5 figures in figures/
# optional subset:
python -m container_sim run --scenarios baseline,canal_closed
The full run is deterministic and takes ≈ 1 second on a laptop (210 replications).
2. Routing and dispatching logic
- Routing — shortest time, derived from the graph. Every leg’s free-flow
time is
distance_nm / min(vessel_service_speed × speed_factor, leg_max_speed). Dijkstra (NetworkX) finds the least-time path from each origin to Rotterdam and back, and is recomputed for every scenario, so closing a leg removes it.- Baseline path:
…→ IOX → BAB → Suez Canal → GIB → NLRTM(Suez beats Cape). - Canal closed: the Cape path
IOX → CAPE → WAFR → NLRTM(+3 200 nm) becomes shortest and is used automatically.
- Baseline path:
- Fail-clear. If no path exists (canal closed and Cape forbidden/closed),
the model raises
RouteErrorand the scenario is recorded asfailedwith a reason — never as a misleadingly low throughput. (Verified: seeverifyand the limitation note below.) - Dispatching — return to home port. Each vessel repeatedly loads full at its own origin, delivers to Rotterdam, and returns empty (ballast) to reload. The Suez Canal (capacity 3) is a shared convoy-slot resource held during transit; Rotterdam’s berth (capacity 1) is the discharge resource vessels queue for.
3. Key results
30 replications/scenario, 180-day horizon. total_teu = TEU discharged at
Rotterdam over the horizon; vs base = ratio to baseline.
| Scenario | Fleet | Route | total_teu (95 % CI) | TEU/day (2nd-half) | Cycle (d) | Anchorage wait (h) | RTM berth util | TEU / vessel | vs base |
|---|---|---|---|---|---|---|---|---|---|
fleet_small | 8 | suez | 228 667 (226 319–231 015) | 1 452 | 61.6 | 84 | 0.48 | 28 583 | 0.73 |
baseline | 12 | suez | 315 333 (312 435–318 231) | 1 970 | 66.0 | 157 | 0.65 | 26 278 | 1.00 |
rotterdam_upgrade ⭐ | 12 | suez | 360 000 (360 000–360 000) | 2 348 | 59.7 | 30 | 0.37 | 30 000 | 1.14 |
canal_upgrade | 12 | suez | 317 000 (314 378–319 622) | 1 978 | 65.5 | 157 | 0.66 | 26 417 | 1.01 |
canal_closed | 12 | cape | 270 333 (269 139–271 528) | 1 670 | 77.9 | 163 | 0.56 | 22 528 | 0.86 |
port_slowdown | 12 | suez | 229 000 (226 955–231 045) | 1 541 | 87.2 | 568 | 0.83 | 19 083 | 0.73 |
fleet_large | 20 | suez | 403 333 (400 503–406 164) | 2 678 | 83.2 | 482 | 0.84 | 20 167 | 1.28 |
⭐ rotterdam_upgrade is my own added scenario (Section 5, Q7).
Resource utilisation at baseline (the bottleneck fingerprint):
| Rotterdam discharge berth | Shanghai berths | Singapore berths | Suez Canal (NB) | Suez Canal (SB) |
|---|---|---|---|---|
| 65 % | 26 % | 29 % | 3.3 % | 2.9 % |
Figures in figures/: topology.png, throughput_transient.png,
fleet_saturation.png, bottleneck_utilisation.png, scenario_comparison.png.
4. Answers to the operator’s decision questions
Q1 — Baseline throughput & uncertainty.
≈ 315 000 TEU over 180 days (95 % CI 312 435–318 231, n = 30), i.e. about
31.5 full vessel-voyages. Sustained steady-state rate ≈ 1 970 TEU/day
(second-half; ≈ 1 900 TEU/day counting from the 14-day warmup, which is depressed
by the empty-pipeline ramp — first deliveries land ~day 28; see throughput_transient.png).
The CI is tight because delivered TEU is quantised in whole 10 000-TEU loads and
the completed-voyage count is robust to the specified transit/handling noise — the
variability shows up in cycle time and anchorage wait, not in the count.
Q2 — Where is the binding constraint? Rotterdam’s discharge operation — one berth worked by 4 cranes (112 TEU/h, ≈ 89 h per ship). The evidence is from the model, not the labels:
- It is by far the most-utilised resource (65 % vs origins 26–29 % vs canal 3 %).
- Degrading it (
port_slowdown) cuts throughput 27 % — the largest swing. - Saturating it (
fleet_large) collapses marginal returns and explodes the anchorage queue (below). - Relieving it (
rotterdam_upgrade) raises throughput 14 % with the same fleet. The famous chokepoint — the Suez Canal — is nearly idle (3 %) and is not the constraint. Reading the labels would mislead; the utilisation evidence is decisive.
Q3 — Does adding ships help, or does it saturate? It saturates. As the fleet grows 8 → 12 → 20:
- Marginal delivered TEU per added vessel falls 21 667 → 11 000.
- TEU per vessel falls 28 583 → 26 278 → 20 167 (ships get less productive).
- Mean anchorage wait rises 84 h → 157 h → 482 h — at 20 ships each vessel idles ~20 days per round trip waiting for the single berth.
- Cycle time stretches 61.6 → 66.0 → 83.2 days; RTM berth utilisation climbs
0.48 → 0.65 → 0.84.
Throughput still rises (+28 % at 20 ships) but the extra ships largely queue; the
fleet is running into the discharge berth (
fleet_saturation.png).
Q4 — Would expanding the canal help? No — essentially no effect (+0.5 %, within noise). The canal is only ~3 % utilised at baseline, so it is not the binding resource. Adding slots (3 → 12) and speeding transit drops canal utilisation further (3.3 % → 0.5 %) and shaves a few hours off a ~25-day transit, but the Rotterdam berth — untouched — still gates the system. Canal spend buys almost nothing here.
Q5 — Sensitivity to destination discharge productivity. Very high.
Halving Rotterdam’s crane rate (28 → 16 moves/h, port_slowdown) cuts throughput
27 % (315 k → 229 k), drives berth utilisation to 0.83 and the anchorage wait
to 568 h. Throughput moves almost one-for-one with discharge capacity — the
clearest confirmation that the discharge operation is the lever.
Q6 — Operational impact of losing the canal. ≈ −14 % (315 k → 270 k, 0.86×). The service reroutes automatically via the Cape of Good Hope (+3 200 nm each way), lengthening the cycle 66 → 78 days, cutting voyages and per-vessel delivery (26 278 → 22 528 TEU). The model does not fail, because a valid alternative exists; it only fails clearly if the Cape route is also unavailable.
Q7 — Single recommended intervention. Add a second crane-equipped deep-sea
discharge berth at Rotterdam (my rotterdam_upgrade scenario: berths 1 → 2,
cranes 4 → 8, fleet unchanged at 12). Evidence:
- +14.2 % throughput (315 k → 360 k) with the same ships.
- Anchorage wait collapses 157 h → 30 h; per-ship productivity rises 26 278 → 30 000 TEU (the opposite of what adding ships does).
- It is the precondition for fleet growth: with one berth, extra ships merely queue (Q3); doubling discharge capacity roughly doubles the system ceiling, so berth expansion is what makes a later fleet increase pay off. Compared with the alternatives, this beats buying 8 ships (far more capital for +28 % but each ship idles ~20 days/voyage) and dwarfs a canal upgrade (+0.5 %).
5. Output file schema
All machine-readable files use clear, conventional names — no bespoke key needed.
results.csv — one row per (scenario, replication)
| column | meaning |
|---|---|
scenario_id | scenario name |
replication | replication index (0…29) |
base_seed | base random seed for the scenario |
fleet_size | number of vessels |
route_type | dominant route used: suez or cape |
total_teu | TEU discharged at Rotterdam over the horizon (primary metric) |
deliveries | number of completed discharge voyages |
teu_per_day | TEU/day delivered after the warmup (> warmup_days) |
teu_per_day_second_half | TEU/day over the fully-ramped second half (steady state) |
teu_per_vessel | total_teu / fleet_size |
mean_cycle_time_days | mean round-trip duration |
mean_anchorage_wait_h | mean wait at Rotterdam for the discharge berth |
mean_origin_wait_h | mean wait at the origin for a load berth |
rtm_berth_util | Rotterdam discharge-berth utilisation (busy ÷ capacity·horizon) |
util_port_CNSHA, util_port_SGSIN | origin berth utilisations |
util_port_DEHAM | Hamburg berth utilisation — always 0 (the distractor sink is never served; useful confirmation) |
util_L06_CANAL_NB, util_L06_CANAL_SB | Suez Canal slot utilisations (0 when closed) |
summary.json — scenario-level summary + cross-scenario analysis
{
"generated_with": "container_sim 1.0.0",
"scenarios": {
"<id>": {
"scenario_id", "description", "horizon_days", "warmup_days",
"replications", "base_random_seed", "fleet_size", "route_type",
"deliver_to", "status", // "ok" or "failed" (+ "reason")
"metrics": { "<metric>": {mean, std, sem, ci95_low, ci95_high, n}, ... }
}, ...
},
"analysis": {
"throughput_ratio_vs_baseline": { "<id>": ratio, ... },
"fleet_marginal_teu_per_vessel": [ {from, to, delta_vessels, delta_teu, marginal_teu_per_added_vessel}, ... ],
"teu_per_vessel_by_fleet": {...}, "anchorage_wait_h_by_fleet": {...},
"baseline_utilisation": {...}
}
}
Metrics carrying a 95 % CI: total_teu, teu_per_day, teu_per_day_second_half,
teu_per_vessel, deliveries, mean_cycle_time_days, mean_anchorage_wait_h,
mean_origin_wait_h, rtm_berth_util, util_L06_CANAL_NB, util_L06_CANAL_SB.
CIs are Student-t with n−1 degrees of freedom.
event_log.csv — auditable vessel-movement trace
One row per important event, sufficient to audit movements and reconstruct
delivered throughput: sum of teu over rows where event_type == "DELIVER",
grouped by (scenario_id, replication), equals that replication’s total_teu
(verified exactly for all scenarios).
| column | meaning |
|---|---|
scenario_id, replication | keys |
sim_time_h, sim_day | event time (hours; days) |
vessel_id, vessel_class | the vessel |
event_type | LOAD_START, DEPART_ORIGIN, CANAL_ENTER, CANAL_EXIT, ARRIVE_DEST, DISCHARGE_START, DELIVER, ARRIVE_HOME |
location | node id or leg id |
route_type | suez or cape |
teu | TEU on the move (10 000 on DELIVER) |
wait_h | queue wait recorded on the event (anchorage / origin) |
6. Reproducibility, validation, limitations
- Reproducible: per-vessel RNG streams seeded
(base_seed, replication, vessel_index);python -m container_sim verifyconfirms bit-identical reruns, seed independence, and event-log reconstruction. Common random numbers (shared base seed) reduce the variance of scenario comparisons. - Warmup / transient: cold start (all vessels at home, t = 0). The pipeline
fills by ~day 28, beyond the 14-day
warmup_days, so we report total TEU (full horizon) and a fully-ramped second-half rate. The cumulative curve (throughput_transient.png) shows a “staircase” from synchronised first departures — a cold-start artifact, not a steady-state feature. - Crane model: cranes split evenly across berths (aggregate port capacity exact; Rotterdam exact at 112 TEU/h). This makes origin loads conservatively slow, but origins stay at 26–29 % utilisation — non-binding — so conclusions are unaffected.
- Scope: unlimited export demand (ships always full), discharge-only
destination, no backhaul, Hamburg unused (distractor), draft non-binding (no
depth limits in the data). See
conceptual_model.md§8–9 for the full list and the data-derived vs introduced split.