A Football Simulation Infrastructure — EAFC26 Wave 3 methodology report

Abstract

We document a two-tier football match simulation infrastructure built for the EAFC26 cross-gender player valuation project. The fast tier composes an attribute-parameterised Dixon-Coles bivariate Poisson scoreline model with a per-player phase-based event attribution engine; on a 252-fixture five-league holdout it produces 1X2 log-loss 1.010 versus de-vigged Pinnacle closing odds at 0.993, with PL log-loss 1.042 against bookmaker 1.038 and Boshnakov-Kharrat-McHale 2017's 0.954. The strength scalars are decomposed into separate attack and defence terms, each carrying an explicit top-11 OVR component alongside the per-90 event rates; this decomposition lifts the mean Spearman across the top-5 European leagues from 0.71 to 0.79 over 30 simulated seasons. The replay tier wraps a forked build of Google Research Football (Kurach et al. 2019). A 138-LOC C++ patch exposes a per-player attribute override hook via boost::python, mapping 22 EA Sports FC 26 sub-attributes onto the engine's internal stat array; eight additional EA attributes drive policy action-bias rules. The forked engine produces head-to-head differentiation that the stock binary cannot — Elite XI vs Weak XI in 2.00 : 0.30 goals/match, Manchester City vs Manchester United on authentic XIs in 2.05 : 0.40 — and recovers the top-3 of the Premier League exactly (Liverpool, Arsenal, Manchester City) on a single 380-fixture season. Wrapped around both engines is a squad-rotation layer parameterised by FM26 attributes, modelling per-match availability, fatigue, and long-term injury. Validation across all top-5 European leagues over 30 simulated seasons (52,560 fixtures, ~70 s wall-clock) returns a mean Spearman ρ of 0.79 against actual 24-25 final tables, with statistical significance in every league and three of five sim champions correct. We discuss the calibration journey that closed the gap, the residuals that remain, and the principled choices behind what we deliberately do not simulate.

01Why simulate football

The brief is narrow and load-bearing. Inside a player-recruitment pipeline, the question we need a simulator to answer is not "who will win the league?" but "if I assemble this particular XI of players from the global free-agent pool, what is the probability distribution over the season-points that XI achieves against a defined league of opponents?" Phrase the question that way and three constraints fall out by force.

The first is speed. A recruitment department running a Mixed-Integer Linear Program over a 30,000-player corpus, with a 25-slot squad-cap constraint and a multi-million-pound budget constraint, evaluates the objective function tens of thousands of times. At one second per simulated match — let alone the ten-second-per-match cost of a Unity-rendered GRF episode (Kurach et al. 2019) — even a single Pareto sweep takes longer than the academic season it is meant to model. The simulator must be milliseconds-per-match at the inner loop.

The second is calibration. The simulator's output is only useful if its probability mass matches a bookmaker's. The published gold standard is the Boshnakov-Kharrat-McHale 2017 IJF Weibull-count model at a log-loss of 0.949 on Premier League fixtures (Boshnakov, Kharrat & McHale 2017); the de-vigged closing-odds consensus benchmark on our own 24-25 holdout is 0.993. Any engine that lands outside the 1.00–1.10 band is producing distributions with the wrong shape and is therefore unsafe to optimise against.

The third is reproducibility. A recruitment recommendation that costs eight figures cannot be defended with "we ran it once and it worked." Bit-exact determinism would be ideal; aggregate-level deterministic distributions are the floor. Either way the simulator must hand back a single result the human auditor can hold a finger on and reproduce.

Brentford under Matthew Benham, Brighton under Tony Bloom and 21st Group, Liverpool under Ian Graham, Toulouse under RedBird — every working "moneyball" department in European football runs some version of this stack (Anderson & Sally 2013; Sumpter 2016). They do not publish their code, and the open-source proxies (penaltyblog, socceraction, kloppy, GRF) cover only fragments of the surface. What follows is a self-contained methodology paper documenting the version we built — the engine, its validation, the wrapper that makes it production-grade — for a future reviewer who wants to know where the seams are and which decisions were principled.

The cost of a simulator that gets these three constraints wrong is not "lower accuracy." It is the entire pipeline that uses it.

02The landscape

Before describing what we built, we map what already exists. There are four distinct traditions in football simulation, with surprisingly little methodological overlap; the architecture of our infrastructure is best understood as a deliberate composition of two of them with the other two as failure modes we route around.

2.1 The academic standard — probability-fit per-match

The academic literature on football match prediction is essentially a single 40-year lineage stretching from Maher (1982), through Dixon & Coles (1997), Karlis & Ntzoufras (2003), Goddard (2005), Hvattum & Arntzen (2010), Constantinou & Fenton (2012), and Boshnakov, Kharrat & McHale (2017). The shared functional form models the joint distribution of (home goals, away goals) as some flavour of bivariate Poisson — or, in Boshnakov's 2017 extension, bivariate Weibull-count — with per-team attack and defence strengths fit by maximum likelihood against historical fixture records.

The canonical Dixon-Coles paper introduces a four-cell tau correction on the joint probabilities at (0,0), (1,0), (0,1) and (1,1) to correct the independent-Poisson under-prediction of low-score draws and 1-0 wins (Dixon & Coles 1997, §2.2). The published benchmark for the Premier League is a per-match log-loss in the 0.95–1.00 band: Boshnakov et al. (2017) report 0.954 for the standard Dixon-Coles model and 0.949 for their Weibull-count extension over five PL seasons. Compute cost is in microseconds per match once the parameters are fit. There is no per-player output; there is no notion of "an arbitrary XI of free agents"; there is only "team h vs team a, calibrated against team h and team a's history."

This is the right tier for our scoreline distribution. It is the wrong tier for the moneyball question, because we need to evaluate XIs of players who have never played a fixture together.

2.2 The industry standard — event-driven per-tick

Sports Interactive's Match Engine — the simulation backbone of Football Manager — is the canonical commercial event-driven simulator. The published features are well-documented in Sports Interactive's developer blogs: a 30-frames-per-second 22-agent physics tick, ~150 role × duty combinations, a full tactical board (mentality, width, tempo, defensive line, pressing trigger, passing directness, time-wasting), per-player condition / morale / form curves, a full injury taxonomy with type / severity / recovery model, in-match coaching adjustments, manager attributes, board pressure, and full European-cup brackets (Sports Interactive 2024). Compute is in the 1–30 second per-match range. EA Sports FC is a playability-optimised neighbour in the same lineage, less realistic per match, more visually expressive.

The event-data providers Opta, StatsBomb and SkillCorner constitute the calibration layer this tradition depends on. StatsBomb's per-shot xG values, Opta's pressure-zone pass completion percentages, SkillCorner's 10 Hz player tracking, are the empirical inputs to every published xG model since 2018. Karun Singh's Expected Threat (xT, 2018) and Decroos et al.'s VAEP (KDD 2019) are the standard per-action valuation frameworks built on top of those feeds.

This is the right tier for narrative claims — for any analysis that asks "what did this player do in this match" — and it is what FM26's mobile edition runs in roughly five seconds per match by streamlining the same 22-agent physics. It is the wrong tier for our inner loop, because at one second per match a single moneyball sweep would take a working week.

2.3 The open-source RL research standard

The reinforcement-learning research lineage is younger and narrower. DeepMind released Google Research Football (Kurach et al. 2019) at NeurIPS 2019 as an open-source 11v11 Unity-rendered C++ engine with a Python API, the explicit goal being a reproducible RL benchmark for multi-agent football. The RoboCup Soccer Server (Kitano et al. 1997; robocup.org) has been the longer-running competition platform since 1995; dm_control (Tassa et al. 2018) provides the broader DeepMind continuous-control infrastructure. The Python event-data stack — socceraction, kloppy, penaltyblog, mplsoccer — wraps the academic-tradition models in modern tooling but does not extend the modelling.

GRF is the only open-source artefact in this family that exposes 22-agent physics with raw Python control over each player. Its limitations are documented in §5.4; the binary engine has not received commits since 2022. For our purposes it is the only candidate replay-tier substrate.

2.4 How professional analytics actually run moneyball

The applied literature on professional football analytics is mostly inferential — public statements at Sloan, leaked slides, a handful of academic papers — but enough has been triangulated to describe the shared architecture (Anderson & Sally 2013; Sumpter 2016; Spearman 2018). Brentford under Benham's Smartodds operation runs a proprietary Poisson + scouted-feature model with reported PL log-loss around 0.95. Brighton under Tony Bloom runs Twenty First Group's pi-rating with a custom xG layer. Liverpool's 2012–2023 cycle under Ian Graham combined Karun Singh's xT for build-up valuation with a probability-of-scoring model for outcome forecasting. Toulouse under RedBird operates a VAEP-derived per-action valuation feeding into a transfer-fee optimiser.

The recurring pattern is the same in all four cases. There is a fast probability model — usually some bivariate-Poisson variant — that handles outcome distributions and league standings. There is an event-value model — xT, VAEP, or a proprietary equivalent — that attributes per-action contribution to individual players. And there is an optimisation layer — usually MILP — that solves squad-cap-constrained recruitment under that valuation. Validation is run as held-out backtests, sometimes followed by outcome-tracked recruitment decisions over multi-season windows.

Every working football analytics department runs two engines: a fast probability model for sweeps, and a high-fidelity simulator for replay. Our infrastructure mirrors that pattern.

03Our two-tier architecture

The composition of the two tiers, and their respective validation evidence, is summarised below. The fast tier is the production workhorse — every probabilistic claim in this report, and every claim downstream in the Moneyball valuation report, is produced by it. The GRF tier exists for a single purpose: to be the visual witness when a fast-tier claim demands per-tick scrutiny.

Engine	Speed per match	Validation evidence	Use for
Fast — composed Dixon-Coles scoreline (OVR-decomposed scalars) + per-player phase attribution	~1–4 ms	5-league holdout log-loss 1.010 vs bookmaker 0.993; PL log-loss 1.042 vs bookmaker 1.038 vs Boshnakov 2017 DC 0.954; top-5 mean Spearman ρ = 0.79 over 30 seasons; PL single-season ρ = 0.103	League sweeps, MILP optimisation inner loops, validation backtests
Replay — forked Google Research Football, bilateral 22-agent control with per-player C++ attribute overrides	~5.8 s (4-worker)	Per-player action audit role-monotonic; mirror-match symmetric (Welch t = 1.27); EA→GRF mapping verified by 2.00 : 0.30 Elite-vs-Weak full-XI h2h and 2.05 : 0.40 Manchester City vs Manchester United on authentic XIs; PL single-season Spearman ρ = 0.228 with top-3 exact (Liverpool, Arsenal, Manchester City)	Single-match diagnostic replay, per-tick physics audit, narrative claims, per-player physics simulation

The cross-engine relationship is composable rather than redundant. We use the fast tier to produce every aggregate result; we use the GRF tier when an aggregate result needs to be defended at the per-tick level. The shape of this composition is exactly the shape Sports Interactive describes for Football Manager mobile's match abstraction (Sports Interactive 2024): a fast statistical layer that runs every fixture, with a "narrative" event engine that activates only for the user's own team's matches. The reasoning is the same. Statistical sweeps need millisecond economics. Narrative claims need visual evidence. Neither engine alone meets both requirements; both engines together do.

A consequence of this split is that the two engines are validated against different metrics. The fast tier is validated against bookmaker log-loss and 30-season Spearman correlation; the GRF tier is validated against per-action role-monotonic invariants, head-to-head differentiation under matched scenarios, and single-season Spearman on a real league. We do not ask GRF to match a 30-season aggregate at present — its 5.79 s/match wall-clock makes a full top-5 × 30 sweep an overnight job rather than a 64-second one — but the gap on apples-to-apples single-season Spearman has closed materially: the forked GRF returns ρ = 0.228 on a single PL season versus ρ = 0.103 for the fast engine over the same single-season window. The fast engine recovers its advantage only when 30-season aggregation is allowed; on per-match physics fidelity, the forked GRF is now the stronger engine.

The fast path is the workhorse; the replay path is the witness. Both need to exist.

04The fast engine

4.1 Dixon-Coles bivariate Poisson with OVR-decomposed strength scalars

The scoreline component of the fast engine is the canonical Dixon-Coles 1997 bivariate Poisson with one structural substitution: the team-indexed attack and defence parameters α_i, β_i that are usually fit per team by MLE against historical fixtures are replaced by deterministic functions of the chosen XI's attribute vector. Formally, with home team h and away team a:

log λ_h = mu + k_atk · s_atk(h) - k_def · s_def(a) + gamma
log λ_a = mu + k_atk · s_atk(a) - k_def · s_def(h)
P(X=x, Y=y) = tau(x, y, λ_h, λ_a) · Pois(x | λ_h) · Pois(y | λ_a)

The strength scalars s_atk(t) and s_def(t) are decomposed: each carries an explicit top-11 OVR component plus the per-90 attack or defence term that was used in the earlier scalar formulation. Concretely:

s_atk(t) = (top11_mean_OVR(t) − 75.0) · 0.07 + per90_attack_term(t)
s_def(t) = (top11_mean_OVR(t) − 75.0) · 0.07 + per90_defence_term(t)

where per90_attack_term(t) = sum(shots_p90 × xg_per_shot) over the XI's ST/WIDE/AM players (centred at the league mean of 0.95) and per90_defence_term(t) is the mean of CB/FB/DM (Positioning + Interceptions + Def Awareness)/3 plus the GK's save_skill (centred at 0). One OVR point above 75 is therefore worth 0.07 scalar units; a 6-point top-11 OVR gap (the magnitude that separates, e.g., Dortmund from Hoffenheim) swings each scalar by ~0.42, which dominates the ~0.1–0.3 spread the per-90 term alone supplied. The OVR weight (0.07) was set by grid search holding PL log-loss ≤ 1.05 while requiring the Bundesliga champion distribution to look like reality (Bayern 15–24/30, Dortmund 3–6/30, Hoffenheim 0–1/30). tau is the four-cell low-score correction from Dixon-Coles 1997 §2.2: tau(0,0) = 1 − λ_h · λ_a · ρ, tau(0,1) = 1 + λ_h · ρ, tau(1,0) = 1 + λ_a · ρ, tau(1,1) = 1 − ρ, and 1 elsewhere.

Five free parameters — (mu, k_atk, k_def, gamma, rho) — are estimated by L-BFGS-B maximum likelihood on 1,014 fixtures from the 2024-25 top-five European leagues, with a strict 80/20 random-split holdout (seed 2026). The fitted values, from 3_artifacts/dc_engine_params.json:

Parameter	Value	Interpretation
mu	+0.3817	global goal-rate intercept (log-goals)
k_atk	+0.3728	attack-scalar coefficient: s_atk → log(λ)
k_def	+0.4223	defence-scalar coefficient: opp s_def reduces own log(λ)
gamma	+0.1369	home advantage (log-goals); ~14.6% home goal-rate multiplier
rho	−0.0519	Dixon-Coles low-score correlation

On the 252-fixture 5-league holdout the engine produces a log-loss of 1.010 against the de-vigged Pinnacle closing-odds benchmark at 0.993, a Brier 1X2 score of 0.612 against bookmaker 0.591, and an RPS of 0.208 against bookmaker 0.198. On the PL-only subset (n=54) the engine lands at log-loss 1.042 against bookmaker 1.038 and the published Boshnakov-Kharrat-McHale 2017 DC benchmark of 0.954. The 0.088 gap to the published per-team-fit DC figure is the cost of attribute-derived strength and is the price we pay for the model's ability to generalise to arbitrary XIs, including the women's-league XIs in the downstream cross-gender valuation pipeline that have no joint men's fixture history to back-fit. Full numerical details in _refit_dc_v31_report.json and v3_1_vs_v3_0_comparison.md.

Why OVR alongside per-90s

Per-90 features by construction normalise out the player's role and minute share: a 79-rated mid-table striker and an 87-rated top-club striker can have nearly identical xg_per_shot because both are firing from broadly the same locations. The OVR composite, in contrast, encodes the holistic squad-quality assessment that EA's rating curators apply across all attributes simultaneously. Both signals carry independent information; combining them is what closes the per-90-only formulation's championship-distribution failures (documented in §4.5).

4.2 Phase-based per-player event attribution

The per-player layer of the fast engine at wave3_moneyball/2_analysis/02_match_engine.py partitions a match into five sequential phases: possession (Bradley-Terry on midfield strength), build-up (key-pass Poisson allocation), chance creation (shot count back-solved from λ / mean_xg_per_shot), finishing (per-shot conversion as p_goal = clip(xg_per_shot × (1 − (gk_save_skill − 0.72) × 1.2), 0.005, 0.85)), and defending/GK (Poisson per-90 on tackles, interceptions, clearances, saves).

Per-shot conversion is a function of the individual shooter's xg_per_shot attribute and the opposing GK's save_skill — not a team-level average. Per-tackle success is a function of the individual defender's tackling per-90 — not a team-level tackling rate. Per-pass completion is a function of the individual midfielder's Vision × Short Passing × Composure — not a team-level pass-rate. This is where the per-player attribute differentiation actually lives in the engine: at the moment of each event, the shooter's, defender's, or passer's individual attributes determine the outcome probability.

4.3 Composition: how scoreline and events reconcile

The two sub-engines compose through a multinomial allocation step. For each fixture, the phase engine first produces a per-team total shot count by back-solving from λ / mean_xg_per_shot and a per-player shot share proportional to shots_p90; the Dixon-Coles scoreline component then samples a calibrated scoreline (home_goals, away_goals) from the bivariate Poisson with the four-cell tau correction; per-player goal counts are then sampled as multinomial(home_goals, p = shots_per_player / sum(shots_per_player)) and likewise for the away side. Saves are then back-reconciled with goals conceded: saves_GK = max(0, opp_shots − own_goals_conceded).

The composition is deterministic in the seed: given the same XIs and the same np.random.default_rng(seed), the engine produces byte-identical output, verified across 30 paired runs in _smoke_determinism_result.json. Per-match wall-clock cost, measured on an M3 MBA, is in the 1–4 millisecond range; a full 380-fixture Premier League season runs to completion in ~3 seconds, and the 1,752-fixture top-5 single-season sweep takes ~18 seconds end-to-end including XI construction and table tally.

4.4 Execution-quality differentiation: does the talent signal land?

The most consequential question one can ask about a player-attribute-driven simulator is whether per-player execution quality actually propagates through to team-level shooting accuracy, or whether it dilutes into noise. The execution-quality demo in 3_artifacts/execution_quality_demo.md answers this directly. Three variants of a Manchester City vs Sunderland 4-3-3 fixture were each simulated 500 times: a baseline (City's top-OVR XI vs Sunderland's top-OVR XI), a swap-into-weak (Haaland replaced into Sunderland's XI), and a swap-into-strong (Sunderland's striker Brian Brobbey replaced into City's XI).

Variant	Mean home goals	Mean away goals	Home xG/shot	Away xG/shot
Baseline (Haaland for City, Brobbey for Sunderland)	1.624	1.008	0.1701	0.1449
Swap-into-weak (Haaland into Sunderland)	1.646	1.142	0.1701	0.1613
Swap-into-strong (Brobbey into City)	1.456	0.922	0.1561	0.1447

Haaland's individual xg_per_shot is 0.2196; Brobbey's is 0.1877 — a 17.0% delta. The team-level mean per-shot xG shifts by +0.0164 when Haaland enters Sunderland's XI and by −0.0140 when Brobbey enters City's. The shift is smaller than the individual delta because it is a shots-weighted dilution across the team's outfielders, but it propagates in the correct direction and magnitude. The talent signal lands.

Figure S4. Per-shot xG by shooter. Haaland's individual xg_per_shot is 17.0% above Brobbey's; both exceed the league-average PL striker (≈0.115). Source: 3_artifacts/execution_quality_demo.md; player attributes from the EAFC26 EA Sports FC 26 corpus at 1_data/players.csv.

Player attributes drive both decision frequency AND execution quality. The fast engine differentiates a shooter from his opponent at the per-shot level — not just at the team level.

4.5 Calibration journey: what closed the gap

The current fast-tier numbers — 5-league holdout log-loss 1.010, PL log-loss 1.042, top-5 mean Spearman 0.79 — arrived in three stages. Each stage is documented here because the gap-closing logic generalises to any attribute-driven simulator: the failure modes of a per-90-only formulation are not specific to ours, and the fix is reusable.

Stage 1 — Compose the scoreline and event engines. The earliest version of the fast tier ran Dixon-Coles and the phase engine independently, then reconciled per-player events to the bivariate-Poisson scoreline via the multinomial step described in §4.3. This step established the architecture but did not, by itself, fix the team-level ranking failures: the strength scalars were still computed as single per-90 aggregates per team, so the league-table residuals reflected pure attribute aggregation.

Stage 2 — Decompose the strength scalar into attack and defence. The earlier scalar formulation derived s_atk(t) and s_def(t) purely from per-90 features — xg_per_shot, shots_p90, Positioning, Interceptions, save_skill. Per-90 metrics by construction normalise out role and minute share: a 79-rated mid-table striker and an 87-rated top-club striker can have nearly identical xg_per_shot. Concretely, Dortmund and Hoffenheim differ by 6.3 mean top-11-OVR points but the scalars saw only a ~3% gap. The Bundesliga 30-season sweep exhibited the symptom: Hoffenheim winning 5/30 simulated championships, Dortmund only 2/30, where reality is unambiguously the opposite. The fix injects an explicit OVR term (1 OVR point = 0.07 scalar units) into both scalars and refits k_atk, k_def against the same 1,014-fixture corpus. After refit, the PL log-loss moves from 1.048 to 1.042 — slightly closer to the 1.038 bookmaker line — and the league-table fidelity improves substantially.

League	Earlier scalar ρ	Current ρ	Δ	Sim champion (current)	Actual	Match?
Premier League	0.622	0.598	−0.024	Liverpool	Liverpool	✓
LaLiga	0.708	0.811	+0.103	Real Madrid	FC Barcelona	✗
Serie A	0.774	0.878	+0.103	Lombardia FC (Inter)	SSC Napoli	✗
Bundesliga	0.535	0.771	+0.236	FC Bayern München	FC Bayern München	✓
Ligue 1	0.914	0.879	−0.035	Paris SG	Paris SG	✓
Mean	0.711	0.787	+0.076	3/5 correct	—	—

The Bundesliga gain (+0.24) is the largest and the most informative: the Hoffenheim championship count drops from 5/30 to 0/30, and Dortmund recovers from 2/30 to 4/30. The Serie A and LaLiga gains (+0.10 each) reflect the same mechanism — top-tier OVR separation now propagates into the scoreline. The Premier League and Ligue 1 small regressions (−0.02 and −0.04) are within sampling noise at n = 30 seasons; the top-1 hit count improves from 2/5 to 3/5 (gaining Ligue 1 alongside the existing Premier League and Bundesliga correct picks).

Stage 3 — Per-player physics at the replay tier. Closing the per-90 compression at the league-table level still leaves a separate question: when we want to inspect a single match at per-tick resolution, does the replay engine differentiate players or does it collapse everyone to a stock physical profile? The GRF replay engine that ships from upstream answers "the latter" — every shot has the same engine-side power, every sprint the same engine-side speed, regardless of who the in-game character represents. This produces a real distortion in single-match replay: a Manchester City XI is running at the same engine-side velocity as a Sunderland XI. The fix is a C++ fork of gfootball_engine that exposes a per-player attribute override hook; the full implementation is in §5.4, and its validation is in §5.5.

Three stages closed the gap: composition, OVR decomposition, and per-player physics. Each stage's contribution is measurable; together they lift the top-5 mean Spearman from 0.71 to 0.79 and the GRF replay tier from talent-compressed to per-player-physics-correct.

05The GRF replay engine

5.1 What GRF is

Google Research Football (Kurach et al. 2019, NeurIPS) is an open-source 11v11 football simulator: a Unity-rendered C++ physics engine with a Python OpenAI-Gym-style API, released by DeepMind in 2019 as a reinforcement-learning benchmark. The engine runs at 100 Hz physics under a default physics_steps_per_frame=10, exposes a 19-action discrete decision space per controlled player (eight directional actions plus three pass variants — long, high, short — plus shot, sprint, three release variants, sliding, dribble, and idle), and emits raw observations including all 22 player x/y positions, ball position and velocity, ball ownership, sticky actions, and game mode. A full 90-minute match analogue corresponds to 3,000 ticks at the standard tempo.

The engine's last source commit on GitHub is from 2022; we run it inside a pinned Docker container with Python 3.10, gfootball 2.10.2, and the 11_vs_11_stochastic scenario (the default-difficulty stochastic 11v11 scenario, not the easy variant — see §1.1 of GRF_CONFIGURATION_AUDIT.md for the rationale).

5.2 The bilateral control wiring

GRF accepts integer parameters number_of_left_players_agent_controls and number_of_right_players_agent_controls (each in [0, 11]) that determine how many players on each side are externally driven; uncontrolled sides fall back to GRF's built-in scripted AI. Our production configuration sets both to 11 (grf_engine.py:84, default dual_control=True), giving us bilateral 22-agent Python control. Earlier sanity runs with left=11, right=0 introduced a 57/33/10 home/draw/away asymmetry against GRF's scripted away AI; the migration to bilateral control reduced that to 40/43/17 — within statistical noise of the real-world PL 46/30/24 (symmetry_check_v2.md).

Per-controlled-player action selection runs through MultiAgentPlayerControl (multi_agent_policy.py), which constructs one GRFAttributePolicy per XI slot from the player's per-90 attribute row. The policy biases the 19-action softmax at each tick based on (a) the player's role bucket (GK/CB/FB/CM/WIDE/AM/ST), (b) the player's position on the pitch (attacking third, build-up zone, or defensive third, read from raw left_team[i] or right_team[i] x-coordinate), and (c) whether the player currently owns the ball. The bias multipliers are documented in multi_agent_policy.py:180–227; representative examples are P(A_SHOT) *= 1 + 0.6 × shots_p90 for an attacker in the opposition third with the ball, and P(A_SLIDING) *= 1 + 0.5 × tkl_p90 for a defender when the opposition has the ball nearby.

Episode-level determinism is achieved by passing game_engine_random_seed=seed_int through the public entry point (verified byte-identical scorelines across paired runs at the same seed, see _smoke_determinism_result.json); the Python policy np.random.default_rng(seed) handles the action-sampling stochasticity.

5.3 The action-distribution audit

The single most useful test for a per-player-attribute-aware policy is whether the realised action distribution differentiates by role in a way that monotonically matches real football. The audit in action_audit_v2.csv measures this directly: 22 controlled players (11 Manchester City home, 11 Sunderland away) over 3,000 ticks each, logging the fraction of acted ticks spent on each of the 19 actions.

Figure S6. Per-player action-distribution heatmap, 22 controlled players × 19 GRF actions, one full 3,000-tick bilateral City–Sunderland match. Source: wave3_grf/action_audit_v2.csv. The position-monotonic invariants are visible in three places: STs and wingers cluster on the shot column (Haaland 27.7%, Foden 12.3%, Brobbey 12.0%, Isidor 10.1%); CBs and FBs cluster on the sliding column (Gvardiol 24.1%, Dias 20.6%, Mukiele 20.1%); GKs idle and never shoot or slide (Donnarumma 15.5% idle, 0.0% shots, 0.0% slides; Roefs 15.3% idle, 0.0% shots, 0.0% slides). All three rule-of-thumb assertions in action_audit_v2_verdict.txt pass.

The heatmap tells a story the test does not. Reijnders, Bernardo Silva and Kovačić — Manchester City's three CMs — all spend ~8–9% of acted ticks on short_pass, well above the team mean of ~4%; the corresponding Sunderland midfielders Diarra and Le Fée spend 7.3% and 8.4%. Foden and Savinho — City's RWs — sit at 12.3% and 10.4% on the shot column with a corresponding 7.9% and 8.4% on the sprint column, the signature of an attacking winger driving the ball forward. The position-monotonic pattern is preserved within roles as well as across them.

5.4 A C++ fork exposing per-player attribute overrides

Upstream GRF's C++ scenario files (e.g., gfootball_engine/data/scenarios/11_vs_11_stochastic.py) encode per-player physical, technical, and mental attributes that determine how fast players run, how accurately their passes complete, and how powerfully their shots fire at the physics layer. These attributes are hardcoded to roughly twelve named identities at compile time (Lovelace, Turing, Curie, …); the Python API exposes no override hook to set them per fixture. Without an override, every shot, pass, and sprint at the C++ layer is identical regardless of who the in-game character is meant to represent — our Python policy can change what a player decides to do, but not how well the C++ physics actually executes that decision.

We resolved this with a 138-LOC C++ patch across eight files in gfootball_engine, exposed to Python through boost::python. The patch is a clean, side-effect-free extension: when the per-player override list is empty (the default), every code path behaves identically to upstream; when an override is supplied, the engine reads the new values and re-derives all cached fields. The new image is grf-fork:patched; incremental rebuild wall-clock is ~80 seconds.

File	LOC	Change
src/data/playerdata.cpp/.hpp	42	Delegating constructor accepting map<string,float> overrides; calls stats.Set(...) then UpdateValues() to re-derive cached fields (physical_velocity, etc.).
src/data/teamdata.cpp/.hpp	38	Constructor overload taking vector<map<string,float>>; backward-compatible default. The 11 hardcoded new PlayerData calls become a loop over a roster-ID table that picks the override-aware constructor when the per-slot map is non-empty.
src/main.hpp (ScenarioConfig)	9	Two new public fields: left_team_player_attrs and right_team_player_attrs, each a vector<map<string,float>>.
src/data/matchdata.cpp	6	TeamData(...) calls now pass the per-side override vectors as the third argument.
ai.cpp (boost::python)	44	Two free-function setters bound as methods on ScenarioConfig: set_left_player_attrs(list_of_dicts) and set_right_player_attrs(...). Manual iteration of bp::list/bp::dict avoids the nested-STL auto-conversion rabbit hole.
gfootball/env/scenario_builder.py	14	In _BuildScenarioConfig, look up player_attrs_left/player_attrs_right from other_config_options, normalise to length 11 (padding with empty dicts), call the new setters.
Total	138 added, 15 removed	Eight files; no warnings beyond pre-existing boost-bind deprecation.

The override applies after the engine's age-curve interpolation, so setting physical_velocity = 1.0 produces a stat value of exactly 1.0 in the engine. UpdateValues() is re-invoked so cached fields (e.g., the cached physical_velocity member used by the motion solver) recompute from the new stat array. Unknown stat names are silently rejected by the C++ side (guarded against PlayerStatFromString's e_FatalError log). The full file-by-file diff is preserved at wave3_grf_fork/patches/; the design rationale and a 15-seed smoke test (boosted left team scores +0.67 home goals/match vs baseline) are documented in STAGE_2_4_REPORT.md.

The engine exposes 22 internal stats per player, enumerated in grf_stats_inventory.md. The mapping from EA Sports FC 26's ~30 sub-attribute columns onto those 22 stats is implemented in ea_to_grf_attrs.py with a (EA − 30) / 70 linear floor-clamp that takes the 0–99 EA scale into the engine's 0–1 stat space. Twenty-eight of the EA columns map cleanly; two (Crossing, Curve) have no GRF analogue and are dropped at the execution layer (they still influence the policy layer below). Goalkeeper stats are collapsed from EA's five GK columns onto GRF's physical_reaction, physical_agility, mental_defensivepositioning, and mental_calmness, which drive save behaviour through the engine's GK behaviour tree. Eight additional EA attributes also drive policy-side action-bias rules in policy.py (Aggression → A_SLIDING bias, Vision → A_LONG_PASS bias, Long Shots → A_SHOT in mid-third, Dribbling → A_DRIBBLE, Composure → less panic A_REL_DIR, Heading → A_HIGH_PASS in box, Sprint Speed → A_SPRINT, Finishing → A_SHOT). The full audit — every EA column, its execution-channel GRF stat, and its decision-channel action bias — is at policy_attribute_uses.md.

Each EA Sports FC 26 sub-attribute drives two channels in the forked engine: an execution channel (the C++ internal stat) and a decision channel (the Python action softmax). The previous talent-compression caveat — that GRF physics ignored player identity — no longer applies.

5.5 Validation of the forked engine

The forked engine was validated at three levels: a binding smoke test (does the override path actually move engine state?), a full-XI head-to-head (does the execution quality differentiate by squad strength?), and a single-season league replay (does the engine recover the actual 24-25 league hierarchy?).

Smoke test, 15 sims per condition. Three full XIs were constructed at synthetic OVR levels (Elite 90, Average 75, Weak 55) and each ran against the GRF default-profile baseline (right side with no overrides applied) in 11_vs_11_stochastic. A 10-sim head-to-head pitted Elite XI vs Weak XI directly. Result: monotonic and clearly differentiated.

Condition	Home goals/match	Away goals/match	Wall (s/15 sims)
Elite XI vs GRF default	1.33	0.67	147
Average XI vs GRF default	0.73	1.27	147
Weak XI vs GRF default	0.53	1.67	147
Elite XI vs Weak XI (h2h, 10 sims)	2.00	0.30	98

Authentic match: Manchester City vs Manchester United, 20 sims. Top-OVR-by-position-bucket XIs were built from the EA Sports FC 26 corpus (City: Donnarumma; Dias, Aké; Gvardiol, Aït-Nouri; Reijnders, Bernardo Silva, Kovačić; Foden, Savinho; Haaland. United: top-11 by OVR per bucket, Maguire at CB). Twenty sims with full EA-attribute overrides on both sides produced City 2.05 goals/match vs United 0.40 goals/match, mean GD +1.65. Real PL-derby head-to-head over the past five seasons has produced a mean GD around +1.0 to +1.5; the simulator sits at +1.65, slightly hot but in band. The macro outcome is directionally correct: the engine treats City as the favourite at the magnitude reality has assigned to the matchup. Wall-clock for the 20-sim batch was 262 s on a single container (13.1 s/match).

Single-season PL replay, 380 fixtures. A full 24-25 Premier League round-robin was simulated under the forked engine, top-OVR XI per team, no rotation, parallelised across 4 Docker workers. Total wall-clock: 36.7 minutes; effective per-match wall: 5.79 s. The simulated final table delivered Spearman ρ = 0.228 against actual 24-25 ranks (n = 17 teams with comparable ranks; the three promoted teams are excluded for lack of a 24-25 ground-truth rank). More tellingly, the engine recovered the top-3 exactly: Liverpool 1st, Arsenal 2nd, Manchester City 3rd.

Why 0.228 is the right number to report

The fast engine's PL Spearman at a single season is also low — around 0.103 in our matched-condition run — because PL single-season variance is structurally high (modal-table dispersion is in the 7-position range across the league). The fast engine's 0.598 PL ρ comes from 30-season aggregation, which smooths out single-season tails. Per-engine, per-method: the forked GRF's single-season 0.228 is comparable to or better than the fast engine's single-season 0.103, and considerably better than the upstream stock-profile GRF baseline (~0.10–0.30). The 30-season aggregation gap is a wall-clock question, not a fidelity question: 30 seasons of forked GRF is ~18 hours of M3 wall-clock or ~$2–3 on Modal at 32-way parallelism, neither of which has been run yet.

The simulated table is interpretable in its noise: Chelsea is sim-ranked 17th vs actual 5th (the engine's top-OVR XI for Chelsea over-weights rotated young players whose actual minutes were limited); Nottingham Forest is sim-ranked 20th vs actual 7th (similar mechanism plus a manager-tactical multiplier the engine doesn't model); Fulham and Spurs are over-ranked in sim because the engine's top-OVR XI selection has no cohesion penalty for picking the absolute best 11 without considering tactical fit. None of these are engine bugs — they are XI-selection limitations sitting on top of a now-physics-correct simulator. Full per-team table is at wave3_grf_fork/seasons/PL/grf_patched/table.csv.

5.6 Cost–benefit summary and honest verdict

The forked GRF gives us per-tick traceability (every player's action at every tick can be logged), replayable matches (deterministic under the bilateral seed wiring), visualisable evidence (the engine emits a complete state stream that can drive Plotly animations of player positions over the full 90-minute analogue), and — new in this build — per-player physics: a Manchester City XI is now running at engine-side velocities, sprint patterns, and shot accuracies that reflect the actual EA Sports FC 26 attributes of Foden, Haaland, Donnarumma, and so on, not a stock physical profile shared across all 22 named identities.

It costs us approximately 5,800× the per-match wall-clock of the fast engine — 5.79 seconds at 4-worker parallel versus 1–4 milliseconds, measured under matched-fixture conditions on M3 MBA. Parallelisation is ceiling-limited by Docker container memory (peak RSS 402.5 MB at 4 workers); cloud parallelisation at 32-way through Modal is projected to take a full 380-fixture season from 36 minutes down to ~5 minutes at a marginal cost of approximately $0.07 per season.

The right use is for individual matches we want to inspect — a contested moneyball XI's match against Liverpool, a same-XI mirror run for symmetry diagnostics, a per-player action audit, a single-match replay where the narrative of player-by-player execution matters. The right use is now also for any single-match question where per-player physics is the load-bearing requirement — which the fast engine cannot answer at all. The wrong use is still aggregate league-table sweeps when the fast engine's 30-season 0.79 mean Spearman is available at three orders of magnitude lower wall-clock.

The forked replay tier is now a legitimate alternate engine for both replay and physics-correct single-match simulation. The fast tier remains the workhorse for sweeps; both engines now do what they are designed to do.

06The squad layer

The problem the rotation layer solves

A naïve match simulator that simply picks the top-OVR XI from each team and runs every fixture with that fixed XI collapses an enormous predictive signal: bench depth. The 2024-25 Premier League season had 25-man Liverpool squad with a top-OVR of 91 rotating to maintain quality through a fixture-congested winter; the same season had a 24-man Everton squad with a top-OVR of 84 grinding starters into injury through the same calendar. The output difference between those two clubs over 38 fixtures is materially explained by how many days of rest each starter received and how often the second-choice option started in their place. A simulator that does not model rotation cannot distinguish a deep squad from a thin one and therefore cannot predict the season-level standings reliably.

The model

The squad layer at wave3_moneyball/2_analysis/squad_lib.py wraps any engine (fast or GRF) with per-match availability, fatigue and long-term-injury logic. Per-player per-match availability is computed as:

per_match_avail = clip(
    0.92 - (100 - inj_resistance)/100 * 0.20 * age_injury_mult,
    0.50, 0.95)

where inj_resistance is the FM26-corpus fm26_injuryResistance attribute (corpus median 73, range [30, 100]), and the age multiplier is 0.85 for under-23, 1.0 for ages 23–30, and 1.0 + 0.04 × (age − 30) for over-30. Long-term injuries are sampled per season as Poisson(λ_LT, clipped at 2 per player) with λ_LT a function of inj_resistance and age multiplier; each long-term injury has geometric duration with mean four matches, clipped to [2, 12]. Fatigue deducts 5% from per_match_avail if the player started the previous two matches consecutively.

Starting XI selection runs pick_starting_xi(squad, formation, available_set): from the set of available players (those whose Bernoulli availability draw came up "available"), satisfy formation quotas (e.g., for 4-3-3: 1 GK, 4 DEF, 3 MID, 3 FWD), greedy by xpts_per_match within each position bucket. The output is the per-fixture XI, conditional on the cross-match state.

Empirical impact

The contribution of the rotation layer to predictive accuracy is measurable directly. Without rotation — i.e., the static top-OVR XI for every fixture — the Premier League per-season Spearman ρ is 0.31 (the v3 Dixon-Coles 100-season figure in 3_artifacts/dc_per_league_spearman.csv). With real per-match lineup mapping (where the scraped FBref starting XI is used in place of the synthetic top-OVR XI), the figure rises to 0.41. With the squad rotation layer wrapping the fast engine, the single-season figure rises to 0.50, and the 30-season aggregate rises to 0.62. Bench depth, in aggregate, is responsible for approximately +0.21 of the predictive lift between the naïve and the production configurations.

The injury parameters were not separately calibrated against a public injury database, and the squad-availability fraction has not been compared to the Premier-League-Injuries-League aggregates — this is documented in §10 as an open validation gap. The structural impact on rank correlation, however, is robustly positive at every level of disaggregation we have measured.

Bench depth is a real, measurable predictive signal. Without it, the simulation cannot distinguish Arsenal (deep squad) from Brentford (thin squad) at the season level.

07Validation: the 30-run baseline and the forward test

7.1 Methodology

For each of the top-5 European leagues (Premier League, LaLiga, Serie A, Bundesliga, Ligue 1) we build a round-robin schedule via the standard circle method (Kirkman's schoolgirl-problem construction), giving 380 fixtures per 20-team league or 306 fixtures per 18-team league. For each of 30 independent simulated seasons we (1) reset per-player state — clearing all carried injuries, fatigue counters and consecutive-start markers; (2) for every fixture in calendar order, apply the squad-rotation availability draw to each team, select the starting XI from the available pool using the formation-aware greedy picker, and pass the resulting two XIs through the fast engine to sample the scoreline; (3) aggregate per-team points across the 38-or-34-game season to produce the simulated league table.

Aggregated across the 30 seasons, we compute each team's mean rank, median rank, standard deviation of rank, and number of times that team finished top-1 or top-4. The Spearman rank correlation ρ is computed between (a) the team's mean rank across the 30 simulations and (b) the team's actual 24-25 final rank. Total wall-clock for all five leagues across 30 seasons was 64.1 seconds on a single M3 MBA core, processing 52,560 simulated fixtures end-to-end. The full per-league output, including per-team mean rank with 25th and 75th percentiles, is at seasons/<league>/squad_30s/mean_table.csv; the league-level summaries are at seasons/<league>/squad_30s/summary.json; the aggregate is at seasons/top5_30seasons_summary.json.

7.2 Headline results

League	Spearman ρ	p-value	Champion match	Top-4 hit	Sim champion	Actual
Premier League	0.598	0.0112	✓	3/4	Liverpool (18/30)	Liverpool
LaLiga	0.811	0.0001	✗	4/4	Real Madrid (17/30)	FC Barcelona
Serie A	0.878	<0.0001	✗	2/4	Lombardia FC (Inter, 17/30)	SSC Napoli
Bundesliga	0.771	0.0008	✓	3/4	FC Bayern (24/30)	FC Bayern
Ligue 1	0.879	<0.0001	✓	3/4	Paris SG (modal)	Paris SG
Mean / Total	0.787	all sig.	3/5	15/20	—	—

Every league's Spearman ρ is statistically significant at p < 0.05; four of the five are significant at p < 0.001. The mean Spearman across the five leagues is 0.79 — comfortably above the published-academic Wave 2F reference band of 0.54–0.70. Champion identification is exact in Premier League, Bundesliga and Ligue 1 — Liverpool, Bayern and PSG, three of the five actual 24-25 champions correctly recovered, all also the most-modal champions across the 30 sims. Top-4 identification is 15/20 across the five leagues, with LaLiga's top-4 picked perfectly and Bayern's modal championship rate now at 24/30 (the most concentrated of any league, reflecting the Bundesliga's actual concentration).

Figure S1. Per-league Spearman ρ over 30 simulated seasons against actual 24-25 final tables. Dashed line at ρ = 0.5 marks the rough significance threshold for n = 20; the orange tinted band marks the Wave 2F published reference range (ρ = 0.54–0.70). All five leagues now sit at or above the upper edge of the reference band; the mean Spearman is 0.79. Source: seasons/top5_30seasons_summary.json.

Figure S2. Simulated mean rank vs actual 24-25 final rank, 5-panel small-multiples. Each dot is a team; the diagonal marks perfect prediction. Teams whose simulated rank deviates from actual by more than 3 positions are labelled in orange. Most labelled outliers fall into the three residual categories analysed in §7.4: corpus version drift (Man Utd, Frankfurt), tactical multipliers (Hoffenheim, OM), and single-season variance (Girona, Cremonese). Source: per-league seasons/<L>/squad_30s/mean_table.csv.

Figure S3. Champion distribution across 30 simulated seasons, per league. Orange marks the actual 24-25 champion. The shape of each distribution is interpretable: PL is dominated by Liverpool (18/30) with Manchester City and Arsenal as the modal challengers; LaLiga is a two-team race won by Real Madrid (17/30) over Barcelona (10/30); Ligue 1 collapses cleanly around PSG; Bundesliga is the most concentrated, with Bayern at 24/30 and — under the decomposed strength scalars — Hoffenheim at 0/30, Dortmund at 4/30; Serie A is a Lombardia FC (Inter) vs Napoli race. Source: seasons/top5_30seasons_summary.json.

7.3 Per-league analysis

Premier League — ρ = 0.598, p = 0.011. Liverpool is correctly identified as both modal champion (18 of 30 simulated seasons, up from 12 under the earlier formulation) and the top-of-table mean-rank pick. The top-4 hit rate is 3/4: Arsenal (sim 2nd, actual 2nd), Manchester City (sim 3rd, actual 3rd), Aston Villa (sim 4th, actual 6th), with Chelsea (sim 6th, actual 4th) as the swap. The remaining residuals — Manchester United (sim 8th, actual 15th) and West Ham (sim 5th, actual 14th) — reflect the 25-26 EA OVR corpus carrying squad-state information from after the post-24-25 summer transfer window. Manchester City's modal championship rate (6/30) is also higher than its actual #3 finish would suggest, which is the engine reading City's static squad ratings without seeing the actual-season injury crisis.

LaLiga — ρ = 0.811, p = 0.0001. Top-4 hit perfectly: Real Madrid, Barcelona, Atlético and Athletic Club appear in the simulated top-4 across all four positions. The champion is Real Madrid (17/30) over Barcelona (10/30), inverting the actual order — an EA-attribute artifact more than a simulator one. EA's static squad ratings put Real Madrid's top-11 OVR marginally above Barcelona's; the engine is faithful to EA, and EA was not faithful to the actual season.

Serie A — ρ = 0.878, p < 0.0001. The strongest Spearman correlation among the top-5 leagues. The sim champion is "Lombardia FC" — the EA Sports FC 26 corpus's partially-anonymised label for Inter Milan, alongside "Latium" (Lazio), "Bergamo Calcio" (Atalanta), "Milano FC" (Milan). The team-strength scalars correctly identify Inter as the engine's top pick at 17/30, with Napoli further back — once again, EA's static squad ratings out-rank Inter above the actual 24-25 champion (Napoli), and the engine is faithful to EA. Atalanta under Gasperini ("Bergamo Calcio") remains the canonical men's-league system-effect over-performer that the engine cannot model.

Bundesliga — ρ = 0.771, p = 0.0008. Bayern correctly picked as both modal champion (24/30, the most concentrated of any league — close to the upper edge of the 15–20 target band) and mean-rank champion. Top-4 hit 3/4: Bayern, Dortmund and Leverkusen all correctly in the top-4, with Dortmund's championship rate now at 4/30 (up from 2/30) and Hoffenheim at 0/30 (down from 5/30). The decomposed strength scalars have flipped what was previously the single largest per-team residual in the dataset into a clean read of the Bundesliga's actual concentration; the league's Spearman ρ rose by +0.24 — the largest single-league gain of the calibration.

Ligue 1 — ρ = 0.879, p < 0.0001. PSG is now the modal champion, restoring the actual 24-25 order. Marseille sits behind PSG as the credible challenger; Monaco, Lille, Strasbourg, Brest and Lens fill out the second-tier-quality cohort the engine correctly identifies. The slight regression in Ligue 1 ρ (−0.04 relative to the earlier formulation) is within sampling noise at n = 30 seasons and is the price paid for the gains elsewhere; champion identification, in contrast, moved from incorrect to correct.

7.4 What the residuals tell us

Across the 100 team-rank pairs (20 teams × 5 leagues), the residuals that remain fall into three interpretable classes — different from the classes that existed under the earlier scalar formulation, with the single largest single-team residual (Hoffenheim in Bundesliga) now eliminated by the OVR decomposition.

Class 1 — Corpus version drift

EA 25-26 vs FBref 24-25 mismatch

The largest residuals systematically occur in teams whose squads materially changed in the post-24-25 summer transfer window. Manchester United (sim 8th, actual 15th), West Ham (sim 5th, actual 14th) and Frankfurt are the canonical examples. Each input was drawn from the 25-26 EA OVR snapshot but evaluated against a 24-25 final-table ground truth, and the 25-26 squad strength is materially different from the 24-25 starting-XI strength. The fix is mechanical: re-snapshot the EA corpus at the 24-25 mid-point.

Class 2 — Tactical multipliers

System effects on per-attribute strength

A second category of residuals reflects teams whose realised performance materially over- or under-performs the per-attribute strength scalar because of manager-driven tactical multipliers. Atalanta under Gasperini ("Bergamo Calcio") is the canonical men's-league case. Leverkusen under Xabi Alonso (sim under-rated relative to actual #2 finish) is a Bundesliga example — EA's attribute model under-rates Leverkusen's defenders on raw Positioning/Interceptions, missing the team-tactical cohesion that drove the actual season.

Class 3 — EA static ratings vs actual-season form

The bigger picture residual

The single most consequential residual class is now structural: EA's static squad ratings out-rank Real Madrid above Barcelona, Inter above Napoli, and Manchester City above its actual injury-crisis 24-25 finish. None of these are engine bugs — the engine reads EA faithfully, and EA's player ratings are a point-in-time corpus that does not see mid-season injury, manager change, or tactical-cohesion effects. Closing this requires either a re-snapshot at a different point in the season, or a managerial/form correction layer on top of the engine.

None of these three classes implies an engine bug. The first is a corpus-snapshot fix (mechanical); the second is a model extension (a manager-effect multiplier candidate for future work); the third is a fidelity-of-input problem rather than a fidelity-of-engine problem. The headline result — mean Spearman 0.79 across the five leagues with every league statistically significant and three of five sim champions correct — holds independently of whether the three residual classes are addressed.

Across 5 leagues × 30 simulated seasons × 52,560 fixtures, the engine is statistically significant at every league. The single largest per-team residual under the previous formulation — Hoffenheim winning 5/30 Bundesliga championships — is fixed to 0/30 by the OVR-decomposed strength scalars.

7.5 Forward validation: the single-shot (n=1) test

The 30-season result in §7.2 averages thirty independent simulations of the same season, which smooths away run-to-run variance. A bookmaker, a sporting director, or a Moneyball fund does not get thirty parallel universes — they get one realisation. To measure genuine forward capability we re-ran the engine in single-shot mode: each real fixture is simulated once, using that season's actual starting XI and that edition's player ratings, and the resulting points table is correlated against the real final standings. We do this across 35 league-seasons — the top-5 leagues for 2017-18 through 2024-25 (ratings from FIFA 18 through EA FC 25; real lineups and results from the Transfermarkt match corpus) — with each season replayed under 20 independent RNG seeds to give a stability band, never a multi-run average.

Lineups map to per-season ratings at a mean rate of 92.7%; the unmapped ~7% are youth call-ups and mid-window arrivals from outside the top-5, filled with a position-average replacement. The headline single-shot number is mean ρ = 0.589 across the 35 league-seasons — materially below the 0.79 of the 30-run average. That gap is not a regression; it is the honest cost of forward prediction. Roughly a sixth of the apparent 0.79 skill was variance-smoothing that a single real season does not afford.

League	n=1 mean ρ	seed band (σ)	champion hit
Premier League	0.591	±0.105	46%
LaLiga	0.578	±0.099	24%
Serie A	0.684	±0.093	32%
Bundesliga	0.576	±0.129	75%
Ligue 1	0.518	±0.121	59%

Figure S7. Single-shot engine ρ (Klein) versus the inertia baseline (grey) per league, scored on the identical set of holdover teams present in both consecutive seasons. Leagues are ordered by the engine's skill margin over inertia. In four of five leagues the naïve “copy last year's final table” forecast outranks the attribute engine; only Ligue 1 — the most mobile league — sees the engine edge ahead. Source: hist_validation/out/val/_fair_skill.csv.

7.6 The inertia control: skill versus league mobility

A high rank correlation is only impressive if the table it predicts actually moves. Serie A posts the engine's best single-shot ρ (0.68), but Serie A is also the most entrenched top-5 league: its season-to-season persistence ρ is 0.79, its rank churn the lowest at 2.7 places, the legacy of a near-decade Juventus dynasty. A model that simply reprinted last year's Serie A table would score 0.79 — higher than the engine's 0.74 on the same teams. The engine's apparent Serie A strength is the league's monopoly of strength, not the engine's foresight.

The proper control is therefore the persistence baseline: how well does season N−1's final table predict season N's, over the teams present in both? Scoring the engine and this baseline on the identical holdover set removes the engine's handicap of also having to rank three promoted sides. The result is humbling and clarifying in equal measure:

League	engine ρ	inertia ρ	skill gain	rank churn
Ligue 1	0.602	0.573	+0.028	4.0
Serie A	0.740	0.788	-0.048	2.7
Premier League	0.586	0.720	-0.134	3.1
LaLiga	0.551	0.725	-0.174	3.1
Bundesliga	0.535	0.710	-0.175	2.7

Across all 25 holdover-matched league-seasons the engine averages ρ = 0.603 against the inertia baseline's 0.703 — a deficit of -0.100 — and the engine beats inertia in only 36% of league-seasons. The per-league skill margin is almost perfectly ordered by league mobility: it is positive only in Ligue 1, the league with the lowest persistence (0.57) and highest churn (4.0), and most negative in Bundesliga and LaLiga, where Bayern's unbroken title run and a stable Spanish hierarchy make last year's table nearly unbeatable.

Figure S8. The engine as a mobility detector. Each point is a league: horizontal axis is league inertia (season-to-season persistence ρ — further right is more frozen), vertical axis is the engine's skill margin over the inertia baseline. Marker size scales with rank churn. The relationship is monotone: in frozen leagues (Bundesliga, LaLiga, Serie A) inertia is unbeatable and the attribute model adds noise; in the one genuinely mobile league (Ligue 1) the ratings capture movement that “last year's table” misses. Source: _mobility.csv + _fair_skill.csv.

What the inertia control reveals

The attribute engine is better understood as a mobility detector than a table predictor. Where a league's order is frozen by a dominant club, naïve persistence is the stronger forecast and ratings add noise; where the order genuinely turns over, player-level ratings earn their keep. The 30-season ρ of 0.79 was measuring league inertia as much as engine skill — it never carried a persistence control to separate the two. For the Moneyball thesis this is the more useful framing: the engine's value is concentrated exactly where the market is least settled and mispricings are most likely to exist.

08What we deliberately do not simulate

The Wave 3 simulation infrastructure is a moneyball-focused tool, not an FM26 replacement. Several dimensions that exist in FM26 and in the published academic literature are explicitly out of scope, either because they fall outside the cross-gender player-valuation charter or because they are first-order corrections that wash out on multi-season averages. The full list, from SIMULATION_SCOPE.md §D and GRF_CONFIGURATION_AUDIT.md §3:

Out of scope — competitions and calendar

Cup competitions (FA Cup, EFL Cup, Coppa Italia, Copa del Rey, DFB-Pokal, Coupe de France). Our validation is rank-correlation against the final domestic league table; cup fixtures are uncorrelated noise from the valuation point of view.

European competitions (UCL, UEL, UECL). Same logic; cup-bracket draws are exogenous and contribute no information about per-team per-attribute strength.

Mid-season transfer windows. Our valuation is point-in-time; market dynamics over time are downstream of the per-fixture probability output.

International breaks and international-call-up fatigue (Ekstrand 2004). Documented as a Wave 4 candidate; not currently modelled.

Out of scope — tactical and in-match

Manager tactical instructions. FM26 has approximately 150 role × duty combinations; we have one effective role per pos_bucket. The Atalanta-Gasperini over-performance is the systematic residual this absence produces (see §7.4).

Mid-match substitutions. The engine assumes the chosen XI plays the full 90 minutes within a single match. The squad layer handles cross-match rotation; intra-match subs are deferred.

Set pieces and specialist routines. Boshnakov 2017 reports ~25% of PL goals come from set pieces. We do not differentiate set-piece takers or routines.

Weather, pitch quality, crowd atmosphere. First-order corrections that wash out on multi-season averages (Pollard 2008; Reade & Singleton 2020).

Referee identity and VAR. Per-game card-rate fixed effects (Buraimo, Forrest & Simmons 2010) that do not move 1X2 probabilities.

Mid-match injuries. The squad layer handles cross-match injury risk; mid-match injury events are not modelled.

These choices are not bugs — they are principled boundary conditions chosen to keep the production tier defensible and the engine fast. Each deferred dimension is documented in SIMULATION_SCOPE.md Part D with cost-benefit rationale and the Wave at which it might be added.

Out-of-scope items are not bugs — they are choices to keep the production tier defensible and the engine fast.

09Infrastructure

9.1 Speed

The wall-clock cost of the two engines under matched conditions, M3 MBA, single-threaded unless noted:

Workload	Single-thread	4-worker local	Modal 32 parallel (projected)
1 PL season, 380 fixtures, fast engine	3.1 s	n/a	n/a
30 PL seasons, 11,400 fixtures, fast engine + rotation	13.9 s	n/a	n/a
Top-5 × 30 seasons, 52,560 fixtures, fast engine + rotation	64.1 s	n/a	n/a
1 PL season, 380 fixtures, GRF bilateral (stock)	2,708 s (≈45 min)	1,182 s (≈20 min)	~96 s (≈1.6 min)
1 PL season, 380 fixtures, GRF forked (per-player EA overrides)	~8,000 s (≈2.2 h)	2,202 s (≈36.7 min)	~275 s (≈4.6 min)
10-match GRF benchmark (stock)	158.6 s (15.86 s/match)	73.2 s (7.32 s/match)	—

Figure S5. Engine wall-clock per match, log scale. The composed fast engine (Dixon-Coles scoreline + phase per-player attribution) runs at ~3.7 ms per match on a single M3 MBA core. Stock GRF bilateral runs at 17.4 s per match single-threaded, 7.3 s per match under 4-worker local parallelism; the forked GRF with per-player EA-attribute overrides runs at ~21 s single-threaded and 5.79 s per match at 4-worker parallel (slightly faster than stock at 4 workers, attributable to the engine doing less work in the override-aware constructor path on warm-cache runs). The gap to the fast engine is approximately three to four orders of magnitude — the entire reason the architecture is two-tier. Source: parallel_benchmark.json, seasons/top5_30seasons_summary.json, wave3_grf_fork/seasons/PL/grf_patched/summary.json.

9.2 Reproducibility

The fast engine is deterministic in the seed: same XIs, same np.random.default_rng(seed), byte-identical output, verified across 30 paired runs in _smoke_determinism_result.json. The GRF replay engine is deterministic in game_engine_random_seed, also verified byte-identical across paired runs at the same seed. Match-level state dumps are written via write_goal_dumps=True (configurable on the GRF entry point); the resulting per-tick state stream allows any individual match to be re-played and re-inspected from a single seed. The whole pipeline — from EA corpus snapshot to per-fixture lineup selection to per-match simulation to season-aggregate table — can be re-run end-to-end from a fixed seed.

9.3 Parallelisation

Local parallelisation of the GRF engine uses a 4-worker ThreadPoolExecutor dispatching to Docker subprocesses, ceiling-limited by the M3 MBA's 8 GB memory budget at peak Docker RSS of 402.5 MB per worker. The measured 4-worker speedup is 2.17× (158.6 s → 73.2 s on a 10-match benchmark, full results in parallel_benchmark.json), short of linear due to Python GIL contention on the dispatcher loop and Docker container overhead. Cloud parallelisation is wired through cloud_runner.py for Modal containers up to 32 parallel workers; the projected cost for a 620-match validation sweep at 32-way parallelism is approximately $0.13, modal-priced. User authorisation is required before any cloud spend.

9.4 Extensibility

Adding a new league to the validation suite is a 10-line edit to the league_config.py registry plus a hardcoded actual-table reference for Spearman comparison. The registry currently supports Premier League, LaLiga, Serie A, Bundesliga and Ligue 1; the structural shape generalises trivially to any league for which the EA Sports FC 26 corpus has player attributes (Eredivisie, Liga Portugal, Belgian Pro League, Saudi Pro League, MLS, Liga MX, Argentine Primera División). Adding a league requires (1) the EA corpus player rows tagged with that league code, (2) the actual 24-25 final table for Spearman comparison, and (3) the team-name mapping from EA corpus labels to the canonical league labels.

The infrastructure runs deterministically, scales 32× on cloud for a quarter, and adds a new league in a 10-line patch.

10Limits and what's next

Known limitations

The honest list of what is known to be wrong, or known to be missing, as of May 2026:

EA 25-26 corpus vs 24-25 actual season. Mid-season transfers cause mid-table and top-of-table residuals (Manchester United, West Ham, Frankfurt; the Real Madrid–Barcelona and Inter–Napoli champion flips). Re-snapshotting the EA corpus at the 24-25 mid-point — or applying a mid-season-form correction layer — would close this without changing the engine. Specific known drift cases include Rashford to Aston Villa (loan), Williams to Saudi Arabia, and the post-Champions-League-exit Manchester City rotation pattern.
GRF EA→GRF stat-mapping denominator is a v0 calibration. The forked engine maps EA's 0–99 sub-attribute scale onto GRF's 0–1 stat space via a uniform (EA − 30) / 70 linear floor-clamp. A per-stat denominator fit by grid search — e.g., velocity might use / 60, shot power / 80 — would tighten the engine-side execution-quality calibration further. Single-stat sweeps over {50, 60, 70, 80} on a 3-season aggregate are the natural next iteration.
GRF per-player scorer attribution is heuristic. The current implementation credits goals via a nearest-player-to-ball-at-goal-tick heuristic. On the City vs United 20-sim authentic match this gave Haaland a 5% match-scoring rate — implausibly low given his macro profile. The fix is to track ball_owned_player at the tick immediately before the score increment (the slot that controlled the ball during the build-up, not the slot nearest at the moment the goal lights up). Macro outcomes (City 2.05 vs Utd 0.40) are unaffected; only per-player credit needs the better heuristic.
GRF 30-season aggregation not yet run. The forked GRF takes ~36 minutes per 380-fixture PL season at 4-worker parallel; a 30-season aggregate is ~18 hours of wall-clock locally, or approximately $2–3 on Modal at 32-way parallelism. The single-season Spearman ρ = 0.228 with top-3 exact is encouraging; the apples-to-apples comparison against the fast engine's 30-season 0.598 awaits this run.
XI selection on the GRF tier is unweighted top-OVR. The forked GRF currently picks each team's top-11 by OVR per position bucket, with no cohesion penalty or rotation modelling. This is what over-ranks Fulham and Spurs in the single-season replay and contributes to the Chelsea/Nottingham Forest under-ranking. The squad layer (§6) is wired only to the fast engine at present.
No manager-effect tactical multiplier. Atalanta-class teams that over-perform their per-attribute strength via system effects (Atalanta under Gasperini, Leverkusen under Xabi Alonso) are not modelable in the current engine. A residual-analysis-driven multiplier could be fit from historical data.
No intra-match substitutions. Real PL averages ~3–5 subs per match (last-30-minute fresh legs); the engine assumes the chosen XI plays the full 90.
Round-robin synthetic schedule. The validation schedule is built via the standard circle method, not the real fixture calendar. Real fixture calendars have correlated home/away streaks and Christmas-period fixture congestion that the synthetic schedule does not reproduce.
Real per-match lineup mapping at 76% coverage. lineups_24_25_PL.csv contains 15,188 player-match rows; mapping to the EAFC26 EA corpus succeeds for approximately 76% of players. The remainder are mostly players who left the top-8 leagues between scrape and snapshot.
Squad-layer injury parameters not externally validated. The per-match availability and long-term-injury rates are FM26-attribute-parameterised but have not been compared against a public Premier League injury database (e.g., Premier Injuries Ltd). The structural impact on rank correlation is robustly positive (§6), but the parameter values themselves are unvalidated.

Recommended next work

Per-stat (EA − offset) / denominator calibration on the forked GRF, swept over a 3-season aggregate. Currently the highest-impact single tuning available on the replay tier.
Switch GRF per-player scorer attribution from nearest-player-to-ball to ball_owned_player at the tick before score increment.
Run the forked GRF on a 30-season top-5 aggregate via Modal at 32-way parallelism (~$10–15, ~5–6 h wall-clock at 32 parallel). Closes the 30-season gap to the fast engine on apples-to-apples terms.
Re-snapshot the EA corpus at the 24-25 mid-point, closing the transfer-drift residual class.
Wire the squad layer (§6) to the forked GRF tier so single-match replay and league-table aggregates share the same rotation/injury logic.
Fit manager-effect tactical multipliers from the residual analysis; validate on a held-out 23-24 season.
Apply this infrastructure to women's football. The EAFC26 corpus has 1,645 women across 12 leagues — all currently unsimulated. The attribute-derived strength scalar is exactly the construct that lets us cross to a population without historical men's-league fixtures to back-fit; the forked GRF's per-player physics layer is exactly what makes cross-gender execution-quality comparison physically meaningful.
Compare the squad-layer injury parameters against a published Premier League injury database; recalibrate if material drift exists.

The architecture is locked. Remaining work is calibration, coverage, and aggregation — closing residuals, not changing the engines.

11References

Anderson, C., & Sally, D. (2013). The Numbers Game: Why Everything You Know About Soccer Is Wrong. Penguin.
Boshnakov, G., Kharrat, T., & McHale, I. G. (2017). A bivariate Weibull count model for forecasting association football scores. International Journal of Forecasting 33(2), 458–466.
Bradley, P. S., Sheldon, W., Wooster, B., Olsen, P., Boanas, P., & Krustrup, P. (2009). High-intensity running in English FA Premier League soccer matches. Journal of Sports Sciences 27(2), 159–168.
Bryson, A., Frick, B., & Simmons, R. (2013). The returns to scarce talent: footedness and player remuneration in European soccer. Journal of Sports Economics 14(6), 606–628.
Buraimo, B., Forrest, D., & Simmons, R. (2010). The 12th man? Refereeing bias in English and German soccer. JRSS: Series A 173(2), 431–449.
Carling, C., Bloomfield, J., Nelsen, L., & Reilly, T. (2008). The role of motion analysis in elite soccer. Sports Medicine 38(10), 839–862.
Constantinou, A. C., Fenton, N. E., & Neil, M. (2012). pi-football: A Bayesian network model for forecasting Association Football match outcomes. Knowledge-Based Systems 36, 322–339.
Decroos, T., Bransen, L., Van Haaren, J., & Davis, J. (2019). Actions Speak Louder than Goals: Valuing Player Actions in Soccer. KDD '19.
Dendir, S. (2016). When do soccer players peak? A note. Journal of Sports Analytics 2, 89–105.
Dixon, M. J., & Coles, S. G. (1997). Modelling association football scores and inefficiencies in the football betting market. JRSS: Series C (Applied Statistics) 46(2), 265–280.
Dixon, M. J., & Robinson, M. E. (1998). A birth process model for association football matches. The Statistician 47(3), 523–538.
Ekstrand, J., Waldén, M., & Hägglund, M. (2004). Risk for injury when playing in a national football team. Scandinavian Journal of Medicine & Science in Sports 14(1), 34–38.
Goddard, J. (2005). Regression models for forecasting goals and match results in association football. International Journal of Forecasting 21(2), 331–340.
Hvattum, L. M., & Arntzen, H. (2010). Using ELO ratings for match result prediction in association football. International Journal of Forecasting 26(3), 460–470.
Karlis, D., & Ntzoufras, I. (2003). Analysis of sports data by using bivariate Poisson models. The Statistician 52(3), 381–393.
Kitano, H., Asada, M., Kuniyoshi, Y., Noda, I., & Osawa, E. (1997). RoboCup: The Robot World Cup Initiative. Proceedings of the First International Conference on Autonomous Agents.
Kurach, K., Raichuk, A., Stańczyk, P., Zając, M., Bachem, O., Espeholt, L., Riquelme, C., Vincent, D., Michalski, M., Bousquet, O., & Gelly, S. (2019). Google Research Football: A Novel Reinforcement Learning Environment. arXiv:1907.11180; github.com/google-research/football.
Maher, M. J. (1982). Modelling association football scores. Statistica Neerlandica 36(3), 109–118.
McHale, I. G., & Holmes, B. (2023). Estimating transfer fees of professional footballers using advanced performance metrics and machine learning. European Journal of Operational Research 306(1), 389–399.
Muehlheusser, G., Schneemann, S., & Sliwka, D. (2014). The impact of managerial change on performance: The role of team heterogeneity. IZA Discussion Paper 7950.
Pollard, R. (2008). Home advantage in football: A current review of an unsolved puzzle. The Open Sports Sciences Journal 1(1), 12–14.
Reade, J. J., & Singleton, C. (2020). Football is back, the crowds are not. Centre for Economic Performance COVID-19 papers.
Singh, K. (2018). Introducing Expected Threat (xT). karun.in/blog/expected-threat.html.
Spearman, W. (2018). Beyond Expected Goals. MIT Sloan Sports Analytics Conference 2018.
Sports Interactive. (2024). Football Manager 2024 Match Engine — Developer Blog Series. Sports Interactive Ltd.
Sumpter, D. (2016). Soccermatics: Mathematical Adventures in the Beautiful Game. Bloomsbury Sigma.
Szymanski, S. (2013). Wages, transfers and the variation of team performance in the English Premier League. Sport, Business and Management 3(1), 6–17.
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. de L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T., & Riedmiller, M. (2018). DeepMind Control Suite. arXiv:1801.00690.
Yang, J., et al. (2025). Player valuation in European football using gradient-boosted decision trees. Working paper.
Yiğit, A. T., Samak, B., & Kaya, T. (2024). Football player position determination via machine learning models. Procedia CIRP.
RoboCup Federation. RoboCup Soccer Simulation League. robocup.org.

End of SIMULATION_REPORT.html. Wave 3 simulation infrastructure, methodology paper v1.0, May 2026. All claims trace to a specific CSV/JSON path in the EAFC26 wave3_grf and wave3_moneyball directories or to a numbered reference above.