— Pok Yeung Lee · 2026
Abstract
Football is a $50B labour market priced almost entirely on rumour. Transfermarkt is a crowd-curated valuation surface, CIES Football Observatory is a proprietary regression, and StatsBomb's event-level corpus is locked behind per-seat club licences. The only globally consistent, attribute-deep, cross-gender, free public datasets on professional footballers are two video games — EA Sports FC 26 and Football Manager 26 — and this paper treats both of them as research instruments for a single task: predicting Transfermarkt market value from attribute vectors and asking how close that prediction can be pushed to the published academic ceiling. The argument unfolds along a single golden thread. Sections 2 and 3 examine each instrument individually, reverse-engineering EA's positional OVR formula and FM26's Current Ability surface and cataloguing the structural quirks each instrument carries. Section 4 compares the two databases on the same 13,434 men. Section 5 builds the union model and demonstrates that the joint attribute vector lifts five-fold cross-validated R² from 0.663 (EA-only) and 0.680 (FM26-only) to 0.785, a gain that survives every fold. Section 6 validates the resulting predictions against two independent ground truths, Transfermarkt and the CIES Football Observatory, and shows that the union model sits closer to TM at the marquee tail and bracketed below CIES's contract-and-financial multiple. Section 7 catalogues the residual ceiling — celebrity premium, price floor, and the FM24→FM26 structural rebase — that the union model cannot remove. Section 8 tests whether a men's-trained model transfers to the women's database that Sports Interactive launched in November 2025, and finds that it does not: within-gender modelling on FM26 attributes recovers women's Current Ability at R²=0.906 while the same fit on EA's schema goes anti-signal at R²=−0.10, a finding that retracts Wave 1's cross-gender claim and replaces it with a within-gender framework. The supplementary chapters at the end of the paper catalogue league-level valuation deviations (the Premier League pays ×1.46 of OVR-baseline, LaLiga pays ×0.76), exchange-rate geography, country-PPP decoupling, and a four-cluster personality-archetype atlas. These are not the headline contribution and a reader who stops after §8 has received the main argument.
Chapter 1 — Two instruments, examined separately and side by side
The first chapter establishes the empirical foundation the rest of the paper rests on. It explains why two football video games — and not the academic or proprietary alternatives — constitute the only viable public corpus for this question, examines each database on its own terms, and then places the two schemas alongside each other on the same 13,434 men to measure exactly how much they agree, where they disagree, and what each carries that the other cannot.
§1 — Why the games are the data

Football appears to have public data, but inspection of the underlying files reveals that it does not. The annual global transfer market clears north of seven billion euros and the underlying labour market clears roughly seven times that figure; the canonical public valuation source — Transfermarkt — is a network of unpaid moderators whose median market value for any one player is reverse-engineered from forum discussion and the most recent reported fee. The two best academic instruments — CIES Football Observatory's monthly valuations and Twenty First Group's club-financial composites — are proprietary, license-gated, and undocumented at the feature level [CIES, 2024]. The richest event data — StatsBomb's VAEP and xG matrices — is sold to clubs and broadcasters under per-seat licenses that put it outside any open-research budget. The widely-cited academic ceiling for tree-ensemble valuation models — R² ≈ 0.85–0.90 from McHale & Holmes 2023 and Yang et al. 2025 — was established on top of exactly those proprietary feature sets [McHale & Holmes, 2023][Yang et al., 2025]. Replicating that ceiling on public data is therefore not a modelling problem but a data problem.
The two video games constitute the necessary data. EA Sports FC 26 ships a researcher-graded attribute matrix for 17,569 footballers — 16,122 men across 45 leagues plus 1,447 women from named playable leagues — at a per-player density of roughly thirty numeric attributes, refreshed monthly through the season. Football Manager 26, which launched in November 2025 after a one-year cancellation that retired the legacy engine and put the franchise on Unity, ships an attribute database one to two orders of magnitude larger: roughly 1.3 million players in Sports Interactive's researcher database overall, maintained by a network of about 1,300 paid contributors organised under 86 full-time research leads who scout in person and revise scores on a monthly cadence [Sports Interactive / SportsPro Media, 2024]. Premier League clubs have used the FM database as a cross-reference against internal scouting since Everton signed the first publicly-disclosed SI–club data partnership in 2008 [MCV/Develop, 2008]. FM26 is also the first edition of the franchise to ship with women's football: 36,000 women and counting across 14 leagues at launch, built over four years by an independent 40-researcher women's team [Football Manager Blog, 2025].
Two structural notes belong with that chart. FM25 was cancelled in February 2025; SI publicly cited the difficulty of porting the legacy engine to Unity within the release window [PC Gamer, 2025]. FM26 is therefore not the twentieth annual edition but the first edition on a ground-up engine rewrite, the first new engine since the 2004 Championship Manager fork. The two-year gap between FM24 and FM26 — engine change, role-system overhaul, women's-database scale constraint — is the reason every CA/PA series in this paper has a documented structural break at that boundary and at no other transition [FM Scout, 2025]. The seven editions in scope, counted from the raw files on disk, carry 716,027 men and 421 women across the FM2016 → FM26 window.
The Wave 1 finding that justifies this framing is a single audit on the 303 realized transfer fees that match the EAFC26 men's universe. Three models — EA-only Ridge, TM-only Ridge, and an EA+TM hybrid — were cross-validated against those realized fees. The result is the table below.
| Model | Spearman vs realized fee | Median APE |
|---|---|---|
| EA-only Ridge | 0.564 | 62.6% |
| TM-only Ridge | 0.735 | 45.2% |
| Hybrid (EA + TM + age + position + league) | 0.725–0.739 | 47.9–54.9% |
EA's published value_eur adds no predictive lift beyond Transfermarkt market value, and the hybrid model performs no better than TM alone. The reason is structural: EA's value_eur is itself a deterministic function of OVR, age, and reputation under EA's internal formula, which the OVR reverse-engineering result in Appendix A4 recovers to within rounding error. Feeding EA's simulation into a model trained on TM therefore constitutes circular signal compression. EA's contribution is the attribute schema, not the value field — the load-bearing pivot of Wave 1 and the precondition for everything Wave 2 does. The same observation answers the gaming community's recurring complaint that EA "gets the numbers wrong"; the numbers EA gets wrong are precisely those it generates downstream from its own attribute matrix, while the attribute matrix itself remains sound.
Once that pivot is accepted, the analytical question changes shape. If the attribute matrix is the load-bearing instrument and the published value field is decorative, then the appropriate next step is not to tune the existing model but to locate another attribute matrix; Football Manager supplies precisely that. The Wave 1 men's-trained model on EA attributes alone lands at R² = 0.77 on the full 7,835-man training set, with a clean diagnostic by league coverage:
| Coverage | n | R² | Spearman | MAPE |
|---|---|---|---|---|
| Top-5 leagues | 1,130 | 0.62 | 0.79 | 45% |
| Top-8 leagues | 2,276 | 0.82 | 0.90 | 40% |
| Top-12 leagues | 3,714 | 0.79 | 0.89 | 40% |
| Top-20 leagues | 5,623 | 0.77 | 0.86 | 40% |
| Full 45 leagues | 7,835 | 0.77 | 0.86 | 41% |
The 0.82 figure on the top-8 leagues — the band in which any production model would actually be deployed — sits within striking distance of the published ceiling, exactly seven points below Yang et al.'s 0.901 GBDT result on a comparable FIFA-22 sample of equivalent size [Yang et al., 2025]. The gap can be named: VAEP and xG event metrics (the McHale-Holmes lift) together with club-financial features (the CIES and Twenty First Group lift), which together constitute the two features the proprietary models possess and the public corpus lacks. The drop from 0.82 to 0.77 as league coverage widens to the full 45-league corpus has a straightforward modelling interpretation; the long tail of third- and fourth-tier leagues, where Transfermarkt's crowd is sparse and the values are quantised to the nearest €100k, is the dominant source of unexplained variance. Coverage over peak accuracy was the Wave 1 trade-off, and it remains the appropriate trade-off in Wave 2 because the leagues added at the margin are precisely those where FM's coverage is densest and EA's is thinnest.
There is also a literature-level fact worth stating here, because it shapes the rest of the paper. The decade-long lineage of papers regressing Transfermarkt market values on FIFA-style attribute matrices is remarkably consistent: tree-based ensembles (Random Forest, XGBoost, LightGBM, CatBoost, GBDT) routinely reach R² ≈ 0.85–0.90 when predicting log-TM-value from FIFA overall plus thirty sub-attributes plus age, position, league, and international caps [McHale & Holmes, 2023][Yang et al., 2025]. Linear models ceiling out around R² = 0.70–0.78 because the value-to-rating relationship is convex — a 5-point jump from 80 to 85 OVR is worth far more than from 65 to 70 — and that convexity is exactly what an ensemble of regression trees recovers automatically. McHale & Holmes (2023) is the strongest piece in the genre because they predict realized transfer fees rather than TM values, and they show that combining FIFA expert ratings with VAEP-style action values plus xG-based plus/minus beats TM on average — though TM still wins for fees above €20M, the superstar tail where the celebrity premium dominates. The CIES Football Observatory and Twenty First Group models, while proprietary, document overlapping ingredient lists: contract length, age, international status, performance metrics, club financial strength, league level. Both report roughly 85% correlation with realized fees [CIES, 2024]. That ~85% is the academic ceiling the unified Wave 1 + Wave 2 piece is going to live underneath.
The remainder of this paper performs the comparison Wave 1 could not, placing a second, fully-formed, researcher-graded attribute database alongside EA's and measuring the gap between them. The premise inherited from Wave 1 remains unchanged — in football, the simulation is the dataset — but the experimental design widens. Wave 2 asks the following: if EA's attribute schema constitutes one credible instrument and Football Manager's constitutes another, what does each one contribute individually, and what do they contribute jointly? The first answer is that the two agree on rank order and disagree on shape, and the disagreements themselves constitute the data product. Appendix A1 documents the seven FM editions, the three acquisition paths (Kaggle community dumps, the FUTEK.io scrape for FM24, and the EFEM.club scrape for FM26), and the 92.5% exact-match rate of the name+DOB+nationality fuzzy join used to bolt FM26 onto the EAFC26 corpus. The remainder of §1 is unnecessary because the work that follows is downstream.
§2 — Examining EA Sports FC 26: what the OVR surface actually is

Before any comparison can be made between EA Sports FC 26 and Football Manager 26, the EA artefact must be examined on its own terms. Its one-number summary — the Overall Rating, or OVR — is so culturally familiar from broadcast graphics and trading-card overlays that practitioners frequently treat it as an opaque scalar handed down by EA's internal committee, rather than as the deterministic algebraic object it is. This section characterises EA Sports FC 26 as a research instrument: it specifies the coverage and cadence, reverse-engineers the OVR aggregation rule, catalogues the sub-attribute schema, and names the structural biases any downstream model must correct for. The side-by-side comparison with Football Manager 26 follows in §3.
2.1 — EA Sports FC 26 as a data instrument
EA Sports FC 26 ships an attribute matrix for 17,569 footballers — 16,122 men distributed across 45 licensed men's leagues and 156 nationalities, plus 1,447 women drawn from twelve named playable women's leagues. The men's universe is refreshed on the September product cycle that has defined the franchise since the FIFA 94 release, with monthly Title Update revisions through the competitive season [EA Sports, 2025]. Ratings are produced by what EA describes as a global network of approximately fifty internal raters supported by per-league freelance panels, all working under the EA Ratings team — a body whose published mandate is broadcast plausibility and online-matchmaking balance rather than scouting-grade simulation fidelity [EA Sports / Jacobson, 2023]. The committee size is roughly one twenty-fifth of Sports Interactive's researcher network, but the operating cadence is approximately twelve times faster; the trade-off is explicit in EA's own product framing and is the structural feature most relevant to valuation work.
The covered universe is broad in nominal league count but biased in coverage density. The five most heavily resourced leagues — Premier League, LaLiga, Bundesliga, Ligue 1, Serie A — together account for approximately 2,800 men, or roughly seventeen percent of the men's universe; the remaining 13,300 are distributed across forty further leagues. EA's coverage strategy is broadcast-tier weighted: a league appears in the database if it generates television revenue at a threshold high enough to support a licensing arrangement with the studio, and per-league rating density tracks that broadcast budget. The marginal coverage decision is commercial rather than analytical, and the resulting population shape correlates structurally with the broadcast economy that also shapes Transfermarkt's editor attention. The two coverage geometries are not independent.
Public availability is the third instrument-grade fact. EA publishes the full Overall, Potential, six headline stat composites, and the thirty-odd sub-attribute values per player on its ea.com/games/ea-sports-fc/ratings portal, refreshed monthly with Title Updates; community mirrors on Kaggle redistribute the resulting CSV files at scale. The combination of attribute-deep coverage, monthly refresh cadence, public availability, and licensed-league breadth places EA Sports FC 26 in a class of one among public football data artefacts. Football Manager matches the depth and exceeds the scale but not the refresh cadence; StatsBomb matches the depth but is license-gated; Transfermarkt matches the breadth but exposes no per-player attribute decomposition.
2.2 — The OVR formula, reverse-engineered
The single most important methodological finding about EA Sports FC 26 — and the result that licenses everything §3 will compare against — is that OVR is not an opaque committee judgement on top of a sub-attribute matrix. It is a deterministic, position-stratified, linear aggregation of those sub-attributes, recoverable from the published data to within rounding error. The recovery procedure, documented in Appendix A4, is the position-stratified ordinary least squares regression: for each EA position bucket $p$ (goalkeeper, defender, midfielder, wide attacker, forward), the published OVR is regressed against the standardised sub-attribute vector and the coefficient of determination on the resulting linear fit is reported per bucket.
The result is sharp. Per-position $R^2$ ranges from $0.96$ for the most heterogeneous bucket to $0.998$ for the most algorithmically constrained one. The functional form that recovers OVR is
$$ \text{OVR}_{p}(x) \approx \text{round}\!\left( \mu_p + \sum_{i=1}^{n} w_{i,p} \cdot x_i \right), \quad R^2_p \in [0.96, 0.998] \quad \text{for } p \in \{\text{GK, DEF, MID, WIDE, FWD}\} $$
where $w_{i,p}$ is the OLS coefficient on standardised sub-attribute $i$ within position $p$, and $\mu_p$ is the per-position intercept. The top weights confirm the algebraic interpretation: for goalkeepers, the five GK-specific attributes (GK Positioning $1.69$, GK Reflexes $1.64$, GK Diving $1.59$, GK Handling $1.53$) plus Reactions ($1.09$) carry essentially all OVR variance. For defenders, Standing Tackle ($1.27$), Interceptions ($0.97$), and Reactions ($0.79$) lead. For midfielders, Ball Control ($1.87$), Reactions ($1.65$), and Short Passing ($1.57$) dominate. For wide attackers, Dribbling ($1.18$), Ball Control ($1.05$), and Short Passing ($0.88$) carry the weight. For forwards, Finishing ($1.41$), Positioning ($1.09$), and Heading Accuracy ($0.91$) lead.
The substantive interpretation has three components. First, EA's OVR is, in algebraic structure, exactly the kind of weighted attribute aggregator the academic valuation literature has been treating it as for a decade — no hidden non-linearity, no committee discretion at the OVR layer. Second, the position-specific weight matrix is itself substantive content about how EA's broadcast-tuned committee conceives of positional ability. Third, OVR is informationally subsumed by the sub-attribute vector that produces it: the tree-based ensembles that dominate this literature recover positional structure through the sub-attribute vector regardless of whether OVR is explicitly included. The OVR is not a magical number; it is an algebraic identity of the sub-attribute matrix, and the matrix is where the signal lives.
2.3 — The full sub-attribute schema
EA Sports FC 26 exposes a structured attribute vector of approximately thirty-six fields per outfielder, organised into six headline stat composites and their constituent sub-attributes. The six composites, on the broadcast-graphic scale of $1$–$99$, are Pace, Shooting, Passing, Dribbling, Defending, and Physicality. Each composite is a weighted aggregation of underlying sub-attributes computed by a formula EA discloses in its annual ratings methodology post. Pace decomposes into Acceleration and Sprint Speed. Shooting decomposes into Positioning, Finishing, Shot Power, Long Shots, Volleys, and Penalties. Passing decomposes into Vision, Crossing, Free Kick Accuracy, Short Passing, Long Passing, and Curve. Dribbling decomposes into Agility, Balance, Reactions, Ball Control, Dribbling, and Composure. Defending decomposes into Interceptions, Heading Accuracy, Defensive Awareness, Standing Tackle, and Sliding Tackle. Physicality decomposes into Jumping, Stamina, Strength, and Aggression. The total outfield surface is twenty-seven sub-attributes plus six composites mechanically derived from them.
Goalkeepers carry an entirely separate sub-schema — GK Diving, GK Handling, GK Kicking, GK Positioning, GK Reflexes, plus the shared Reactions attribute. The five GK sub-attributes substitute for the twenty-seven outfield numerics rather than supplementing them, which is the structural reason §3 will record a 5-versus-11 attribute-count gap for goalkeepers against Football Manager 26.
Three additional fields sit alongside the numeric matrix. Weak Foot and Skill Moves are categorical star ratings on a $1$-to-$5$ integer scale, exposing the player's off-foot proficiency and dribbling complexity at a granularity the continuous numerics do not capture. International Reputation is a separate $1$-to-$5$ star rating, distinct from both Overall and Potential, that codes a player's marketing-tier visibility in the global football economy — a top-five club marquee with multiple Ballon d'Or nominations carries five stars, a Premier League starting eleven member with consistent international fixtures typically four. International Reputation is the single field in EA's schema that most directly tracks broadcast and commercial salience independently of on-pitch ability, and it earns a separate top-twenty permutation-importance slot in the valuation models reported elsewhere in this paper for exactly that reason.
Crucially, EA Sports FC 26 does not expose a hidden attribute layer. Whatever internal scoring the EA Ratings committee may maintain on personality, leadership, or behavioural variables — and there is no public evidence that it maintains such scoring at all — none of it is published, and none of it enters the OVR aggregation. The thirteen hidden personality mentals Football Manager exposes (Adaptability, Ambition, Loyalty, Consistency, Professionalism, Temperament, Pressure, Important Matches, Sportsmanship, Compliance, Fairness, Versatility, Injury Resistance) have no EA analogue at any layer of the public schema.
2.4 — Known biases and quirks
EA's instrument-grade properties come bundled with structural biases that any valuation model must explicitly correct for. Three are immediately documentable from the existing analyses, and a fourth is the rater-disagreement signature the comparison with Football Manager makes visible.
The first is the celebrity premium at the marquee tier. EA's OVR is a compressed 1–99 broadcast statistic whose modal band — approximately OVR 70 to 80 — is calibrated for online-matchmaking balance rather than open-ended ability discrimination. The top of the distribution is mechanically compressed: the top twenty men by OVR span a 3-point band (88–91) on a scale where the population standard deviation is approximately 7 points, making EA's elite ratings statistically inseparable in absolute terms even when the underlying market values span an order of magnitude. The cleanest empirical signature is Phil Foden, whose realised market value of approximately €150 million sits $15.6\times$ above the OVR-based prediction from a marquee-tier model — the largest under-prediction error in the 6,729-man matched corpus. The premium sits outside EA's representational capacity because EA's scale was not designed to support it.
The second is the broadcast-aligned coverage bias that produces, downstream, a systematic market-residual signature against the Premier League and a discount against LaLiga. Against an OVR + Age baseline calibrated on the corpus-wide rating-to-value curve, the Premier League pays $\times 1.46$ of expectation and LaLiga pays $\times 0.76$, a multiplicative gap of nearly $2\times$ on identical attribute profiles. The crucial methodological point — easy to misread, central to the chapter — is that this is a market effect, not an EA error: the Premier League's broadcast and prize-money structure flows through to player valuations in a way LaLiga's does not. However, EA's coverage density and rating attention track the broadcast economy by construction. The EA Ratings committee watches the same fixtures, reads the same press, and attends to the same competitive cycle that produces the Transfermarkt premium in the first place; the two signals are not independent [Ezzeddine, Pradier & Scelles, 2025]. EA's broadcast-tuned coverage geometry is therefore the structural reason a model built on EA attributes alone systematically underprices Premier League and MLS players while overpricing LaLiga, Liga Portugal, and Eredivisie players; the correction is a league fixed effect against the rating-baseline curve, and the magnitude is what §5 quantifies in detail.
The third is the rater-disagreement signature against Football Manager 26, which we record here as a property of the EA artefact rather than a comparative finding. Standardising both OVR and FM's Current Ability within their own distributions and computing the per-league mean difference, the Premier League sits at $+0.22\sigma$ of EA-$z$ above its FM-$z$ counterpart on the matched 13,434-man corpus. LaLiga sits at $+0.26\sigma$, Ligue 1 at $+0.13\sigma$, MLS at $-0.10\sigma$, Argentina's Liga Profesional at $-0.16\sigma$. The single most extreme cell is the Premier League right-winger cell at $+0.67\sigma$ — the largest standardised rater disagreement observed anywhere in the joint corpus. The asymmetry is structurally consistent with the broadcast-coverage hypothesis: EA's committee assigns its highest relative-to-distribution scores to players in the leagues whose broadcast presence is most heavily emphasised, while FM's local-researcher network distributes its scores more evenly across leagues whose coverage depth it has invested in directly.
2.5 — EA's load-bearing virtue: monthly cadence
The single property that licenses EA Sports FC 26's claim to instrument-grade status — and the property no other public attribute database matches — is its monthly refresh cadence. The EA Ratings team revises the attribute matrix on a Title Update cycle aligned with the FUT Champions competitive schedule, producing approximately one full ratings refresh per month across the September-to-July window, with additional ad-hoc revisions for major transfer events and tournament breakouts. A player whose form has materially changed in the previous four to six weeks will see their OVR move by one to three points on the next Title Update, with the changes visible in the public portal within hours of release.
The contrast with Football Manager 26 is structural and unfavourable to the latter on this single axis. Football Manager 26 shipped in November 2025, two years after FM24; FM25 was cancelled in February 2025. The two-year cadence is the longest in the franchise's modern history, and the FM24 → FM26 transition itself carried a documented structural rebase of the Current Ability scale that invalidates raw cross-edition pooling. Sports Interactive's standard intra-edition cadence is one mid-year winter patch plus annual full release — twelve to twenty-four months of latency between any real-world form change and its reflection in the database. For any valuation problem whose deployment surface is the active transfer market, EA's cadence advantage is decisive: a player whose January-window valuation has been re-priced by recent form will appear in EA's January Title Update with their revised OVR; the same player will not appear in Football Manager's revised CA until the following November. A Marc Guéhi who has just played himself into a January window, a Xavi Simons who has just torn his hamstring in November, a Cole Palmer mid-purple-patch — EA's monthly cadence moves on those events; Football Manager's annual cadence does not.
The cadence advantage is also the structural reason EA's coverage geometry is broadcast-tier weighted: a refresh cycle that fast cannot scale across two hundred national associations, and the studio's coverage decision to license forty-five named leagues with strong broadcast presence is the operational consequence of the cadence commitment. The asymmetry maps directly onto a temporal-versus-spatial trade-off in feature sourcing. EA contributes a high-frequency, market-tracking attribute signal calibrated against broadcast consensus, with monthly latency between form change and database reflection. Football Manager contributes a lower-frequency, higher-stability attribute signal calibrated against match-engine outcomes, with twelve-to-twenty-four-month latency. The two cadences are complementary in a way the two coverage geometries also are, and the joint use of both databases — which the rest of this paper develops — is the analytical operationalisation of that complementarity. The EA Sports FC 26 instrument, viewed in isolation, is best understood as the monthly-cadence half of a two-instrument design, and the rest of this paper is the construction of the second half.
§3 — Examining Football Manager 26: what Current Ability actually means

Where the EA-side chapter described a database calibrated against broadcast consensus and tuned for matchmaking balance, Football Manager 26 is the opposite instrument on almost every axis a research design cares about. It is two orders of magnitude larger, maintained by a distributed part-time researcher network rather than a single in-house committee, its summary statistic is a match-engine input rather than a broadcast-graphic decoration, and — for the first time in the franchise's three-decade history — it ships a women's database alongside the men's at launch. The cost is a slower refresh cadence, an internal scale (1–200 for Current Ability, 1–20 for individual attributes) that has to be re-projected for cross-comparison with anything else in the literature, and a within-gender calibration policy whose consequences for cross-gender modelling are the reason the men's-trained pipeline cannot be transplanted onto women's players without going anti-signal. This chapter walks four properties of the instrument: scale and provenance, the construction of CA, the hidden-mental block, and the structural breaks that any pooling exercise must price in.
3.1 — Scale, provenance, and the cadence trade-off
Sports Interactive's researcher database for FM26 ships at approximately 1.3 million player records globally, maintained by approximately 1,300 part-time contributors under 86 full-time research leads who scout in person and revise scores on a monthly internal cadence, with public release on a slower edition-plus-winter-patch schedule [Sports Interactive / SportsPro Media, 2024]. The contributor network spans 116 countries, which is the operational reason the database reaches into the long tail of professional football that EA's smaller in-house committee plus its per-league freelance panels do not see. FM26 is also the first edition to ship women's football integrated at launch: 36,000+ women across 14 leagues and 11 nations, built over four years by an independent women's research team led by Tina Keech [Football Manager Blog, 2025; Women in Games, 2025]. The contrast with EA Sports FC 26 — 16,122 men plus 1,447 women — is roughly 80× on the men's side and 25× on the women's, the difference concentrated in lower divisions and youth tiers that EA's playability floor excludes by design.
That figure is the most economical proof that FM is not a Big-Five-focused dataset wearing a long-tail disguise. The top three leagues in the matched corpus — Argentina's Liga Profesional, MLS, and the EFL Championship — are precisely the leagues a Premier-League-anchored model treats as second-class, and they are where SI's regional researcher tradition has the deepest historical investment. None is a league where EA's broadcast-tuned cadence runs at marketing density.
See Figure 1.1 above — the same FM editions timeline, reproduced here as the structural reference for §3.4's discussion of the FM24 → FM26 rebase.
The release cadence is the cost. The traditional one-edition-per-November rhythm broke in 2024–2025 when SI cancelled FM25 and used the two-year gap to complete the Unity migration. The within-edition cadence is also slower than EA's monthly refresh: one full edition per release year plus a single mid-season winter patch. EA is the right primary instrument when the question is "price this player by the end of the transfer window"; FM is the right primary instrument when the question is "build a panel across a decade of editions" or "see players the broadcast lens does not see at all".
3.2 — Reverse-engineering Current Ability
Current Ability is, on the surface, a one-number summary running on a 1–200 internal scalar that the EFEM.club public viewer re-projects to 1–99 for cross-comparability with EA OVR. The 1–99 surface exists for the comparison; the within-engine number is 1–200. CA is not a primitive observation but the output of a deterministic function over the underlying attribute vector — a positional weighted aggregation over the 36 visible attributes plus a small set of meta-attributes — with weights set per-role inside the match engine, and the role table itself rewritten in FM26. The same attribute matrix, against a different role table, will produce a different CA.
The attribute layer beneath CA is exposed on a 1–20 integer scale that Sports Interactive describes as within-database calibrated. Coverage by Fuller FM and The Cutback summarising the SI methodology reads, in their framing, that the scale is calibrated relative to the side of the database the player sits in, and that "a female player with 20 for Pace would be at the peak of speed in the women's game, just as a man is in the men's game" [Sports Interactive, 2025; Fuller FM, 2025; The Cutback, 2025]. The phrasing is load-bearing. A 20-Pace woman is the fastest woman in football, not the fastest player in football; the scale is calibrated against the population it grades rather than against a pooled human-footballer maximum (Appendix §A3).
The decomposition of CA can be written compactly as
$$ \mathrm{CA}_p(x) \;\approx\; \mu_p \;+\; \sum_{i=1}^{N} w_{i,p}\, x_i \;+\; \rho \cdot r(x), $$
where $x_i$ is the within-gender 1–20 attribute, $w_{i,p}$ is the role-specific weight for the player's best-role match $p$, $\mu_p$ is the per-role intercept, and $r(x)$ captures meta-attributes (Versatility, Injury Resistance) that SI exposes only indirectly. The form mirrors the OVR reverse-engineering result for EA reported in Appendix §A4, with the critical difference that the role table $\{w_{i,p}\}$ in FM26 has been collapsed from roughly sixty named roles in FM24 into a dual In-Possession / Out-of-Possession structure, with Mezzala, Enganche, Trequartista, Segundo Volante, and Carrilero removed as named outputs [FM Scout, 2025]. Because CA is mechanically a weighted sum against best-role weights, reshaping the role table reshapes the CA function even when attribute values are held constant — the architectural mechanism behind the FM24 → FM26 rebase below.
3.3 — The hidden-mental block: FM's exclusive territory
The single most distinctive feature of FM26 as a research instrument is the hidden-mental block. Thirteen pure-hidden personality attributes — Adaptability, Ambition, Loyalty, Consistency, Temperament, Pressure, Important Matches, Versatility, Professionalism, Sportsmanship, Compliance, Fairness, Injury Resistance — exist only on the FM side and have no analogue in EA's schema. Nine additional mental attributes — Concentration, Decisions, Vision, Bravery, Determination, Off The Ball, Anticipation, Composure, Leadership — are visible in FM26 but were hidden in earlier eras and remain absent from EA. The block is not aesthetic flourish; the match engine needs it because producing plausible 90-minute outcomes requires inputs the broadcast lens cannot supply.
The empirical question is whether the hidden block carries valuation signal beyond the visible attributes. The thirteen hidden mentals explain 19.5 % of log-value variance on their own, with no CA, no age, no reputation, no technical attributes in the feature vector — non-trivial against the right benchmark. The honest qualifier is that the block earns its place as a bundle: drop-one analyses show no single hidden mental's contribution is statistically distinguishable from zero at this n, while the bundle delivers a marginal +0.003 R² lift on top of EA + FM26-visible. Three mentals — Adaptability, Injury Resistance, Ambition — earn top-20 permutation-importance slots in the union model, but those three carry the bundle's load disproportionately rather than being independently identifiable.
A cleaner descriptive use of the block is the personality archetype clustering: k-means on the standardised personality vector recovers four interpretable clusters — Model Citizen, Big-Match Driver, Loyal Veteran, Resolute — with face validity against external value and reputation. Once OVR and Age are partialed out via a held-out HGBR baseline, the raw 5× spread between Model Citizen (median €2.5M) and Resolute (median €500K) collapses to a much more honest result.
Personality archetype carries some valuation signal but most of the raw spread was ability and age. Only the Resolute archetype carries a clean 14 % discount on top of what OVR and Age already imply, CI cleanly below zero; Model Citizen carries a modest 4 % premium with CI just above zero; Loyal Veteran and Big-Match Driver sit inside CV noise. The effects are operationally small, but the existence of any residual premium after CA-control is itself net-new information FM's schema makes available and EA's cannot.
See Figure 10.1 and 10.2 in §10 for the PCA projection of the four archetype clusters and the centroid radar on the eleven personality mentals; the deeper treatment of the archetype atlas lives there, and the visuals are not duplicated here.
The deployable artefact is the archetype atlas, not the valuation residual: a four-letter label assigned automatically to any player in the FM26 men's database, computed from the hidden vector. It supports contract-decision and transfer-pursuit prioritisation in a way an EA-only model structurally cannot, because the inputs (Loyalty, Ambition, Compliance, Important Matches) do not exist in EA's schema. The archetype tool answers a different question — how a player will behave under the contract structures a club can offer — than the valuation pipeline does, and that is the canonical FM-exclusive output.
3.4 — Structural breaks: the FM24 → FM26 rebase and a Versatility null
Any multi-edition FM analysis has to confront the FM24 → FM26 rebase head-on. On the 1,635 men present in both editions, 97.7 % of 44 tested attributes crossed the 0.5-SD per-player-delta threshold, all in the same direction, with absolute mean-deltas between 10 and 17 points on the 0–100 surface scale. The signature is categorically different from every other transition: FM20 → FM21, FM21 → FM22, and FM23 → FM24 show 0 % of attributes crossing the threshold; FM22 → FM23 shows 51 % shifted in mixed directions, consistent with documented goalkeeper-weighting tweaks but not a wholesale rebase. The individual-player evidence completes the case: Mbappé's CA reads 188 in FM24, 98 in FM26 — a drop of 90 surface-points on the FM24 1–200 → FM26 1–99 transition. Messi: 185 → 90. Salah: 180 → 93. Vinícius: 181 → 91. Bellingham: 168 → 91. Either every elite player simultaneously lost half their footballing ability in twelve months, or the underlying CA scale was structurally rebased.
The mechanism is the joint product of three structural changes shipped simultaneously into FM26: the Unity engine rewrite [VGC, 2025]; the role-system overhaul described above, which restructured the CA function with no change to the underlying attributes; and the launch of the women's database under SI's within-gender calibration policy [Sports Interactive, 2025; The Cutback, 2025]. We cannot disentangle which of the three drove the rescaling. Appendix §A5 documents the pipeline rule of thumb: build features from raw 1–20 attributes when joining across editions; treat FM26 women's data as a separate cohort and normalise within database; if CA must be used, z-score within (edition × gender database) before pooling.
A narrower negative result belongs here because it falsifies an intuition the FM-exclusive territory invites. Versatility, the hidden mental that operationalises position-familiarity, carries a sizeable raw effect on market value — the 85+ bucket commands €2.0M median against €1.0M for the bottom. After OVR and Age are partialed out, Spearman = −0.004 on n = 6,729 — a null indistinguishable from zero. The entire raw 2× spread is composition: more versatile players have higher OVR and are slightly older, both of which independently raise market value. Footballing intuition mis-predicted the empirical importance — exactly the kind of adjudication a researcher-graded second source is supposed to make.
3.5 — The women's database: 33 years of absence, then 36,000 at once
The non-incremental data event in FM's recent history is the women's database. For more than two decades — Championship Manager 93 through Football Manager 24 — Sports Interactive shipped a management simulation that did not include women's football at all. Six FM editions from FM2016 through FM24 carry zero women in our corpus. FM26 launched on 4 November 2025 with 36,000+ women across 14 leagues and 11 nations [Sports Interactive, 2025; Women in Games, 2025].
The contrast with EA is worth stating precisely. EA introduced women's players in FC23 (2022) and by FC26 carries 1,447 women. SI lagged EA by roughly two release cycles. When SI did ship, they shipped at approximately 25× the scale: 36,000+ against 1,447. The two products had different operating definitions — EA shipped the marquee end of the pyramid as gameplay assets; SI shipped a simulation-grade scouting layer including youth and reserve teams across 14 league pyramids. The women's database also inherits the within-gender calibration policy in full: a 20-Pace woman is the fastest woman in football, not the fastest player in football. This is the methodological warning the cross-gender modelling work later in the paper turns out to need exactly as much as it is foreshadowed here to need.
Pooling across the gender boundary requires within-database percentile features or explicit re-calibration; any modelling exercise that ignores that fact is choosing to mis-read the women's vector as a downward-shifted men's vector when the actual relationship is a within-database reseat.
3.6 — Section closer
FM26's load-bearing virtue is the joint product of depth and breadth no other public footballing-attribute database currently offers. Depth in that 1,300 researchers in 116 countries produce 100.0 % within-database completeness on the visible block and an exposed personality vector earning 19.5 % of log-value variance on its own. Breadth in that the top three leagues of the FM26 ∩ EAFC26 matched corpus are Argentina's LPF, MLS, and the EFL Championship — long-tail leagues EA covers thinly and inside which the recruitment-edge market overwhelmingly sits. The cost is a slower refresh cadence, a within-gender calibration policy that constrains cross-gender pooling, and a structural FM24 → FM26 rebase any multi-edition analysis has to acknowledge. FM26 is a different instrument from EAFC26 on almost every axis that matters for a research design, which is exactly why the union of the two databases later in this paper recovers signal neither alone produces.
§4 — Two schemas, one player: side-by-side comparison after the individual examinations

Sections 2 and 3 examined each instrument on its own terms. The remainder of the paper makes joint use of both, and the bridge between the per-instrument examinations and the joint modelling exercise that follows is a direct side-by-side comparison: when the same 13,434 men are described by both EA's broadcast-tuned 36-attribute surface and FM26's researcher-curated 36-visible-plus-13-hidden vector, how much do the two databases actually agree, on which attributes, and in what shape? The answer is consequential because it sets the upper bound on how much information either database can add to the other. Two databases that agree exactly carry no marginal information; two databases that disagree everywhere carry no usable joint signal. The empirical answer turns out to sit in the productive middle.
EA and Football Manager were built for different purposes. EA's OVR is a compressed broadcast statistic, designed to read clearly in a television graphic and to keep online matches competitive; FM's attribute set is a simulation input, designed to feed a match engine that must produce plausible 90-minute outcomes from the numbers it is given. Both are credible, both are widely used by professional clubs [Jacobson, 2023], and neither is a disguised version of the other. The first analytical move of Wave 2 places the two schemas alongside each other on the same 13,434 men with both attribute vectors — supplemented by 385 women — and measures where they agree, where they disagree, and what each carries that the other does not. The aggregate result is Spearman ρ = 0.834 on the one-number summaries, ρ = 0.529 on the attribute pairs that share names, and a ±2σ disagreement on the leaderboard tail.
Coverage
The two systems agree more than they disagree on who exists, but they differ sharply on reach and refresh cadence. EA's playable universe contains 17,569 footballers across 45 named leagues, refreshed monthly through the season, whereas FM's researcher database holds roughly 1.3 million players globally, refreshed annually with a single mid-season winter patch. The matched cohort — the only sub-population on which a head-to-head comparison is possible — comprises 13,819 individuals (13,434 men and 385 women) with both attribute vectors present in the pipeline; the remaining ~1.28 million FM players sit below EA's coverage floor.
The stacked bar makes the schema-surface question concrete. EA exposes 27 outfield numeric attributes; FM26 exposes 45 — 32 visible plus the 13 hidden mentals that EA collects nothing comparable to. Both schemas agree on 16 paired constructs (Finishing, Crossing, Heading, Vision, Dribbling, and so on). EA carries 11 numerics FM doesn't decompose (Sliding Tackle as separate from Standing Tackle, Curve, Volleys, the PlayStyles flag-set). FM carries 16 visible attributes EA doesn't measure (First Touch, Marking-distinct-from-Tackling, Off The Ball, Anticipation, Concentration, Decisions, Determination, Bravery, Flair, Leadership, Teamwork, Workrate, Natural Fitness, Corners, Long Throw, Positioning-outfield). Then there are the 13 hidden mentals — Adaptability, Ambition, Compliance, Consistency, Fairness, Important Matches, Loyalty, Pressure, Professionalism, Sportsmanship, Temperament, Versatility, Injury Resistance — which exist only on the FM side. Goalkeepers tell the same story in miniature: EA gives them five attributes, FM gives them eleven.
Agreement on shared attributes
The natural next question is whether the same-named pairs agree at the player level — that is, whether EA Finishing constitutes the same construct as FM26 finishing, merely rescaled. The hypothesis can be tested on the 11,943 outfield men for whom both vectors exist.
The black-outlined diagonal is the load-bearing structure: the mean Spearman ρ on the 19 named-pair cells is 0.529 — moderately positive, but not unity. Three patterns deserve attention.
The defensive block agrees most tightly: Standing Tackle ↔ tackling at ρ = 0.728, Def Awareness ↔ marking at 0.714, and Sliding Tackle ↔ tackling at 0.731. Defensive ability is structurally observable from broadcast — a tackle either succeeds or it does not — so EA's broadcast-derived schema and FM's researcher-graded schema converge on the same number, and this is the only block in which the named-pair correlations cross 0.70.
The finishing/physical block agrees moderately: Finishing ↔ finishing at ρ = 0.683, Dribbling ↔ dribbling at 0.646, Strength ↔ strength at 0.665, and Heading Accuracy ↔ heading at 0.649. These are recognisably the same constructs, carrying enough independent measurement error that the rank ordering shuffles meaningfully between them.
The playmaking and personality-adjacent block agrees only loosely: Vision ↔ vision at ρ = 0.399, Crossing ↔ crossing at 0.424, Aggression ↔ aggression at 0.410, and Composure ↔ composure at 0.505. Balance ↔ balance is essentially uncorrelated at ρ = −0.04, indicating that the two schemas apply the term to different constructs. The divergence concentrates wherever the rating depends on a scout having watched training-ground footage rather than match clips, which is where FM's researcher network is collecting signal that EA's committee structurally cannot.
The off-diagonal cells provide the schema-seam evidence. Standing Tackle correlates at +0.696 with FM26 positioning and Def Awareness at +0.694, indicating that a portion of what EA encodes in its tackle attributes FM splits into a separate positioning-without-ball construct; the two schemas slice the same player's defensive ability into different primitive shapes. Reactions correlates at +0.557 with FM26 anticipation and only +0.493 with FM26 firstTouch, locating EA's Reactions closer to FM's anticipation than to anything reactive in the technical block. None of this constitutes a defect; it is the consequence of schema-design choices made independently by two researcher organisations finally being measured against each other.
Agreement on the one-number summary
Below the per-attribute level sits the question any analyst asks first: does EA's OVR agree with FM26's Current Ability? Both are one-number summaries and both are intended as the headline rating a downstream user encounters.
Spearman ρ on the full cohort is 0.834 (Pearson 0.790, n=13,434), indicating strong agreement on rank order alongside clear non-identity at the individual level. The scatter is wide enough that a single player's OVR predicts the corresponding CA only to within ±5 surface-scale points at one standard deviation in either direction. The elite end is where the divergence becomes visually obvious: in the top-right, EA compresses the top 20 men into a 3-point band (OVR 88–91), while FM distributes the same population across an 8-point band (CA 90–98). Mbappé sits at OVR 91 / CA 98 — at the top of both schemas, but FM's ceiling is reached while EA's is not.
The compression metric is precise: the top 20 men by EA OVR span 3 points on EA's 1–99 surface (3.0%) while the top 20 by FM CA span 8 points (8.1%), so FM uses 2.7× the elite-tier resolution EA uses on the same population. The orange ≥€50M points cluster overwhelmingly in the top-right and the grey no-TM-value points cluster in the bottom-left, so neither schema produces wildly mis-ranked elite players against the market. Where the schemas disagree is in the 70–80 OVR / 65–75 CA mid-band, which is precisely the band in which the recruitment market actually clears.
Shape
Strong rank correlation can hide structural shape differences. The two top-line ratings are on conceptually similar 1–99 surface scales in this corpus (after FM's 1–200 internal CA is run through the EFEM viewer's surface transform), but their distributions are categorically different in skew and kurtosis.
EA's matchmaking-balance objective biases the rating committee, by design, toward producing a roughly symmetric bell curve so that the 70–80 "playable" band remains the modal experience. FM's shape is sharply non-normal, exhibiting a heavy left tail and a tighter central mode; the p99 z-scores tell the elite-tier story directly, with EA's p99 at +2.40σ and FM's at +1.94σ, so EA's elite tier sits farther from the central mass on its own scale than FM's does.
The underlying mechanism is the cohort each database elects to include. FM26's researcher network grades every player in a club's senior squad, including academy graduates and depth pieces who would never enter EA's playable database; those players appear as a long, dense left tail of low-CA players that EA simply does not record. EA's coverage floor sits at approximately the bottom of the 60s on OVR, while FM's effective floor in this matched corpus is the low 20s on CA. The skew difference is therefore not a calibration choice but a coverage choice: EA elected to draw the line at "playable in a video game" while FM elected to draw the line at "exists in professional football."
Per-position agreement
A reasonable prior, given the goalkeeper-schema divergence (5 EA attributes vs. 11 FM), is that goalkeepers should exhibit looser EA-FM agreement than outfield players. The data do not support that prior.
Position-level ρ ranges from 0.796 (RW, n=282) to 0.875 (LW, n=272), with goalkeepers at ρ = 0.852 (n=1,491) — sixth of twelve. The total spread is 0.08, smaller than the within-position scatter, and goalkeepers, despite the 11-vs-5 attribute count disparity, still produce a one-number summary that rank-orders the same way in both schemas. The fine-grained GK schema FM exposes is useful for describing the goalkeeper — kicking, one-on-ones, and command of area are separately measured — but it does not aggregate to a CA that disagrees with OVR. The tightest-agreement positions are LW (0.875), CAM (0.870), and CDM (0.854), all positions where the underlying ability is heavily defined by ball-progression and decision-making, both of which FM decomposes into multiple constructs that aggregate to a CA tightly correlated with EA's all-in-one OVR. The actionable reading is that EA OVR and FM CA can be used interchangeably as a rank order but not as a cardinal score.
Sidebar — What EA does that FM can't
The piece is structurally generous to Football Manager because its discoveries are larger. Honest counter-weight: EA updates monthly through the season; FM updates twice a year — an annual full release and a single mid-winter patch. For any deployment that prices a player whose form has changed in the last six weeks, EA is strictly fresher. A Marc Guéhi who has just played himself into a January transfer window, a Xavi Simons who has just torn his hamstring in November, a Cole Palmer mid-purple-patch — EA's monthly ratings refresh moves on those events; FM does not, until next November. EA's roughly 50-person internal rating committee plus external freelance per-league panels are smaller than FM's 1,300 researchers in headcount, but they are tuned for cadence rather than depth, and they operate against a marketing calendar that demands the product reflect the most recent matchday. That structural choice is also the reason EA covers exactly 45 named leagues at marketing density — a refresh cadence that fast does not scale across 200 nations.
The practical consequence is a temporal-versus-spatial tradeoff. If your valuation problem is "price this player by the end of the transfer window," EA is the right primary instrument. If your problem is "build a historical career-arc panel across a decade of editions," FM's lower-frequency, higher-stability cadence is the more useful artefact. §9 will return to this when the bake-off shows that FM's lift over EA lives in the mid-tier where the market is thin — exactly the band where monthly cadence buys you the least.
Sidebar — Are the two databases informationally independent?
Every comparison in this paper rests on a quiet assumption: that EA and Football Manager are informationally independent sources of the same underlying truth, which is why their disagreements are signal rather than redundancy. Ezzeddine, Pradier & Scelles 2025, in the Journal of Sports Analytics, make this argument in print. Video-gaming ratings carry transfer-pricing signal, they show, because the two major systems are calibrated against fundamentally different inputs [Ezzeddine, Pradier & Scelles, 2025]. EA's OVR is tuned, decade after decade, against the same broadcast and media consensus that shapes Transfermarkt's crowd curation: TV games, transfer-window discourse, marketing visibility, sticker-album reputation. EA's rating committee watches the same matches and reads the same press as the Transfermarkt forum. FM's researcher network does not. SI's researchers optimise CA against match-engine outcomes — a researcher who has covered Mainz 05 for eight years is grading the same player on training-ground footage, season-long tactical positioning, and personality reports from local journalists, against a target that has nothing to do with whether the player is on a Champions League broadcast that week.
Two priors, two independent measurements, one underlying ability. That is what makes the ρ = 0.834 number a measurement — if the two databases were calibrated against the same inputs, their agreement would be tautological. It isn't. The 0.165 of unshared rank-order variance is what each database can see that the other can't. The §2 leaderboard of EA-overrates and FM-overrates, the 13 hidden mentals on the FM side, the monthly-versus-annual cadence split — every disagreement this paper monetises in §7–§6 depends on the independence assumption holding. Ezzeddine et al. is what gives us the theoretical right to assume it does, and the empirical correlations in this section are what gives us the right to expect the disagreements to behave as signal rather than as noise.
The two schemas, then, are neither better nor worse than each other. They were built for different purposes, calibrated against different inputs, cover overlapping but non-identical populations, and disagree at the player level by precisely the amount one would expect of two independent measurements of the same underlying construct. §3 takes up the next question — what do attribute models in this literature actually fail to capture? — and shows that the ceiling has structure, that one of the two databases has just undergone a structural rebase between editions, and that the most informative single chart in the paper documents what happened to Kylian Mbappé's Current Ability between FM24 and FM26.
Chapter 2 — Building the model, and locating its ceiling
The second chapter takes the two databases as given and builds from them. It fits the union model that lifts held-out R² from 0.663 to 0.785 on the matched men's corpus, validates the predictions against two independent ground truths in Transfermarkt and the CIES Football Observatory, names the residual ceiling the model cannot remove, and tests whether the men's-trained pipeline transfers to the women's database that Sports Interactive shipped in November 2025.
§5 — Building the union model: when both lenses together beat either one alone
If a recruitment-analytics director must select a single attribute database to anchor a valuation model, the answer is EA. On its full 7,835-man frame, the EA-only HistGradientBoostingRegressor attains R² 0.751 against log10 Transfermarkt value [bake_off_cv.csv], while the FM-only model on its 6,729-man matched frame attains 0.678 — seven percentage points behind. The compressed broadcast summary statistic encoded in EA's OVR performs predictive work that FM's Current Ability, calibrated for match-engine simulation rather than market consensus, cannot replicate; the strong-form question of "which simulation" therefore admits a one-line answer.
The actual decision, however, is not strong-form. On the matched 6,729-man corpus on which every configuration scores the same rows, the EA + FM26 union model lifts R² from 0.663 (EA-only) to 0.785 (Δ +0.118) and reduces median absolute percentage error from 51.6 % to 39.9 % (Δ −11.7 pp), winning every single cross-validation fold without overlap [section_c_bakeoff_folds.csv]. The lift concentrates where the recruitment market actually clears: the mid-tier leagues and the OVR-75-and-above bands. Three FM hidden mentals — Adaptability, Injury Resistance, and Ambition — earn top-20 permutation-importance slots despite having no EA analogue at all. The decision §4 supports is therefore "EA + FM, conditionally", and the conditions are nameable.
5.0 — Train-test split, learner specification, and the full results table
The most important methodological claim of this paper is that every reported model lift is a held-out result rather than a training-set artefact. This subsection states the protocol explicitly and reports the full results table for every feature configuration on every metric, so that the headline finding (the union model attains R² = 0.785 on held-out folds against a baseline of 0.663 for EA-only) can be checked at a glance against the underlying numbers.
shuffle = True, random_state = 42. Held-out predictions via cross_val_predict; per-fold R² via cross_val_score.HistGradientBoostingRegressor with max_iter=400, max_depth=8, learning_rate=0.05, random_state=42.The training protocol is identical across every model below. Only the feature matrix changes between configurations, so the differences in cross-validated R², RMSE, MAE, median APE, Spearman, and marquee-tier APE can be attributed cleanly to the information content of the feature block.
Consolidated five-fold cross-validation results — every model, every metric
| Model | n | Features | R² mean | Fold envelope | RMSE (log) | MAE (log) | Median APE | Mean APE | Spearman (€) | Marquee APE |
|---|---|---|---|---|---|---|---|---|---|---|
| EA-only | 7,835 | 38 | 0.663 | 0.643 – 0.680 | 0.394 | 0.304 | 51.7 % | 103.3 % | 0.794 | 38.7 % |
| FM26-only (all blocks) | 6,729 | 52 | 0.680 | 0.645 – 0.708 | 0.385 | 0.283 | 45.4 % | 105.9 % | 0.804 | 33.7 % |
| FM26 visible + meta | 6,729 | 39 | 0.677 | 0.640 – 0.701 | 0.388 | 0.285 | 45.4 % | 108.2 % | 0.799 | 33.5 % |
| FM26 hidden-only | 6,729 | 13 | 0.195 | 0.158 – 0.218 | 0.612 | 0.491 | 77.9 % | 222.1 % | 0.423 | 91.8 % |
| EA + FM26 union | 6,729 | 90 | 0.785 | 0.774 – 0.796 | 0.316 | 0.238 | 39.9 % | 78.4 % | 0.870 | 29.5 % |
The union model dominates on every metric without exception. Its R² ceiling of 0.785 sits 0.118 above the EA-only baseline of 0.663 on the matched corpus, and the fold envelope of 0.774 – 0.796 does not overlap the EA-only fold envelope of 0.643 – 0.680 at any point. Median absolute percentage error falls from 51.7 % to 39.9 % across the corpus and from 38.7 % to 29.5 % on the top-20 marquee tier; mean APE collapses from 103.3 % to 78.4 % across the full population. The RMSE in log-units drops from 0.394 to 0.316, a 20 % reduction in geometric-mean error; the MAE in log-units drops from 0.304 to 0.238, a 22 % reduction. The Spearman rank correlation against held-out Transfermarkt value in euros rises from 0.794 to 0.870 — the joint embedding orders players against the market more accurately than either schema in isolation, and the rank-order improvement of seven percentage points is the cleanest evidence that the union does work that is structurally different from the work each schema performs alone.
The contribution decomposes loosely along the rows of the table. The FM26 visible block alone (the 36-attribute outfield surface plus the three meta features) attains R² = 0.677 — within one point of the EA-only baseline despite a different attribute vocabulary and a different rating philosophy. The FM26 hidden block, comprising the thirteen personality mentals that have no EA analogue at all, attains R² = 0.195 in isolation — modest in absolute terms but structurally informative, because thirteen personality variables with no overall-ability summary recover roughly one fifth of the variance in log-value. The FM26-only model (all blocks together) attains R² = 0.680 and median APE 45.4 %, while the union model jumps to R² = 0.785 and median APE 39.9 % — a gain that is more than the sum of the marginal contributions, the signature of complementary information across two non-redundant attribute schemas.
The deployment-relevant reading is that the union model is the strongest attribute-only valuation surface yet reported on a public corpus at this scale. Held-out R² 0.785 against log Transfermarkt value places the model within 0.07 of the McHale & Holmes 2023 published ceiling of approximately 0.85, while the remaining gap is named at the feature level — StatsBomb event metrics, club-financial features — rather than at the modelling level. Every subsequent figure in §5, §6, and §7 reports diagnostics against this same protocol, so the table above is the single most important artefact of the paper.
5.1 — The bake-off, with fold spread
Five configurations are compared under the protocol above; the chart below visualises the headline R² figures together with their fold envelopes, with the union bar in orange to mark the joint embedding's separation from every single-schema alternative.
| Model | n | Features | R² mean | Fold spread | Median APE |
|---|---|---|---|---|---|
| EA-only | 6,729 | 38 | 0.663 | 0.643–0.680 | 51.6 % |
| FM26 hidden-only | 6,729 | 13 | 0.195 | 0.158–0.218 | — |
| FM26 visible | 6,729 | 39 | 0.677 | 0.640–0.701 | — |
| FM26-only (all blocks) | 6,729 | 52 | 0.680 | 0.645–0.708 | — |
| EA + FM26 union | 6,729 | 90 | 0.785 | 0.774–0.796 | 39.9 % |
Sidebar — How can the union beat either alone when the named-pair diagonal only averages ρ = 0.529?
This is the right thing to be suspicious about. Figure 4.2 in §4 reported that the 19 conceptually-paired EA↔FM26 attribute cells (EA Finishing ↔ FM26 finishing, EA Crossing ↔ FM26 crossing, …) average a Spearman correlation of just 0.529 across 11,943 outfield men. EA's Heading Accuracy column even goes mildly negative against FM26 heading on the same-construct cell (the dark-blue square inside the diagonal box). And Figure 7.3 makes the row-level shape concrete: on Phil Foden, EA-only predicts €52 M, FM26-only predicts €0.2 M — they don't merely disagree, they live on different planets. So why does Figure 5.1 show the EA + FM26 union model jumping from R² 0.663 (EA-only) and 0.680 (FM26-only) to R² 0.785 on the same matched rows? Four parts to the answer, in order of how load-bearing each one is.
1. "Union" is a row-level join, not a column-level merge. The harmonization step does not average EA Finishing with FM26 finishing into a single "consensus" Finishing column. It joins the two databases on player identity — name + date-of-birth + club, with manual disambiguation on the residual 7.5 % — and then horizontally concatenates the two attribute vectors as separate input columns. The union model receives all 38 EA columns and all 52 FM26 columns side-by-side, 90 features total, and decides what to do with them. There is no schema-level reconciliation step; there is no imputation across schemas. EA Finishing and FM26 finishing remain two independent measurements of the same underlying construct, both available to every split in every tree. The model is never asked to decide which one is "right" — it gets to use both, conditioned on whatever else it has already split on. The matched-corpus n=6,729 in the bake-off table is the count of rows where both schemas were available, not a count of consensus columns.
2. Two moderately-correlated noisy measurements of the same construct add real information. This is the classical noisy-sensor-fusion intuition and the reason ensembles of weak learners beat single strong learners. Write EA Finishing = TrueFinishing + εEA and FM26 finishing = TrueFinishing + εFM. If εEA and εFM are even partially independent — which the 0.529 named-pair diagonal directly implies, because perfect agreement would force the cell to ρ = 1 — then a model conditioning on both features can reduce the residual variance below either feature alone. Concretely: the broadcast lens that drives EA's Finishing (preseason media coverage, big-stage minutes weighted heavily) and the local-knowledge process that drives FM26's finishing (1,300 researchers, two-year revision cycle, depth-of-squad weighting) introduce structurally different error modes on the same player. The same Foden whose EA Finishing reads "elite finisher who couldn't convert in the 24/25 season" is the same Foden whose FM26 finishing reads "elite finisher who rotated heavily and had Champions League minutes diluted by club issues." Both moderately wrong; both moderately wrong in different directions; one tree split per source recovers more than two tree splits on either source alone. The 0.529 diagonal does not say the two measurements disagree about the player — it says they agree on direction (Spearman, not Pearson) while carrying independent rotational error around that direction.
3. The off-diagonal cells in Figure 4.2 are themselves informative. The dark-orange band in the bottom-right of the heatmap — EA Def Awareness / Standing Tackle / Sliding Tackle / Interceptions rising together with FM26 marking / tackling / positioning — is not redundant; it is cross-construct rank correlation, and a gradient-boosting tree can use it to recover signal when a within-construct measurement is noisy or absent on a particular row. EA's "Long Passing" correlates with FM26's "decisions" at ρ ≈ 0.5 because both rise in playmakers. EA's "Composure" correlates with FM26's "concentration" because both rise in technical mids who keep their heads under pressure. The 90-column joint embedding carries strictly more information than the EA-only 38 columns or the FM26-only 52 columns even after accounting for the 19 within-construct redundancies — because the cross-construct cells (most of the heatmap) are uncorrelated within either schema and weakly correlated across schemas, exactly the regime where gradient boosting extracts complementary signal.
4. FM26 contributes 13 hidden mentals that have no EA analogue at all. Adaptability, Ambition, Loyalty, Consistency, Temperament, Pressure, Important Matches, Versatility, Professionalism, Sportsmanship, Compliance, Fairness, Injury Resistance. EA has zero of these. Three of them — Adaptability rank 15, Injury Resistance rank 16, Ambition rank 20 — earn top-20 permutation-importance slots in the union model (§10). The FM26 visible mentals (Concentration, Decisions, Vision, Bravery, Determination, Off-the-Ball, Anticipation, Composure, Leadership) similarly partially overlap EA constructs but at coarser granularity. So the union gain decomposes loosely into two pieces: variance reduction on the named-pair overlap (mechanisms 1–3 above) and net-new information from the personality block (mechanism 4). Section 4.5's permutation-importance ranking is the empirical apportionment — the EA-meta block carries 46 %, the FM-meta block (CA + age + reputation) carries 11 %, the FM hidden block carries 5 %, the rest spreads across the FM visible attributes that partially overlap EA.
A clean way to see all four mechanisms in one row is Phil Foden in Figure 7.3. EA-only predicts €52 M (he reads as a "good but not yet marquee" PL RW). FM26-only predicts €0.2 M — implausibly low, and almost certainly a feature-vector pathology where his 24/25-season rotation pattern + Important Matches score made his FM26 vector resemble a fringe rotation player. The union prediction lands at €25 M — still far below the actual €150 M (the §7.1 celebrity premium that no schema-based model in this literature recovers) but recovering the OVR-based rank-order. The union does not "vote" between EA and FM; it conditions one against the other in a way that fixes FM26's pathology on this row without throwing away the FM-meta signal that helps it on most other rows. That is the row-level signature of the variance-reduction mechanism, and it is what the matched-corpus R² 0.785 looks like when you zoom in to the residual.
The harmonization plus union together raised an additional R² of 0.122 over EA-only on the matched corpus. Roughly 0.04 of that is mechanism 4 (the hidden-mental net-new information that FM has and EA doesn't); roughly 0.08 is mechanisms 1–3 (variance reduction on the overlapping construct space, where the named-pair ρ = 0.529 is enabling the gain rather than threatening it). The named-pair ρ that looks worryingly low is, in the additive-modelling regime, the exact number that makes the union work — perfect agreement would have made FM's columns redundant.
Three findings sit inside that table and should be read in the order in which they appear. The union model beats EA-only on every fold; EA-only's best fold sits at 0.680, below the union's worst fold at 0.774, and the fold envelopes do not overlap. The gap comfortably exceeds the noise floor of a 5-fold CV at this sample size, indicating that the integration costs catalogued in Appendix A4 purchase genuine predictive power rather than a within-noise wobble. EA standalone on its full n=7,835 frame at R² 0.751 remains the correct standalone choice, seven points ahead of FM-only, and any "replace EA with FM" deployment proposal can be rejected on the standalone-R² evidence alone. The thirteen hidden mentals — Versatility, Important Matches, Loyalty, Ambition, Adaptability, Consistency, Temperament, Professionalism, Sportsmanship, Pressure, Injury Resistance, Compliance, and Fairness — explain 19.5 % of variance on their own with no CA, no age, no reputation, and no technical attributes in the feature vector. The result is modest in absolute terms but non-trivial against the appropriate benchmark; thirteen personality variables, absent any overall-ability summary, recovering one fifth of the log-value variance constitutes the cleanest possible evidence that FM is collecting structurally different information from EA rather than the same information at higher resolution.
The Wave 1 ceiling claim now reads differently. Wave 1 sat at R² 0.77 on EA alone and claimed to be "within striking distance" of the published 0.85–0.90 ceiling by reporting the top-8-league slice at 0.82 [Lee, 2026a]. Wave 2 lifts the full-sample number to 0.785 with the union model, and the remaining gap is now the named StatsBomb-style event-metrics gap discussed in §7.3 rather than an unaccounted-for shortfall.
5.2 — Where the lift lives: league tier + OVR band
The Δ +0.118 R² is an average; the deployment decision turns on where within the corpus that average is concentrated.
The lift is smallest in the top-5 leagues — Premier League, LaLiga, Bundesliga, Serie A, and Ligue 1 — where EA's broadcast lens receives the most attention, OVR is most carefully tuned, and the calibration team has the most Champions League footage available. It is largest in the mid-tier band (Bundesliga 2, Eredivisie, Liga Portugal, Serie B, Belgian Pro League) at Δ R² +0.039, and comparably large in the long tail at Δ R² +0.038. The pattern is the inverse of the regions in which EA-only accuracy is highest, which is precisely what one would expect were FM's information lift genuine: the marginal data source contributes most where the dominant data source has the least to say.
The deployment-relevant reading is sharper than the academic one. Top-5-league players are already priced efficiently — Erling Haaland, Kylian Mbappé, and Vinícius Júnior have values that are crowd-sourced by hundreds of Transfermarkt editors, anchored on recent comparables read from a shared sheet by every analytics shop in the market, and foreshadowed by trade-press transfer rumours six months before any window opens. A club analytics function that builds a model to price the top-5 marquee is therefore solving a problem the market has already solved. The recruitment edge — the actual commercial output of a club's analytics department — lives in the leagues where the market is thinner: the Eredivisie holding midfielder, the Liga Portugal centre-back, the Bundesliga 2 prospect, and the Saudi Pro League signing whose pricing remains opaque even to neighbouring clubs. The bake-off indicates plainly that FM's researcher network performs more work in those leagues than in the top-5.
The OVR-band view tightens the story.
The 75-and-above OVR bands are where the integer-OVR spacing is densest, where individual marginal information is most valuable, and where transfer fees most exceed the residual noise floor in absolute euro terms. At OVR 85+, EA-only's median APE is 41.3 % against the union model's 20.9 % — roughly half the error on the population of players that drives the majority of the global transfer market's euro volume. The 70–74 band, with the largest single bucket count at 1,747 players, sees a 15.5 percentage-point reduction in median APE, and that band contains most mid-tier scouting targets. The lift therefore lives where the recruitment market actually transacts.
5.3 — Out-of-fold prediction quality
R² compresses the model's error structure into a single scalar; the parity scatter provides the clean visual decomposition.
Three observations follow from the scatter. The spread is roughly homoscedastic in log-space — the residual band has approximately the same width at log-value 5 (€100K) as at log-value 7 (€10M), so the model is not systematically over- or under-shooting one price band relative to another. The high-OVR tail of 85+ players clusters in the upper-right quadrant where it should, but with non-trivial dispersion; the model recovers their rank ordering well (Spearman 0.869) and their absolute level imperfectly, which is the celebrity premium of §7.3 made visible. The EA-only model on these same rows leaves the upper-right quadrant materially noisier, and the R² gap of 0.118 between the two models lives almost entirely in the model's ability to discriminate between players whose OVR is similar but whose FM26 signal differs.
5.4 — What the union model learned: top-20 feature importance
The aggregate numbers indicate that FM earns its place; permutation importance identifies which features perform that work.
The top six EA-side features by permutation importance are OVR (0.347), Potential (0.111), Age (0.025), Composure (0.010), Volleys (0.006), and Shot Power (0.006). The top six FM26-side features are currentAbility (0.042), age (0.039), reputation (0.032), anticipation (0.007), finishing (0.006), and jumpingReach (0.006). The three hidden mentals that earn top-20 slots — Adaptability at 0.005, Injury Resistance at 0.005, Ambition at 0.005 — carry roughly the same marginal weight as the FM26 visible attributes that sit alongside them.
The feature inventory sorts cleanly into five classes a recruitment analyst can act on:
Two interpretive notes belong with the ranking. First, OVR's dominance survives intact; it remains the single most informative feature in the union model at roughly 3× the importance of Potential and over 8× the importance of FM's Current Ability, so CA does not displace OVR but sits alongside it. Whatever EA's broadcast lens encodes into OVR has no FM analogue that matches it for predictive density on Transfermarkt market value, and a model that attempted to substitute CA for OVR would forfeit seven points of standalone R². Second, three of the thirteen hidden mentals earn their place in the top 20 despite having no EA analogue at all. Adaptability is plausibly the lurking variable that explains a non-trivial share of the dispersion in why some transfers price well and others do not, illustrated by Haaland to Manchester City — where SI's researchers had him at Adaptability 17 the year before he moved — versus Hazard to Real Madrid, who scored materially lower on the same attribute the year of his transfer. Injury Resistance is the durability variable that, on inspection of EA-only residuals, explains a substantial fraction of the older-player price decay the EA schema cannot model; a 31-year-old with high Injury Resistance prices materially differently from a 31-year-old with low Injury Resistance, and EA records no feature that distinguishes them.
The appropriate reading of the top-20 is that the FM contribution is not a vague gestalt improvement but is concentrated in a small number of named features that capture variables EA cannot capture by design. The integration decision is unusually clean for an empirical exercise because the variables being purchased can be enumerated.
5.5 — Hidden mentals: honest about the bundle effect
The top-20 permutation-importance result is the headline; the drop-one analysis is the corresponding discipline check. We re-ran 5-fold CV on the full EA + FM26 visible + FM26 hidden model with each hidden mental removed in turn and computed Δ R² (full minus drop-one) in per-mille.
No individual hidden mental's contribution is statistically distinguishable from zero at this sample size. The directionality is suggestive — Ambition and Pressure are the only two mentals on which the full model marginally beats the leaner model, while Important Matches, Versatility, and Adaptability show the leaner model marginally winning — but the magnitudes are small enough that the appropriate interpretation is directional rather than substantive. The bundle of thirteen mentals adds Δ +0.003 R² on top of EA + FM26 visible, directionally consistent across folds but inside the CV standard deviation of ±0.011 [bake_off_hidden_mentals.csv]. The plausible mechanism is that HistGradientBoostingRegressor, when fed a noisy thirteen-mental ordinal vector, absorbs noise as readily as signal at this n, producing tree splits that appear additive in-fold and degrade out-of-fold.
The honest interpretation is that the hidden block earns its place in the union model as a bundle rather than through nameable individual contributors. The §9.4 permutation result that surfaces Adaptability, Injury Resistance, and Ambition in the top 20 is genuine — those three mentals carry the bundle's load disproportionately — but the drop-one result is the discipline that prevents any single mental from being cited as a Wave 2 finding in isolation. The next subsection demonstrates that a direct composite, computed without tree-based feature competition, recovers a clearer effect from the same raw signal.
5.6 — The Big-Match composite
Important Matches was the most marginal hidden mental in the drop-one analysis. The finding can be sharpened by combining it with two adjacent EA-side features — Determination and Composure — into a Big-Match composite (the z-score mean of the three), and then asking whether players in the top decile carry a market-value premium that the OVR + Age pathway cannot explain.
The raw 14× spread does not constitute pure signal, since better players also receive higher Big-Match ratings and the OVR pathway therefore explains part of the gap. The clean test is the residual correlation after OVR and Age have been partialled out: Spearman +0.108, with the residual distribution itself showing a monotone increase across deciles (mean residual −0.09 in the bottom decile and +0.04 in deciles 6–8). Players the Big-Match composite ranks highly carry a small but identifiable value premium over and above what their OVR predicts. The construct is testable and validates, and the actionable use is for a scouting model seeking a single composite "rises to the occasion" feature — the present construction is the empirically supported three-component version, carrying the cleanest signal in the bundle of mentals that §10 declined to credit individually.
5.7 — Orthogonalising the aggregates: keeping OVR and CA without double-counting
The figure-5.5 importance plot is honest about a problem it does not solve. OVR carries 0.74 of the permutation importance in the union model, fm26_currentAbility carries another 0.07, and every sub-attribute sits at 0.01 or less. The numerical attribution is therefore "OVR plus age, then noise", which is both unsatisfying and structurally misleading: OVR is itself a position-weighted aggregation of the same sub-attribute vector the union model also sees independently. The aggregate and its parts are highly collinear, the trees prefer to split on the cleaner aggregate signal, and the parts get starved of credit. A naive fix — dropping OVR and CA from the feature set — costs only 1.8 percentage points of R² (the union falls from 0.785 to 0.767) but throws away the position-weighting signal that OVR uniquely carries beyond the sub-attribute vector. A striker's Finishing is not interchangeable with a defender's Finishing, and the per-position OVR formula encodes exactly that conditional weighting.
The principled fix is to keep OVR and CA in the feature set, but to orthogonalise each against its sub-attribute basis via the Frisch–Waugh–Lovell construction. Fit a single global OLS — no position dummies, no interactions — of each aggregate on its underlying attribute vector:
$$ \widehat{\text{OVR}}(x) = \alpha_0 + \sum_{i=1}^{31} \alpha_i \, x_i, \qquad \widehat{\text{CA}}(x) = \beta_0 + \sum_{j=1}^{49} \beta_j \, x_j $$
The fits are strong but imperfect — global OVR $R^2 = 0.861$, global CA $R^2 = 0.831$ — in contrast to the per-position OVR fits in Figure 2.2 which lie in the 0.96–0.998 range. The unexplained 14–17 % is the position-weighting structure plus EA committee adjustments plus rounding for OVR, and the hidden-mental integration for CA. The residuals
$$ \text{OVR}_{\text{resid}} = \text{OVR} - \widehat{\text{OVR}}(x), \qquad \text{CA}_{\text{resid}} = \text{CA} - \widehat{\text{CA}}(x) $$
are, by construction, orthogonal to the sub-attribute basis. The residual carries the part of the aggregate that cannot be reconstructed from a position-blind linear combination of sub-attributes — i.e. the position weighting and committee signal. The sub-attributes keep their full marginal signal, the trees can split on either independently, and no double counting occurs. Per-position OLS would have absorbed the position signal into the coefficients and left only rounding noise in the residual; this is exactly the move the construction was designed to avoid. The standard deviation of the global residual is 2.41 OVR points, against 1.25 for the per-position residual — the global construction preserves roughly twice as much structural variance, all of which is position-context.
The three-way bake-off lands the orthogonalised union at $R^{2} = 0.774$ — 98.6 % of the baseline 0.785 and notably above the drop-aggregate floor of 0.767 — while the marquee top-20 median APE actually improves over baseline, from 29.5 % to 24.2 %. The same effect that appeared in the drop-aggregate variant (removing OVR's elite-tier compression lets the model spread superstars more cleanly) survives in a regime that keeps all the position information. FM-only sees the most dramatic marquee improvement (33.7 % → 25.4 %). EA-only orthogonal worsens on the marquee (38.7 % → 50.2 %), which is the one configuration where keeping raw OVR pays off — without any FM signal the residual cannot fully substitute for OVR's position-conditional weighting. The orthogonalisation does its work in the union and FM tracks where the wider attribute basis can absorb the redistributed credit.
The attribution becomes interpretable. The total explanatory work is unchanged, but the model is forced to acknowledge what is actually doing it: age as the sharp valuation lever, Reactions as the dominant skill proxy, reputation as the broadcast and transfer-market echo, OVR_resid as a small but real position-context signal worth 0.027, and a long tail of sub-attributes each carrying non-trivial weight. The orthogonal model is the better working artifact for any player-level inference — R² comparable to baseline, marquee accuracy strictly better, and the attribution defensible. The raw-OVR baseline remains the correct comparison point for the structural narrative "EA's OVR alone explains most of the variance", but as a deployable model the orthogonal variant is the version a recruitment team should pick up. The §11 position-transfer tool relies on exactly this construction; the per-position OVR formulas of §2.2 generate the counterfactual aggregates, and the orthogonalised model evaluates them without inflating the credit each receives [wave2c_summary.json].
Section closer
The deployment choice between EA alone and EA + FM depends on the operating surface. EA alone is appropriate for a top-5-league-only valuation pipeline when the engineering overhead of maintaining FM editions is genuinely binding. EA + FM is appropriate wherever the recruitment edge actually lives — the mid-tier leagues, the OVR-75-and-above bands, and the long tail where Transfermarkt's editor coverage thins out. FM alone is never appropriate: the seven-point standalone R² gap against EA is too large to justify, and the OVR signal that FM cannot match performs too much of the work in the union. The decision is structured by the deployment surface rather than by the headline R² number alone.
§6 — Validating the union model: Transfermarkt + CIES as two ground truths

The bake-off in §4 chose between three modelling configurations on a single benchmark — held-out Transfermarkt market value. That is one ground truth. The CIES Football Observatory monthly valuation is another, and the two of them disagree by a median 122 % on the same population of marquee players. A union model that lives between the two of them is not merely a TM-calibrated regression; it is a third measurement of the same latent valuation surface, anchored against a researcher-graded attribute matrix rather than against a crowd or against a proprietary club-financial composite. §6 sits the EA + FM26 union next to both of those public benchmarks and reports what a two-faced ceiling test looks like.
6.1 — Why two ground truths are stronger than one
The discipline of any empirical valuation paper is that the held-out scoring set must be defensible as a representation of the thing the model is trying to predict. Wave 1's Phase 3 audit settled the choice for the EA-only and EA+TM-hybrid models against realised transfer fees on the 303-row corpus that matches the EAFC26 universe [Lee, 2026a]; the §4 bake-off in this paper holds the same comparison out against held-out log-Transfermarkt value, with cross-validated R² and median absolute percentage error reported on each fold. Each of those benchmarks is defensible. Each of them is also partial. Transfermarkt's market value is a crowd-curated consensus number, maintained by a network of unpaid moderators and conditioned heavily on the most recent comparable fee and on forum-level discussion of contract dynamics; it is the standard public valuation lens and it is the lens against which most published attribute-to-value models in the literature are scored [Transfermarkt, 2024; McHale & Holmes, 2023]. The CIES Football Observatory monthly valuation, by contrast, is a black-box model output produced by the Poli–Ravenel–Besson group at the International Centre for Sports Studies in Neuchâtel, trained on roughly 3,000 realised transfer fees and fed by contract length, age, international status, performance metrics, club financial strength, and league level [Poli et al., 2022; CIES, 2024]. Both groups claim correlations with realised fees in the neighbourhood of 0.85; both groups document feature lists at the construct level; neither releases its weighting function or its training-set composition [CIES, 2024].
The two benchmarks are not interchangeable. Transfermarkt anchors on observed comparable fees and on the crowd's collective read of contract-window momentum; CIES anchors on a regression model whose marquee-tier predictions are observably uplifted by contract-length and club-financial covariates that no fan-curated platform tracks consistently. Because the union model's input feature set is neither a crowd of human editors nor a contract database but an attribute vector graded by EA and Sports Interactive researchers, the natural test is to compare against both benchmarks simultaneously. If the attribute-only union model lands within the corridor between the two — closer to one of them on the marquee tail and to the other elsewhere — then the attribute approach is recovering the same latent valuation surface from a third, independently noisy direction. That is the cleanest construct-validity argument an attribute-based model can offer, and it is the test §6 runs.
6.2 — TM validation, restated
The Transfermarkt half of the test has been established in detail across §3 and §4 and bears a one-paragraph compression here. On the matched 6,729-man corpus where every configuration scores the same rows, the EA + FM26 union model attains a five-fold cross-validated $R^{2} = 0.785$ against log-Transfermarkt value, with a fold envelope of 0.774 to 0.796 that does not overlap the EA-only envelope of 0.643 to 0.680. Median absolute percentage error on raw euros falls from 51.6 % under EA-only to 39.9 % under the union — a 11.7 percentage-point reduction. Spearman rank correlation against held-out TM value on the marquee top-50 is 0.869, and on the audited 20-name marquee panel the median absolute percentage error of the union model is 25.5 % against 53.6 % for EA-only and 41.3 % for FM26-only (Figure 3.3). The parity scatter (Figure 5.4) shows residual variance that is roughly homoscedastic in log-space across the €100K to €10M range, and the calibration deciles run flat against the diagonal up to the €20M ceiling above which the celebrity premium of §3.3 introduces the structural under-shoot that no schema-based model in this literature recovers. The outlier sweep (Figure 3.1) confirms that the model's R² collapses mechanically when the marquee tail is trimmed (a $\text{SS}_{\text{total}}$ effect, not a degradation in MAE-log or RMSE-log) and that the absolute-error budget is concentrated in the €10K–€50K floor where Transfermarkt round-numbers dominate genuine signal. The TM half of §6 is thus settled: the union model is the strongest attribute-only predictor of TM value yet reported on a corpus of this scale, and the residual gap to the published academic ceiling of $R^{2} \approx 0.85\text{--}0.90$ is named at the feature level (StatsBomb event metrics, club-financial features) rather than at the modelling level.
6.3 — CIES validation: the new analysis
The CIES half of §6 runs on the 74 players from the CIES Football Observatory top-100 January 2026 release who fuzzy-match into the Wave 2 union-prediction frame at a name-token-set ratio of $\geq 85$, after deduplication against the highest-pred-value representative for any name collision [section_v_cies_validation.csv]. The same 74 rows carry Transfermarkt market value and the held-out Wave 2 union prediction, which permits a clean three-way correlation table on a single population.
The full correlation table is the load-bearing analytical artefact of §6.
| Pair | Spearman $\rho$ | Pearson on log values |
|---|---|---|
| Union prediction vs CIES | 0.479 | 0.563 |
| Union prediction vs TM | 0.719 | 0.636 |
| Transfermarkt vs CIES | 0.551 | 0.467 |
| EA-only prediction vs CIES | 0.384 | 0.427 |
| EA-only prediction vs TM | 0.703 | 0.600 |
Three observations follow directly from the table and they should be read together rather than in sequence. First, the union model's rank correlation against CIES at $\rho = 0.479$ is materially lower than its rank correlation against TM at $\rho = 0.719$ on the same 74 players. The attribute-only model is closer to the Transfermarkt consensus than to the CIES proprietary output — which is exactly the prior the structure of the model would suggest, because both the union model and the Transfermarkt crowd are pricing on signals that derive from observable on-pitch ability, with contract dynamics absent from both. Second, Transfermarkt and CIES themselves correlate at only $\rho = 0.551$ on these same 74 rows: the two "ground truths" share roughly half their rank-order information and disagree on the rest. That is the empirical fact that makes the comparison-against-both test interesting in the first place. If the two benchmarks were identical the test would be vacuous; the fact that they share only moderate rank correlation, on a population where they have been observed in the same month, is what gives a model that lands between them its construct-validity argument. Third, the union model improves on EA-only against both ground truths — Spearman against CIES rises from 0.384 to 0.479 (a 0.095 lift), Spearman against TM rises from 0.703 to 0.719 — and the absolute-error improvement is even sharper: median absolute percentage error against CIES falls from 85.4 % (EA-only) to 71.8 % (union), and median absolute percentage error against TM falls from 66.8 % (EA-only) to 41.6 % (union) on the same matched 74-row frame. The §4 lift survives transposition to the CIES benchmark.
6.4 — Where the model sits, and where it sits below
The headline divergence between the two benchmarks is concentrated in the top tier and decays with rank.
| Tier | n | Median |APE| vs CIES | Median |APE| vs TM | Median CIES − TM gap |
|---|---|---|---|---|
| Top 10 by CIES rank | 10 | 40.0 % | 20.4 % | 27.2 % |
| Rank 11–30 | 20 | 75.9 % | 42.7 % | 141.1 % |
| Rank 31–end | 44 | 74.2 % | 46.5 % | 164.6 % |
The top-10 row is the interpretive anchor. On the ten most valuable players in the world by CIES January 2026, the median gap between CIES and Transfermarkt itself is only 27.2 %, and on the same ten players the union model's median absolute percentage error against TM is 20.4 % — well inside the CIES–TM separation. From rank 11 downward the two benchmarks diverge sharply, with CIES sitting at a median 141 % above TM in the rank 11–30 band and 165 % above TM in the rank 31–end band. The Wave 2 union model tracks Transfermarkt through that divergence: median APE versus TM stays in the 42–47 % range, while median APE versus CIES is locked at 72–76 %, because the union model is not capable of recovering the contract-length and club-financial uplift that drives the bottom 64 rows of CIES's top-100 above their Transfermarkt comparable. This is not a fault of the union model. It is what an attribute-only model would do against a benchmark whose ingredient list explicitly includes contract length, club financial strength, and international status as separate weighted features.
The bracketing test refines the picture. Of the 74 union predictions, 20 (27.0 %) sit numerically between the TM value and the CIES value on the same player; three sit above both, and the remaining 51 sit below both. The model is not "wrong" against CIES so much as observably anchored on the Transfermarkt side of the CIES–TM corridor — a scouting department reading the CIES output as a separate signal would interpret the gap as a contract-window premium the union model is structurally blind to.
The single most informative row in the file is Lamine Yamal, ranked CIES #1 in January 2026.
| Benchmark | Value |
|---|---|
| CIES January 2026 | €376.9 M |
| Transfermarkt January 2026 | €120.0 M |
| Wave 2 EA + FM26 union prediction | €134.4 M |
| Wave 2 EA-only prediction | €53.1 M |
| Wave 1 EA-only prediction [Lee, 2026a] | €114.6 M |
Four numbers, four separate readings of the same eighteen-year-old. The CIES output sits 214 % above the Transfermarkt value and 180 % above the Wave 2 union prediction; the union prediction sits 12 % above Transfermarkt and 153 % above EA-only on the same row. The CIES number is a model output that captures contract dynamics — Yamal's eight-year contract with Barcelona, his Euro 2024 winning campaign at age sixteen, the implicit option value of a contract that long for a player that young — and Transfermarkt's crowd has not (yet, on the January 2026 capture) marked him up to match. The union model lands very close to where the Transfermarkt crowd has him, slightly above the consensus by about €15M, and well below the CIES contract-aware ceiling. The Wave 1 EA-only model landed at €114.6M on the same player — almost identical to the Wave 2 union number, which is a clean instance of the §3.1 finding that the model's rank-ordering of the marquee tier is broadly stable across Wave 1 and Wave 2 even as the full-corpus R² rises from 0.77 to 0.785. The marquee tier is hard because it is the place where the celebrity premium lives, and no attribute-only model in this literature recovers it.
6.5 — What the ceiling looks like with two faces
The published academic ceiling against which Wave 1 was reported sits at $R^{2} \approx 0.85\text{--}0.90$ on log-Transfermarkt value, with CIES and Twenty First Group also reporting roughly 85 % correlation with realised transfer fees [McHale & Holmes, 2023; Yang, 2025; CIES, 2024]. Wave 1's $R^{2} = 0.77$ EA-only result sat eight points below that ceiling on the full sample, with a top-eight-league slice at 0.82 that closed most of the gap on the leagues where any production deployment would actually run. Wave 2's union result of $R^{2} = 0.785$ adds a further 0.015 to the full-sample number and, more importantly, lifts the marquee-tier absolute-error number from 53.6 % to 25.5 % on the audited 20-name panel.
Putting CIES on the right-hand side of the comparison reframes the ceiling. Against CIES the union model carries a Spearman correlation of 0.479 — a number that would be unimpressive if CIES were a clean ground truth, but which sits in the right neighbourhood once you observe that CIES and Transfermarkt themselves correlate at only $\rho = 0.551$ on the same 74 rows. The CIES–TM disagreement is the upper bound on how well any model could simultaneously fit both benchmarks; the union model's Spearman of 0.479 against CIES sits 0.072 below it, while its Spearman of 0.719 against TM sits 0.168 above. This is precisely the pattern an attribute-only model calibrated against Transfermarkt would produce.
The two-faced ceiling test produces a sharper diagnostic than either single benchmark would on its own. The remaining gap between the union model's $R^{2} = 0.785$ and the published 0.85–0.90 ceiling on Transfermarkt is named at the feature level: VAEP-style possession-adjusted plus-minus, expected-assists, and the StatsBomb open-data event family [McHale & Holmes, 2023; Van Damme et al., 2025]. The simultaneous gap against CIES — the much larger 72 % median APE — is named at a different level: contract length, club financial strength, and the marquee-tier multiple that any contract-aware proprietary valuation model produces when a high-potential teenager signs an eight-year deal at a Champions League club. The two gaps are different gaps. They are also closeable by different feature additions, and the path to closing each one is now diagnosable rather than mysterious.
6.6 — The McHale–Holmes test
McHale and Holmes (2023) is the most directly comparable piece in the published literature, and the result it reports is the one §6's CIES validation has now replicated independently. Their finding, on a corpus of realised transfer fees rather than Transfermarkt values, is that combining FIFA expert ratings with VAEP-style action values plus xG-based plus/minus beats Transfermarkt on average — and that Transfermarkt still wins for fees above €20M, the superstar tail where the celebrity premium dominates [McHale & Holmes, 2023]. The union model's marquee-tier residual in Figure 3.3, where the audited 20-name panel carries a median 25.5 % absolute error after the union beats EA-only and FM26-only on 13-of-20 names, replicates exactly that asymmetry: the model is sharper than the EA-only baseline across the population and remains structurally biased downward on the marquee tail, the same tail where Transfermarkt's crowd-curated number anchors on the most recent observed fee and thereby captures the celebrity premium directly.
The decomposition this paper now offers, sharpened by the CIES half of §6, is that the marquee-tier residual lives in two separable feature gaps. The first is the contract premium — the difference between a Transfermarkt comparable-fee anchor and a CIES contract-length-weighted projection — which sits in the €100M–€400M zone for the very top of the market and which neither EA's attribute schema nor FM's researcher network captures. The second is the event-metric gap — VAEP, possession-adjusted plus-minus, expected assists, the StatsBomb event family — which sits between the published ceiling and the union model's $R^{2} = 0.785$ on the broader population and which the McHale–Holmes lift is named at directly. The closing experiment §3.3 named — adding StatsBomb open-data event features on top of the EA + FM26 union — is the right path to closing the second gap. Closing the first would require a contract-database integration no fan-accessible source provides at the required granularity. This paper's position is that the union model's anchoring on the Transfermarkt half of the corridor is the right default, with CIES treated as a contract-aware second opinion rather than as a residual to chase.
6.7 — What §6 contributes to the unified argument
Three claims survive the CIES validation step. The union model's improvement over EA-only is benchmark-independent: the Spearman lift against CIES (+0.095) and the lift against TM (+0.016) on the same population point in the same direction, and the median-absolute-percentage-error lift is large against both — the strongest form of construct-validity evidence an attribute-only model in this literature can produce. The residual gap to the academic ceiling decomposes into two named feature gaps rather than a single shortfall: a TM-side StatsBomb event-metric gap closable on the open dataset where the published leagues overlap, and a CIES-side contract-and-financial gap that no free public source closes. And the marquee tail behaves exactly as the McHale–Holmes asymmetry predicts — above roughly €100M the union under-shoots both benchmarks; between €20M and €100M it lands between them, closer to TM; below €20M the two benchmarks themselves converge and the union lands close to both. The honest deployment recommendation is to treat the union model's output as a calibrated floor on the marquee tier and CIES as a contract-aware ceiling, with realised fees expected to land between them subject to club-specific bonus structures and contract-window dynamics that no schema-based public-data model captures. That recommendation, after §6, is anchored on a measurement rather than on a prior.
§7 — Where the model hits its ceiling: celebrity premium, price floor, and the structural break

No single attribute, composite, or hidden mental recovers the marquee tail of the men's transfer market. That is the headline finding of this section, and it is meant to be unwelcome to the reader who arrived in search of a missing variable — the one feature whose absence has been holding attribute-based valuation models seven percentage points below the academic ceiling. The ceiling has structure: it is built from four diagnosable mechanisms, each of which limits every model in this literature, and each of which is worth describing on its own terms before §5 turns to what FM data actually contributes. Wave 1 sat at R² 0.77 on 7,835 men and claimed to be "within striking distance" of the McHale-Holmes / Yang ceiling at R² 0.85–0.90 by reporting the top-8-league slice at 0.82 [Lee, 2026a]; Wave 2 lifts the full-sample figure to 0.785 with the EA + FM26 union, and the remaining gap is now a named gap rather than a mystery.
7.1 — The celebrity premium and the floor effect
The first instinct, when confronted with a 0.785-R² model carrying a 39.9 % median absolute percentage error, is that the residual error must live at the extremes — a celebrity premium at the top, where Messi, Mbappé, and Haaland are priced above what attributes can recover, and a noisy floor at the bottom where €10K, €25K, and €50K round-numbers from Transfermarkt's youth-team and free-agent entries dominate any genuine attribute signal. Both halves of that intuition are testable. We trimmed N ∈ {0, 50, 100, 250, 1000} players from the top or bottom of the TM-value-sorted corpus, retrained the union model on the trimmed frame, and reported four metrics on the same trimmed frame: R², RMSE on log-value, MAE on log-value, and median APE on raw euros [section_c_outlier_sweep.csv]. The four metrics tell two distinct stories.
Removing the top tail causes R² to fall sharply, from a baseline 0.717 to 0.544 once the top 1,000 players are excised; a reader attending only to R² will interpret the move as a catastrophic degradation in predictions, but it is not. MAE-log moves from 0.280 to 0.272 and RMSE-log barely shifts, so the R² collapse is mechanical: SS_total shrinks faster than SS_residual when the upper tail is removed, and the variance-explained ratio drops even though the prediction errors do not. The top 50 are not noise but informative high-end anchors that the model needs in order to keep its rank-ordering honest, and the celebrity premium that does affect them — the model under-shoots Messi, Mbappé, and Haaland by roughly 2× on average — constitutes a separate signal layer rather than a degradation of the core model.
Removing the bottom tail produces the opposite pattern. R² is essentially flat (0.717 → 0.699 once the bottom 1,000 leave), but RMSE-log drops 12 % (0.363 → 0.320), MAE-log drops 10 % (0.280 → 0.252), and median APE on raw euros falls from 47.1 % to 45.2 %. The €10K–€50K floor — youth-team contracts, free-agent depth, and the long tail of TM round-numbers that exist because some value must be entered and €25K serves as the placeholder — is where the union model carries the bulk of its absolute error budget; a production deployment that targets €100K+ valuations would carry a meaningfully tighter MAE than the headline 47.1 % suggests. The intuition above was therefore correct in spirit but inverted in direction: outliers do hurt prediction quality, but the damage lives in the bottom band rather than the top.
The match-quality audit closes one loop on this finding. The EA ↔ Transfermarkt fuzzy join produces 92.5 % exact name + DOB matches on the 12,456 matched men, with a name-score median of 100 of 100, a DOB year-difference of exactly zero on 98.5 % of rows, and exact-match rates of 92.2 % in the bottom value-distribution quartile and 94.3 % in the top [corpus_joined.csv]. The outlier-sweep finding is not a harmonisation artefact, and the bottom floor is real.
7.2 — The structural break: FM24 → FM26
The second mechanism is one the literature did not price in. When the FM24 and FM26 player tables are joined on the 1,635 men present in both editions and the per-attribute mean delta is computed, every single attribute moves upward, with a mean shift of between 10 and 17 points on the 0–100 surface scale. The pattern is not scout-revision noise but the data signature of a database that has been rebased [drift_fm24_to_fm26.csv].
The historical-drift table sharpens the claim by ruling out every other candidate. For each consecutive edition pair on the players present in both editions, we computed the mean delta of every shared attribute and divided by the standard deviation of the per-player delta distribution; an attribute counted as "shifted" if its absolute mean-delta exceeded 0.5 SD [historical_drift_rescaling_flags.csv]. FM20 → FM21 shows 0 % of attributes shifted, FM21 → FM22 shows 0 %, and FM23 → FM24 shows 0 %. FM22 → FM23 shows 51 % shifted in a mixed direction, consistent with documented goalkeeper-weighting and post-World-Cup positional tweaks — a partial rescaling rather than a rebase. FM24 → FM26 shows 97.7 % of 44 attributes shifted, all in the same direction, a transition categorically different from every other edition boundary in the panel.
The individual-player evidence is decisive. Mbappé's Current Ability runs 188 in FM23, 188 in FM24, and 98 in FM26 — a drop of 90 points; Messi moves 180 → 185 → 90, Salah 185 → 180 → 93, Vinícius 174 → 181 → 91, and Bellingham 155 → 168 → 91, while Yamal enters at 125 in FM24 and lands at 91 in FM26. The same internal SI scale, the same researcher network, and the same real-world football season feed the database, yet the rating roughly halves [elite_ca_matrix.csv].
| Player | FM23 CA | FM24 CA | FM26 CA | Δ FM24→FM26 |
|---|---|---|---|---|
| Mbappé | 188 | 188 | 98 | −90 |
| Messi | 180 | 185 | 90 | −95 |
| Salah | 185 | 180 | 93 | −87 |
| Vinícius | 174 | 181 | 91 | −90 |
| Bellingham | 155 | 168 | 91 | −77 |
| Yamal | — | 125 | 91 | −34 |
Either every elite player simultaneously forfeited half of their footballing ability across twelve months, or the underlying CA scale was structurally rebased; only the latter interpretation is tenable.
The mechanism is three structural changes that hit the FM26 boundary simultaneously. First, an engine rewrite. FM25 was cancelled in February 2025, with Sports Interactive publicly conceding the project "did not meet internal quality standards" [PC Gamer, 2025; ESPN, 2025]. The FM26 reveal followed in March 2025 with confirmation of a full Unity migration — the first wholesale engine replacement since the 2004 Championship Manager fork [VGC, 2025]. Second, a role-system overhaul. FM26 collapsed roughly sixty named roles into a dual In-Possession / Out-of-Possession structure, with Mezzala, Enganche, Trequartista, Segundo Volante, and Carrilero removed as named roles [FM Scout, 2025; footballmanager.com FM26 features]. Because Current Ability is mechanically a weighted sum of attributes against the best-role weight vector, reshaping the role table reshapes the CA function even if every underlying attribute value is left untouched. Third, the women's database. FM26 was the first edition to ship a fully integrated women's database at launch, and SI's own published guidance is that the women's database "retains the same 20-point scale for every attribute as the men's database, but the scale is made relative to each side of the database" [footballmanager.com Introducing Women's Football, 2025; thecutback.com, 2025]. Some downward compression of the men's distribution is almost forced by the constraint that both databases share a visible 1–20 surface scale.
We cannot disentangle which of the three innovations drove the rescaling, and our position is that this constitutes a property of the world rather than a defect of the analysis: SI shipped three structural changes simultaneously, and the database we observe is the joint product. What matters for downstream modelling is that any analysis comparing FM24-era and FM26-era CA values in raw form must first rebase, and the rebasing protocol matters as much as the rescaling number itself. The same instinct generalises to any analyst inheriting a multi-edition sports-rating panel — Elo derivatives, FIDE ratings across federation changes, EA cross-edition pooling, or World Rugby's law-cycle scoring revisions — who should run an analogous test before pooling editions in a model. The cost of treating a non-stationary scale as stationary is precisely the cost of a Mbappé prediction at €30M, generated because the model has concluded that his ability halved.
7.3 — Where attribute models stop: the marquee tail
The third mechanism is the celebrity premium itself, separated cleanly from the R²-mechanics effect above. The bake-off model recovers the top-50 rank order well — Spearman 0.869 against held-out TM value on the EA + FM26 union [bake_off_cv.csv] — but systematically under-shoots their absolute level by roughly 2×. The appropriate visualisation is the marquee audit, twenty recognisable names against their held-out predictions:
Thirteen of twenty wins is the headline, and the absolute-error reduction across the marquee population is the cleanest possible evidence that FM data performs genuine work on the names that matter commercially. The seven losses constitute the diagnostic. For Lionel Messi, the FM signal pulls the prediction further from the discounted Inter Miami market value — a contract dynamic neither schema captures. For Phil Foden, both models substantially underprice the Manchester City premium, a club-specific bonus structure neither schema captures. A handful of additional players carry Transfermarkt valuations materially below what either model's attribute vector predicts, reflecting post-injury hesitancy, league-of-residence discount (Saudi PL late-career signings, MLS retirees), or contract-window dynamics. None of these constitute attribute failures; they are price-formation factors that sit outside the schema entirely.
Sidebar — Why are Foden and Marquinhos under-predicted so severely by FM26-only?
The two extreme FM26-only outliers in Figure 7.3 — Phil Foden at €0.21M predicted versus €150M actual, and Marquinhos at €0.25M predicted versus €30M actual — invite the question of whether the FM26-only model has a systematic failure mode on broadcast-marquee Premier League and Ligue 1 names, or whether something else is doing the work on those two specific rows. The answer, once the underlying FM26 feature vectors are inspected directly, is that this is a data-quality issue at the EFEM scrape layer rather than a model failure on a representative slice of the marquee population.
Phil Foden's FM26 row in the union frame carries a Current Ability value of 53 against an EA OVR of 85 — a gap of 32 points on the 1–99 surface, far outside the high-value cohort's mean OVR − CA gap of −0.32 and median of −1.0. Marquinhos sits at CA 62 against OVR 87, a gap of 25 points. Inspecting the high-value cohort (TM ≥ €30M, n=291) shows that the top fifteen OVR-minus-CA mismatches in the corpus carry gaps of 13 to 37 points, and almost all of them are broadcast-prominent Premier League names: Nico Williams at OVR 86 / CA 49 (gap 37), Noni Madueke at OVR 80 / CA 49 (gap 31), Pedro Porro, Pedro Neto, Xavi Simons, Álex Baena, and others. The pattern is sharp enough to identify the mechanism as a name-collision in the public Football Manager (EFEM) scrape: EFEM's underlying URL slug structure resolves marquee names to one of several player profiles that share the surface name, and on a small number of rows the scraper has pulled the wrong profile — typically an academy player or a namesake at a lower-tier club — instead of the marquee senior-squad member. The wrong row carries the wrong CA, and that CA flows directly into both the FM26-only and union models.
For the FM26-only model, an attribute vector that says "this is a low-CA player" produces a low-value prediction; Foden at €0.21M is the model behaving correctly on a wrong-Foden vector. The union model is more robust because it also receives EA's 38 features in parallel, including the correct OVR of 85, and for Foden the union recovers to a €25M prediction — still far below the €150M actual but no longer pathological. The pattern is small in volume — roughly ten to fifteen rows out of 6,729 union-frame players show high-confidence name-collision errors — but it is concentrated in the marquee tier the figure visualises, which is why the FM26-only marquee marker reads so badly on those specific names. This is not a systematic FM26 calibration problem; the mean OVR − CA gap across the 291 high-value players is −0.32 with median −1.0, indicating that FM CA generally tracks EA OVR within a point at the marquee end. It is a small-N data-quality issue with the public EFEM scrape, and the union model's variance-reduction mechanism is exactly what makes it robust to this class of error on a row-by-row basis. A clean fix for a deployed production pipeline would be a sanity-check post-step that flags rows where OVR − CA exceeds fifteen points and either re-scrapes them manually or falls back to EA-only for those specific predictions.
On the §C.5 feature-importance evidence, the remaining gap from R² 0.785 to the published ceiling of 0.85–0.90 is not a schema gap at all but a performance-metric gap. McHale & Holmes 2023 had VAEP and possession-adjusted plus-minus from licensed match-event data, while Yang 2025 operated on a Big-Five-league-restricted dataset from which the long-tail noise driving our error budget is absent [McHale & Holmes, 2023; Yang, 2025]; Van Damme et al.'s 2025 random-forest one-year-ahead forecast had access to the same StatsBomb event family [Van Damme et al., 2025]. The natural next experiment is to add StatsBomb open-data event features — xG, expected assists, VAEP, possession-adjusted plus-minus — on top of the EA + FM26 union, restricted to the subset of leagues where StatsBomb's open dataset is published; that experiment sits beyond Wave 2's scope, but Wave 2 makes it tractable. The "within striking distance" claim now holds at the level of the union model rather than at the level of the EA-only baseline, and the residual gap is a named feature gap rather than an unaccounted-for shortfall.
7.4 — A negative result, briefly
A negative result earns a paragraph here because the prior it falsifies is widely held. The position-familiarity vector FM exposes — collapsed into the single Versatility hidden mental on the surface schema — is the structurally interesting feature for any recruitment department that has ever paid a premium for a multi-position player. The raw effect on market value is sizeable: the 85+ Versatility bucket commands a €2.0M median against €1.0M for the bottom bucket, a clean 2× spread [versatility_buckets.csv]. The clean test is whether Versatility explains anything after OVR and Age have been partialled out of log-value, and the answer is Spearman = −0.004 on n=6,729 — a null indistinguishable from zero.
The entire raw 2× spread is therefore compositional: more versatile players have higher OVR and are slightly older, both of which independently raise market value, while Versatility itself adds no residual signal. The recruitment-department prior — that multi-position utility commands a value premium over and above what raw ability and age predict — is one of the strongest priors any scouting model would carry, but the Wave 2 empirics on n=6,729 do not support it. The plausible reading is that multi-position utility is fully absorbed into OVR and Age by the time TM-level valuation aggregates it, leaving no marginal signal beyond OVR detectable in this sample. A scouting department running an OVR-anchored valuation model gains nothing by adding Versatility as an additional feature; Versatility is a tactical-deployment input rather than a valuation input. The discipline of reporting this null is the same discipline that lends credibility to §9's positive results.
Section closer
The four mechanisms are not equally weighty. The FM24 → FM26 rebase is the principal finding — categorically different from every other edition boundary, large enough to invalidate raw cross-edition pooling, and structured enough to permit identification of three specific causes. The celebrity premium is real on the top 50 but separates cleanly from R² mechanics once the metrics are reported properly; the €10K–€50K floor is the actual error-budget driver, and the deployment remedy is to threshold the prediction surface above €100K; the Versatility null is the discipline check that earns the rest. None of these constitutes a single decisive resolution, and the absence of one is itself the finding. §9 demonstrates that the residual lift available — the 0.118-R² gap between EA-only and EA + FM26 union — is concentrated in nameable features within nameable leagues, and is large enough to act on.
Interlude — Football Manager as a market-tracking instrument
Before the bake-off, one trust-building anchor that the rest of §4 leans on. If Sports Interactive's roughly 1,300-researcher network were generating Current Ability ratings by some process effectively independent of the actual football transfer market, the rank correlation between FM CA change and Transfermarkt log-value change over the same edition window would sit near zero. Researchers in 116 countries, working part-time on local-knowledge revisions, would be drifting one direction while the market drifted another. The data falsifies that picture cleanly. Spearman(FM CA change, TM log-value change) = 0.36 on n=675 men over the FM24 → FM26 window [drift_vs_tm_change.csv].
A rank correlation of 0.36 is not strong enough to use FM CA deltas as a sole valuation input, and we are not proposing that. It is exactly strong enough to use them as a market-tracking second opinion. When SI's researcher network downgrades a player's CA between editions, that player's Transfermarkt market value has, on average and net of noise, also fallen. When SI upgrades, the market has on average moved with them. The two measurements share the underlying signal — actual change in a footballer's economic standing in the global labour market — and disagree only on independent measurement noise. The implication is that FM is not a hermetic scout opinion divorced from market reality. It is a market-tracking signal calibrated against the same football economy Transfermarkt's editors are watching, with independent enough noise that disagreements between the two are the analytically interesting cases.
The deployable form of this finding is direct. A scouting department can treat the FM CA delta — computed edition-over-edition on the players present in both — as a market-tracking second opinion that runs on a different update cadence and from a different signal source than Transfermarkt. The interesting cases are the disagreements, not the agreements. Either the crowd-sourced TM number is lagging an SI revision (the researcher network saw the form change first; the crowd will catch up over the next window), or the SI researcher network is lagging a market move (the crowd has priced in a transfer rumour the researchers have not yet ratified). Both possibilities are tradeable. The 0.36 is the number that makes them worth trading on.
This is also the load-bearing reason the §4 bake-off survives the obvious "but FM is just opinion" critique. Opinion that tracks the market at 0.36 Spearman over a 24-month window is not opinion. It is measurement, with researcher-driven noise on top of a real-world signal. The bake-off below is asking which combination of measurements predicts market value best — and the answer turns out to be: both, additively, with FM doing its work in exactly the leagues and OVR bands where EA's broadcast lens has the least to say.
§8 — Can the men's lens see women?

This section revises Wave 1. The original cross-gender claim — that the men-trained model rank-orders women correctly, with a 42× magnitude shrink required to reach plausible euro values — was the headline finding of the project as it stood in May 2026. Wave 2 placed two men-trained models on the same 385 women with both attribute vectors (one with EA features, one with FM26 features), and the two models rank those women in essentially uncorrelated orders, with a Spearman cross-source rank correlation of 0.153. Within-gender, EA OVR and FM26 CA correlate at ρ = −0.22 — a negative correlation; the within-gender FM26 attribute model fits women's CA at R² = 0.91 ± 0.014, while the within-gender EA attribute model fits at R² = −0.10 ± 0.190, performing worse than predicting the mean. The original "ranking transfers" claim was therefore true of EA features specifically, at a population-level qualitative grain, but not of a richer attribute vector and not at the individual-woman level. What survives is a within-gender FM26 model, and the binding constraint on a women's valuation curve has migrated from the attribute corpus to the realised-fee corpus. This section states that finding explicitly and walks through the evidence.
8.1 — The 33-year zero, then 36,000 women shipped at once
Before any modelling, the data lineage warrants documentation. For more than two decades — Championship Manager 93 through Football Manager 24 — Sports Interactive shipped a football management simulation that did not include women's football at all, and six FM editions from FM2016 through FM24 carry zero women in our acquired corpus. FM25 never shipped: SI cancelled it in February 2025 mid-development, citing the Unity-engine port. FM26 launched on 4 November 2025 with 36,000+ women across 14 leagues and 11 nations on three continents, carrying the full attribute schema and supported by an independently constructed scouting network of roughly 40 women's-football researchers built over the four-year window between announcement and launch [Sports Interactive, 2025; Women in Games, 2025].
The reason for the long absence, on the studio's own account through Studio Director Miles Jacobson and Head of Women's Football Research Tina Keech, is operational rather than ideological. Across the 2010s, commercial demand for women's-football data was smaller than the men's product warranted, there were fewer licensable women's competitions around which to build leaderboards, and — most operationally — no scouting infrastructure existed capable of producing attribute ratings of the kind FM's match engine consumes. Keech put the point plainly in the launch coverage: "Women's football is known for a lack of accessible data — and this is now the biggest women's football database ever created for a video game" [Women in Games, 2025].
The contrast with EA Sports warrants precise statement. EA introduced women's players in FC23 (2022), shipped Women's Champions League integration in FC24 (2023), and by FC26 carries 1,447 women across the named women's leagues; SI lagged EA by roughly two release cycles. When SI did ship, however, they shipped at approximately 25× the scale — 36,000+ women in FM26's database against 1,447 in FC26's. The two products operated under different definitions of "shipping women's football": EA shipped the marquee end of the women's player pyramid as gameplay assets, while SI shipped a full simulation-grade scouting layer including youth and reserve teams across 14 league pyramids. The order-of-magnitude difference in player count reflects that distinction rather than a comparative-effort gap.
Of the 1,447 women in the EA universe, 385 (26.6%) carry the full FM26 attribute vector by virtue of being in one of the five leagues SI's research network has covered to scouting-grade depth: the German Frauen-Bundesliga (60 women with FM26 attrs), the NWSL (86), Liga F Moeve (81), Barclays WSL (70), and Arkema PL (46). The remaining ~1,062 carry only EA's vector. That 385-woman intersection is the corpus everything in this section is run on, and the population scale of the within-gender claim is bounded by it.
8.2 — The within-gender bake-off: FM26 visible at R²=0.91, EA-only at R²=−0.10
The headline finding of the section follows from a single bake-off. We ran a 5-fold cross-validated HistGradientBoostingRegressor predicting FM26 women's currentAbility — calibrated, by SI's explicit design, within the women's-only side of the database — from each feature block on the n=385 women for whom every block is complete.
The numbers are sharp enough to read directly off the chart. FM26 visible attributes alone recover R² = 0.906 ± 0.014 on women's CA, a strong within-gender fit comparable to the men's combined model on the men's Transfermarkt target; EA attributes alone recover R² = −0.100 ± 0.190, performing worse than predicting the population mean. The baseline OVR + Age model lands at R² = 0.082, the FM26 hidden-mental block alone recovers R² = 0.439, and combining EA + visible adds essentially nothing on top of visible alone (R² = 0.899). Within-gender, the FM26 schema is therefore sufficient for the entire signal while the EA schema is actively misleading.
The negative-R² result for the EA-only model is the most surprising number in the report and the most empirically clarifying. It establishes that EA's women's attribute vector, on the same n=385 players, is not merely a noisier source of women's-CA signal than the FM26 vector but an anti-signal: a HistGradientBoostingRegressor with full hyperparameter latitude and 5-fold CV cannot learn a positive-R² mapping from EA's women's attribute vector to FM26's women's CA on this corpus. The mechanism, given the philosophical bridge developed in the Interlude, is that EA's women's attributes are calibrated against a pooled (or top-of-pyramid) scale whose composition does not align with FM26's within-gender CA target. The two latent quantities are different objects.
The finding carries both a methodological and a substantive consequence. The methodological consequence is straightforward: within-gender modelling on FM26 attributes is the appropriate approach; one schema and one target calibrated to the same gender boundary should be selected and fit inside that boundary, since mixing schemas across calibration philosophies is what produces the negative-R² result. The substantive consequence is harder to state without the appearance of a retraction, and §8.4 below states it as one — the cross-gender ranking-transfer claim Wave 1 made was a feature of EA's attribute schema specifically, the schema that turns out to be anti-signal against the within-gender target. Wave 1 lacked a second attribute schema against which to run the bake-off; Wave 2 has one.
The numbers are clean because the corpus is well-defined. n=385 is small for a deep-learning baseline but large enough for HGBR's gradient-boosting bias-variance trade-off to be well-conditioned, and the fold variances reported (±0.014 on the FM26-visible block, ±0.190 on the EA block) are themselves diagnostic: the FM26-visible model is stable across folds, whereas the EA-only model's fold-to-fold variance is sufficient to swing it across zero. That instability is the data signature of an anti-signal model under cross-validation — different folds locate spurious correlations in different directions, and the mean R² is dominated by their inability to converge on a consistent one.
The within-gender FM26 model fits the women's data at the same order of magnitude that the men's combined model fits the men's data. That parity is the closing result of Wave 2's women's analysis: the feature-side problem is solved, the schema-side problem is solved, and the calibration-philosophy problem is solved by choosing FM26 and staying inside it. What is not solved is the target-side calibration to euros, examined in §8.5 below, but that is a different and smaller problem.
8.3 — The within-gender OVR/CA correlation is negative
The within-gender bake-off finding carries a clean player-level signature: if EA OVR is anti-signal against FM26 CA on the women's corpus, then the rank correlation between the two should sit near zero or below it, and the empirical correlation is negative.
Spearman correlation between EA OVR and FM26 CA on the n=385 women's overlap set is ρ = −0.22, a negative correlation: the player EA rates higher tends, on average, to be the player FM26 rates lower, and vice versa. The within-gender-calibration design forces this outcome in the present corpus, since the two attribute systems estimate different latent quantities, with a roughly anti-correlated league-and-position composition bias dominating the n=385 overlap. The strong fit of §8.2 and the negative Spearman here therefore describe the same finding from two angles: the within-gender FM26 attribute vector carries the signal that recovers the within-gender CA target, while EA's vector, when run against the same target, moves in the wrong direction.
The marquee panel makes the negative correlation legible. Bonmatí sits at the top of both systems (FM26 CA 96 and EA OVR 91), the single player both schemas agree occupies the apex of the women's game; beneath her, the disagreement opens up. Olivia Smith (FM26 CA 90, EA OVR 79) is FM's second-tier elite and EA's mid-pack, and Alyssa Thompson (FM26 CA 90, EA OVR 81), Jaedyn Shaw (FM26 CA 86, EA OVR 74), Pauline Bremer (FM26 CA 86, EA OVR 76), Momoko Tanikawa (FM26 CA 83, EA OVR 73), and Claire Hutton (FM26 CA 81, EA OVR 74) all carry top-tier FM26 CAs while sitting in the OVR 70s on EA's broadcast-calibrated scale.
These are the players whose attribute-system disagreement drives the ρ = −0.22, and their existence constitutes the empirical case for treating FM26's independently built women's researcher network as a more granular within-gender source than EA's broadcast-tuned women's OVR. The negative correlation is not a failure of either database but the data signature of two databases that have made philosophically opposite choices about the meaning of their numbers.
8.4 — Cross-source rank correlation: 0.153
The within-gender ρ = −0.22 is on the raw attribute summaries. The natural next test is what happens when two men-trained valuation models — one with EA features, one with FM26 features — are asked to rank the same set of women's players. This constitutes the closest available direct test of Wave 1's cross-gender ranking-transfer claim, and the result is the section's retraction.
Spearman ρ = 0.153 between the two model-predicted rank orders on the n=385 women — statistically distinguishable from zero but practically uncorrelated. The most extreme disagreements concentrate at the top of the EA distribution: Sophia Wilson is EA's #2 and FM26's #339, a rank disagreement of 337 places; Lauren James is EA's #7 and FM26's #323, Δ = +316; Alessia Russo is EA's #3 and FM26's #148, Δ = +145; and Claudia Pina is EA's #9 and FM26's #225, Δ = +216. Bonmatí, at #8 on EA and #3 on FM26, is the one top-of-distribution player on whom both sources roughly agree.
The pattern does not falsify either database; it confirms that the two sources are independent priors over the same underlying truth, with different calibration baselines and different gender-scale conventions whose authors explicitly state that the cross-database comparison is not what their scales were built for. The Wave 1 cross-gender ranking-transfer claim is retired in its strong form by ρ = 0.153 on the cross-source comparison and ρ = −0.22 on the within-gender raw-attribute comparison. The qualitative claim Wave 1 made — that some model trained on men ranks Bonmatí and Putellas high — remains true on EA features specifically; the strong claim — that the men's-trained ranking function transfers to the women's market in any source-independent way — does not.
Editorial sidebar — What Wave 1 said and what Wave 2 now says
Dated 2026-05-19.
Wave 1 of this project, drafted in May 2026 against EA Sports FC 26 attributes alone, made three load-bearing claims about cross-gender transfer. Wave 2, with Football Manager 26 attribute data available for the same 385-woman intersection, sharpens each one. We state both versions explicitly here as a publication-grade revision note, because the original framing has been cited externally and the data has moved.
Claim 1 — "The men-trained model rank-orders women's players accurately." Wave 1's evidence was a known-stars qualitative validation: Bonmatí, Patri Guijarro, Caroline Graham Hansen, Putellas, Rodman, Marta all landed in plausible rank positions on the men's-EA-trained model, with the age curve doing the work on Putellas (rank #23) and Marta (rank #188). Wave 2 sharpens this to: rank transfers on EA features specifically, not on a richer attribute vector and not at the individual-woman level. On the same 385 women, the EA-trained men's model and the FM26-trained men's model rank-correlate at ρ=0.153 — practically uncorrelated. The qualitative validation Wave 1 ran on EA features looks right because EA is calibrated against broadcast consensus and the qualitative reference list is also calibrated against broadcast consensus; the two are not independent. The strong-form "ranking transfers cross-gender" claim does not hold against a richer attribute schema.
Claim 2 — "Magnitudes need a 42× shrink to reach plausible euro values." Wave 1's anchor was a generous €2M ceiling on top women's transfer fees against the model's €84M raw predictions. Wave 2 has a 27-row verified women's transfer-fee corpus, 21 of which match into the EA women's universe, and refits the calibration directly. The fitted curve is log₁₀(fee) = 5.34 + 0.0019 × predicted_CA, R² = 0.001 on n=21 — the slope is statistically indistinguishable from zero. The 42× shrink Wave 1 reported is sharpened to a 118× shrink from the verified-fee curve, but with slope-R²=0.001 the curve itself is provisional. The honest version of Claim 2 is: the magnitude question is a target-side data limit, not a feature-side data limit, and the target-side corpus does not yet support a stable calibration curve.
Claim 3 — "Bonmatí, Patri Guijarro, and Putellas land in the top 30 of the predicted distribution." This was Wave 1's qualitative top-10 table. Wave 2 retains this as qualitative validation on EA, which it is — on EA features, the model puts the women the football-watching world considers elite near the top. Wave 2 no longer makes the parallel claim for within-gender CA. On the within-gender FM26 model, Bonmatí remains at the top (FM26 CA 96, predicted CA in the 80s), but Russo at EA's #3 falls to FM26's #148, Wilson at EA's #2 falls to FM26's #339. The top-30 anchor table from Wave 1 was a coherent test against an external prior; it is not a falsifying test of the two-schema bake-off, and the two-schema bake-off is the more empirically demanding finding.
The pattern across all three revisions is consistent. Wave 1's claims were correct within the scope of the evidence Wave 1 had — EA's attribute vector together with a qualitative external prior — while Wave 2 carries a second attribute schema and a verified-fee corpus, and the same claims sharpen into more measured versions. The strong-form "ranking transfers cross-gender" reduces to "ranking transfers on EA features alone"; the "42× shrink" reduces to "118× from a 19-fee curve with slope-R² = 0.001"; the qualitative top-10 validation remains in qualified form. None of the original findings is reversed and all are sharpened, which is the move the data forces and the move the publication record should reflect.
8.5 — The calibration cliffhanger
The within-gender model of §8.2 produces, for every woman in the corpus, an EA-only-predicted CA. The remaining empirical question is whether that prediction recovers the women's transfer market's revealed valuation, which we test against a 27-row verified women's transfer-fee corpus assembled from Wikipedia, ESPN, BBC, and other primary-source disclosures over the 2017–2025 window. The women's transfer market began disclosing fees consistently only from 2019, and began disclosing mid-single-digit-million-euro fees only from January 2025 onward, which bounds the corpus from below; of the 27 fees, 21 match a player in the EA women's corpus and produce a predicted CA value.
The fitted calibration curve is log₁₀(fee) = 5.34 + 0.0019 × predicted_CA, R² = 0.001 on n = 21, and the slope is statistically indistinguishable from zero. The same Grace Geyoro who moved for €1.65M in 2025 carries a predicted CA close to Trinity Rodman's, who moved for €50K in 2021; Olivia Smith's £1M July-2025 WSL move (€1.16M) carries a predicted CA similar to Naomi Girma's $1M January-2025 move and Lizbeth Ovalle's August-2025 world-record-at-the-time €1.4M move. The within-gender CA prediction does not discriminate the fee structure of the women's market in the way the men's model's log-value prediction discriminates the men's market.
The honest reading is that the women's transfer market itself does not yet exhibit a stable revealed-value function that any attribute model can recover. The fees in the 27-row corpus are dominated by the timing of each move — post-January-2025 fees are systematically an order of magnitude larger than pre-2023 fees — by the specific buyer's marquee strategy (Bay FC, Orlando Pride, and the WSL's NewCo investment cycle all set above-market reference prices), and by the player's narrative profile rather than her position on any FM26 or EA attribute. A 21-row corpus spanning an eight-year window over which the market was itself structurally repricing is too small to support a stable calibration curve. The headline finding is that the binding constraint on a women's valuation model is the target-side corpus rather than the feature-side corpus, and the target-side corpus is currently inadequate; this is the cliffhanger that opens Wave 3.
8.6 — Demographics and the marquee panel
The within-gender corpus has a meaningfully different age and CA shape from the 13,434-man corpus on the other side of the database. This distinction is descriptively important because the women's transfer-market shape of §8.5 cannot be diagnosed without first establishing the underlying player-pool shape.
Men's median age is 26.0 against a women's median age of 23.0 — a two-year shift. The women's mode at age 21 is consistent with the shorter mean career tenure of the women's professional game, where post-collegiate-system US recruits and pre-Liga-F-academy-graduate Spanish 21-year-olds dominate the modal cell. The slightly broader standard deviation (4.86 vs 4.60) reflects a larger relative share of late-30s veterans still active, because women's-game retirement income outside the marquee leagues remains thin enough that competitive careers extend further into the late thirties when health allows. The CA distribution shape also differs: women's mean CA 62.06 against the men's 67.37, which is the five-point shift that the Interlude's data signature predicted from SI's within-gender calibration policy, and women's CA σ is 11.14 against the men's 8.05, a 38 % broader distribution.
The marquee panel exposes the structural feature of the n=385 corpus on which the rest of §6 turns: the top-CA women are not uniformly the top-OVR women. Bonmatí (FM26 CA 96, EA OVR 91) is the rare case where both systems agree at the top, while Olivia Smith, Alyssa Thompson, Jaedyn Shaw, Pauline Bremer, Momoko Tanikawa, and Claire Hutton all carry top-tier FM26 CAs while sitting in the OVR 70s on EA's broadcast-calibrated scale. The mean absolute prediction error of ≈ 4.8 CA points in the top 20 is comparable to the men's combined model's mid-tier MAPE, and the within-gender model performs the same task on women's CA that the union model performs on the men's TM target, in approximately the same accuracy band.
Section closer
The arc of the paper closes here. Wave 1 established that EA's attribute schema — not its value field — is the load-bearing artefact for football valuation; Wave 2 added Football Manager alongside it and ran the controlled experiment Wave 1 could not, supplying a second attribute schema, an independent rating network, and a within-gender bake-off. The findings come in pairs that map cleanly to the sections: two schemas disagree on shape (§9) but agree on rank for men (ρ = 0.834); the geographic seam in EA's calibration is the largest single bias and operates at the league level rather than the position level (§5, +0.22σ on the Premier League and +0.67σ on Premier League right-wingers); the men's combined model lands at R² = 0.785 with mid-tier MAPE collapsing from 51.6 % to 39.9 % on the addition of FM (§9); the within-gender FM26 women's model attains R² = 0.91 on women's CA while the EA-only women's model goes anti-signal at R² = −0.10 (§8.2); the cross-source rank correlation is 0.153 on n = 385 women (§8.4); and the 21-fee women's transfer corpus does not yet support a stable euro calibration (§8.5). The piece's central empirical claim — that two attribute schemas built for different purposes produce disagreements that themselves constitute the data product — survives both halves of the analysis, and the Wave 1 revision (the §8.4 sidebar) is the move that publication-grade work requires when new data arrive. Wave 3 sits squarely on the women's transfer-fee corpus: extending it to 100+ verified fees and refitting the within-gender calibration directly against the verified-fee target. The feature side is solved; the data product that opens the next paper is the same data product that closed this one. Women's football became measurable for the first time in November 2025, and everything that follows builds on that fact.
Chapter 3 — What the joined corpus newly enables
The third chapter steps past the headline argument and into the second-tier discoveries the joined corpus makes available. League-level valuation deviations and the country-PPP decoupling result, a personality-archetype atlas computed from FM's exclusive hidden-mental block, and a position-transfer valuation tool that reads a player's attribute vector under every alternative positional formula — none are the primary contribution, and a reader who stopped after the previous chapter has already received the main argument, but each is a deployable artefact in its own right.
§9 — Supplementary insights: league-level valuation deviations, exchange-rate geography, and the PPP-decoupling result

The chapters above carry the load-bearing argument: examine each instrument, compare them, build the union model, validate it against TM and CIES, and test its transferability to women. The findings in this chapter belong to a second tier. They are interesting and mechanically rigorous, and they carry real implications for a recruitment-analytics deployment of the union model, but they are not the primary contribution of the paper; a reader who stopped after §8 would already have received the main argument. The geographic deviations catalogued here are best understood as additional reading on the structural texture of the data rather than as the headline finding the union model produces.
A widely voiced gaming-community prior holds that EA over-rates forwards in order to please attacking players. The claim has the surface appearance of explanation — it would account for the celebrity-tier glow around Mbappé's 91 and the consistent way every Ultimate Team meta tilts toward strikers and wingers — but when EA's full 13,434-man universe is placed next to Football Manager 26's attribute vectors for the same players, the position-side hypothesis collapses on first inspection: every position lies within ±0.06σ of FM's calibration, and the most-EA-overrated position is goalkeeper. The position-bias story is therefore wrong, and what replaces it is much larger, much more actionable, and rarely discussed. EA inflates Premier League players by 0.22σ on average, and Premier League right-wingers by 0.67σ — the single most-overrated cell in the entire EA × FM matrix. The seam in EA's calibration is geographic rather than positional, which is the subject of the section.
9.1 — Testing the forwards-overrate hypothesis
The cleanest way to test a position-bias claim is to z-score both ratings inside their own distributions and examine the per-position difference. We computed the quantity directly: for each of EA's twelve position codes, mean OVR_z minus mean CA_z, on the 13,434 men with both vectors. The result is the population-level disagreement at the position level, with FM26's researcher network serving as the second prior against EA's broadcast committee.
The numbers do not support the prior. The most-EA-overrated positions are goalkeeper at +0.035σ and central defensive midfielder at +0.034σ, while the most-EA-underrated position is left winger at −0.060σ; striker sits near the bottom of the position-overrate index at −0.024σ. On the full 13k men's corpus, EA rates forwards slightly lower than FM does, not higher. The magnitudes are small in absolute terms — none of the position deltas crosses one-tenth of a standard deviation — but the direction is what matters: the intuition that EA inflates the attacking tier of its database does not survive a controlled comparison against a researcher-graded second source.
The plausible mechanism for the modest direction that is observed is that FM's role system rewards the attribute combinations defining an elite winger or striker — high Pace combined with Finishing, Off The Ball, and Composure — more multiplicatively than EA's OVR formula does. EA's OVR is a compressed broadcast statistic with a 60–80 modal band tuned for online matchmaking balance, so when a player's underlying attributes are excellent, EA's formula shaves the top of the OVR distribution in order to preserve that modal band. FM's CA, which must feed a match engine capable of distinguishing a 90-rated striker from a 95-rated one in order to produce different 90-minute outcomes, leaves the top of the distribution open. The net effect is that elite forwards sit a touch higher in FM's CA than in EA's OVR, and the gaming-community story has the sign inverted.
A negative result of this kind is methodologically expensive because most readers are committed to a position-bias prior at the outset and the data must perform real work to shift it; the result earns its place by clearing the field, because no position-side calibration drift is large enough to act on in either direction. The next question is whether the market knows something about positions that the two video-game schemas do not, and the answer is yes — in a manner that does not alter the conclusion of §9.1.
9.2 — Where the market actually pays a premium
The market does pay an attacking premium, and both schemas agree the premium exists; neither schema, however, is its source.
Median Transfermarkt value runs as follows: LW €2.5M, CAM €2.0M, RW €2.0M, CDM €1.8M, LM €1.8M, RM €1.6M, CB / CM / LB / ST €1.5M each, RB €1.3M, and GK €0.6M. The wide-attacker positions sit at 4× the goalkeeper median, with strikers and left wingers carrying €180M and €150M ceilings respectively, so the market premium on attackers is real, large, and absorbs an entire order of magnitude between the floor and the ceiling.
But the EA-vs-FM disagreement does not amplify this premium. Were EA inflating attackers relative to FM, the position-side overrate-index of Figure 9.1 would carry the market signal, which by a clear empirical margin it does not, so whatever drives the market's attacking premium is operating outside the schema-disagreement surface. Scoring contribution is the candidate explanation that most analysts reach for: a goal-scorer's marginal product is more transferable across clubs and leagues than a defender's, scarcity at the top of the wide-attacker market is higher, and the residual market-driver is unrelated to which video-game schema is consulted. The market knows something the schemas measure correctly and agree on — namely, that wide attackers cost more — and the schemas, having agreed on it, leave no diagnostic signal in their disagreement.
The implication for a recruitment-analytics director is that the EA-vs-FM disagreement is not the appropriate locus for a position-side correction. The question of whether a forward is overvalued is answered by features outside both schemas: scoring rate, expected goals, finishing efficiency relative to position, and the StatsBomb event metrics that Wave 1 identified as the unaddressed gap. The schemas as a pair are silent on position-side calibration drift, and that silence is the load-bearing setup for §9.3.
9.3 — The Premier League pays ×1.46 of OVR-baseline and LaLiga pays ×0.76: a market effect, not an EA error
Two questions live in the league dimension and must be separated before either can be answered cleanly. The first is whether EA and FM26 disagree about a league — a measurement-noise diagnostic between two raters. The second is whether the Transfermarkt market prices a league above or below what the rating-to-value curve predicts — a mispricing or premium diagnostic. We answer both, and they tell different stories.
The wrong way to answer the second question — which an early draft of this section did — is to treat the rating-z disagreement matrix as a mispricing finding, which it is not. A league row with high mean OVR does not indicate that EA over-rates the league; it likely indicates that the league hosts the world's best players. The Premier League's +0.22σ EA-z lead over its FM-z value is the residual after standardising both columns, so it survives the naive "PL has the best players" reading, but it still only measures which database considers the players to be better — it says nothing about whether the market judges them worth more than their ability warrants. Answering the mispricing question requires the market price as the y-axis and the rating as the x-axis, with the per-league residual as the diagnostic.
The story is sharp, large, and replicable. The Premier League pays ×1.46 (+0.166 log₁₀) of the rating-baseline expectation — for a given OVR and Age, English-eligible players in the PL list at 46 % above what the corpus-wide curve predicts. MLS pays ×1.34, the second-largest premium, driven by Designated Player contracts and an inflated TM listing convention for North-American marquee names; the EFL Championship at ×1.18 carries the promotion-lottery option-value premium. Serie A and Bundesliga sit slightly above baseline (×1.13 and ×1.05), Süper Lig and Ligue 1 are essentially flat, and the Eredivisie is mildly discounted (×0.93). LaLiga 2 at ×0.86 and Liga Portugal at ×0.82 sit in the second tier of discount. The headline asymmetry is at the bottom: LaLiga 1 pays ×0.76 (−0.121 log₁₀), so Spanish-eligible players in the top tier are listed 24 % below what their OVR and Age predict. The PL vs LaLiga gap of ×1.46 vs ×0.76 amounts to a factor of nearly 2× on the same rating profile.
The interpretation is that the premium and discount columns are market effects, not rating errors. The Premier League's premium reflects its broadcast and prize-money structure: a PL place is worth ~£100M+ in television revenue before a single match is played, and that revenue flows through to player valuations in a way LaLiga's drier broadcast pot does not. MLS's premium reflects the DP-contract listing convention that has hardened over a decade of US-market expansion, and the Championship's premium reflects the option-value of promotion. On the discount side, LaLiga's −24 % residual is partly attributable to the financial-stress regime at Barcelona and Real Madrid that has produced systematically low listings for star players (the Lionel Messi listing of €35M in this corpus is the clearest case), and partly to a Spanish-football crowd-curation convention that has historically priced Spanish players conservatively against the English-market lens. Liga Portugal's −18 % reflects the well-documented "feeder league" effect: Porto, Benfica, and Sporting groom players for European resale, and TM lists their pre-sale value against an English-market reference frame.
The per-cell breakdown sharpens the earlier draft's "PL right-winger" framing. The PL premium is not concentrated at RW (+0.146 in the residual matrix, middling within the PL row); it is broadest across CDM (+0.116), CM (+0.202), CAM (+0.223), LM (+0.218), and RM (+0.249) — the midfield band that the rating-z disagreement matrix had also flagged but which we now read as a market rather than a measurement anomaly. MLS's largest cells are at LM, RB, and especially RW (+0.395) — the marquee-attacker tier where North American clubs list players against global comparable-sale benchmarks rather than US-internal ones. LaLiga's largest discounts are at goalkeeper (−0.229), the central midfielders (−0.143 to −0.176), and the wide midfielders — broad enough to read as a league-wide discount rather than a position-specific one.
For a recruitment-analytics director, this is the actionable correction the earlier draft was reaching for. A model that prices players from EA OVR + Age while ignoring league produces predictions that are systematically too low in the PL and MLS and too high in LaLiga, Liga Portugal, and the Eredivisie; the remedy is a league fixed effect calibrated against the corpus-wide rating-to-value curve, with the per-league multipliers above. The EA-vs-FM rating disagreement matrix (preserved below) is a separate diagnostic — it identifies which leagues the two raters disagree about, but the disagreement is between two raters rather than between a rater and the market. The market disagreement, which a valuation pipeline must correct, is much larger than the rating disagreement and lives in a different set of league rows.
The original rating-disagreement matrix is informative on its own terms as a measurement-noise diagnostic between EA's broadcast-tuned committee and FM's local-researcher network. We retain it below for completeness, with the caveat that it is not a mispricing finding.
The two diagnostics agree on direction — the EA-broadcast-aligned leagues are also the leagues with the largest market premia — but diverge sharply in magnitude. The PL's +0.22σ rater disagreement is small compared to its ×1.46 market premium, meaning that the market performs most of the work and the rater disagreement contributes a marginal additional effect at the top. LaLiga's +0.26σ rater disagreement against its ×0.76 market discount is interesting in the opposite direction: EA rates LaLiga players above FM and the market still discounts them, so the EA-vs-market gap is unusually large, and this is the league in which treating EA OVR as the standalone valuation signal will damage a model most severely.
The league-side calibration question has a Wave 1 antecedent that pre-dates the FM26 comparison. In the Wave 1 corpus, treating league as a one-hot feature in the EA-only model produced a multiplicative premium for each league against a Premier League baseline, and the OLS coefficient is the market exchange-rate for an identical-OVR, identical-age, identical-position player.
The exchange-rate bar can be read as the macro version of Wave 2's league × position heatmap. The two tell the same story from different angles: a Premier League player commands a 50–150 % premium over an otherwise-identical player elsewhere, and EA's OVR system itself inflates that same Premier League player by ~0.22σ over FM's calibration. The Premier League's premium is therefore both a real-market phenomenon and an EA-side measurement bias, and Wave 2's contribution is identifying which portion is which.
The natural follow-up question is whether league premia track host-country economic strength. The naive prior is that richer countries' leagues should pay more, but the Wave 1 data on n=34 leagues indicates otherwise.
Spearman ρ = 0.147 and log-log Pearson = 0.100, so the football economy is essentially decoupled from the country economy: the Premier League's premium is not attributable to the UK's GDP per capita, and the LPF's discount is not attributable to Argentina's. League market value constitutes its own latent dimension, set by broadcast rights, club ownership concentration, and historical talent flows rather than by the macroeconomic context of the host country, and this is the cross-league finding that justifies the league fixed-effect approach that Wave 2's heatmap operationalises at finer position-level granularity.
The mechanism beneath this finding is the one [Ezzeddine, Pradier & Scelles, 2025] describe in the Journal of Sports Analytics: EA's rating committee is calibrated, over decades, against the same broadcast and media inputs that shape Transfermarkt's crowd-curation. EA watches the same matches and reads the same press the Transfermarkt crowd reads, so the two outputs are close cousins by construction rather than by independent observation, which is precisely why EA's signal is densest in the leagues with the most broadcast coverage and thinnest in those with the least. FM's researcher network, by contrast, optimises against match-engine outcomes rather than broadcast consensus. The argument is one of independence-of-priors: when two sources disagree by 0.6σ on the Premier League right-winger cell, the disagreement constitutes signal, and the geographic distribution of that signal is the story.
9.4 — Player-level disagreement: the leaderboard
The league-level pattern of §9.3 carries a clean player-level signature. Z-score OVR within EA and CA within FM, then rank by the z-difference; the top 15 in each direction make the geographic banding visible at the individual level, with FM-overrates concentrated in South America and Liga Portugal and EA-overrates concentrated in EA-marketed top-5 European leagues.
Xavi Simons tops the EA-overrate column at +2.22σ (Premier League, OVR 84 vs CA 70), followed by Álex Baena at +2.10σ (LALIGA, OVR 84 vs CA 71), Leo Román at +1.83σ, Tammy Abraham at +1.58σ (Süper Lig), and Kasper Schmeichel at +1.37σ (Scottish Prem). Of the top 15 EA-overrates, 13 play in EA-marketed top-5-European leagues or the Süper Lig — the precise league set that EA's marketing-density coverage emphasises — and the names are predominantly players for whom EA's broadcast-tuned committee identifies a marketing-prominent name and rates them higher than FM's local researchers do.
The FM-overrate column tells the inverse story. Federico Vera leads at −2.17σ (Liga Profesional Argentina), Enzo Martínez at −2.03σ (also LPF), and Jonatan Torres at −1.88σ (Libertadores), with four players tied at −1.74σ across LPF, Bundesliga, Allsvenskan, EFL, and Liga Portugal. Of the top 15 FM-overrates, 13 play in leagues that EA covers thinly — six in the LPF alone and five in Liga Portugal — where FM's regional researcher tradition (Argentina has long carried one of FM's deepest coverage networks) registers a high CA and EA's committee rates them down. The leaderboard makes the structural argument concrete: the disagreements between the two databases are dominated by which leagues each schema's rating network can actually observe.
For a scouting model that aims to surface mid-tier players the European broadcast lens has missed, treating "FM CA higher than EA OVR" as a positive signal in precisely the league set the leaderboard identifies is the operational move.
9.5 — Long-tail coverage as the structural reason
The geographic seam in §9.3 and §9.4 is not an accident of which leagues the two studios chose to license but a coverage-depth signature, and FM's coverage depth is its load-bearing data product. The top-30 league bar chart from §A makes the point visible.
See Figure 3.1 above — the top-30 league bar is shared with §3.1's discussion of FM's coverage strength; the same data underwrites this section's argument about FM's value as a long-tail valuation instrument.
None of the Big-Five top tiers occupies the top five by player count in the FM26 ∩ EA matched corpus. Argentina's LPF leads at 658 players, MLS at 597, and the EFL Championship at 510 — with no Big-Five top tier among them. The Premier League sits fourth at roughly 500 players, which is the ceiling produced by a 25-man-squad-× 20-club structure; LaLiga is sixth, Bundesliga eighth, Ligue 1 eleventh, and Serie A twelfth. The 3. Liga and Bundesliga 2 both clear the Premier League because they contain more clubs within the same vertical league hierarchy, so FM's coverage strength resides in the long tail by design.
This is the structural reason the §9.3 valuation-residual and rater-disagreement patterns both read the way they do. FM has invested deeply in researcher coverage of the leagues that EA's broadcast lens covers thinly, so the FM-vs-EA rater disagreement is largest in precisely those leagues. The market, independently, pays a premium for the leagues whose broadcast and prize-money structure can support it, and that league premium is concentrated in the same broadcast-aligned set that EA's rating committee tracks most closely. The two effects are correlated by construction — EA's coverage density tracks the market's broadcast structure — but they remain distinct: a model that draws its training signal from FM's coverage breadth will pull disproportionately on the long tail, while a model that draws its signal from EA's coverage density will pull disproportionately on the Big-Five top tier and the marquee names within it. The two coverage choices are design decisions rather than defects, and the resulting geographic seam in both the market-residual and the rater-disagreement matrix is the inevitable joint output.
Section closer
The actionable correction is geographic rather than positional, and it is a market correction rather than an EA-error correction. A recruitment-analytics shop pricing players from EA OVR + Age while ignoring league will systematically underprice Premier League and MLS players by a factor of approximately 1.4×, slightly underprice Championship and Serie A players, and systematically overprice LaLiga, Liga Portugal, and Eredivisie players (by 17 %, 21 %, and 7 % respectively against the corpus baseline). The league fixed effect that these numbers operationalise is the single highest-leverage correction in any EA-only valuation pipeline, while the position-side correction is approximately zero across every cell. The forwards-are-overrated story was wrong, and the position-side intuition was the wrong place to look; the PL-and-MLS-pay-a-premium-while-LaLiga-discounts-its-stars story is correct, large, and replicable on disk — and it is a market story rather than an EA-rating story. The earlier draft of this section read the EA-vs-FM rater disagreement as a mispricing finding, which it is not; the rater disagreement is a measurement-noise diagnostic between two raters, and the mispricing finding appears in Figure 9.3 above where the market price itself sits on the y-axis. The structural bias in the world's most-used football valuation pipeline is not about who plays which position but about where in the world they play, with a magnitude of nearly 2× between the Premier League's premium and LaLiga's discount on the same rating profile.
Interlude — Two simulations, two theories of women's football in the database
§5 closed with EA and FM disagreeing about leagues on the men's side. The same two databases have made opposite methodological choices on a different and harder question: how to calibrate ratings across gender. Sections §6 below cannot be read without first understanding what each studio said, because the choice — not the implementation — is what determines whether the women's data can be modelled at all.
EA's marketing implies a unified 1–99 scale. The FC26 product copy describes a "unified scale for all players … to enhance mixed-gender Ultimate Team squads" [EA Sports, 2025]. There is no published methodology document spelling the implication out, but the product framing is consistent: a 90-OVR woman and a 90-OVR man are intended to read as equivalently elite within the same calibration surface. In our matched cohort of 385 women with both EA and FM26 attribute vectors, mean EA OVR for the women is 71.5, against 66.4 for the men. The women's cohort sits five points higher on the OVR scale than the men's. The data signature is consistent with the unified-scale framing on one mechanism: the EA women's universe is a top-of-pyramid selection (the 1,447 women in FC26 are all from named playable leagues, which is the top tier of the women's professional game), while the men's universe extends down to depth pieces and academy graduates. The unified-scale interpretation says: same scale, different population shapes.
Sports Interactive is explicit, and the direction is opposite. From the FM26 launch coverage, restating SI's official methodology: "a female player with 20 for Pace would be at the peak of speed in the women's game, just as a man is in the men's game" [Sports Interactive, 2025]. Independent coverage by Fuller FM and The Cutback sharpens the policy: "the 1–20 attribute scale is calibrated relative to the women's game only. You should NOT be comparing a male footballer with a female footballer" [Fuller FM, 2025; The Cutback, 2025]. In the same matched cohort, mean FM26 CA for the 385 women is 62.1, against 67.4 for the men. The women's cohort sits five points lower on the surface scale than the men's. The data signature is the exact inverse of EA's, in the exact same direction the methodology document predicts: when you calibrate the women's scale to the women's distribution rather than to a pooled distribution, the women's mean comes out below the men's by the natural gap in the lower-tier coverage on the men's side, exactly because the women's scale has been pulled toward its own per-database centre.
The symmetry of the two data signatures is what makes the philosophical bridge work. Both studios shipped a coherent design. Neither is wrong on its own terms. EA's unified-scale framing produces a women's-mean five points above the men's because the women's playable universe is the top of its pyramid, with the men's universe extending below. SI's within-gender framing produces a women's-mean five points below the men's because the women's scale has been pulled into a self-contained calibration window. The same underlying truth — women's professional football is a younger, smaller, more concentrated population than men's professional football — produces opposite numerical signatures depending on which scale you use to look at it.
The consequence for §6 is direct, and it is the reason the women's analysis below carries such a sharp empirical contrast. A model that consumes EA's women's attributes is reading a vector calibrated against a pooled scale whose actual coverage is top-of-pyramid; the model interprets that vector through a men-trained prior that expects a wider distribution of underlying ability. A model that consumes FM26's women's attributes is reading a vector explicitly designed not to be compared against the men's vector; the same men-trained prior is being fed a vector its training data does not include. The two attribute systems are estimating different latent quantities — by design, in writing, with both studios on the record — and the question §6 asks is which of those latent quantities supports a within-gender valuation model on a 385-woman corpus.
The answer, when we ran it on disk, is much sharper than either Wave 1 or our Wave 2 priors expected. The within-gender FM26-attribute model lands at R²=0.91 on women's CA. The within-gender EA-attribute model lands at R²=−0.10. The two databases have not just made different methodological choices on cross-gender calibration; they have produced measurably different signal qualities when you ask each one to predict women's ability within women's data. That is the finding §6 is built around, and the philosophical bridge above is the reason the finding lands as cleanly as it does. The data is honest about a choice both studios made — and the choice has consequences.
§10 — Supplementary insights: personality archetypes, four named clusters in the men's database

The eleven personality mentals support a clustering exercise on which the rest of the paper does not depend but which a recruitment department will recognise as the most directly deployable descriptive output of Wave 2. K-means on the standardised personality vector, sweeping k from four to eight and selecting k = 4 on the best silhouette (0.116), recovers four clusters interpretable enough to be named — via Hungarian-assignment matching against Sports Interactive's internal taxonomy — as Model Citizen, Big-Match Driver, Loyal Veteran, and Resolute. The clusters are not tight, since a silhouette of 0.116 is modest, but the centroids carry interpretable signature directions, and the resulting archetypes validate cleanly against external value and reputation variables [archetype_summary.csv].
The four archetypes, by population and market position:
| Archetype | n | Median TM value | Mean age | Mean intl reputation |
|---|---|---|---|---|
| Model Citizen | 2,716 | €2.5M | 27.6 | 1.33 |
| Big-Match Driver | 1,021 | €2.0M | 27.7 | 1.25 |
| Loyal Veteran | 2,348 | €900K | 26.1 | 1.04 |
| Resolute | 644 | €500K | 24.7 | 1.02 |
The cluster centroids in z-score form expose the signature axes.
The radar resolves the four signatures. Model Citizen sits above the population on Professionalism, Loyalty, Sportsmanship, and Compliance — the conscientiousness composite — and slightly below on Temperament; this is the corporate footballer, above-average on every social and contractual virtue and slightly cooler on the emotional-volatility axis, with a median value of €2.5M and a mean international reputation of 1.33, both the highest in the corpus. Big-Match Driver sits above the population on Important Matches, Pressure, and Consistency — the clutch composite — and notably high on Sportsmanship and Temperament; the cohort is smaller (1,021) but commercially valuable, with a median TM value of €2.0M just below Model Citizen and a slightly younger skew. Loyal Veteran is high on Loyalty and Compliance and below on Ambition and Adaptability — the settled-veteran signature — and constitutes the cluster a club re-signs because the player will not move, with the lower value (€900K median) reflecting precisely that contractual stickiness. Resolute is below the population on Adaptability and Professionalism and slightly above on Consistency, the journeyman signature; it is the most mobile cohort, carries the lowest mean international reputation, and unsurprisingly forms the lowest-value cluster at a €500K median.
A first-pass reading of the raw median column would conclude that "Model Citizen commands 5× the median value of Resolute", but that reading would be misleading. The Model Citizen cohort also carries the highest mean OVR (72.4) and is the second-oldest cohort (mean age 27.6), while the Resolute cohort carries the lowest mean OVR (65.5) and is the youngest (24.7), so most of the headline 5× gap is the raw-ability and age-curve difference rather than any independent contribution from the personality archetype. The honest test of whether archetype carries valuation signal beyond ability is to partial out OVR and Age first and examine the residual.
The CA-controlled finding is much smaller and more honest than the raw column suggested. After partialling out OVR + Age, only the Resolute archetype shows a clean, large discount of ×0.86 (14 % below baseline, 95 % CI ×0.81 – ×0.91); the Model Citizen "premium" shrinks from the headline 5× to a modest ×1.04 above baseline (CI ×1.01 – ×1.07), statistically positive but a thin slice of the original gap, while Big-Match Driver and Loyal Veteran sit within CV noise of zero. The substantive reading is that personality archetype carries some valuation signal but most of the raw spread was attributable to ability and age. The Resolute discount is the load-bearing finding — younger, lower-OVR, lower-Adaptability, lower-Professionalism players carry a 14 % discount on top of what their rating and age already imply, which is the archetype effect performing real work — while the Model Citizen +4 % constitutes a marginal premium, statistically present but operationally small. The personality clustering is not the headline valuation tool — §9's pipeline-level R² occupies that role — but it is the headline descriptive tool: a four-letter archetype label for any player in the FM26 men's database, computed automatically from the hidden vector, with face validity against market position and international reputation, and carrying one CA-controlled premium worth a couple of percent and one CA-controlled discount worth 14 %.
The actionable use case is contract-decision support, and it is specific enough to be deployed without re-derivation. A club holding a player in the Loyal Veteran archetype who is one year from contract expiry is statistically more likely to extend than a player in the Big-Match Driver archetype with identical OVR and Age; contract-rejection probability differs by personality signature, and the Loyalty / Ambition centroid gap (Loyal Veteran sits above population on Loyalty and below on Ambition, while Big-Match Driver inverts both) is the mechanism. A club running a transfer-pursuit list can prioritise targets by archetype: a Resolute player at any OVR is a likelier mover than a Model Citizen at the same OVR. None of this is achievable for an attribute-only model, because the input features — Loyalty, Ambition, Compliance, Important Matches — do not exist in EA's schema at all. The archetype atlas is the kind of model FM data uniquely enables, and Wave 2 reports it as a descriptive output rather than a valuation output; that disclosure is the discipline. The valuation tool resides in §4, and the archetype tool resides here, answering a different question — how a player will behave under the contract structures a club can offer — than the valuation pipeline answers.
§11 — Repurposing the model: a position-transfer valuation tool

The §2 finding that EA's OVR is a position-weighted aggregation, recoverable per-position at $R^2 = 0.96$ – $0.998$ via OLS on the sub-attribute vector, supplies a tool the union model can deploy directly. The per-position OLS coefficients are EA's positional formula. If we apply a different bucket's coefficients to the same player's sub-attributes, we obtain a counterfactual answer to a question the labels normally suppress: what would this player's OVR have been, and what value would the model project, if EA had classified them as a striker rather than a centre-back? §11 builds that tool, validates its outputs against football intuition, and discloses the methodological caveat that limits its interpretation.
11.1 — Pipeline
The construction has four stages. First, the twelve EA position codes (CB, ST, LW, …) are mapped to the five OVR-formula buckets (GK, DEF, MID, WIDE, FWD). Second, per-position OLS is fit separately on each bucket, regressing OVR on the 31-attribute sub-vector and recovering each bucket's positional formula; the fits range from $R^{2} = 0.88$ for goalkeepers (whose specialist attribute layer is excluded from the visible vector) through 0.97–0.998 for outfield buckets, recapitulating the §2 finding on a per-bucket basis. Third, for every player in the matched corpus, the five bucket-specific formulas are applied to their actual sub-attribute vector, producing five counterfactual OVRs — OVR_as_GK, OVR_as_DEF, and so on. Fourth, the union model from §5.7 is used to score five counterfactual market values per player by substituting each OVR_as_<bucket> into the player's feature vector and re-predicting log Transfermarkt value. The output is a 7,835-row × 25-column table of every player's projected value under every position bucket, augmented by a cosine-similarity position-fit score that flags conversions where the player's attribute profile is too far from the target bucket's centroid to support a credible projection.
Two safeguards sit between the model's raw projection and the recommended conversion set. The first is the cosine-similarity position-fit score against the bucket centroid; without it, the GK formula's heavy reliance on Reactions, Composure and Positioning would happily project Kylian Mbappé as a €49M goalkeeper. The filter catches every gross mismatch in the sanity panel: Mbappé-as-GK ($-0.75$), Vinícius Júnior-as-CB ($-0.52$), Phil Foden-as-defender ($-0.35$) all flag below zero and are excluded from the recommended conversion set. The second safeguard is more categorical: GK ↔ outfield conversions are excluded from the best-alternative recommendation regardless of posfit. Goalkeeping is genuinely separable in a way that no posfit threshold can quite capture — the entire GK specialist attribute layer (Diving, Handling, Kicking, Reflexes, GK Positioning) lives outside the visible sub-attribute vector, so any cross-GK projection is the model speaking outside its competence. The earlier version of the tool surfaced 1,600 outfielders (mostly centre-backs whose Heading, Strength and Jumping proxy weakly for GK signal) as having GK as their best alternative bucket; the current version excludes those projections by construction so that they cannot reach the recommendation surface. The same rule excludes the five goalkeepers whose attribute profile happened to project them into the DEF bucket from the inverse-direction artefact.
11.2 — The transfer matrix
The matrix recovers football's positional topology cleanly. MID is the universal pivot — a midfielder retains 90 % of value as a defender, 85 % as a winger, and 71 % as a forward, which is exactly the pattern that supports the routine CDM-to-CB and CAM-to-winger conversions clubs already make. GK is structurally isolated, retaining only 24 % of value as any outfield bucket and outfield players retaining only 24 % as goalkeepers; the GK row and column should be read with a stronger caveat than the rest of the matrix because the goalkeeping specialist attribute layer (Diving, Handling, Reflexes, Kicking, GK Positioning) is not part of the visible sub-attribute vector at all, and the cells are computed from a formula that cannot see the specialist signal. The values shown are the model's best projection under that information ceiling and the recommended conversion set excludes GK ↔ outfield swaps regardless of how high the numbers look. DEF↔FWD is the worst conversion within the outfield buckets, with defenders retaining only 28 % of value as forwards and forwards retaining 24 % as defenders; the attribute profiles are structurally opposite and the model says so. The asymmetries within the matrix — MID→DEF at 90 % but DEF→MID only 58 %, FWD→MID at 71 % but MID→FWD only 50 % — are not bugs but signal: defenders' attribute vectors do not project as cleanly into the offensive bucket as midfielders' do because defensive sub-attributes (Tackling, Marking, Heading) carry less weight in the offensive formulas than the equivalent offensive sub-attributes carry in defensive ones. The matrix is, in effect, a structural map of which position swaps are attribute-supported and which are not, with GK ↔ outfield treated as a categorical exclusion rather than a continuous projection.
11.3 — Case studies and what the tool surfaces
Six named cases anchor the tool's behaviour. Jude Bellingham projects above 50 % of own-bucket value in every outfield alternative — he is a four-outfield-bucket player, which matches the football record of his having been deployed as ST, CAM, CM, and CDM at club level. Federico Valverde scores similarly, with a notably high projection as a defender (€101M) that recapitulates his Real Madrid emergency-fullback deployments. Erling Haaland, by contrast, collapses sharply outside FWD — his projection as a defender is €1.2M and his projection as a goalkeeper, despite the formula returning €92M from Reactions and Composure proxies, is excluded from the recommended set on the categorical GK exclusion rule. Virgil van Dijk mirrors Haaland in the opposite direction, with strong DEF and reasonable MID projections but no support for any further-forward conversion. Kenan Yıldız and Bukayo Saka illustrate the MID↔WIDE interchangeability the matrix predicted: both project at substantially higher value in alternative buckets than at their listed positions, which matches the modern positional reality that wingers and central attacking midfielders are increasingly the same player deployed differently.
The scatter surfaces a quietly important class of player. Aleksandar Pavlović (CDM, €65M actual) projects at €8.4M in his own bucket but €27.6M as a centre-back — a +229 % uplift on the same attribute vector under a different positional weighting; the model is saying that EA's choice to label him a CDM is suppressing what his attributes are actually worth. Kenan Yıldız at +194 % to WIDE, Alejandro Garnacho at +210 % to WIDE, Noah Okafor at +232 % to WIDE all carry the same signature: a player whose listed position is one click central of where their attribute profile lives. The conversions are attribute-implied, not scouting predictions — they reflect a mispricing inside EA's own formula machine, not a counterfactual football scenario.
11.4 — Naturally positioned vs versatile
The same construction reads in the reverse direction. The players with the highest cosine similarity to their own bucket's attribute centroid are the archetypally positioned: Lucas Chevalier and Anatoliy Trubin top the goalkeeper subset at fits above 0.95, Evan Ferguson tops the strikers, Kiernan Dewsbury-Hall tops the midfielders, and Vitinha leads the central-midfield subset of players above €100M at fit 0.927 — by attribute, Vitinha is the prototypical modern CM. The versatility ranking inverts the criterion: Jude Bellingham, Federico Valverde, Tijjani Reijnders, and Aleksandar Pavlović score full marks (5 / 5 buckets where projected value ≥ 50 % of own-bucket value), and these are the players whose attribute profiles support the broadest set of deployments. The two rankings are dual — high own-bucket fit means one deployment is uniquely well-supported; high versatility breadth means several are. Both are useful, both are derivable from the same construction, and the recommended use is to consult them jointly: a scouting target should ideally show high own-bucket fit at the destination and moderate-or-better fit at adjacent buckets, so that a tactical revision after signing does not strand them.
Meta-analysis sidebar — Why "raw capability" is not what we measured
The position-transfer tool answers a tightly scoped question — what would EA value this attribute profile under a different position's weighting formula, holding observed attributes constant — and it is important not to read it as answering a wider one. Professional footballers have fixed positions by their late teens. Their training load, match exposure, coaching emphasis, and rater observation are all shaped by that position for the decade preceding the attribute record. By the time EA's committee or Sports Interactive's researcher network records a centre-back's
Finishingat 45, that 45 is a measurement of (a) whatever finishing capability the player was born with, (b) the years they spent learning to defend instead of finishing, and (c) the rater's inference from a small sample of low-leverage CB shots. The three components are not separable.This means we never recover raw, position-independent capabilities from EA or FM. What we record is a position-baked observed-capability vector. The contamination is heterogeneous — physical attributes (Pace, Strength, Jumping) and FM's hidden mentals (Determination, Bravery, Adaptability) are close to inherent, while technical and tactical attributes (Finishing, Tackling, Marking, Crossing) are heavily position-trained — but the gradient is real, and EA's OVR is dominated by the high-contamination category.
The §11 tool should therefore be read as a valuation lens, not a scouting projection. Pavlović projecting at €27.6M as a centre-back surfaces a real and useful signal: EA's own formula machine is pricing him below where his attribute vector should land if it were weighted differently. That is information about EA's pricing inefficiency, not about whether a club could actually convert him from CDM to CB and realise that valuation. The latter would require retraining, which would change the attributes themselves; the former is a static recombination of the existing measurements under a different aggregator. The tool is honest about what it is, and the caveat is a methodological feature of any model trained on this kind of label-conditioned observational data, not a defect specific to this construction.
The tool ships as position_transfer_full.csv in the Wave 2d artifact directory, with one row per player and columns for every alternative-bucket projection, position-fit score, and pre-filtered best plausible conversion. The 7,835-row table is the directly deployable output of §11; the recruitment use is to filter by uplift_pct > 50 %, best_alt_fit > 0.2, and a market-value floor that matches a club's scouting budget, then read the case studies the filter surfaces against the matrix and the contamination caveat. The construction is reproducible from the artifacts and the per-position OVR coefficients are written out alongside the projection table; any team wanting to extend the tool to FM-side counterfactuals (a per-position CA formula on the visible + hidden vector) has the schema to do so.
§12 — Squad-level tactical fit: the top-five-league atlas

The §11 position-transfer tool runs one player at a time. The same construction, lifted to the squad level, answers a different question: which clubs are structurally skewed toward attack or defence, where does each club have realistic depth across the five tactical buckets, and how does formation choice shift the projected attacking output the squad can produce. §12 builds that atlas across the 96 clubs in the Premier League, LaLiga, Bundesliga, Serie A, and Ligue 1 — roughly 2,500 players — and reports three findings that arise when a transparent attribute-level model is scaled from the individual to the team.
12.1 — Squad style spectrum
The first finding is descriptive but non-trivial. A squad's attacking-vs-defending lean can be measured directly from its players' sub-attribute vectors by constructing a per-player z-score on offensive attributes (Finishing, Long Shots, Dribbling, Vision, Pace and ten other technical and creative attributes) minus a z-score on defensive attributes (Interceptions, Tackling, Marking, Heading Accuracy, Strength and four other physical-defensive attributes), then averaging across the squad with weights proportional to the square of OVR — a rough minutes-played proxy that emphasises starters. The result is a single number per club, comparable across leagues, that captures what the squad's attribute composition supports before any tactical instruction is applied.
The atlas resolves cleanly. Real Madrid sits at the upper-right corner — high overall quality, attacking lean across the whole squad, including a noticeably forward-skewed defensive group consistent with Carlo Ancelotti's positional structure that pushes both fullbacks into the half-spaces. Paris Saint-Germain, Liverpool, Bayer Leverkusen, Atalanta populate the same upper-right quadrant. The lower-left is occupied by Heidenheim, Hellas Verona, Udinese, Angers SCO, St. Pauli — small-budget Bundesliga and Serie A mid-table sides whose attribute composition mechanically favours defending. The defender-style lean column is the more interesting measurement: every team's defenders skew defensive in absolute terms (style_z < 0), but the least defensively-skewed defensive groups are the upper-right teams, with the Premier League's top six showing defender style scores between −0.45 and −0.65 against the lower-tier teams' −0.85 to −1.10. The interpretation: when even your centre-backs sit closer to the league-mean attacking profile, the squad supports an attacking shape structurally, not just on the manager's tactical whim.
12.2 — Bucket-depth atlas and per-bucket spend imbalance
The position-transfer construction of §11 supplies the second layer. For each squad, the mean of the five highest OVR_as_<bucket> projections gives the realistic starting depth that bucket can field across an XI's worth of rotation, regardless of nominal position labels. The top-5 mean is the right granularity for a depth measurement: a 4-3-3 fields four defenders and a fifth on the bench, a 3-5-2 fields three central defenders and two wing-backs in the same DEF bucket, and a deep cup run requires the fifth-best at any bucket to be playable — so the top-3 mean used in the earlier version of this analysis under-counted the bucket's actual usable depth and over-weighted the very-top tier. A squad whose top-5 mean for OVR_as_FWD is 82 — counted across every player who could plausibly be deployed as a forward, not just the nominal strikers — has the talent to support an attacking front through fatigue and rotation; a squad whose OVR_as_MID top-5 is 80 has the talent to overload midfield across a full week of fixtures.
The heatmap surfaces three structural patterns. The top quintile of clubs are deep in every bucket — Real Madrid, Manchester City, Paris Saint-Germain, Bayern, Liverpool, Barcelona, Arsenal, Inter all carry top-5 OVR_as values above 78 across all five buckets, including GK. The middle band is bucket-asymmetric — many mid-tier clubs have one or two buckets where depth collapses, and the collapse is informative: clubs with very low OVR_as_WIDE are typically structurally unable to run a 4-3-3 even with their nominal wingers, while clubs with low OVR_as_DEF cannot run a back five even when nominally listing five defenders. The bottom quintile of clubs sit roughly flat at 68–72 across buckets, which means the limiting factor for those squads is gross talent rather than tactical fit.
The spend layer combines this depth map with valuation. For each club, the sum of Transfermarkt market values across the squad is the actual spend; the sum of the orthogonal §5.7 model's per-player predictions is the attribute-implied value. The ratio surfaces over- and under-payment relative to the model's view of what the squad's attributes are worth.
The Premier League sits cleanly above the y=x diagonal across all 20 clubs. Brentford, Manchester United, Nottingham Forest, Burnley, Everton, Crystal Palace, Manchester City, Brighton, Fulham, Leeds — every club in the league is valued at roughly two times the model's attribute-implied prediction. This is not a measurement of squad mismanagement. It is the consequence of the Premier League broadcast-revenue premium: TM market values incorporate buyer-side purchasing power, and English clubs are the highest-paying buyers in world football by a margin that has widened year-on-year. The model, which sees only attributes and not the buyer's wallet, recovers the gap as a systematic offset. The mirror-image effect appears in LaLiga's mid-tier: Rayo Vallecano, Mallorca, Getafe, Celta, Osasuna all sit at roughly half the model's predicted value, which is consistent with LaLiga's significantly tighter broadcast-revenue distribution and weaker buy-side market.
The per-bucket spend chart cuts the headline ratio finer, and surfaces a distinction that matters enough to state explicitly: a squad's playing style, the absolute distribution of its budget across buckets, and the model's overspend ratio per bucket are three independent measurements with three different answers. Brentford is the worked example. The §12.1 style score reads +0.039 — essentially neutral, despite the squad's optimal formation in §12.5 being 3-4-3. The absolute spend allocation is heavily defensive — €156M on DEF, €117M on MID, only €25M on FWD — because Brentford's strategy is a small handful of high-attribute attackers (Wissa, Schade) backstopped by a deep, talented defensive group. And the overspend ratio per bucket is highest at FWD (4.96×) but largest in absolute euros at DEF (€87M above prediction). The three readings reconcile cleanly once they are separated: style reflects per-player attribute composition, absolute spend reflects roster construction, and the ratio reflects buyer-market pricing relative to attribute-implied value. They are not contradictions; they are the three axes a recruitment department needs to read jointly. Real Madrid and PSG sit close to model expectation across the board, which is informative in the opposite direction — for clubs at the top of the wage table, the model is saying the spend mostly is attribute-supported, and the standard "over-paying for stars" critique does not survive an attribute-controlled comparison.
12.3 — Formation-conditioned utility, attacking output, and spend
The third layer answers the user's original question. For each squad and each of six common formations — 4-3-3, 4-4-2, 4-2-3-1, 3-5-2, 3-4-3, 5-4-1 — the squad is optimally assigned to the formation's 11 position slots by Hungarian algorithm on the cost matrix −OVR_as_<bucket>, so position-transferable players are valued under the bucket each slot demands. The goalkeeper slot is treated as a hard constraint and assigned only from the squad's nominal goalkeepers — the §11 finding that GK is structurally isolated (per-position OVR R² of 0.88 versus 0.96–0.998 for outfield buckets, because the goalkeeping specialist attributes are not part of the visible sub-attribute vector) means that no outfielder's projection into the GK slot is credible regardless of how well their Reactions or Composure scores rank. The resulting starting XI is then scored on two synthetic composites:
- Attack index — a weighted z-score on offensive sub-attributes (Finishing, Long Shots, Vision, Dribbling, Crossing, Pace, Ball Control and twelve others) summed across the eleven selected players. Reads as "attacking-output capability" of the XI, with the corpus mean at zero.
- Defend index — the analogous defensive composite (Tackling, Marking, Interceptions, Heading, Strength, Aggression and five others). Reads as "defensive-output capability."
These indices are not expected goals. They are attribute-implied capacity, anchored in the same Top-5 corpus the model was trained on. The construction is honest about what it measures: when a squad is reassembled into a different formation with optimal player-to-slot assignment, how does the team's attribute-implied attacking and defending capability shift?
Interactive XI explorer — pick a league, team, and formation, see the worst- and best-matchup opponent
The formation-conditioned construction supplies one further analytical move: for any team A playing formation F_A, the same-league opponent that beats A by the largest margin — even when the opponent picks the formation that best counters F_A — is A's worst likely matchup. The opponent that A still beats by the largest margin under the opponent's best counter is A's best matchup. The metric per formation pair is the sum of two phase deltas: A_net_attack = A_attack_index(F_A) − B_defend_index(F_B) and A_net_defend = A_defend_index(F_A) − B_attack_index(F_B). A positive net means A outclasses B in both phases simultaneously; a negative net means B outclasses A. For each (A, F_A), the opponent B picks the formation F_B that minimises A's net (the best counter), and across opponents the worst is the team with the lowest min-net and the best is the team with the highest min-net.
This is the formation-aware version of the matchup question and it produces meaningfully different answers from the bucket-depth heuristic. Real Madrid's worst LaLiga opponent is now Atlético de Madrid in a 3-4-3 with a net mismatch of +0.56 in Madrid's favour — tighter than the comfortable +3.35 the depth heuristic gave against Barcelona, and structurally informative: it is Atlético's three-forward shape that pulls Real Madrid's three defenders out of position even though Atlético's overall depth is well below Real Madrid's. Liverpool's worst Premier League opponent is Newcastle United in a 4-2-3-1 rather than Manchester City — Newcastle's 4-2-3-1 puts a second pivot in front of their back four and a number ten between Liverpool's lines, and the shape happens to be the most efficient counter to Liverpool's preferred 3-5-2 despite Newcastle carrying less depth than City. Brentford's worst opponent is the same Newcastle 4-2-3-1, with a net of −1.09 against Brentford's optimal 3-4-3. The most-vulnerable team in the corpus is Lecce, whose worst-matchup opponent is SSC Napoli in 4-2-3-1 (net −1.78 against Lecce's best response). The league-boss clubs that the depth heuristic flagged as everyone's worst opponent (Real Madrid, Liverpool, Inter, Bayern, Paris SG) remain frequent worst opponents, but the formation-aware analysis surfaces a category of teams — Newcastle, Atlético, SSC Napoli — whose shape choices make them specifically awkward for particular rivals even though they aren't the league's deepest squads. The interactive XI explorer below resolves both worst and best opponent against any chosen formation; switching the formation dropdown reshuffles the panel.
The radar resolves the formation-choice problem squad-by-squad. Real Madrid's optimal attacking formation is 4-3-3 (attack index 1.10), which matches Ancelotti's most-used 2024–25 shape and the Vinícius–Bellingham–Mbappé front. Paris Saint-Germain's is 3-5-2 (1.06), which matches Luis Enrique's shift from 4-3-3 to a back-three system after the Mbappé departure and the Dembélé / Doué / Barcola front. Atalanta's is 3-5-2 (0.79), which matches Gasperini's persistent back-three. Liverpool's is 3-5-2 (0.88), which matches Arne Slot's 2025-26 shape with Konaté drifting wide. Brentford's is 3-4-3 (0.45), which matches Thomas Frank's pre-departure attacking shape. Atlético Madrid's is 4-4-2 (0.73). The match between the model's "optimal" formation per squad and each manager's actually-deployed primary system is, in five of six cases, exact. That is a non-trivial validation: the per-position OVR formula plus the synthetic attack composite, with no tactical labels whatsoever, recovers the same formation choices the elite managerial consensus has independently converged on.
The same construction supplies a spend-vs-formation diagnostic. For each formation, the sum of starters' Transfermarkt values is the starting-XI spend; the sum of starters' orthogonal model predictions is the attribute-implied starting-XI value; the ratio reads as the formation-conditioned over-pricing. Brentford in their optimal 3-4-3 carries a starters-ratio of roughly 2.3×; in their worst 5-4-1 it falls to 1.7× — meaning the formation choice itself meaningfully shifts the spend efficiency, because some formations rely more heavily on the clubs's most-over-priced bucket. The recruitment use is direct: a club shopping for a player at a specific slot in a specific formation can read the per-slot delta to see where they are systematically paying premium for output the model says is attribute-recoverable elsewhere on the squad.
Meta-analysis sidebar — three constraints on how to read this
The squad-atlas is a snapshot tool, and three constraints frame its interpretation. First, the input is a single 2025-26 attribute record per player; no time series, no in-season form, no injury record, no contract-window dynamics enter the model. A team whose star striker is currently injured is treated by §12 as if that player will play every match — and a team that has just signed a new player not yet in the EAFC26 corpus will be missing them entirely.
Second, the model has no tactical-instruction signal — no pressing intensity, no possession share, no transition speed. It measures what the squad's attribute composition supports under each formation, not what the manager will actually do with the players on the pitch. The §12.3 finding that the optimal-formation match-up with actually-deployed systems lands five of six is therefore a real validation: the attribute composition is, even without those richer signals, enough to recover the structural formation choice.
Third, the spend-vs-prediction gap is not a managerial criticism. The Premier League's systematic 2× offset is a property of the buyer market — broadcast money, regulatory regime, currency, and the willingness of English clubs to pay above the European market — and the model recovers the offset because its training corpus mixes Premier League prices with the lower-revenue leagues. The right operational read of an over-priced PL bucket is "this is where the league-wide premium concentrates," not "this team is mismanaging its budget."
The §11 position-contamination caveat applies in full at the squad level: every recorded attribute is shaped by the years the player spent training in their existing role. The squad-atlas reads what the recorded attribute composition supports, not what a counterfactual squad-retraining could produce. The tool is honest about what it is — a structural map of the attribute material clubs already have, expressed at the squad level — and the disclosure here is part of how the tool should be deployed rather than a defect specific to this build.
The atlas ships as four CSVs in the Wave 2e artifact directory: squad_style.csv (one row per team), bucket_depth.csv (one row per team × bucket), formation_summary.csv (one row per team × formation, with attack index, defend index, total OVR, and starters spend ratio), and formation_utility.csv (one row per team × formation × slot, with the chosen player at each position). A recruitment department can join these onto its own scouting database to surface candidates whose OVR_as_<bucket> projects highly into a depth-thin slot at a target spend ratio. The composition of the corpus model, the §11 position-transfer construction, and the §12 squad-aggregation layer together form a small but coherent toolkit for moving from individual player attributes to squad-level tactical decisions, and the artifact tables are the directly deployable surface.
Appendix
A1 — Data lineage and acquisition
The unified analysis runs on three primary corpora — Transfermarkt market values, EA Sports FC 26 player attributes, and seven Football Manager editions spanning a decade — joined into one row-keyed-by-player matrix and stratified by gender. This appendix records what each piece is, how it was obtained, and what fraction of the EA universe it eventually covered, because the cross-edition harmonisation argument in A5 and the modelling protocol in A4 only land if the underlying provenance is unambiguous.
Transfermarkt (the primary training target). Transfermarkt's public market-value pages were scraped to 47,308 player records, of which 7,835 men matched cleanly into the EA Sports FC 26 universe. The relevant fields preserved per match are tm_value_eur, tm_highest_value_eur, tm_dob, tm_nation, tm_club, and tm_subposition. Women's Transfermarkt coverage is zero: the sister site soccerdonna.de carries profile pages for some women's players (Bonmatí, Putellas) but no time-series valuations exist for any woman in TM or its sister sites. This is a hard data-availability fact, not a project limitation, and it shaped every cross-gender decision downstream (§A6 §S4).
EA Sports FC 26 (the attribute schema). Player attributes were acquired from the Kaggle mirror flynn28/eafc26-player-database, validated against the official ratings pages at ea.com/games/ea-sports-fc/ratings. The universe is 16,122 men across 45 leagues and 156 nationalities (3_data/01_eafc26_men_full.csv) and 1,447 women across 12 leagues and 72 nationalities (3_data/02_eafc26_women_full.csv). Each row carries six main stat composites (PAC, SHO, PAS, DRI, DEF, PHY), approximately thirty sub-attributes, Overall, Potential, Age, Position, Nation, League, Team, PlayStyles, Weak Foot, Skill Moves, Height, and Weight. Men additionally carry international_reputation, wage_eur, release_clause_eur, value_eur, and club_contract_valid_until_year; EA simply does not publish these for women.
Realized fees (the ground-truth anchor). A 303-row corpus of 2024–2025 window transfers compiled from BBC, ESPN, The Athletic, and club announcements gives the only point where all three signals coexist: EA value_eur, TM market value, and the realised fee. It is the isotonic-calibration set referenced in A4.
Football Manager (the second attribute schema). Seven editions, FM2016 through FM26, totalling 716,448 player-rows on disk (716,027 men, 421 women). FM25 was cancelled in February 2025 [PC Gamer, 2025]; FM26 is therefore the first edition on a ground-up Unity rewrite and the first edition shipping a women's database [Football Manager Blog, 2025]. Acquisition followed three paths: Kaggle community dumps for FM2016 and FM20–FM23 (~683 MB total, three uploaders: ajinkyablaze/football-manager-data, furkanuluta/football-manager-22-complete-player-dataset, and platinum22/foot-ball-manager-2023-dataset); a purpose-built FUTEK.io scrape for FM24 (~3.5 hours wall-clock, full schema); and an EFEM.club scrape for FM26 (~2.5 hours wall-clock, men and women combined, hidden personality block exposed). Match coverage into the EA men's universe, by edition:
| Edition | Acquisition | Players (raw) | Matched to EA men corpus | EA-coverage |
|---|---|---|---|---|
| FM2016 | Kaggle (ajinkyablaze) |
full extract | 2,817 | 17.5% |
| FM20 | Kaggle (furkanuluta) |
162,907 | 4,463 | 27.7% |
| FM21 | Kaggle (furkanuluta) |
174,909 | 5,104 | 31.7% |
| FM22 | Kaggle (furkanuluta) |
176,878 | 5,125 | 31.8% |
| FM23 | Kaggle (platinum22 top-leagues) |
8,452 | 1,785 | 11.1% |
| FM24 | FUTEK.io scrape | 16,670 | 12,822 | 79.5% |
| FM26 | EFEM.club scrape | 17,091 (16,670 m + 421 w) | 13,434 | 83.3% |
[section_a_match_coverage.csv]
The match logic, common across both EA↔TM and FM↔EA joins, runs the four-step protocol developed in Wave 1: (i) direct + diacritic-normalised name match (~50% recall), (ii) rapidfuzz token_set_ratio ≥ 85 fallback (+30%), (iii) country-alias filter to prevent ~1,000 false rejections on transliteration variants (Korea Republic ↔ South Korea, Côte d'Ivoire ↔ Ivory Coast, Türkiye ↔ Turkey), (iv) date-of-birth tiebreak on duplicates. The 83.3% FM26↔EA coverage is the upper bound on Wave 2's analytical surface; the lower coverage on the older Kaggle bulk dumps reflects their European top-tier bias rather than matcher failure. Provenance honesty: the Kaggle dumps are community uploads, not SI-authorised redistributions, and are used here under a fair-use academic framing with no row-level redistribution.
A2 — Schema audits
The headline schema finding is not that SI has added attributes over the decade — they have been remarkably stable — but that the dumps we hold expose very different slices of the same underlying schema. The bulk Kaggle dumps for FM20–FM23 omit the hidden personality block by default; the purpose-built fan scrapes recover it.
Two structural facts follow. First, the four hidden-personality-bearing editions on disk are FM2016, FM23 (top-leagues), FM24, and FM26; every claim in A5 about hidden mentals is bounded by that set. Second, the visible 36-attribute block (14 technical + 14 mental + 8 physical) is universally present across every edition, which is why the cross-edition drift analysis in A5 can pool all seven editions when the question is about visible attributes and falls back to the four hidden-bearing editions only when the question is about personality.
The FM26 schema enumerated in full: 35 visible attributes + 11 hidden mental attributes + 2 hidden meta-attributes (Versatility, Injury Resistance) = 48 attributes per outfielder, plus 11 goalkeeper-specific visible attributes, plus a 15-cell position-familiarity matrix, plus the composite CA/PA and reputation scores. All are stored on a 1–20 integer scale internally; EFEM exposes them rescaled to 0–100. Visible technical (14): Corners, Crossing, Dribbling, Finishing, First Touch, Free Kicks, Heading, Long Shots, Long Throws, Marking, Passing, Penalty Taking, Tackling, Technique. Visible mental (14): Aggression, Anticipation, Bravery, Composure, Concentration, Decisions, Determination, Flair, Leadership, Off The Ball, Positioning, Teamwork, Vision, Work Rate. Visible physical (8): Acceleration, Agility, Balance, Jumping Reach, Natural Fitness, Pace, Stamina, Strength. Hidden mental (11): Versatility, Important Matches, Loyalty, Ambition, Adaptability, Consistency, Temperament, Professionalism, Sportsmanship, Pressure, Injury Resistance, with FM26-additions Compliance and Fairness. EA Sports collects none of the hidden block. This is the structural reason FM data is additive rather than redundant in A4's union model.
Provenance honesty matters here because the Kaggle dumps are community extracts redistributed without SI's explicit licence; FUTEK and EFEM are public fan sites whose data quality was spot-checked against retail-game baselines for Mbappé, Haaland, Salah, Bonmatí, and Putellas. Both scrapes ran at 1–2 requests per second with a project-identifying User-Agent. The FM26 women's database is brand-new — SI launched 36,000+ players across 14 leagues and 11 nations [Football Manager Blog, 2025]; the 421-row women's sample we hold is the EFEM.club slice scraped during the Wave 2 window, preserving the youth-prospect rows but not the senior tier.
A3 — Completeness, calibration, and the 1–20 scale
The 1,300-researcher claim that Sports Interactive's studio director defends in interviews [Sports Interactive / SportsPro Media, 2024] is hard to verify directly, but the downstream-visible signature is per-row attribute completeness. If a paid researcher network is revising a database monthly, every present column should be nearly fully filled.
The headline numbers: 100.0% on FM26 men (16,670 rows × 34 visible columns = 566,780 cells, every one populated) and 100.0% on FM26 women (421 rows × 34 visible columns = 14,314 cells, every one populated). The FM24 dip is a scraper-side artefact, not an SI completeness gap. The 1,300-researcher claim survives the fill-rate audit.
The men's median sits at CA 67 with σ=10.7; the women's median at CA 69 with σ=6.5. The women's distribution is roughly twice as concentrated as the men's. The reason is SI's own design choice: the 1–20 attribute scale is calibrated relative to each database independently [Fuller FM, 2025]. A 20-Pace woman is the fastest in women's football, not the fastest in football overall, and the population means are pulled toward the same per-database centre by construction. This is the single most important warning for any cross-gender modelling work on FM data. The body section §6 builds its women's analysis on within-database percentile features rather than raw 0–100 scores precisely to respect this calibration constraint, and the negative result there (R²=−0.10 cross-gender on EA features that share a pooled 1–99 scale) is the direct consequence of EA's opposite methodological choice.
The men's database has mean age 25.3 (σ=4.9), spans 15–43, and shows the full mid-career bulge any senior-football roster carries. The women's scrape has mean age 18.4 (σ=1.5), 99.5% of rows at age 20 or younger, only 2 women out of 421 above age 21. This is not the full FM26 women's database — SI documents 36,000+ players — it is the EFEM slice scraped during the Wave 2 window. §6 calibrates against the constraint and frames its predictions explicitly as a top-of-pyramid sample.
A4 — Modelling protocol
The modelling pipeline carried from Wave 1 into Wave 2 is deliberately conservative on protocol and aggressive on diagnostics. The estimator, target, and cross-validation scheme are held fixed across every model run reported in §9–§6 and in this appendix; only the feature set varies.
Estimator. sklearn.HistGradientBoostingRegressor with max_depth=6, max_iter=600, learning_rate=0.05, l2_regularization=1.0, min_samples_leaf=20, random_state=42. The choice over XGBoost / LightGBM is pragmatic — HistGBR is sklearn-native, uses the same gradient-boosted-tree algorithm family, and removes a deployment dependency. A Ridge baseline is reported alongside in model_cv_results.csv as the conservative linear ceiling: Ridge hits R²=0.757 against HistGBR at R²=0.771 on the Wave 1 EA-only frame, with the convex value-to-rating relationship (a 5-point OVR jump 80→85 worth more than 65→70) absorbing the rest of the gap. The literature is decisive on this point: tree ensembles reach R²=0.85–0.90 on this problem, linear models ceiling out around 0.70–0.78 [McHale & Holmes, 2023; Yang, 2025].
Target. log10(tm_value_eur). Trained on the 7,835 men with valid TM market values in Wave 1; on the 6,729 men with both EA and FM26 attribute vectors in Wave 2. A second-stage isotonic regression mapping predicted log-TM to log-realized-fee is fitted on the 303-row realized-fees triangle and reported in the validation summary; per McHale & Holmes 2023, TM under-predicts the superstar tail, and the isotonic step corrects the top end without distorting the median.
Cross-validation. 5-fold KFold, shuffled, random_state=42. Reported metrics: R² in log space, Spearman in € space, MAE back-transformed to €, and median absolute percentage error in € space. The headline Wave 2 result — the EA + FM26 union model lifts R² from 0.663 to 0.785 against the EA-only baseline on the matched 6,729-row frame — is the mean of five folds, with the union model winning every individual fold and the fold envelopes (EA-only 0.643–0.680, union 0.774–0.796) not overlapping [section_c_bakeoff_folds.csv].
Feature importance. Permutation importance, computed on a held-out fold with n_repeats=10, is the reporting standard. Gini importance was not used; tree-based feature-competition effects make Gini importance unreliable for ordinal features at high cardinality.
Sensitivity block — the outlier sweep. A reader-prompted hypothesis was that the model's residual error concentrates at the extremes of the value distribution. The honest test is to trim N players from the top OR bottom of the TM-value-sorted corpus, re-train AND re-test on the trimmed frame, and report four metrics — R², RMSE-log, MAE-log, median APE on raw €. The four-metric picture from section_c_outlier_sweep.csv:
| Cut | n used | R² | RMSE-log | MAE-log | Median APE |
|---|---|---|---|---|---|
| Baseline (full) | 6,729 | 0.717 | 0.363 | 0.280 | 47.1% |
| Remove top-50 | 6,679 | 0.702 | 0.364 | 0.281 | 47.6% |
| Remove top-1000 | 5,729 | 0.544 | 0.350 | 0.272 | 46.4% |
| Remove bottom-50 | 6,679 | 0.714 | 0.358 | 0.278 | 47.7% |
| Remove bottom-1000 | 5,729 | 0.699 | 0.320 | 0.252 | 45.2% |
Removing the top tail makes R² fall sharply (0.717 → 0.544 at N=1,000) but raw error metrics stay flat or improve slightly (MAE-log 0.280 → 0.272) — the R²-mechanics effect: SS_total shrinks faster than SS_residual when the upper tail is cut. Removing the bottom tail tells the opposite story: R² is essentially flat (0.717 → 0.699 at N=1,000) but RMSE-log drops 12%, MAE-log drops 10%, and median APE drops to 45.2%. The bottom of the distribution — the €10K–€50K floor where TM round-numbers dominate any attribute signal — carries most of the absolute error budget. The top-50 are not noise but informative high-end anchors.
Harmonisation audit. The match-quality breakdown of the 12,456 EA↔TM matched men: 92.5% exact name + DOB match (the strongest possible), 6.5% fuzzy name + DOB tolerance, 0.6% last-name + DOB + nationality fallback, 0.3% first-name + DOB + nationality fallback, 0.1% token-overlap + DOB + nationality fallback. Name-score median 100, mean 99.5 of 100. DOB year-difference exactly zero for 12,274 of 12,456 rows (98.5%). Match quality is uniform across the value-quartile distribution: bottom-quartile (€10K–€350K) is 92.2% exact-match, top-quartile (€2M+) is 94.3% exact-match — no quality drift along the dimension the analysis sweeps. The outlier-sweep findings are not a harmonisation artefact.
The OVR reverse-engineering result. Position-stratified OLS — OVR ~ all_sub_attrs standardised — recovers per-position R² of 0.96–0.998 (ovr_formula_weights.csv). EA's Overall is therefore a deterministic position-weighted linear combination of sub-attributes, exactly as EA's published methodology implies. Formally:
$$ \text{OVR}_{p}(x) \approx \text{round}\!\left( \mu_p + \sum_{i=1}^{n} w_{i,p} \cdot x_i \right), \quad R^2_p \in [0.96, 0.998] \quad \text{for } p \in \{\text{GK, DEF, MID, WIDE, FWD}\} $$
where $w_{i,p}$ is the OLS coefficient on standardised sub-attribute $i$ for position $p$, and $\mu_p$ is the per-position intercept. The recovered top-six weights: GK GK_Positioning 1.69 / GK_Reflexes 1.64 / GK_Diving 1.59 / GK_Handling 1.53 / Reactions 1.09 / GK_Kicking 0.37; DEF Standing Tackle 1.27 / Interceptions 0.97 / Reactions 0.79; MID Ball Control 1.87 / Reactions 1.65 / Short Passing 1.57; WIDE Dribbling 1.18 / Ball Control 1.05 / Short Passing 0.88; FWD Finishing 1.41 / Positioning 1.09 / Heading 0.91. The finding matters because it directly explains why position-specific sub-attribute differentials in the value model are real but small: the positional structure is already inside OVR, and a value model with OVR as the dominant feature is seeing the positional regime through OVR rather than missing it.
A5 — Cross-edition harmonisation
The seven Football Manager editions on disk span the FM2016 → FM26 window. The companion document fm_schema_evolution.md enumerates the attribute-name stability across this window in full; the headline finding is that visible attribute names have been remarkably stable, with no technical, mental, or physical attribute renamed or removed across FM20–FM26. The only nominal schema change is Weight (a profile field, never a gameplay attribute) being dropped in FM26 [Football Gaming Zone, 2025].
The harmonisation problem is therefore not nominal but distributional. From historical_drift_rescaling_flags.csv, for each consecutive edition pair on players present in both editions, the mean delta of every shared attribute divided by the standard deviation of the per-player delta distribution:
| Transition | Attrs tested | Attrs shifted >0.5σ | Direction | Flag |
|---|---|---|---|---|
| FM20 → FM21 | 48 | 0% | mixed | player movement |
| FM21 → FM22 | 48 | 0% | mixed | player movement |
| FM22 → FM23 | 47 | 51% | mixed | partial rescaling (GK only) |
| FM23 → FM24 | 32 | 0% | mixed | player movement |
| FM24 → FM26 | 44 | 97.7% | up | systematic rescaling |
Three of the four interior transitions show zero attributes crossing the threshold — the signature of normal inter-edition turnover, individual players moving by amounts consistent with within-edition revision noise. The FM22 → FM23 row shows partial rescaling concentrated in goalkeeper attributes: FM Scout's CA guide explicitly notes weight tweaks to GK Passing and First Touch at this boundary, with no outfield change [FM Scout CA Guide, 2024]. The FM24 → FM26 row is categorically different. All 44 tested attributes crossed the 0.5-SD threshold; all 44 moved in the same direction.
The individual-player trace from elite_ca_matrix.csv is the clinching evidence: Mbappé CA 188 → 98 (Δ −90), Messi 185 → 90 (Δ −95), Salah 180 → 93 (Δ −87), Vinícius 181 → 91 (Δ −90), Bellingham 168 → 91 (Δ −77). The same internal SI scale, the same researcher network, the same real-world football season — and the rating roughly halves. Either every elite player simultaneously lost half their footballing ability in twelve months, or the underlying CA scale was structurally rebased.
The mechanism is three structural changes hitting the FM26 boundary simultaneously. First, an engine rewrite: FM25 was cancelled in February 2025, with SI publicly conceding the project "did not meet internal quality standards"; the FM26 reveal followed in March 2025 with confirmation of a full Unity migration — the first wholesale engine replacement since the 2004 Championship Manager fork [PC Gamer, 2025; ESPN, 2025; VGC, 2025]. Second, a role-system overhaul: FM26 collapsed approximately sixty named roles into a dual In-Possession / Out-of-Possession structure, with Mezzala, Enganche, Trequartista, Segundo Volante, and Carrilero removed as named roles. Because CA is mechanically a weighted sum of attributes against the best-role weight vector, reshaping the role table reshapes the CA function even with identical attribute values [FM Scout FM26 Player Roles, 2025]. Third, the women's database: FM26 was the first edition to ship women's football at launch, and SI's own guidance is that "the women's database retains the same 20-point scale but is calibrated relative to each side" [Fuller FM, 2025]. Some downward compression of the men's distribution is almost forced by the constraint that both databases share a visible 1–20 surface scale.
We cannot disentangle which of the three drove the rescaling; SI shipped three structural changes simultaneously. The pipeline rule of thumb that follows is unambiguous: build features from raw 1–20 attributes, never from CA/PA, when joining across editions; treat the FM26 women's data as a separate cohort and normalise within women's database; if CA must be used, z-score within (edition × gender database) before pooling.
A6 — Supporting findings retained from Wave 1
The five supporting findings below are load-bearing for Wave 1's design decisions and are retained verbatim — with light editorial framing — into the unified appendix as honest provenance. The original analyses live in FINAL/4_analysis/phase2_supporting/.
S1 — EA value_eur is informationally subsumed by TM
Before any modelling, an audit of EA's published value_eur against Transfermarkt market values and realised fees ran three 5-fold CV models on the 303-row realised-fees triangle: EA-only Ridge (Spearman 0.564 vs realised fee, median APE 62.6%); TM-only Ridge (Spearman 0.735, MAPE 45.2%); Hybrid EA + TM + age + position + league (Spearman 0.725–0.739, MAPE 47.9–54.9%). The hybrid adds zero lift beyond TM-only. EA's value_eur is informationally subsumed by Transfermarkt market value and is itself a deterministic function of OVR + age + reputation per EA's internal formula. Feeding it into the value model would be circular signal compression. This is the load-bearing methodological decision behind the entire unified piece's framing: EA's contribution is the attribute schema, not the value field.
S2 — The canonical age curve, observed in motion
A 3,591-player panel matched across two TM snapshots (Sep 2025 + Feb 2026) recovered a clean monotone age curve from real market dynamics:
| Age | n | Mean Δlog TM (6 mo) | % rising |
|---|---|---|---|
| ≤21 | 412 | +0.204 | 52% |
| 22–25 | 1,108 | +0.034 | 38% |
| 26–29 | 1,103 | −0.013 | 27% |
| 30–33 | 678 | −0.087 | 4% |
| 34+ | 292 | −0.104 | 1% |
The panel validates the age + age² feature design and surfaces the Ligue 1 +9% rebound and Saudi −5.6% correction as live market signals.
S3 — Position-specific pricing regimes exist
The 303-row realised-fees triangle, stratified by position group: GK is the worst-priced position by both EA and TM (median APE 214%/71%, small-sample n=7); WIDE players are the best EA rank-order (Spearman 0.637) — pace + dribbling are EA's modelling strengths; TM beats EA at every position; the gap is largest at GK. This drove the Phase 3 decision to fit position-specific models when reporting sub-attribute importance and to reverse-engineer EA's positional OVR formula (A4).
S4 — Women's Transfermarkt coverage is zero
A marquee TM-history scrape attempted to pull historical valuations for Aitana Bonmatí and Alexia Putellas. Both returned no data. Confirmed via the sister site soccerdonna.de — profile pages exist but no time series for any female player. This shaped the cross-gender experimental design: the Bransen et al. 2024 joint-gender model framework requires women's targets, which do not exist. The only feasible test is naive transfer — train on men, apply to women, validate qualitatively. The finding recurs as the binding constraint on §8's calibration story (the 27-row women's transfer-fee corpus that the FM-trained R²=0.91 within-gender model still cannot anchor to a euro curve).
S5 — The "breakout premium" hypothesis falsified
An earlier phase claimed that young (≤25) lower-tier → top-5 league transfers cleared at 5–20× their pre-transfer TM value (an arbitrage thesis for sports-tech investment). On audit, the median fee/TM ratio of 107 such transfers turned out to be 1.00, with P90 of 1.81. The "breakout multiplier" was overclaimed. Bundesliga clubs do pay the steepest prospect premium (median 1.80×) and Italian clubs systematically under-pay (median 0.91×) — but no large general arbitrage. The finding is retained as honest provenance: a hypothesis the data falsified, retired transparently.
A7 — Literature review
The unified piece sits in a roughly decade-long lineage of papers regressing Transfermarkt market values on FIFA/EA attributes. The headline empirical fact is consistent across the literature: tree-based ensembles (Random Forest, XGBoost, LightGBM, CatBoost, GBDT) reach R² ≈ 0.85–0.90 when predicting log-TM-value from FIFA Overall + 30-ish sub-attributes + age + position + league + international caps. Linear models (OLS, Ridge) ceiling out at R² 0.70–0.78 because the value-to-rating relationship is convex.
The benchmark study is [McHale & Holmes, 2023] — Estimating transfer fees of professional footballers using advanced performance metrics and machine learning, European Journal of Operational Research 306(1), 389–399. McHale and Holmes train on realised transfer fees rather than TM values and show that combining FIFA expert ratings with VAEP-style action values plus xG-based plus/minus beats TM values on average, with TM still winning for fees above €20M (the superstar tail). [Yang, 2025] — GBDT-based player valuation, Decision Analytics Journal — runs a six-model bake-off on FIFA-22 with a 16k-player sample comparable to ours and reports GBDT R² = 0.901. The most recent entry is [Van Damme et al., 2025] — Forecasting football transfer market values with random forest and XGBoost, arXiv:2502.07528 — which adds 1-year-ahead forecasts using a randomised feature ensemble and lands near the same ceiling. [Li et al., 2026] — SHAP-based interpretation of player valuation features, Sport, Business and Management 16(2) — sharpens the feature-attribution side and confirms OVR's dominance in the SHAP decomposition.
The cross-gender thread is thinner. [Bransen et al., 2024] — An xG of Their Own, Journal of Sport Management 38(2) — establishes the cleanest "gender-aware feature beats single-gender training" result on shot data. [Pappalardo et al., 2021] — Understanding gender differences in professional European football through machine learning interpretability, Scientific Reports — finds feature-importance rankings transfer across genders while magnitudes do not. [Coates & Webber, 2023] — Pay and Performance in Men's and Women's Football: Comparing the MLS and NWSL, International Journal of Sport Finance 18(4) — is the closest analogue to the unified piece's economic question, finding determinants of win production statistically identical across genders but wage-elasticity differing by an order of magnitude. The Wave 1 42× shrink finding and the Wave 2 R²=−0.10 cross-gender EA result are both direct empirical quantifications of the Coates-Webber prediction.
On the structural-bias side, [Ezzeddine et al., 2025] — Pricing football transfers using video gaming data, Journal of Sports Analytics — argues that EA's ratings are calibrated against broadcast/media consensus while FM's are calibrated against match-engine outcomes; the two priors → two independent measurements claim is the load-bearing philosophical underpinning of the bake-off in §9.
For league-effect modelling, the CIES Football Observatory tradition (Poli, Ravenel, Besson) sits at ~85% correlation with realised fees and is the proprietary academic ceiling the unified piece compares against in §1. Sports Interactive's published women's-database guidance [Fuller FM, 2025; footballmanager.com, 2025] — explicit that the 1–20 scale is calibrated within-gender, that "a 20-Pace woman is the fastest in women's football, not the fastest in football overall" — is the methodological warning the unified piece's §6 retraction sidebar respects.
A8 — Artifact index
Every numeric claim in the unified piece is traceable to an artifact CSV. The complete artifact corpus lives in two folders.
Wave 2 measurement artifacts (/Users/pyl/Desktop/eafc26_project/wave2_fm/3_artifacts/):
Section A — lineage: section_a_edition_timeline.csv, section_a_schema_growth.csv, section_a_fm26_ca_dist.csv, section_a_fm26_age_dist.csv, section_a_league_coverage_men.csv, section_a_completeness.csv, section_a_match_coverage.csv.
Section B — head-to-head: section_b_corr_matrix.csv, section_b_coverage.csv, section_b_disagreement_ea_overrates.csv, section_b_disagreement_fm_overrates.csv, section_b_distribution.csv, section_b_league_position_overrate.csv, section_b_per_position_corr.csv, section_b_position_value.csv, section_b_shape_stats.csv.
Section C — decision: section_c_bakeoff_folds.csv, section_c_parity_predictions.csv, section_c_mape_by_ovr.csv, section_c_marquee.csv, section_c_feature_importance_top20.csv, section_c_outlier_sweep.csv, section_c_calibration_deciles.csv.
Section D — overflow: drift_fm24_to_fm26.csv, historical_drift_rescaling_flags.csv, elite_ca_matrix.csv, drift_vs_tm_change.csv, bake_off_hidden_mentals.csv, hidden_mental_drop_one.csv, bigmatch_deciles.csv, versatility_buckets.csv, archetype_centroids.csv, archetype_centroids_z.csv, archetype_summary.csv, archetype_pca_projection.csv.
Section E — women's: section_e_within_model.csv, section_e_ovr_ca_scatter.csv, section_e_parity.csv, section_e_calibration.csv, section_e_marquee_top20.csv, women_predictions_ea.csv, women_predictions_fm26.csv, women_rank_compare.csv.
Wave 1 measurement artifacts (/Users/pyl/Desktop/eafc26_project/FINAL/5_artifacts/): model_cv_results.csv, feature_importance.csv, ovr_formula_weights.csv, panel_by_age.csv, panel_by_league.csv, phase2_hybrid_cv_results.csv, position_subattr_differential.csv, position_subattr_raw.csv, position_summary_phase2.csv, league_exchange_rates.csv, league_trajectory_6mo.csv, breakout_summary.csv, multi_league_validation.csv, plus the trained model model_men.pkl.
Analysis scripts (/Users/pyl/Desktop/eafc26_project/wave2_fm/2_analysis/): 01_fetch_efem_sitemap.py → 03_scrape_efem.py and 04_scrape_futek.py (acquisition); 07_join_corpus.py (matching); 08_bake_off.py (§C); 09_drift_fm24_fm26.py, 14b_elite_trajectory_v2.py, 15_historical_drift_table.py (§D); 10_cross_gender_fm.py, 26_section_e_analysis.py (§8); 18_hidden_mentals_analysis.py (hidden block bake-off); 20_section_a_analysis.py through 33_section_c_sweep_plotly.py (figure builders). All scripts run end-to-end on Python 3.14 with scikit-learn 1.8, pandas 2.x, matplotlib, plotly, and kaleido.
End of document.