About numbertwenty — Football Analytics Project

The Name

numbertwenty is named in tribute to Diogo Jota, Liverpool's Portuguese forward, who wore the number 20 shirt and played a key role in delivering the club's historic 20ᵗʰ league title just weeks before his tragic passing in a car accident. He died together with his brother, André Silva.

This project is dedicated to their memory, celebrating the attacking spirit and relentless energy Diogo brought to the game.

The Idea

Football debates often revolve around deserved results. Supporters have strong opinions, but it is mathematically difficult to quantify the impression a match leaves. How can we provide an objective answer to a debate that is often deeply subjective?

Beyond xG: Limitations

Expected goals (xG) provide valuable long-term insights into team performance, but they sometimes fail to capture the true texture of a single match.

For instance, if we strictly rely on xG and derived Poisson models, there must always be a winner — the team with higher xG. In reality, roughly 27–32% of matches end in a draw depending on the competition.

Statistical Similarity

To go beyond xG, numbertwenty searches for statistical neighbors — past matches with similar profiles within the same or related league(s). By comparing a match to its closest historical equivalents, the system better captures local context and subtle dynamics that xG alone may miss. This approach aims to reduce the aleatoric uncertainty of football: the irreducible randomness of any single match outcome.

The model focuses on features like big opportunities, shots on target, offensive possessions, passing sequences, defensive actions and more — providing a compact yet powerful summary of match outcome and style. While xG helps calibrate the features, the system captures the nuances of each match more faithfully, allowing it to reflect draws, home or away wins, and rare match patterns more realistically.

How Similarity is Measured

For each match, the model does not simply compare raw statistics between teams. Instead, it constructs three complementary views for each feature pair (team vs. opponent):

Difference: the raw gap between the two sides (e.g., shots on target team 1 − shots on target team 2). This informs about direct offensive or defensive dominance.
Rate: the signed share, defined as (team 1 − team 2) / (team 1 + team 2). This normalises for match intensity and indicates the direction of dominance regardless of scale.
Sum: the total volume produced by both sides combined. This captures the overall intensity of the match.

Combining these three dimensions for each statistic gives a richer fingerprint of a match than any single number could.

Adaptive Feature Weighting

Feature importance is not fixed across time. The model progressively adjusts the weight of each feature using a causal temporal correlation engine: at any given moment, features are weighted by their recent correlation with match outcomes, measured on all past matches up to that point only. This means the model progressively adjusts the weight of each feature over time, reflecting how certain metrics may become more or less relevant depending on recent match trends.

To avoid overfitting to short competition histories, feature weights are computed at three nested levels and then blended via shrinkage:

Global level: a baseline computed across all competitions in the database. Serves as the final anchor for any competition with very little history.
Similar-competition level: a peer-group baseline computed from related competitions. This intermediate anchor is blended in when the competition has fewer matches.
Competition-level: raw weighted correlation computed on the competition's own historical matches. Used fully once the competition has accumulated enough data (≥ 1 500 match-observations).

Finding the Neighbors: a Causal Index

For each match, the model retrieves the k closest past matches using the Mahalanobis distance in the transformed feature space described above. The search is performed with exact L2 euclidean distance (post-Mahalanobis), operating on the feature-weighted and causally-scaled vectors.

Causality is a hard constraint: a match can only be compared to matches that happened before it. The scaling (standardisation) applied to features is itself recomputed at each point in time using only past data, so that no future information ever leaks into the neighbor search.

Neighbors are also drawn not only from the same competition, but from a configured set of similar competitions, weighted and scaled consistently. This widens the pool of comparable matches when a competition is young or inherently uncommon, without sacrificing relevance. For competitions with a large volume of matches (e.g., the Premier League), neighbors are drawn exclusively from within the same competition.

From Neighbors to Probabilities

Once the k nearest neighbors of a match are identified, each neighbor votes for an outcome (team 1 win, draw, or team 2 win) according to what actually happened in that match. Votes are weighted by a softmax of the distances. Yeah, an expert way to say closer neighbors carry exponentially more weight.

The raw probability estimates are then calibrated to ensure that their predicted distribution of outcomes matches the real observed distribution in that competition. This prevents the overrepresentation of the most common outcomes, such as home wins, when using many neighbors, while still respecting the real-world proportions of wins, draws, and losses — without flattening the variation between individual games. This calibration is updated causally week by week.

Interpretation

In practice, the model aims to quantify actual offensive dominance. It answers the question: given my offensive production and that of my opponent, was my team's performance sufficient — and what result would have been most deserved?

The model relies on full-match aggregated statistics and can evaluate whether, given the number of big chances, shots inside the opponent box, and similar metrics, the final score aligns with what a comparable body of matches would suggest.

However, it does not capture minute-by-minute evolution: it cannot know if a goal came early or late, or how match momentum shifted. This is a known limitation, but it also reflects actual offensive output: a team that sat back and defended after scoring early still produced — or failed to produce — over 90 minutes of football, and the model reads that honestly.

Analysis shows that roughly 34% of matches (a third of football matches) end with a result that goes against what the statistics would suggest. Using a margin of 5% from the most probable outcome, this falls to 26%. Football is inherently random, and the model is designed to surface that reality rather than hide it.

Statistics Bars & Percentiles

On each match page (and in the match comparator), a set of statistic bars displays each team's performance for the key metrics. The fill of each bar does not represent the raw value — it represents the percentile of that performance relative to a reference scope and period chosen by the user: the competition only, all competitions, the last 365 days, or the full historical dataset.

Crucially, these percentiles are computed up to the date of the match using only data available at that point in time. This means two matches from the same competition with identical raw statistics can show different percentile fills if they happened at different moments in the season — because the pool of comparison matches grows as the season progresses. This is an intentional design choice: it ensures percentiles are always fair and historically grounded.

This also explains why the nearest neighbors of a match can appear to have different statistics, whether in raw values or in percentile bars. The neighbor search operates on the transformed feature space (differences, rates, sums — scaled and weighted) rather than on raw counts. Two matches with different absolute shot tallies can be very close neighbors if their rates and differences are similar.

Fair Elo

Unlike classical rankings that award all points to the winner, the Fair Elo system distributes points according to the fair probabilities of the match evaluated after the game.

This means that even if a team wins, it may gain fewer points than its opponent if the statistical evidence suggests it was dominated. Conversely, a team that loses can still earn points if the fair probabilities indicate it was the stronger side on the day. The system thus rewards merit rather than raw results.

Formally, the rating update follows the Glicko-2 framework extended with a Bradley-Terry-Davidson model for three outcomes (win, draw, loss). This means a team on an erratic run will have larger rating swings than a consistent one with the same average level.

This dynamic ensures that no team ever receives all points or zero points outright, even in extreme results. The system naturally compresses the Elo range: exceptionally strong teams almost never capture 100% of the points, and weaker teams almost never capture zero.

Matches between teams from different competitions — such as UEFA club competitions — help calibrate inter-league transfers, propagating rating information across domestic boundaries. International matches between confederations (e.g., World Cup) further calibrate confederation-level ratings, notably for national teams.

Power Ranking: A Power Ranking table combines Fair Elo with current form to provide a more dynamic snapshot of team strength. This compensates for the fact that Elo ratings update gradually and can lag behind sharp performance trends.

Competitions & Rankings

On each competition page, team rankings are presented in three modes:

Actual: based on real match results as they stand in the official table.
Expected: points calculated from fair probabilities — specifically 3 × P(win) + 1 × P(draw) — for each played match, regardless of the actual result.
Projected: end-of-season projection combining current actual standings with predicted outcomes for all remaining fixtures.

For the projected rankings, two display modes are available. The continuous mode (often seen with xG tables) distributes fractional points proportionally according to outcome probabilities, giving a precise probabilistic reflection of expected final standings. The majority result mode assigns full points to the single most probable outcome per match, producing a more discrete ranking with starker values — useful for ordinal comparisons but less realistic as a representation of uncertainty.

Match Analysis

Each match page includes a short textual analysis of the game, highlighting the strengths and weaknesses of each side, their style of play, and the key moments that shaped the contest. The goal is to give a concise, readable summary useful for anyone who did not watch the match.

These analyses are generated automatically by a local language model. Like any generative system, it can occasionally produce small factual inaccuracies or unusual phrasings. Additionally, the textual analysis is generated independently from the fair probability model: the LLM does not receive the computed probabilities as input, so its conclusions may sometimes diverge from the fair probabilities shown on the card. Both perspectives are complementary and each has its own basis.

Transparency

numbertwenty is not a prediction engine. It illustrates how outcomes can vary widely even between games with nearly identical stats. Football is inherently chaotic, and randomness is part of its beauty.

Each match is presented through a match card. Cards display fair probabilities — either predicted before the match (for upcoming games) or observed after the match (for past games). Two badges are shown:

Uniqueness Score (0–10): reflects the average similarity between the match and its closest historical equivalents. A high score means the match looks very much like others in the competition — a standard game of football. A low score signals a rare or atypical match in the competition's history.
Predictability Score (0–10): measures whether the pre-match predicted probabilities (based on team ratings and form) align with the fair probabilities computed post-match from the actual statistics. A high value indicates the observed dominance was consistent with expectations; a low value highlights a surprising gap between prediction and reality. This does not imply the real outcome was predictable — only that the offensive balance of the match was, or was not, in line with what the pre-match model anticipated.

Taken together, these two scores offer a quick read on whether a result stands within the statistical reality of the match, or emerges as a genuine outlier.

The model is fully explainable — no black-box neural networks, every calculation can be traced. It remains data-driven while embracing the unpredictability that makes football fascinating (and frustrating).

Known Limitations

Venue Information

Evolving Model

The current post-match analysis model is not yet the definitive version. Both the set of features used and their temporal weights will continue to be refined as internal researches progress. As a result, the fair probabilities shown today may shift slightly in future updates as the model improves. All historical values will be recomputed consistently whenever a meaningful update is deployed.

Minute-by-Minute Dynamics

The model works on full-match aggregated statistics and has no visibility into how a match unfolded over time. It cannot distinguish a goal scored in the 2nd minute from one in the 90th, nor can it track momentum shifts, red-card effects, or tactical adjustments mid-game. This is an inherent limitation of aggregate statistics, and one the model makes no attempt to conceal.