How Ball-by-Ball Prediction Works
A comprehensive guide to understanding the machine learning system that predicts cricket matches through individual ball outcomes.
Why Predict Individual Balls?
The fundamental insight that makes this system possible
The Data Problem
Traditional prediction models try to directly forecast match winners. But with only 15,000 historical T20 matches, there isn't enough data to learn the nuanced patterns that determine outcomes.
- Limited training examples
- High variance predictions
- No uncertainty quantification
The Ball-by-Ball Solution
Instead of predicting matches, we predict individual ball outcomes. This gives us 4 million training examples - 266x more data to learn from.
- Abundant training data
- Rich pattern learning
- Natural uncertainty via simulation
Six Possible Ball Outcomes
The Prediction Pipeline
From raw match data to win probabilities in five steps
Data Collection & Processing
We start with ball-by-ball data from 8,341 T20 matches sourced from Cricsheet. Each delivery is parsed to extract the outcome (runs scored or wicket) along with contextual information.
Feature Engineering
Each ball is described by 63 features across 6 categories. These capture everything from the current match state to historical player performance and tactical matchups.
Explore all 63 featuresXGBoost Model
An XGBoost classifier predicts the probability distribution over the 6 possible outcomes for each ball. The model was tuned using Optuna across 50 hyperparameter trials.
Monte Carlo Simulation
To predict a match, we simulate it 1,000 times. Each simulation plays out ball-by-ball: predict probabilities, sample an outcome, update match state, repeat for all 240 balls (both innings).
Win Probabilities
After 1,000 simulations, we aggregate the results. If Team A wins 620 simulations, their win probability is 62%. We also get score distributions and confidence intervals.
63 Features in 6 Categories
Click each category to explore the features within
Match State
Where is the game right now?
scorewicketsballs_bowledrun_rateis_powerplayis_deathballs_remainingwickets_in_handinning_idxPlayer Stats
Historical batter & bowler performance
batter_avgbatter_srbatter_recent_avgbowler_avgbowler_economyh2h_avgh2h_srbatter_vs_pace_avgbatter_vs_spin_srMomentum & Pressure
What just happened in the match?
last_5_balls_runslast_10_balls_runsballs_since_boundarylast_10_dotspartnership_runspressure_indexChase Dynamics
Second innings target pursuit
chase_targetrequired_ratelead_gapvenue_avg_scoreTactical Matchups
Batter vs bowler type advantages
spin_matchup_advantagesame_arm_matchupmatchup_typebatter_handbowler_typeMatch Context
Venue, toss, and situational factors
venue_encodedis_batting_firstis_toss_winnerballs_in_overbatter_agebowler_ageTemporal Integrity
The most important engineering decision in the entire system
The Problem: Data Leakage
When predicting a match from June 2024, what player statistics should we use? If we accidentally include performances from July 2024, we're "cheating" - using future information that wouldn't be available to a real predictor.
Use player's complete career stats (includes future matches) → Artificially inflated accuracy
Take stat snapshot BEFORE each match date → Realistic, production-valid predictions
Implementation: Chunked Stats Cache
To achieve temporal integrity efficiently, we built a chunked stats cache system:
Ready to Explore Further?
Dive into the technical details or try the live prediction demo