Prediction Engine Overview

How Ball-by-Ball Prediction Works

A comprehensive guide to understanding the machine learning system that predicts cricket matches through individual ball outcomes.

Why Predict Individual Balls?

The fundamental insight that makes this system possible

The Data Problem

Traditional prediction models try to directly forecast match winners. But with only 15,000 historical T20 matches, there isn't enough data to learn the nuanced patterns that determine outcomes.

  • Limited training examples
  • High variance predictions
  • No uncertainty quantification

The Ball-by-Ball Solution

Instead of predicting matches, we predict individual ball outcomes. This gives us 4 million training examples - 266x more data to learn from.

  • Abundant training data
  • Rich pattern learning
  • Natural uncertainty via simulation

Six Possible Ball Outcomes

0
Dot Ball
~35%
1
Single
~40%
2
Double
~5%
4
Boundary
~12%
6
Six
~5%
W
Wicket
~3%

The Prediction Pipeline

From raw match data to win probabilities in five steps

1

Data Collection & Processing

We start with ball-by-ball data from 8,341 T20 matches sourced from Cricsheet. Each delivery is parsed to extract the outcome (runs scored or wicket) along with contextual information.

8,341
Matches
4M+
Ball Records
7,223
Players
2

Feature Engineering

Each ball is described by 63 features across 6 categories. These capture everything from the current match state to historical player performance and tactical matchups.

Explore all 63 features
3

XGBoost Model

An XGBoost classifier predicts the probability distribution over the 6 possible outcomes for each ball. The model was tuned using Optuna across 50 hyperparameter trials.

444
Estimators
10
Max Depth
0.24
Learning Rate
55%
Accuracy
4

Monte Carlo Simulation

To predict a match, we simulate it 1,000 times. Each simulation plays out ball-by-ball: predict probabilities, sample an outcome, update match state, repeat for all 240 balls (both innings).

# Simplified simulation loop
for sim in range(1000):
for ball in range(240):
probs = model.predict(features)
outcome = sample(probs)
update_match_state(outcome)
record_winner()
5

Win Probabilities

After 1,000 simulations, we aggregate the results. If Team A wins 620 simulations, their win probability is 62%. We also get score distributions and confidence intervals.

Example Output
India 52.1%
South Africa 47.9%
Score Distribution
Mean Score 158.3
95% CI 138 - 180

63 Features in 6 Categories

Click each category to explore the features within

12

Match State

Where is the game right now?

score
Current team score
wickets
Wickets lost
balls_bowled
Balls in innings
run_rate
Current run rate
is_powerplay
Overs 1-6
is_death
Overs 16-20
balls_remaining
Balls left
wickets_in_hand
Wickets remaining
inning_idx
1st or 2nd innings
20

Player Stats

Historical batter & bowler performance

batter_avg
Career batting average
batter_sr
Career strike rate
batter_recent_avg
Last 10 innings avg
bowler_avg
Career bowling average
bowler_economy
Runs per over
h2h_avg
Head-to-head average
h2h_sr
H2H strike rate
batter_vs_pace_avg
Avg vs pace bowlers
batter_vs_spin_sr
SR vs spin bowlers
10

Momentum & Pressure

What just happened in the match?

last_5_balls_runs
Runs in last 5 balls
last_10_balls_runs
Runs in last 10 balls
balls_since_boundary
Dot ball pressure
last_10_dots
Dots in last 10
partnership_runs
Current partnership
pressure_index
Combined pressure metric
4

Chase Dynamics

Second innings target pursuit

chase_target
Runs needed to win
required_rate
RRR to win
lead_gap
Runs behind/ahead
venue_avg_score
Venue par score
5

Tactical Matchups

Batter vs bowler type advantages

spin_matchup_advantage
Batter vs spin compatibility
same_arm_matchup
Same arm advantage
matchup_type
RHB vs RF, LHB vs LS, etc.
batter_hand
Left or right handed
bowler_type
Pace or spin
12

Match Context

Venue, toss, and situational factors

venue_encoded
Stadium identifier
is_batting_first
Setting or chasing
is_toss_winner
Won the toss
balls_in_over
Position in over (0-5)
batter_age
Player age at match
bowler_age
Bowler age at match
Critical Engineering

Temporal Integrity

The most important engineering decision in the entire system

The Problem: Data Leakage

When predicting a match from June 2024, what player statistics should we use? If we accidentally include performances from July 2024, we're "cheating" - using future information that wouldn't be available to a real predictor.

Wrong Approach

Use player's complete career stats (includes future matches) → Artificially inflated accuracy

Our Solution

Take stat snapshot BEFORE each match date → Realistic, production-valid predictions

Implementation: Chunked Stats Cache

To achieve temporal integrity efficiently, we built a chunked stats cache system:

7.6GB
Total cache size
69
Chunks (LRU)
550MB
Memory usage
<10ms
Query time

Ready to Explore Further?

Dive into the technical details or try the live prediction demo