Prediction Engine Overview

How Ball-by-Ball Prediction Works

A comprehensive guide to understanding the machine learning system that predicts cricket matches through individual ball outcomes.

See the Pipeline Technical Deep Dive

Why Predict Individual Balls?

The fundamental insight that makes this system possible

The Data Problem

Traditional prediction models try to directly forecast match winners. But with only 15,000 historical T20 matches, there isn't enough data to learn the nuanced patterns that determine outcomes.

Limited training examples
High variance predictions
No uncertainty quantification

The Ball-by-Ball Solution

Instead of predicting matches, we predict individual ball outcomes. This gives us 4 million training examples - 266x more data to learn from.

Abundant training data
Rich pattern learning
Natural uncertainty via simulation

Six Possible Ball Outcomes

Dot Ball

~35%

Single

~40%

Double

~5%

Boundary

~12%

Six

~5%

Wicket

~3%

The Prediction Pipeline

From raw match data to win probabilities in five steps

Data Collection & Processing

We start with ball-by-ball data from 8,341 T20 matches sourced from Cricsheet. Each delivery is parsed to extract the outcome (runs scored or wicket) along with contextual information.

8,341

Matches

4M+

Ball Records

7,223

Players

Feature Engineering

Each ball is described by 63 features across 6 categories. These capture everything from the current match state to historical player performance and tactical matchups.

Explore all 63 features

XGBoost Model

An XGBoost classifier predicts the probability distribution over the 6 possible outcomes for each ball. The model was tuned using Optuna across 50 hyperparameter trials.

444

Estimators

Max Depth

0.24

Learning Rate

55%

Accuracy

Monte Carlo Simulation

To predict a match, we simulate it 1,000 times. Each simulation plays out ball-by-ball: predict probabilities, sample an outcome, update match state, repeat for all 240 balls (both innings).

# Simplified simulation loop

for sim in range(1000):

for ball in range(240):

probs = model.predict(features)

outcome = sample(probs)

update_match_state(outcome)

record_winner()

Win Probabilities

After 1,000 simulations, we aggregate the results. If Team A wins 620 simulations, their win probability is 62%. We also get score distributions and confidence intervals.

Example Output

India 52.1%

South Africa 47.9%

Score Distribution

Mean Score 158.3

95% CI 138 - 180

63 Features in 6 Categories

Click each category to explore the features within

Match State

Where is the game right now?

score

Current team score

wickets

Wickets lost

balls_bowled

Balls in innings

run_rate

Current run rate

is_powerplay

Overs 1-6

is_death

Overs 16-20

balls_remaining

Balls left

wickets_in_hand

Wickets remaining

inning_idx

1st or 2nd innings

Player Stats

Historical batter & bowler performance

batter_avg

Career batting average

batter_sr

Career strike rate

batter_recent_avg

Last 10 innings avg

bowler_avg

Career bowling average

bowler_economy

Runs per over

h2h_avg

Head-to-head average

h2h_sr

H2H strike rate

batter_vs_pace_avg

Avg vs pace bowlers

batter_vs_spin_sr

SR vs spin bowlers

Momentum & Pressure

What just happened in the match?

last_5_balls_runs

Runs in last 5 balls

last_10_balls_runs

Runs in last 10 balls

balls_since_boundary

Dot ball pressure

last_10_dots

Dots in last 10

partnership_runs

Current partnership

pressure_index

Combined pressure metric

Chase Dynamics

Second innings target pursuit

chase_target

Runs needed to win

required_rate

RRR to win

lead_gap

Runs behind/ahead

venue_avg_score

Venue par score

Tactical Matchups

Batter vs bowler type advantages

spin_matchup_advantage

Batter vs spin compatibility

same_arm_matchup

Same arm advantage

matchup_type

RHB vs RF, LHB vs LS, etc.

batter_hand

Left or right handed

bowler_type

Pace or spin

Match Context

Venue, toss, and situational factors

venue_encoded

Stadium identifier

is_batting_first

Setting or chasing

is_toss_winner

Won the toss

balls_in_over

Position in over (0-5)

batter_age

Player age at match

bowler_age

Bowler age at match

Critical Engineering

Temporal Integrity

The most important engineering decision in the entire system

The Problem: Data Leakage

When predicting a match from June 2024, what player statistics should we use? If we accidentally include performances from July 2024, we're "cheating" - using future information that wouldn't be available to a real predictor.

Wrong Approach

Use player's complete career stats (includes future matches) → Artificially inflated accuracy

Our Solution

Take stat snapshot BEFORE each match date → Realistic, production-valid predictions

Implementation: Chunked Stats Cache

To achieve temporal integrity efficiently, we built a chunked stats cache system:

7.6GB

Total cache size

Chunks (LRU)

550MB

Memory usage

<10ms

Query time

Ready to Explore Further?

Dive into the technical details or try the live prediction demo

Technical Deep Dive Try Live Demo