Machine Learning + Cricket Analytics

Predicting Cricket
One Ball at a Time

A machine learning system that predicts 4 million individual ball outcomes to forecast T20 match results with Monte Carlo simulation.

4M+
Balls Analyzed
63
Features Per Ball
55%
Ball Accuracy
29%
Market Edge
The Challenge

Why Traditional Prediction Fails

Direct match prediction suffers from a fundamental data problem

Traditional Approach

15,000
Historical T20 matches

Predicting match winners directly gives you limited training data. Not enough to learn nuanced patterns.

Our Approach

4,000,000
Individual ball outcomes

By predicting each ball, we get 266x more training data. Rich patterns emerge from abundant examples.

The Key Insight

Each ball has an outcome: dot, single, double, boundary, six, or wicket. By simulating entire matches ball-by-ball using Monte Carlo methods, we capture the natural uncertainty of cricket while leveraging abundant training data.

The Solution

How It Works

A five-step pipeline from raw data to match predictions

1

Data

8,341 T20 matches → 4M balls

2

Features

63 features per ball

3

Model

XGBoost classifier

4

Simulate

1000 Monte Carlo runs

5

Predict

Win probabilities

63 Features Across 6 Categories

12
Match State
Score, wickets, phase
20
Player Stats
Averages, strike rates
10
Momentum
Recent balls, pressure
4
Chase
Target, required rate
5
Matchups
Batter vs bowler type
12
Context
Venue, batting order

What Drives the Predictions?

Top 10 features by XGBoost gain importance

Model Comparison

Model Leaderboard

Four different architectures tested on T20 World Cup 2024 matches

Rank Model Log Loss Brier Score Market Edge Speed
1
XGBoost
Gradient Boosting
0.655 0.219 29.4% ~346s
2
MLP
Neural Network
0.707 0.254 27.0% ~75s Fastest
3
LSTM
Recurrent Network
0.721 0.261 25.8% ~420s
4
Fine-tuned LLM
Language Model
0.748 0.278 24.1% ~890s

XGBoost

Best

Gradient boosted trees with Optuna hyperparameter tuning. 444 estimators, max depth 10.

46 features 6-class

MLP

Fast

3-layer feedforward network with BatchNorm and dropout. Focal loss for class imbalance.

256→128→64 ReLU

LSTM

Sequence

2-layer LSTM with player embeddings. Captures sequential patterns from 10-ball windows.

256→128 Embeddings

Fine-tuned LLM

Experimental

Transformer-based model fine-tuned on cricket commentary and match state descriptions.

GPT-style LoRA

Why XGBoost Wins

Despite being a simpler architecture, XGBoost outperforms neural networks on this task. The tabular nature of cricket statistics (player averages, match state) plays to XGBoost's strengths. Neural networks like LSTM are better suited for sequential patterns, but the additional complexity doesn't translate to improved match predictions.

Results

Tested Against Betting Markets

Evaluated on 44 T20 World Cup 2024 matches with real betting odds

0.73
Log Loss
Lower is better
0.26
Brier Score
Calibration metric
29%
Average Edge
vs betting markets

What This Means

Market Disagreement

The model finds significant edge opportunities on every match, identifying where betting markets may be mispricing probabilities.

Calibration

The Brier score of 0.26 indicates reasonable probability calibration. When the model predicts 60% win probability, teams win approximately 60% of the time.

About This Project

CricML is a personal project exploring the intersection of machine learning and cricket analytics. It demonstrates production-grade ML engineering with proper temporal data handling, efficient memory management, and rigorous evaluation against real-world betting markets.