Predicting Cricket,
One Ball at a Time
A machine learning system that predicts 4 million individual ball outcomes to forecast T20 match results with Monte Carlo simulation.
Why Traditional Prediction Fails
a data problemTraditional Approach
Historical T20 matches
Predicting match winners directly gives limited training data — not enough to learn nuanced patterns.
Our Approach
Individual ball outcomes
By predicting each ball we get 266× more training data. Rich patterns emerge from abundant examples.
The Key Insight
Each ball has an outcome: dot, single, double, boundary, six, or wicket. By simulating entire matches ball-by-ball with Monte Carlo methods, we capture the natural uncertainty of cricket while leveraging abundant training data.
How It Works
data to predictionsData
8,341 matches → 4M balls
Features
63 features per ball
Model
XGBoost classifier
Simulate
1000 Monte Carlo runs
Predict
Win probabilities
63 Features Across 6 Categories
What Drives the Predictions?
Top 10 features by XGBoost gain importance
Model Leaderboard
four architectures, one taskTested on T20 World Cup 2024 matches with real betting odds. Lower log loss and Brier are better; market edge is disagreement with the book.
| Rank | Model | Log Loss | Brier | Market Edge | Speed |
|---|---|---|---|---|---|
| 1 | XGBoost · Gradient Boosting | 0.655 | 0.219 | 29.4% | ~346s |
| 2 | MLP · Neural Net | 0.707 | 0.254 | 27.0% | ~75s |
| 3 | LSTM · Recurrent | 0.721 | 0.261 | 25.8% | ~420s |
| 4 | Fine-tuned LLM · Transformer | 0.748 | 0.278 | 24.1% | ~890s |
XGBoost
Gradient-boosted trees, Optuna-tuned. 444 estimators, max depth 10.
MLP
3-layer feedforward (256→128→64), BatchNorm, focal loss.
LSTM
2-layer with player embeddings over 10-ball windows.
Fine-tuned LLM
GPT-style transformer with LoRA on cricket commentary.
Why XGBoost Wins
The tabular nature of cricket statistics — player averages, match state — plays to gradient boosting's strengths. Neural nets like LSTM suit sequential patterns, but the added complexity doesn't translate into better match predictions here.
Tested Against the Market
44 World Cup matchesLog Loss
Lower is betterBrier Score
Calibration metricAverage Edge
vs betting marketsMarket Disagreement
The model finds edge opportunities on every match, flagging where betting markets may misprice probabilities.
Calibration
A Brier score of 0.26 indicates reasonable calibration — when the model says 60%, teams win about 60% of the time.
About This Project
CricML is a personal project exploring the intersection of machine learning and cricket analytics. It demonstrates production-grade ML engineering with proper temporal data handling, efficient memory management, and rigorous evaluation against real-world betting markets.