Technical Deep Dive
Complete technical documentation of the CricML prediction system, including architecture, code patterns, and engineering decisions.
1 System Architecture
The system is organized into three main phases: training, simulation, and evaluation. Each phase is designed to be modular and reproducible.
End-to-End Pipeline
Key Files
- scripts/parsing_v2.py
- scripts/xgboost_v2.py
- scripts/sim_v1_2.py
- scripts/stats_provider.py
- scripts/sim_eval/match_evaluator.py
Technology Stack
- Python 3.11+
- XGBoost 2.0+ (gradient boosting)
- Pandas / NumPy (data processing)
- Optuna (hyperparameter optimization)
- Parquet (columnar storage)
2 Data Engineering
The most critical engineering challenge is maintaining temporal integrity - ensuring we never use future information when making predictions. This is achieved through a chunked stats cache system.
Chunked Stats Cache
The complete stats cache is 7.6GB - too large to fit in memory. We split it into 69 chunks with LRU (Least Recently Used) eviction.
# stats_provider.py - Temporal stats lookup
class StatsProvider:
def __init__(self, cache_dir, max_chunks=5):
self.cache_dir = cache_dir
self.max_chunks = max_chunks
self.loaded_chunks = OrderedDict() # LRU cache
self.metadata = self._load_metadata()
def get_stats(self, player_id, match_date):
"""Get player stats as of match_date (temporal lookup)"""
# Binary search for most recent snapshot <= match_date
chunk_idx = self._find_chunk(match_date)
# Load chunk if not in memory (with LRU eviction)
if chunk_idx not in self.loaded_chunks:
self._load_chunk(chunk_idx)
# Return stats snapshot
return self.loaded_chunks[chunk_idx].get(player_id, DEFAULT_STATS)
Data Sources
Cricsheet
Ball-by-ball match data in JSON format
- 8,341 T20 matches
- 7,223 unique players
- Full delivery records
Player Metadata
Enriched player information
- Batting hand (L/R)
- Bowling arm & style
- Date of birth
3 Feature Engineering
Each ball is described by 63 features organized into a three-tier structure. The features capture match state, player performance, momentum, and tactical matchups.
Feature Tiers
Tier 1: Player Metadata (8 features)
Static player attributes: batting hand, bowling arm, bowling style, age at match date
Tier 2: Matchup Features (5 features)
Tactical interactions: spin matchup advantage, same arm matchup, matchup type encoding
Tier 3: Type-Based Historical (8 features)
Performance by type: batter avg/SR vs pace/spin, bowler metrics vs left/right-handers
Feature Extraction Code
# parsing_v2.py - Feature extraction for each ball
def extract_ball_features(ball, match_state, stats_provider, match_date):
features = {}
# Match State Features
features['score'] = match_state.runs
features['wickets'] = match_state.wickets
features['balls_bowled'] = match_state.balls
features['run_rate'] = match_state.runs / (match_state.balls / 6)
features['is_powerplay'] = 1 if match_state.balls < 36 else 0
# Player Stats (temporal lookup - CRITICAL)
batter_stats = stats_provider.get_stats(ball.batter_id, match_date)
bowler_stats = stats_provider.get_stats(ball.bowler_id, match_date)
features['batter_avg'] = batter_stats['batting_avg']
features['batter_sr'] = batter_stats['strike_rate']
features['bowler_economy'] = bowler_stats['economy']
# Momentum Features
features['last_5_runs'] = sum(match_state.recent_balls[-5:])
features['balls_since_boundary'] = match_state.balls_since_boundary
features['pressure_index'] = calculate_pressure(match_state)
# Matchup Features
features['spin_matchup'] = get_spin_advantage(batter_stats, bowler_stats)
return features
Complete Feature List
| Category | Count | Key Features |
|---|---|---|
| Match State | 12 | score, wickets, balls_bowled, run_rate, is_powerplay, is_death |
| Player Stats | 20 | batter_avg, batter_sr, bowler_economy, h2h_avg, recent_form |
| Momentum | 10 | last_5_runs, balls_since_boundary, partnership, pressure_index |
| Chase | 4 | target, required_rate, lead_gap, venue_avg_score |
| Matchups | 5 | spin_matchup, same_arm, matchup_type, batter_hand |
| Context | 12 | venue_encoded, batting_first, toss_winner, balls_in_over |
4 Model Training
The core model is an XGBoost multi-class classifier predicting 6 outcomes: dot ball (0), single (1), double (2), boundary (4), six (6), and wicket.
Hyperparameter Optimization
Optuna was used for Bayesian optimization across 50 trials with 7 hyperparameters.
# xgboost_v2.py - Optuna hyperparameter search
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 500),
'max_depth': trial.suggest_int('max_depth', 3, 12),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
'reg_alpha': trial.suggest_float('reg_alpha', 0, 1),
'reg_lambda': trial.suggest_float('reg_lambda', 0, 1),
}
model = XGBClassifier(**params, objective='multi:softprob')
model.fit(X_train, y_train, sample_weight=class_weights)
return log_loss(y_val, model.predict_proba(X_val))
Final Hyperparameters
Class Imbalance Handling
Ball outcomes are heavily imbalanced. Dot balls and singles dominate, while sixes and wickets are rare.
5 Simulation Engine
The simulation engine runs 1,000 Monte Carlo simulations of each match. Each simulation plays out ball-by-ball, updating match state and sampling outcomes from the model's probability distribution.
# sim_v1_2.py - Core simulation logic (simplified)
class SimulationEngine:
def simulate_match(self, match_info, n_sims=1000):
results = []
for _ in range(n_sims):
state = MatchState(match_info)
# First innings
while not state.innings_complete():
features = self.extract_features(state)
probs = self.model.predict_proba(features)
outcome = self.sample_outcome(probs)
state.update(outcome)
state.start_second_innings()
# Second innings (chase)
while not state.match_complete():
features = self.extract_features(state)
probs = self.model.predict_proba(features)
outcome = self.sample_outcome(probs)
state.update(outcome)
results.append({
'winner': state.winner,
'team1_score': state.team1_runs,
'team2_score': state.team2_runs,
})
return self.aggregate_results(results)
def sample_outcome(self, probs):
"""Sample from probability distribution"""
outcomes = [0, 1, 2, 4, 6, 'W']
return np.random.choice(outcomes, p=probs)
Match State Management
The MatchState class tracks everything needed to simulate a match:
Score State
- Current score
- Wickets fallen
- Balls bowled
- Target (2nd innings)
Player State
- Current batter
- Non-striker
- Current bowler
- Batting order queue
History
- Recent ball outcomes
- Partnership runs
- Bowler spells
- Phase transitions
6 Evaluation Framework
The model is evaluated against real betting market odds from 44 T20 World Cup 2024 matches. This provides a rigorous benchmark since betting markets aggregate information from many sophisticated predictors.
Metrics
Log Loss
Measures probability calibration. Formula: -log(P(actual winner))
Brier Score
Mean squared error of probabilities. Formula: (P - actual)^2
Market Edge
Average difference from betting market probabilities
Probability Calibration
A well-calibrated model predicts probabilities that match historical frequencies. If the model predicts a 60% win probability for 100 matches, those teams should win approximately 60 times. Brier score (0.26) mathematically captures this reliability.
Testing on 44 T20 World Cup matches. Note: Plotly is used for the interactive python version, this is a web representation.
# match_evaluator.py - Evaluation metrics
def evaluate_match(model_prob, market_prob, actual_winner):
# Log Loss
log_loss = -np.log(model_prob[actual_winner])
# Brier Score
brier = sum((model_prob[team] - (team == actual_winner))**2
for team in model_prob)
# Edge vs Market
edge = {team: model_prob[team] - market_prob[team]
for team in model_prob}
return {
'log_loss': log_loss,
'brier_score': brier,
'edge': edge,
}
Ready to See It in Action?
Try the live prediction demo to see the model make real-time predictions.
Try Live Demo