For Data Scientists & Engineers

Technical Deep Dive

Complete technical documentation of the CricML prediction system, including architecture, code patterns, and engineering decisions.

1 System Architecture

The system is organized into three main phases: training, simulation, and evaluation. Each phase is designed to be modular and reproducible.

End-to-End Pipeline

Training Phase
Input
Raw JSON Data
8,341 matches
Feature Engineering
parsing_v2.py
63 features per ball · Temporal Stats Lookup
7.6GB Chunked Cache
69 chunks · LRU eviction
4M+ Training Records
Parquet format
Model Training
xgboost_v2.py
Optuna · 50 trials · Class weights
Output
Trained XGBoost Model
128MB + Encoders
Simulation Phase
Input
Test Match + Lineups
Monte Carlo Engine
sim_v1_2.py
1000 simulations · Ball-by-ball
predict → sample → update
Output
Win Probabilities
+ Score Distributions
Evaluation Phase
Inputs
Model Predictions
+ Betting Market Odds
Metrics Calculator
match_evaluator.py
44 World Cup matches
Results
Log Loss · Brier · Edge
Calibration Analysis
Key Files
  • scripts/parsing_v2.py
  • scripts/xgboost_v2.py
  • scripts/sim_v1_2.py
  • scripts/stats_provider.py
  • scripts/sim_eval/match_evaluator.py
Technology Stack
  • Python 3.11+
  • XGBoost 2.0+ (gradient boosting)
  • Pandas / NumPy (data processing)
  • Optuna (hyperparameter optimization)
  • Parquet (columnar storage)

2 Data Engineering

The most critical engineering challenge is maintaining temporal integrity - ensuring we never use future information when making predictions. This is achieved through a chunked stats cache system.

Chunked Stats Cache

The complete stats cache is 7.6GB - too large to fit in memory. We split it into 69 chunks with LRU (Least Recently Used) eviction.

# stats_provider.py - Temporal stats lookup

class StatsProvider:
    def __init__(self, cache_dir, max_chunks=5):
        self.cache_dir = cache_dir
        self.max_chunks = max_chunks
        self.loaded_chunks = OrderedDict()  # LRU cache
        self.metadata = self._load_metadata()

    def get_stats(self, player_id, match_date):
        """Get player stats as of match_date (temporal lookup)"""

        # Binary search for most recent snapshot <= match_date
        chunk_idx = self._find_chunk(match_date)

        # Load chunk if not in memory (with LRU eviction)
        if chunk_idx not in self.loaded_chunks:
            self._load_chunk(chunk_idx)

        # Return stats snapshot
        return self.loaded_chunks[chunk_idx].get(player_id, DEFAULT_STATS)
7.6GB
Total size
69
Chunks
550MB
Memory footprint
95%+
Cache hit rate

Data Sources

Cricsheet

Ball-by-ball match data in JSON format

  • 8,341 T20 matches
  • 7,223 unique players
  • Full delivery records
Player Metadata

Enriched player information

  • Batting hand (L/R)
  • Bowling arm & style
  • Date of birth

3 Feature Engineering

Each ball is described by 63 features organized into a three-tier structure. The features capture match state, player performance, momentum, and tactical matchups.

Feature Tiers

Tier 1: Player Metadata (8 features)

Static player attributes: batting hand, bowling arm, bowling style, age at match date

Tier 2: Matchup Features (5 features)

Tactical interactions: spin matchup advantage, same arm matchup, matchup type encoding

Tier 3: Type-Based Historical (8 features)

Performance by type: batter avg/SR vs pace/spin, bowler metrics vs left/right-handers

Feature Extraction Code

# parsing_v2.py - Feature extraction for each ball

def extract_ball_features(ball, match_state, stats_provider, match_date):
    features = {}

    # Match State Features
    features['score'] = match_state.runs
    features['wickets'] = match_state.wickets
    features['balls_bowled'] = match_state.balls
    features['run_rate'] = match_state.runs / (match_state.balls / 6)
    features['is_powerplay'] = 1 if match_state.balls < 36 else 0

    # Player Stats (temporal lookup - CRITICAL)
    batter_stats = stats_provider.get_stats(ball.batter_id, match_date)
    bowler_stats = stats_provider.get_stats(ball.bowler_id, match_date)

    features['batter_avg'] = batter_stats['batting_avg']
    features['batter_sr'] = batter_stats['strike_rate']
    features['bowler_economy'] = bowler_stats['economy']

    # Momentum Features
    features['last_5_runs'] = sum(match_state.recent_balls[-5:])
    features['balls_since_boundary'] = match_state.balls_since_boundary
    features['pressure_index'] = calculate_pressure(match_state)

    # Matchup Features
    features['spin_matchup'] = get_spin_advantage(batter_stats, bowler_stats)

    return features

Complete Feature List

Category Count Key Features
Match State 12 score, wickets, balls_bowled, run_rate, is_powerplay, is_death
Player Stats 20 batter_avg, batter_sr, bowler_economy, h2h_avg, recent_form
Momentum 10 last_5_runs, balls_since_boundary, partnership, pressure_index
Chase 4 target, required_rate, lead_gap, venue_avg_score
Matchups 5 spin_matchup, same_arm, matchup_type, batter_hand
Context 12 venue_encoded, batting_first, toss_winner, balls_in_over

4 Model Training

The core model is an XGBoost multi-class classifier predicting 6 outcomes: dot ball (0), single (1), double (2), boundary (4), six (6), and wicket.

Hyperparameter Optimization

Optuna was used for Bayesian optimization across 50 trials with 7 hyperparameters.

# xgboost_v2.py - Optuna hyperparameter search

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 1),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 1),
    }

    model = XGBClassifier(**params, objective='multi:softprob')
    model.fit(X_train, y_train, sample_weight=class_weights)

    return log_loss(y_val, model.predict_proba(X_val))

Final Hyperparameters

n_estimators
444
max_depth
10
learning_rate
0.240
subsample
0.878
colsample_bytree
0.742
reg_alpha
0.850
reg_lambda
0.182
Accuracy
55-60%

Class Imbalance Handling

Ball outcomes are heavily imbalanced. Dot balls and singles dominate, while sixes and wickets are rare.

Dot
~35%
Single
~40%
Double
~5%
Four
~12%
Six
~5%
Wicket
~3%

5 Simulation Engine

The simulation engine runs 1,000 Monte Carlo simulations of each match. Each simulation plays out ball-by-ball, updating match state and sampling outcomes from the model's probability distribution.

# sim_v1_2.py - Core simulation logic (simplified)

class SimulationEngine:
    def simulate_match(self, match_info, n_sims=1000):
        results = []

        for _ in range(n_sims):
            state = MatchState(match_info)

            # First innings
            while not state.innings_complete():
                features = self.extract_features(state)
                probs = self.model.predict_proba(features)
                outcome = self.sample_outcome(probs)
                state.update(outcome)

            state.start_second_innings()

            # Second innings (chase)
            while not state.match_complete():
                features = self.extract_features(state)
                probs = self.model.predict_proba(features)
                outcome = self.sample_outcome(probs)
                state.update(outcome)

            results.append({
                'winner': state.winner,
                'team1_score': state.team1_runs,
                'team2_score': state.team2_runs,
            })

        return self.aggregate_results(results)

    def sample_outcome(self, probs):
        """Sample from probability distribution"""
        outcomes = [0, 1, 2, 4, 6, 'W']
        return np.random.choice(outcomes, p=probs)

Match State Management

The MatchState class tracks everything needed to simulate a match:

Score State
  • Current score
  • Wickets fallen
  • Balls bowled
  • Target (2nd innings)
Player State
  • Current batter
  • Non-striker
  • Current bowler
  • Batting order queue
History
  • Recent ball outcomes
  • Partnership runs
  • Bowler spells
  • Phase transitions

6 Evaluation Framework

The model is evaluated against real betting market odds from 44 T20 World Cup 2024 matches. This provides a rigorous benchmark since betting markets aggregate information from many sophisticated predictors.

Metrics

Log Loss

0.73

Measures probability calibration. Formula: -log(P(actual winner))

Brier Score

0.26

Mean squared error of probabilities. Formula: (P - actual)^2

Market Edge

29%

Average difference from betting market probabilities

Probability Calibration

A well-calibrated model predicts probabilities that match historical frequencies. If the model predicts a 60% win probability for 100 matches, those teams should win approximately 60 times. Brier score (0.26) mathematically captures this reliability.

Testing on 44 T20 World Cup matches. Note: Plotly is used for the interactive python version, this is a web representation.

# match_evaluator.py - Evaluation metrics

def evaluate_match(model_prob, market_prob, actual_winner):
    # Log Loss
    log_loss = -np.log(model_prob[actual_winner])

    # Brier Score
    brier = sum((model_prob[team] - (team == actual_winner))**2
                for team in model_prob)

    # Edge vs Market
    edge = {team: model_prob[team] - market_prob[team]
            for team in model_prob}

    return {
        'log_loss': log_loss,
        'brier_score': brier,
        'edge': edge,
    }

Ready to See It in Action?

Try the live prediction demo to see the model make real-time predictions.

Try Live Demo