Technical Deep Dive

1 System Architecture

The system is organized into three main phases: training, simulation, and evaluation. Each phase is designed to be modular and reproducible.

End-to-End Pipeline

Training Phase

Input

Raw JSON Data

8,341 matches

Feature Engineering

parsing_v2.py

63 features per ball · Temporal Stats Lookup

7.6GB Chunked Cache

69 chunks · LRU eviction

4M+ Training Records

Parquet format

Model Training

xgboost_v2.py

Optuna · 50 trials · Class weights

Output

Trained XGBoost Model

128MB + Encoders

Simulation Phase

Input

Test Match + Lineups

Monte Carlo Engine

sim_v1_2.py

1000 simulations · Ball-by-ball

predict → sample → update

Output

Win Probabilities

+ Score Distributions

Evaluation Phase

Inputs

Model Predictions

+ Betting Market Odds

Metrics Calculator

match_evaluator.py

44 World Cup matches

Results

Log Loss · Brier · Edge

Calibration Analysis

Key Files

scripts/parsing_v2.py
scripts/xgboost_v2.py
scripts/sim_v1_2.py
scripts/stats_provider.py
scripts/sim_eval/match_evaluator.py

Technology Stack

Python 3.11+
XGBoost 2.0+ (gradient boosting)
Pandas / NumPy (data processing)
Optuna (hyperparameter optimization)
Parquet (columnar storage)

2 Data Engineering

The most critical engineering challenge is maintaining temporal integrity - ensuring we never use future information when making predictions. This is achieved through a chunked stats cache system.

Chunked Stats Cache

The complete stats cache is 7.6GB - too large to fit in memory. We split it into 69 chunks with LRU (Least Recently Used) eviction.

# stats_provider.py - Temporal stats lookup

class StatsProvider:
    def __init__(self, cache_dir, max_chunks=5):
        self.cache_dir = cache_dir
        self.max_chunks = max_chunks
        self.loaded_chunks = OrderedDict()  # LRU cache
        self.metadata = self._load_metadata()

    def get_stats(self, player_id, match_date):
        """Get player stats as of match_date (temporal lookup)"""

        # Binary search for most recent snapshot <= match_date
        chunk_idx = self._find_chunk(match_date)

        # Load chunk if not in memory (with LRU eviction)
        if chunk_idx not in self.loaded_chunks:
            self._load_chunk(chunk_idx)

        # Return stats snapshot
        return self.loaded_chunks[chunk_idx].get(player_id, DEFAULT_STATS)

7.6GB

Total size

69

Chunks

550MB

Memory footprint

95%+

Cache hit rate

Data Sources

Cricsheet

Ball-by-ball match data in JSON format

8,341 T20 matches
7,223 unique players
Full delivery records

Player Metadata

Enriched player information

Batting hand (L/R)
Bowling arm & style
Date of birth

3 Feature Engineering

Each ball is described by 63 features organized into a three-tier structure. The features capture match state, player performance, momentum, and tactical matchups.

Feature Tiers

Tier 1: Player Metadata (8 features)

Static player attributes: batting hand, bowling arm, bowling style, age at match date

Tier 2: Matchup Features (5 features)

Tactical interactions: spin matchup advantage, same arm matchup, matchup type encoding

Tier 3: Type-Based Historical (8 features)

Performance by type: batter avg/SR vs pace/spin, bowler metrics vs left/right-handers

Feature Extraction Code

# parsing_v2.py - Feature extraction for each ball

def extract_ball_features(ball, match_state, stats_provider, match_date):
    features = {}

    # Match State Features
    features['score'] = match_state.runs
    features['wickets'] = match_state.wickets
    features['balls_bowled'] = match_state.balls
    features['run_rate'] = match_state.runs / (match_state.balls / 6)
    features['is_powerplay'] = 1 if match_state.balls < 36 else 0

    # Player Stats (temporal lookup - CRITICAL)
    batter_stats = stats_provider.get_stats(ball.batter_id, match_date)
    bowler_stats = stats_provider.get_stats(ball.bowler_id, match_date)

    features['batter_avg'] = batter_stats['batting_avg']
    features['batter_sr'] = batter_stats['strike_rate']
    features['bowler_economy'] = bowler_stats['economy']

    # Momentum Features
    features['last_5_runs'] = sum(match_state.recent_balls[-5:])
    features['balls_since_boundary'] = match_state.balls_since_boundary
    features['pressure_index'] = calculate_pressure(match_state)

    # Matchup Features
    features['spin_matchup'] = get_spin_advantage(batter_stats, bowler_stats)

    return features

Complete Feature List

Category	Count	Key Features
Match State	12	score, wickets, balls_bowled, run_rate, is_powerplay, is_death
Player Stats	20	batter_avg, batter_sr, bowler_economy, h2h_avg, recent_form
Momentum	10	last_5_runs, balls_since_boundary, partnership, pressure_index
Chase	4	target, required_rate, lead_gap, venue_avg_score
Matchups	5	spin_matchup, same_arm, matchup_type, batter_hand
Context	12	venue_encoded, batting_first, toss_winner, balls_in_over

4 Model Training

The core model is an XGBoost multi-class classifier predicting 6 outcomes: dot ball (0), single (1), double (2), boundary (4), six (6), and wicket.

Hyperparameter Optimization

Optuna was used for Bayesian optimization across 50 trials with 7 hyperparameters.

# xgboost_v2.py - Optuna hyperparameter search

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 1),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 1),
    }

    model = XGBClassifier(**params, objective='multi:softprob')
    model.fit(X_train, y_train, sample_weight=class_weights)

    return log_loss(y_val, model.predict_proba(X_val))

Final Hyperparameters

n_estimators

444

max_depth

10

learning_rate

0.240

subsample

0.878

colsample_bytree

0.742

reg_alpha

0.850

reg_lambda

0.182

Accuracy

55-60%

Class Imbalance Handling

Ball outcomes are heavily imbalanced. Dot balls and singles dominate, while sixes and wickets are rare.

Dot

~35%

Single

~40%

Double

~5%

Four

~12%

Six

~5%

Wicket

~3%

5 Simulation Engine

The simulation engine runs 1,000 Monte Carlo simulations of each match. Each simulation plays out ball-by-ball, updating match state and sampling outcomes from the model's probability distribution.

# sim_v1_2.py - Core simulation logic (simplified)

class SimulationEngine:
    def simulate_match(self, match_info, n_sims=1000):
        results = []

        for _ in range(n_sims):
            state = MatchState(match_info)

            # First innings
            while not state.innings_complete():
                features = self.extract_features(state)
                probs = self.model.predict_proba(features)
                outcome = self.sample_outcome(probs)
                state.update(outcome)

            state.start_second_innings()

            # Second innings (chase)
            while not state.match_complete():
                features = self.extract_features(state)
                probs = self.model.predict_proba(features)
                outcome = self.sample_outcome(probs)
                state.update(outcome)

            results.append({
                'winner': state.winner,
                'team1_score': state.team1_runs,
                'team2_score': state.team2_runs,
            })

        return self.aggregate_results(results)

    def sample_outcome(self, probs):
        """Sample from probability distribution"""
        outcomes = [0, 1, 2, 4, 6, 'W']
        return np.random.choice(outcomes, p=probs)

Match State Management

The MatchState class tracks everything needed to simulate a match:

Score State

Current score
Wickets fallen
Balls bowled
Target (2nd innings)

Player State

Current batter
Non-striker
Current bowler
Batting order queue

History

Recent ball outcomes
Partnership runs
Bowler spells
Phase transitions

6 Evaluation Framework

The model is evaluated against real betting market odds from 44 T20 World Cup 2024 matches. This provides a rigorous benchmark since betting markets aggregate information from many sophisticated predictors.

Metrics

Log Loss

0.73

Measures probability calibration. Formula: -log(P(actual winner))

Brier Score

0.26

Mean squared error of probabilities. Formula: (P - actual)^2

Market Edge

29%

Average difference from betting market probabilities

Probability Calibration

A well-calibrated model predicts probabilities that match historical frequencies. If the model predicts a 60% win probability for 100 matches, those teams should win approximately 60 times. Brier score (0.26) mathematically captures this reliability.

Testing on 44 T20 World Cup matches. Note: Plotly is used for the interactive python version, this is a web representation.

# match_evaluator.py - Evaluation metrics

def evaluate_match(model_prob, market_prob, actual_winner):
    # Log Loss
    log_loss = -np.log(model_prob[actual_winner])

    # Brier Score
    brier = sum((model_prob[team] - (team == actual_winner))**2
                for team in model_prob)

    # Edge vs Market
    edge = {team: model_prob[team] - market_prob[team]
            for team in model_prob}

    return {
        'log_loss': log_loss,
        'brier_score': brier,
        'edge': edge,
    }