Time Series Cross-Validation

1.3. Time Series Cross-Validation#

1.3.1. Cross-Validation#

Cross-validation is a powerful statistical technique used in machine learning for model selection and performance evaluation. It is particularly useful for smaller datasets and helps ensure that models generalize well to unseen data. By systematically partitioning the data into subsets for training and validation, cross-validation provides a more robust assessment of model performance than simple train-test splits [James et al., 2023].

Purpose and Benefits

Model Selection:
- Helps choose the best model or hyperparameters for a given task.
- Allows comparison of different models or algorithms on the same data.
- Provides a systematic way to tune hyperparameters for optimal performance [Refaeilzadeh et al., 2009].
Performance Estimation:
- Provides a more reliable estimate of model performance on unseen data.
- Reduces the impact of data partitioning luck by using multiple train-test splits.
- Offers insights into the model’s stability and sensitivity to data variations [James et al., 2023].
Overfitting Prevention:
- Reduces the risk of overfitting by using multiple subsets of data for training and validation.
- Helps identify models that are too complex or too simple for the given data.
- Encourages the development of more generalizable models [Refaeilzadeh et al., 2009].

1.3.2. Why Standard Cross-Validation Fails for Time Series#

Traditional k-fold cross-validation randomly partitions data into folds, which violates the temporal structure of time series data. This can lead to:

Data leakage: Using future information to predict past events
Unrealistic performance estimates: Models appear more accurate than they would be in practice
Violation of temporal dependencies: Breaking the sequential nature that drives time series patterns

Example: If we randomly split stock price data, we might train on 2023 data and test on 2021 data, which is impossible in real forecasting scenarios.

1.3.3. Time Series Cross-Validation#

Time Series Cross-Validation is a specialized technique designed for evaluating models on time-dependent data. This method is crucial for maintaining the chronological order of observations, which is essential in time series analysis. Here’s an expanded explanation:

Sliding Window Approach:
- Uses a forward-moving time window for training and testing.
- The training set grows with each iteration, while the test set remains a fixed size.
- This approach mimics real-world forecasting scenarios where we use past data to predict future values.
Maintaining Temporal Order:
- Preserves the time-based dependencies in the data.
- Ensures that future information is not used to predict past events, avoiding data leakage.
Multiple Train-Test Splits:
- Creates several train-test splits, each representing a different point in time.
- Allows for assessing model performance across various temporal segments of the data.
Forecasting Performance Assessment:
- Evaluates how well the model generalizes to future, unseen data.
- Particularly useful for assessing the stability of model performance over time.
Adaptability to Changing Patterns:
- Helps in understanding how model performance might change as new data becomes available.
- Useful for detecting concept drift in time series.
Fixed Window (Rolling Origin):
- Training window size remains constant as it moves forward
- Useful when recent patterns are more relevant than distant history
- Example: Always use exactly 252 trading days (1 year) for training
Expanding Window (Used in our example):
- Training window grows with each split
- Incorporates all available historical information
- Better for capturing long-term trends and structural changes
Blocked Cross-Validation:
- Introduces gaps between training and test sets
- Accounts for temporal correlation that might persist beyond immediate neighbors
- Useful when autocorrelation extends over multiple periods

Example using NVIDIA (NVDA) stock price data:

init_notebook_modetrusted

Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

1.3.4. Types of Time Series Cross-Validation#

The three main approaches to time series cross-validation differ fundamentally in how they handle training data over time. The timeline visualization below illustrates the key distinctions between expanding window, rolling window, and blocked cross-validation methods:

../_images/ts_cv_methods.png — Fig. 1.2 Timeline comparison of three time series cross-validation methods. The visualization shows how each method handles training data (green), test data (blue), and gaps (orange) across multiple splits over time. Expanding window grows training sets progressively, rolling window maintains fixed training sizes, and blocked CV introduces purged gaps to prevent data leakage.#

Method Comparison Overview:

Expanding Window Cross-Validation (Top section):

Training sets grow progressively with each split, accumulating all historical data
Maximizes available information for model training
Best for capturing long-term trends and when all historical context is valuable
Each split builds upon all previous knowledge, creating increasingly comprehensive training sets

Rolling Window Cross-Validation (Middle section):

Fixed training window size that slides forward through time
Maintains consistent training periods (e.g., always 200 days)
Emphasizes recent patterns while discarding older information
Ideal for non-stationary series where recent data is more predictive than distant history

Blocked Cross-Validation (Bottom section):

Introduces purged gaps (orange) between training and test periods
Prevents data leakage from autocorrelation and temporal dependencies
Training sets expand but stop before test periods with buffer zones
Essential for financial data and high-frequency time series with strong serial correlation

Key Temporal Insights:

The timeline clearly shows how each method treats the temporal boundary between training and testing:

Expanding: Direct transition from training to testing (no gaps)
Rolling: Direct transition but with limited historical memory
Blocked: Intentional gaps to break temporal correlations

Selection Criteria:

Choose Expanding Window for trend-following models and when historical patterns remain relevant
Select Rolling Window for adaptive models in changing environments with structural breaks
Use Blocked CV when autocorrelation is present and realistic performance estimates are critical

Each method serves different analytical needs and data characteristics, making the choice dependent on your specific time series properties and forecasting objectives.

1.3.4.1. Expanding Window Cross-Validation#

The Expanding Window approach (also known as “growing window” or “anchored walk-forward”) is the default method in scikit-learn’s TimeSeriesSplit. This technique progressively increases the training window size while maintaining consistent test set sizes, making it particularly valuable for capturing evolving patterns in time series data.

Characteristics of Expanding Window Cross-Validation

Progressive Data Accumulation: Each subsequent split incorporates all previous training data plus additional historical observations, creating an ever-expanding knowledge base for the model. This cumulative approach ensures that valuable historical patterns and long-term trends are preserved across all validation splits.

Mimicking Real-World Scenarios: In practice, analysts and traders typically have access to all historical data when making forecasts. The expanding window method closely replicates this reality, where decision-makers use the entirety of available historical information rather than discarding older data.

Long-term Pattern Recognition: By retaining all historical data, this method is particularly effective at capturing:

Structural changes that occur gradually over time
Long-term cycles and secular trends
Regime changes in market conditions or business environments
Seasonal patterns that may evolve or strengthen over time

When to Use Expanding Windows

Sufficient Historical Data: When you have adequate historical observations to support meaningful training
Evolving Relationships: When relationships between variables may strengthen or change over time
Long-term Dependencies: For time series where distant past observations remain relevant
Model Stability Testing: To assess how model performance changes as more data becomes available

Trade-offs to Consider:

Computational Cost: Training time increases with each split as the training set grows
Concept Drift: May be less responsive to recent structural changes if older patterns dominate
Memory Requirements: Larger training sets require more computational resources

The NVIDIA example below demonstrates these concepts in action, showing how each split builds upon previous knowledge while testing the model’s ability to generalize to future periods.

Split 1:
	Train set: 		From 2020-11-30 To 2021-09-30
	Test set: 		From 2021-10-01 To 2022-08-01
Split 2:
	Train set: 		From 2020-11-30 To 2022-08-01
	Test set: 		From 2022-08-02 To 2023-05-31
Split 3:
	Train set: 		From 2020-11-30 To 2023-05-31
	Test set: 		From 2023-06-01 To 2024-04-01
Split 4:
	Train set: 		From 2020-11-30 To 2024-04-01
	Test set: 		From 2024-04-02 To 2025-01-30
Split 5:
	Train set: 		From 2020-11-30 To 2025-01-30
	Test set: 		From 2025-01-31 To 2025-11-28

../_images/time_series_cv.png — Fig. 1.3 Time Series Cross-Validation splits for NVIDIA stock price data (2020-2025). The visualization shows five sequential splits with expanding training windows (green) and fixed-size test sets (blue). Each split demonstrates the forward-moving nature of the validation process, with the gray line representing the complete dataset. The stock price shows significant growth, particularly in 2023-2025, ranging from around $0 to $150.#

Split Structure

Each split shows:
- Training data (green shaded area)
- Test data (blue shaded area)
- Full dataset line (gray)
The training window expands with each subsequent split
Test set size remains consistent across all splits

Progressive Nature

Split 1: Uses minimal training data with early test period
Split 2: Expands training data forward
Split 3: Further expansion of training window
Split 4: Includes significant price increase period
Split 5: Uses maximum training data, testing on most recent period

1.3.4.2. Rolling/Sliding Window Cross-Validation#

Definition: Rolling Window Cross-Validation (also known as Fixed Window or Sliding Window) maintains a constant training window size that slides forward with each split, emphasizing recent data patterns over historical information.

Characteristics of Rolling/Sliding Window Cross-Validation

Training window size remains fixed as it moves forward
Useful when recent patterns are more relevant than distant history
Also called “Rolling Origin” validation
Preferable when older history may not be relevant due to structural changes

When to Use Rolling/Sliding Window

Large datasets with sufficient observations
When structural changes occur over time (e.g., market regime shifts)
When you want to simulate real-world scenarios where only recent data is available
Example: Using exactly 252 trading days (1 year) for training, always

Show code cell content

Hide code cell content

from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Custom Rolling Window implementation
class RollingTimeSeriesSplit:
    def __init__(self, n_splits=5, train_size=None, test_size=None):
        self.n_splits = n_splits
        self.train_size = train_size
        self.test_size = test_size
    
    def split(self, X):
        n_samples = len(X)
        if self.test_size is None:
            test_size = n_samples // (self.n_splits + 1)
        else:
            test_size = self.test_size
            
        if self.train_size is None:
            train_size = test_size * 2  # Default: 2x test size
        else:
            train_size = self.train_size
        
        for i in range(self.n_splits):
            start_train = i * test_size
            end_train = start_train + train_size
            start_test = end_train
            end_test = start_test + test_size
            
            if end_test <= n_samples:
                train_indices = np.arange(start_train, end_train)
                test_indices = np.arange(start_test, end_test)
                yield train_indices, test_indices

# Example usage with NVIDIA data
rolling_cv = RollingTimeSeriesSplit(n_splits=4, train_size=200, test_size=50)

print("Rolling Window Cross-Validation:")
for i, (train_idx, test_idx) in enumerate(rolling_cv.split(data), 1):
    print(f"Split {i}:")
    print(f"\tTrain: From {data.loc[train_idx.min(), 'Date'].strftime('%Y-%m-%d')} "
          f"To {data.loc[train_idx.max(), 'Date'].strftime('%Y-%m-%d')} "
          f"({len(train_idx)} days)")
    print(f"\tTest: From {data.loc[test_idx.min(), 'Date'].strftime('%Y-%m-%d')} "
          f"To {data.loc[test_idx.max(), 'Date'].strftime('%Y-%m-%d')} "
          f"({len(test_idx)} days)")

Rolling Window Cross-Validation:
Split 1:
	Train: From 2020-11-30 To 2021-09-15 (200 days)
	Test: From 2021-09-16 To 2021-11-24 (50 days)
Split 2:
	Train: From 2021-02-11 To 2021-11-24 (200 days)
	Test: From 2021-11-26 To 2022-02-07 (50 days)
Split 3:
	Train: From 2021-04-26 To 2022-02-07 (200 days)
	Test: From 2022-02-08 To 2022-04-20 (50 days)
Split 4:
	Train: From 2021-07-07 To 2022-04-20 (200 days)
	Test: From 2022-04-21 To 2022-07-01 (50 days)

../_images/rolling_window_cv.png — Fig. 1.4 Rolling Window Cross-Validation for NVIDIA stock price data (2020-2022). The visualization demonstrates four sequential splits with fixed-size training windows (green) that slide forward through time, maintaining consistent 200-day training periods. Unlike expanding windows, each training set discards older data as it moves forward, focusing on the most recent 200 days of market behavior. Test sets (blue) remain consistent at 50 days each. This approach is particularly effective for capturing evolving market dynamics during NVIDIA’s transition period from gaming-focused to AI-infrastructure company.#

Advantages

Adapts to new data trends through the rolling mechanism
Provides more robust forecasts in dynamic environments
Prevents overfitting through consistent window size
Better for non-stationary time series with structural breaks

Disadvantages

May discard valuable historical information
Requires sufficient data for meaningful window sizes
Less suitable for capturing long-term patterns

1.3.4.3. Blocked Cross-Validation#

Definition: Blocked Cross-Validation introduces gaps (blocking periods) between training and test sets to reduce temporal correlation and minimize data leakage due to autocorrelation.

Key Concepts:

Purging: Eliminates observations from the training set that have labels overlapping in time with the test set
Embargo: Adds an additional buffer period after the test set to eliminate serial correlation between consecutive periods

Why It’s Needed:

Time series data exhibits autocorrelation - observations close in time are correlated
Even without direct data leakage, temporal dependencies can leak future information into training
Traditional CV methods can underestimate prediction errors due to these dependencies

Show code cell source

Hide code cell source

import pandas as pd
import numpy as np
from typing import Iterator, Tuple

class PurgedTimeSeriesSplit:
    def __init__(self, n_splits: int = 5, embargo_pct: float = 0.05, 
                 purge_pct: float = 0.02):
        """
        Purged Time Series Cross-Validation with embargo
        
        Parameters:
        -----------
        n_splits : int
            Number of splits
        embargo_pct : float
            Percentage of data to embargo after test set
        purge_pct : float  
            Percentage of data to purge before test set
        """
        self.n_splits = n_splits
        self.embargo_pct = embargo_pct
        self.purge_pct = purge_pct
    
    def split(self, X: pd.DataFrame) -> Iterator[Tuple[np.ndarray, np.ndarray]]:
        n_samples = len(X)
        test_size = n_samples // (self.n_splits + 1)
        
        # Calculate embargo and purge sizes
        embargo_size = int(n_samples * self.embargo_pct)
        purge_size = int(n_samples * self.purge_pct)
        
        for i in range(self.n_splits):
            # Test set boundaries
            test_start = (i + 1) * test_size
            test_end = test_start + test_size
            
            if test_end >= n_samples:
                break
                
            # Training set (before test set, with purging)
            train_end = test_start - purge_size
            train_indices = np.arange(0, max(0, train_end))
            
            # Test set
            test_indices = np.arange(test_start, test_end)
            
            # Apply embargo: exclude period after test set from next training
            # (This affects subsequent iterations)
            
            yield train_indices, test_indices

# Example usage
purged_cv = PurgedTimeSeriesSplit(n_splits=4, embargo_pct=0.05, purge_pct=0.02)

print("Blocked/Purged Cross-Validation:")
for i, (train_idx, test_idx) in enumerate(purged_cv.split(data), 1):
    if len(train_idx) > 0 and len(test_idx) > 0:
        # Split header
        print('\033[1;31m' + f'Split {i}:' + '\033[0m')

        # Train set (green background)
        print(
            "\t\033[1;42mTrain set\033[0m:"
            f"\tFrom {data.loc[train_idx.min(), 'Date'].strftime('%Y-%m-%d')} "
            f"To {data.loc[train_idx.max(), 'Date'].strftime('%Y-%m-%d')} "
            f"({len(train_idx)} days)"
        )

        # Test set (blue background)
        print(
            "\t\033[1;44mTest set\033[0m:"
            f"\tFrom {data.loc[test_idx.min(), 'Date'].strftime('%Y-%m-%d')} "
            f"To {data.loc[test_idx.max(), 'Date'].strftime('%Y-%m-%d')} "
            f"({len(test_idx)} days)"
        )

        # Gap (optional magenta background if you want to highlight purged days)
        gap_days = test_idx.min() - train_idx.max() - 1
        if gap_days > 0:
            print(
                "\t\033[1;45mGap\033[0m:"
                f"\t\t{gap_days} days (purged)"
            )

Blocked/Purged Cross-Validation:
Split 1:
	Train set:	From 2020-11-30 To 2021-10-21 (226 days)
	Test set:	From 2021-11-29 To 2022-11-25 (251 days)
	Gap:		25 days (purged)
Split 2:
	Train set:	From 2020-11-30 To 2022-10-20 (477 days)
	Test set:	From 2022-11-28 To 2023-11-27 (251 days)
	Gap:		25 days (purged)
Split 3:
	Train set:	From 2020-11-30 To 2023-10-20 (728 days)
	Test set:	From 2023-11-28 To 2024-11-25 (251 days)
	Gap:		25 days (purged)
Split 4:
	Train set:	From 2020-11-30 To 2024-10-21 (979 days)
	Test set:	From 2024-11-26 To 2025-11-26 (251 days)
	Gap:		25 days (purged)

../_images/blocked_cv.png — Fig. 1.5 Blocked (Purged) Cross-Validation for NVIDIA stock price data (2020-2025). This visualization demonstrates the critical concept of temporal gaps in time series validation. Training sets (green) grow over time but are separated from test sets (blue) by purged gaps (red) that prevent data leakage from autocorrelation. The 25-day gaps ensure that temporal dependencies between consecutive observations don’t artificially inflate model performance estimates. This method provides more realistic performance estimates for financial time series where price movements exhibit strong serial correlation.#

In Fig. 1.5:

Purged Gaps (Red Areas):

25-day buffer zones between training and test sets
Explicitly removed from both training and testing to break autocorrelation chains
Prevent information leakage from temporally correlated observations
Size determined by autocorrelation analysis (2% of total data in this example)

Expanding Training with Gaps:

Training sets (green) still grow over time like expanding window CV
But critically stop before test periods to maintain temporal independence
Each split incorporates more historical knowledge while respecting temporal boundaries

Realistic Performance Assessment:

Split 1: Tests model on 2021-2022 data using only pre-gap 2020-2021 training
Split 2: Tests on 2022-2023 period with expanded training up to mid-2022 gap
Split 3: Evaluates 2023-2024 AI boom period with comprehensive pre-2023 training
Split 4: Tests most recent period (2024-2025) with maximum historical context

Applications

Financial Time Series: Where price movements exhibit strong autocorrelation
High-frequency Data: Where serial correlation extends over multiple periods
Label Overlap: When features are constructed using forward-looking windows
Regulatory Compliance: When avoiding look-ahead bias is critical

Benefits

More accurate error estimates than standard CV
Reduces data leakage from temporal dependencies
Better model selection in presence of autocorrelation
Provides more conservative (realistic) performance estimates

Practical Considerations

Gap Size: Should be based on autocorrelation function analysis
Embargo Period: Typically 5-10% of total sample size
Trade-off: Larger gaps provide better independence but reduce training data
Computational Cost: More complex than standard CV but essential for time series

Practical Considerations for Time Series CV

Choosing the Number of Splits:
- More splits provide better performance estimates but increase computational cost
- Consider the trade-off between statistical reliability and practical constraints
- Typically 3-10 splits depending on dataset size and computational resources
Test Set Size:
- Should reflect the actual forecasting horizon of interest
- For daily data: 30-90 days for monthly forecasting, 252 days for annual
- Balance between having enough test data for reliable evaluation and sufficient training data
Temporal Alignment:
- Ensure splits align with natural time boundaries (e.g., month/quarter ends for business data)
- Consider seasonal patterns when determining split points
- Account for calendar effects (holidays, weekends) in financial data