Remark

Please be aware that these lecture notes are accessible online in an ‘early access’ format. They are actively being developed, and certain sections will be further enriched to provide a comprehensive understanding of the subject matter.

1.6. Cross-Correlation in Time Series#

In this section, we explore Cross-Correlation, a metric used to measure the similarity between two time series as a function of the time lag applied to one of them. This is a fundamental tool in geophysical and environmental data science for identifying “lead-lag” relationships—determining if one signal (e.g., temperature in your previous home of Calgary) has a predictive relationship with another (e.g., temperature in your new home of Columbia).

1.6.1. Mathematical Background#

Cross-correlation is a method used to measure the similarity between two sets of data, especially when one is shifted in time relative to the other. It helps us determine if a change in one signal precedes or follows a change in another, and quantifies the strength of that relationship. This is a fundamental tool in domains like geophysics, finance, and environmental data science for identifying predictive “lead-lag” relationships.

1.6.1.1. Time Domain Definition#

Mathematically, the raw cross-correlation \(R_{xy}(\tau)\) of two sequences \(x_t\) and \(y_t\) at lag \(\tau\) is given by the sliding dot product:

(1.2)#\[\begin{equation} R_{xy}(\tau) = \sum_{t} x_t y_{t + \tau} \end{equation}\]

However, to make the metric comparable across different datasets, we typically use the Normalized Cross-Correlation Function. According to [Brockwell and Davis, 2016], the sample cross-correlation \(\hat{R}_{xy}(\tau)\) is defined as:

(1.3)#\[\text{Normalized } R_{xy}(\tau) = \dfrac{1}{N} \sum_{t} \dfrac{(x_t - \mu_x)}{\sigma_x}\dfrac{(y_{t+\tau} - \mu_y)}{\sigma_y} \cdot \]

Where:

  • \(\mu_x, \mu_y\): Means of the series.

  • \(\sigma_x, \sigma_y\): Standard deviations of the series.

  • \(\tau\): The lag (shift) in time steps.

  • \(N\): The number of observations.

Interpretation: The resulting value ranges from \(-1\) to \(+1\):

  • Peak at \(\tau = 0\): The series are synchronized.

  • Peak at \(\tau < 0\): \(x_t\) leads \(y_t\) (or \(y_t\) leads \(x_t\) depending on convention; see implementation below).

  • Peak at \(\tau > 0\): The opposite relationship holds.

1.6.1.2. Frequency Domain (Coherence)#

While cross-correlation measures similarity in the time domain, Coherence measures the correlation between two signals as a function of frequency. It is analogous to a squared correlation coefficient for each frequency component.

The Magnitude Squared Coherence \(C_{xy}(f)\) is defined as:

(1.4)#\[C_{xy}(f) = \dfrac{|P_{xy}(f)|^2}{P_{xx}(f) P_{yy}(f)}\]

Where:

  • \(P_{xy}(f)\): Cross-spectral density between \(x\) and \(y\).

  • \(P_{xx}(f), P_{yy}(f)\): Power spectral densities of \(x\) and \(y\) respectively.

Values of \(C_{xy}(f)\) near 1 indicate a strong linear relationship at that frequency, while values near 0 indicate no relationship.

This efficiency is why standard libraries (such as scipy.signal.correlate) often use FFT-based methods (method='fft') for larger arrays.

1.6.2. Example: Columbia vs. Calgary Temperatures#

To illustrate cross-correlation and coherence, we analyze daily mean temperatures from Columbia, Missouri (target, at 92°W) and Calgary, Alberta (source, at 114°W) during 2024. These cities are separated by over 2,000 km with distinct climates: Calgary is colder year-round due to higher latitude and continental exposure, while Columbia experiences milder winters and more humid summers. The data spans is sourced from https://open-meteo.com/ with daily temperatures means measured in degrees Celsius.

The key scientific question is: Do weather systems that affect Calgary (west) influence Columbia (east) with a detectable time delay? North American weather systems typically propagate eastward, so we hypothesize that Calgary’s temperature anomalies (unusually warm or cold days) may “lead” Columbia’s by a few days. Cross-correlation will quantify this lead-lag timing, and coherence will tell us whether this relationship is strongest at the seasonal scale, the synoptic (3–7 day) scale, or both.

We will apply this theory to analyze the relationship between daily temperatures in Columbia, MO, and Calgary, AB for the year 2024. Given the prevailing West-to-East weather patterns in North America, we hypothesize that temperature changes in Calgary (West) might “lead” similar changes in Columbia (East).

1.6.2.1. Visualizing and Comparing the Series#

We will analyze the daily mean temperatures of Calgary, AB and Columbia, MO for the year 2024. Note that Calgary (approx. 114°W) is significantly west of Columbia (approx. 92°W). Since weather systems in North America often move West-to-East, we might hypothesize a lag relationship.

Hide code cell source

import pandas as pd

# 1. Load the datasets
como_df = pd.read_csv('chapter1_data/daily_temp_como.csv', parse_dates=['date'], index_col='date')
calgary_df = pd.read_csv('chapter1_data/daily_temp_calgary.csv', parse_dates=['date'], index_col='date')

# 2. Filter for a complete overlapping period (Year 2024)
start_date = '2024-01-01'
end_date = '2024-12-31'

ts_target = como_df.loc[start_date:end_date, 'temperature_2m_mean (°C)']
ts_source = calgary_df.loc[start_date:end_date, 'temperature_2m_mean (°C)']

# 3. Align data (Inner Join)
# This ensures we only compare days where both cities have data
df_combined = pd.concat([ts_target, ts_source], axis=1, keys=['Columbia', 'Calgary']).dropna()
Loading ITables v2.6.1 from the init_notebook_mode cell... (need help?)

Before diving into mathematical metrics, it is essential to visually inspect the time series. Here, we compare the daily mean temperatures of Calgary, Canada, and Columbia, Missouri.

Although these two cities are geographically distinct—separated by over 2,000 km and different climate zones—they are both located in the Northern Hemisphere. Therefore, we expect them to share a strong seasonal component (the annual temperature cycle).

../_images/como_calgary_daily_mean_temp.png

Fig. 1.11 Daily mean temperature in Columbia, MO (solid, purple) and Calgary, AB (dashed, red) for 2024. The series share a clear annual seasonal cycle (warming in summer, cooling in winter), but differ in amplitude and baseline. Notably, many warm and cold spells appear to occur nearly simultaneously, though slight timing offsets may exist within short intervals. This visual similarity motivates the search for an optimal time lag.#

Visual observations:

  • Both series follow a sinusoidal annual rhythm: winter minima (January, December) and summer maxima (July, August) occur at nearly the same calendar dates.

  • Amplitude asymmetry: Calgary’s curve swings from roughly −30°C to +20°C, while Columbia ranges from −20°C to +30°C. This reflects Calgary’s continental climate (larger extremes) versus Columbia’s more temperate profile.

  • Apparent synchrony: Many rapid warm-ups and cold snaps seem to happen together (e.g., both cities warm in April, cool in September), suggesting shared synoptic weather drivers.

This visual alignment is reassuring—it tells us the two locations are not independent—but it doesn’t yet answer: Is there a small time delay? And which timescale (seasonal vs. weekly) matters most?

1.6.2.2. Part 1: Time‑Domain Analysis – When Does One Series Lead the Other?#

To answer “When?”, we compute the normalized cross-correlation function (CCF) across a range of lags. Recall the normalized CCF (from (1.3))

Interpretation logic:

  • If the CCF peaks at \(\tau = +2\), this means: shift Calgary forward by 2 days, and the two series match best. In meteorological terms, Calgary’s temperature anomalies arrive 2 days before Columbia’s. Put plainly: a cold spell in Calgary on January 10 tends to reach Columbia around January 12.

  • If the peak were at \(\tau = -2\), it would mean Columbia leads Calgary—physically implausible for eastward-moving systems.

  • A peak near \(\tau = 0\) would mean the cities respond simultaneously to the same large-scale weather driver (e.g., the seasonal cycle), without a propagation delay.

Hide code cell content

import numpy as np
from scipy import signal

def analyze_cross_correlation(target_series, source_series):
    """
    Computes the cross-correlation between a target and a source series.
    Returns:
        lags (np.array): Array of lag indices.
        corr (np.array): Normalized cross-correlation values.
        lag_days (int): The lag in days where correlation is maximized.
        max_corr (float): The maximum correlation value.
    """
    # 1. Normalize the signals (Z-score normalization)
    # This ensures the output is a correlation coefficient (-1 to 1)
    target_norm = (target_series - target_series.mean()) / target_series.std()
    source_norm = (source_series - source_series.mean()) / source_series.std()
    
    # 2. Compute Cross-Correlation using FFT
    # 'full' mode returns valid cross-correlation for all possible overlaps
    correlation = signal.correlate(target_norm, source_norm, mode='full', method='fft')
    
    # 3. Normalization by length to get Pearson Coefficient
    # Note: This is an approximation assuming large N. 
    correlation /= len(target_series)
    
    # 4. Get the lags associated with the correlation array
    lags = signal.correlation_lags(len(target_series), len(source_series), mode='full')
    
    # 5. Find the optimal lag (max correlation)
    max_idx = np.argmax(correlation)
    best_lag = lags[max_idx]
    max_corr = correlation[max_idx]
    
    return lags, correlation, best_lag, max_corr

# Run the analysis
lags, ccf_values, best_lag, max_corr = analyze_cross_correlation(
    df_combined['Columbia'], 
    df_combined['Calgary']
)

print(f"Optimal Lag: {best_lag} days")
print(f"Max Correlation: {max_corr:.4f}")

# Verify Improvement
original_corr = df_combined['Columbia'].corr(df_combined['Calgary'])
print(f"Original Correlation (Lag 0): {original_corr:.4f}")
Optimal Lag: 2 days
Max Correlation: 0.8317
Original Correlation (Lag 0): 0.7813

Next, we move from what the series look like to when one location tends to follow the other by examining the cross‑correlation and a lagged overlay of the two series. We visualize the results to confirm the relationship. A negative lag implies the “Source” (Calgary) leads the “Target” (Columbia).

../_images/como_calgary_cross_corr_aligned.png

Fig. 1.12 (a) Normalized cross-correlation between Calgary and Columbia daily temperatures for lags from −50 to +50 days. The peak occurs at \(\tau = +2\) days with a correlation of 0.83, substantially higher than the lag-0 correlation of 0.78. The dashed red line marks the optimal lag; the vertical black line shows lag zero for reference. (b) Time series overlay after shifting Calgary by +2 days. The two curves now track closely, confirming visual alignment. Deviations reflect either local weather noise or synoptic features that do not propagate uniformly across the distance.#

Interpretation:

The 2-day lead makes meteorological sense. Weather systems (cold fronts, high-pressure ridges) that originate in the Pacific or Rocky Mountain region reach Calgary first, then propagate eastward at a typical speed of ~1,000 km per 2 days, arriving at Columbia 2–3 days later. This is consistent with observed wind speeds in the troposphere (~30–40 m/s at jet stream altitudes) and surface pressure perturbation propagation.

The correlation magnitude (0.83) is high but not perfect (1.0). Why? Because:

  1. Local effects (urban heat island, local orography) introduce noise at each location.

  2. Some weather systems weaken or split during eastward travel.

  3. The 2-day lag is an average; some events lead by 1 day, others by 3, creating scatter.

Note

Lag answers the question: “How many days early does Location A predict Location B’s anomalies?” A positive lag means the source (Calgary) is predictive of the future target (Columbia); we are not claiming causality, only that they share a time-shifted correlation structure consistent with eastward weather propagation.

1.6.2.3. Part 2: Frequency‑Domain Analysis – At What Time‑Scales Are the Cities Linked?#

Now we ask a complementary question: “Which cycle lengths matter?” The annual cycle is trivial (both locations have summer and winter); the coherence function reveals which other timescales drive the relationship.

Conceptual setup:

Imagine decomposing each time series into “layers” at different frequencies:

  • Very low frequency (annual, 1 cycle per year): the smooth seasonal envelope.

  • Intermediate frequencies (3–7 day timescales): synoptic weather systems (fronts, high/low pressure centers).

  • High frequencies (1–3 days or shorter): local weather noise and gravity waves.

Coherence measures, for each frequency band, how tightly the two cities’ variations are linked. A value near 1.0 means at that frequency, the two cities move in lockstep; a value near 0.0 means they vary independently at that frequency.

Hide code cell content

from scipy import signal

# Compute Coherence using Welch's method
# Convert pandas Series to numpy arrays
f, Cxy = signal.coherence(df_combined['Columbia'].values, 
                          df_combined['Calgary'].values, 
                          fs=1.0, nperseg=100)
# Here
# f: Frequency (cycles/day)
# Cxy: Coherence (0 to 1)
../_images/como_calgary_freq_cross_corr.png

Fig. 1.13 Magnitude-squared coherence between Calgary and Columbia daily temperatures versus frequency (in cycles per day). Two vertical dashed lines mark reference timescales: the annual cycle at 1/365 ≈ 0.0027 cycles/day (red), and the weekly cycle at 1/7 ≈ 0.143 cycles/day (green). Coherence peaks sharply near the annual frequency (as expected), but importantly remains high in the 3–7 day band (around 0.15 cycles/day), indicating that synoptic weather patterns drive a substantial portion of the shared variability beyond simple seasonality.#

Interpretation:

  1. Low frequencies (left side, near 0): Coherence ≈ 1.0. This is expected and almost trivial: both locations experience winter in January and summer in July. The annual cycle dominates the temperature variance and is perfectly coherent between them.

  2. Intermediate frequencies (the “synoptic band,” 1/7 to 1/3 cycles/day, or 3–7 day timescales): Coherence remains elevated (0.4–0.6 or higher). This is the key physical insight. Cold fronts, high-pressure ridges, and other synoptic disturbances that affect Calgary on, say, Monday tend to affect Columbia by Wednesday or Thursday. Because these systems propagate together across both locations, the cities show coherence at the 3–7 day timescale.

  3. High frequencies (right side, >0.2 cycles/day, or sub-3-day scales): Coherence drops. This makes sense: brief local events (afternoon thunderstorms, local wind shifts) occur at different times in different places and are not propagating organized systems.

Note

At what frequencies do the two signals behave like a single, coordinated system?” High annual coherence is expected (shared seasons). High synoptic coherence reveals that the cities are coupled by large-scale weather systems that propagate between them. Low high-frequency coherence reflects local independence. Together, these tell us the dominant physical mechanisms linking the two sites.

1.6.2.4. Synthesis: Lag and Coherence#

The cross‑correlation and coherence results answer complementary questions in a compact way:

Aspect

Question

Answer from Data

Timing (Time Domain)

How many days does Calgary lead Columbia?

≈2 days (CCF peak at \(\tau = +2\))

Correlation strength (Time Domain)

How strong is the link at that lag?

\(r \approx 0.83\) (higher than lag 0)

Seasonal linkage (Frequency Domain)

Are annual cycles shared?

Yes, coherence ≈ 1 at 1/365 cycles/day

Synoptic linkage (Frequency Domain)

Are 3–7 day systems shared?

Yes, coherence stays elevated near 1/7 cycles/day

Local variability (Frequency Domain)

Are sub‑3‑day fluctuations shared?

Mostly no, coherence drops at high frequencies

Notes

  1. Why do we need both analyses?
    Cross-correlation reveals when one location predicts another (the 2-day lead); coherence reveals which timescales matter (synoptic systems dominate, not just annual seasonality). Together, they expose the mechanisms linking the two cities: large-scale weather systems propagating eastward at ~1,000 km/2 days, while local boundary-layer processes remain independent.

  2. Note on Practical Use: In operations, knowing that Calgary temperature anomalies lead Columbia by ~2 days allows meteorologists to use Calgary observations to improve short-range forecasts for Columbia. However, the imperfect correlation (0.83, not 1.0) reflects real variability in propagation speed and system intensity—a useful reminder that statistical relationships never replace physical reasoning.