From Thousands to Pairs: Statistical Arbitrage Screening with Cointegration and Kalman Filters | API Guide

"Two stocks move together because they share a common hidden cause. Find that cause, and you find the trade."

In August 2007, aquant fund called Long-Term Capital Management had collapsed nearly a decade earlier—but the lesson endured. Two stocks in the same sector should not be treated as independent instruments. When Goldman Sachs and Morgan Stanley both trade on Wall Street, they inhale and exhale together. The arbitrageur's job is not to predict direction. It is to measure the distance between the two, wait for that distance to exceed a statistical threshold, and bet on convergence.

Pairs trading remains one of the few strategies that is genuinely market-neutral in theory—and one of the most treacherous to execute in practice. The gap between textbook and production is wide. This article builds the screening pipeline from the ground up: how to start with thousands of instruments, apply econometric filters, estimate mean-reversion half-life, and implement a dynamic Kalman filter to track the hedge ratio in real time. All code is production-grade Python with proper error handling, reconnection logic, and environment-variable-based authentication patterns.

1. Why Cointegration Beats Correlation

Every newcomer to pairs trading starts with correlation. Correlation measures whether two series move in the same direction. Cointegration measures whether two series are pulled together by a long-run force despite short-run deviations.

The distinction matters enormously. Consider SPY and QQQ. Their 252-day correlation exceeds 0.97. But they are both trending upward over multi-year horizons. They share a common trend, not a mean-reverting relationship. Correlation is a short-run property. Cointegration is a long-run equilibrium condition.

Formally, two price series $X_t$ and $Y_t$ are cointegrated if there exists a coefficient $\beta$ such that the residual $Z_t = X_t - \beta Y_t$ is stationary—meaning $Z_t$ has a constant mean, constant variance, and autocovariance that depends only on lag, not on absolute time.

$$Z_t = X_t - \beta Y_t \sim I(0)$$

Stationarity is the whole game. If $Z_t$ is stationary, it will always snap back to its mean. The spread is mean-reverting, and we have a trade.

Correlation can be high while cointegration fails. Two random walks can drift apart forever. Correlation tells you about the joint distribution of returns. Cointegration tells you about the equilibrium relationship between levels. For pairs trading, you need the latter.

2. The Cointegration Testing Pipeline

2.1 Step 1: Pre-Screening with Correlation

Testing every possible pair among 3,000 US equities for cointegration is computationally expensive. The standard approach is a two-stage filter:

Correlation filter: Discard pairs with rolling correlation below 0.80 over a 252-day window. This eliminates obvious non-pairs.
Cointegration test: Apply the Engle-Granger or Johansen test to the surviving pairs.

The correlation threshold of 0.80 is an operational choice. Lower thresholds pass too many pairs into cointegration testing, dramatically increasing computation. Higher thresholds risk missing genuine cointegrated pairs in sectors with low inter-stock correlation.

import os
import time
import random
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from itertools import combinations
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller, coint

# Environment variable authentication pattern
API_KEY = os.environ.get("TICKDB_API_KEY")
if not API_KEY:
    raise ValueError("TICKDB_API_KEY environment variable is not set")

BASE_URL = "https://api.tickdb.ai/v1"

def fetch_historical_kline(symbol: str, interval: str = "1d", limit: int = 500) -> pd.DataFrame:
    """Fetch historical OHLCV data for a given symbol.
    
    Uses the TickDB /v1/market/kline endpoint. For pairs trading backtests,
    use at least 500 daily bars to ensure statistical significance of
    cointegration tests. 2 years (≈504 trading days) is preferred.
    
    Args:
        symbol: Exchange symbol, e.g. "AAPL.US"
        interval: Candle interval, defaults to "1d" for daily analysis
        limit: Number of bars to fetch, defaults to 500 (≈2 years of daily data)
    
    Returns:
        DataFrame with columns: timestamp, open, high, low, close, volume
    """
    url = f"{BASE_URL}/market/kline"
    headers = {"X-API-Key": API_KEY}
    params = {"symbol": symbol, "interval": interval, "limit": limit}
    
    max_retries = 3
    base_delay = 1.0
    
    for attempt in range(max_retries):
        try:
            response = requests.get(
                url,
                headers=headers,
                params=params,
                timeout=(3.05, 10)
            )
            data = response.json()
            
            if data.get("code") == 0:
                df = pd.DataFrame(data["data"])
                df["timestamp"] = pd.to_datetime(df["ts"], unit="ms")
                df = df.sort_values("timestamp").reset_index(drop=True)
                return df[["timestamp", "open", "high", "low", "close", "volume"]]
            
            # Rate limit handling
            elif data.get("code") == 3001:
                retry_after = int(response.headers.get("Retry-After", 5))
                print(f"Rate limited. Waiting {retry_after} seconds.")
                time.sleep(retry_after)
                continue
            
            # Symbol not found
            elif data.get("code") == 2002:
                print(f"Symbol {symbol} not found. Verify via /v1/symbols/available")
                return pd.DataFrame()
            
            else:
                raise RuntimeError(f"API error {data.get('code')}: {data.get('message')}")
                
        except requests.exceptions.Timeout:
            delay = min(base_delay * (2 ** attempt), 30)
            jitter = random.uniform(0, delay * 0.1)
            print(f"Timeout on attempt {attempt + 1}. Retrying in {delay + jitter:.1f}s")
            time.sleep(delay + jitter)
            continue
            
        except requests.exceptions.RequestException as e:
            raise RuntimeError(f"Request failed: {e}")
    
    raise RuntimeError(f"Failed after {max_retries} attempts")

2.2 Step 2: The Engle-Granger Cointegration Test

The Engle-Granger two-step method is the workhorse of cointegration testing. It proceeds as follows:

Regress $Y_t$ on $X_t$ via OLS to estimate the hedge ratio $\beta$.
Compute the residuals $Z_t = Y_t - \beta X_t$.
Apply the Augmented Dickey-Fuller (ADF) test to $Z_t$. If the null hypothesis of a unit root is rejected, the residuals are stationary, and the pair is cointegrated.

The test statistic follows a specific distribution (not the standard normal), which is why we use statsmodels.tsa.stattools.coint, which handles this correctly.

def test_cointegration(series1: pd.Series, series2: pd.Series) -> dict:
    """Engle-Granger cointegration test.
    
    Returns a dict with the test statistic, p-value, and critical values.
    A p-value below 0.05 indicates cointegration at the 5% significance level.
    
    Args:
        series1: First price series (e.g., AAPL close prices)
        series2: Second price series (e.g., MSFT close prices)
    
    Returns:
        dict with keys: t_stat, p_value, crit_values, is_cointegrated
    """
    # Align series — drop rows with NaN in either series
    aligned = pd.DataFrame({"s1": series1, "s2": series2}).dropna()
    
    if len(aligned) < 252:
        return {"is_cointegrated": False, "reason": "Insufficient data (< 252 points)"}
    
    # Step 1: OLS regression to find hedge ratio
    X = sm.add_constant(aligned["s1"])
    model = sm.OLS(aligned["s2"], X).fit()
    hedge_ratio = model.params["s1"]
    
    # Step 2: Residuals = Y - beta * X
    residuals = aligned["s2"] - hedge_ratio * aligned["s1"]
    
    # Step 3: ADF test on residuals
    adf_result = adfuller(residuals, maxlag=1, regression="c")
    t_stat = adf_result[0]
    p_value = adf_result[1]
    crit_values = adf_result[4]
    
    is_cointegrated = p_value < 0.05
    
    return {
        "t_stat": t_stat,
        "p_value": p_value,
        "crit_values": crit_values,
        "is_cointegrated": is_cointegrated,
        "hedge_ratio": hedge_ratio,
        "n_observations": len(aligned)
    }


def screen_pairs(stock_list: list[str], min_correlation: float = 0.80) -> pd.DataFrame:
    """Screen a list of stocks for cointegrated pairs.
    
    Stage 1: Correlation filter (252-day rolling window).
    Stage 2: Engle-Granger cointegration test on surviving pairs.
    
    ⚠️ This function is computationally intensive for large stock lists.
    For 1,000 stocks, it generates ~500,000 pairs. Consider using
    multiprocessing.Pool for parallel computation.
    
    Args:
        stock_list: List of TickDB symbols, e.g. ["AAPL.US", "MSFT.US"]
        min_correlation: Minimum 252-day correlation to proceed to cointegration test
    
    Returns:
        DataFrame of cointegrated pairs with test statistics
    """
    print(f"Fetching price data for {len(stock_list)} stocks...")
    
    # Fetch all price data
    price_data = {}
    for symbol in stock_list:
        df = fetch_historical_kline(symbol, interval="1d", limit=500)
        if not df.empty:
            price_data[symbol] = df["close"]
            print(f"  Fetched {symbol}: {len(df)} bars")
        time.sleep(0.1)  # Respect rate limits
    
    # Align all series to a common date index
    price_df = pd.DataFrame(price_data)
    price_df = price_df.ffill().dropna()
    
    print(f"\nAligned price matrix: {price_df.shape[0]} trading days x {price_df.shape[1]} stocks")
    
    # Stage 1: Correlation filter
    print("Stage 1: Computing correlation matrix...")
    corr_matrix = price_df.corr()
    
    print("Stage 2: Testing cointegration on candidate pairs...")
    results = []
    stock_symbols = list(price_df.columns)
    
    # Generate pairs (skip symmetric duplicates)
    pair_count = 0
    tested_count = 0
    
    for i in range(len(stock_symbols)):
        for j in range(i + 1, len(stock_symbols)):
            sym1, sym2 = stock_symbols[i], stock_symbols[j]
            pair_count += 1
            
            # Correlation filter
            corr = corr_matrix.loc[sym1, sym2]
            if corr < min_correlation:
                continue
            
            # Cointegration test
            tested_count += 1
            result = test_cointegration(price_df[sym1], price_df[sym2])
            
            if result["is_cointegrated"]:
                half_life = calculate_half_life(result["residuals"]) if "residuals" in result else None
                results.append({
                    "stock1": sym1,
                    "stock2": sym2,
                    "correlation": corr,
                    "hedge_ratio": result["hedge_ratio"],
                    "p_value": result["p_value"],
                    "t_stat": result["t_stat"],
                    "half_life_days": half_life
                })
                print(f"  ✓ Cointegrated pair found: {sym1}/{sym2} (p={result['p_value']:.4f})")
    
    print(f"\nPair screening complete: {pair_count} total pairs, {tested_count} tested, {len(results)} cointegrated")
    
    return pd.DataFrame(results)

3. Mean-Reversion Half-Life

Once you confirm cointegration, the next question is: how fast does the spread mean-revert? This determines your holding period, your position sizing, and whether the strategy is economically viable after transaction costs.

The Ornstein-Uhlenbeck process models the spread as a mean-reverting process:

$$dZ_t = \lambda (\mu - Z_t) dt + dW_t$$

where $\lambda > 0$ controls the speed of mean reversion. The half-life of this process is:

$$\text{half-life} = \frac{\ln 2}{|\lambda|}$$

We estimate $\lambda$ from the ADF regression. In the ADF regression, the coefficient on the lagged spread term is $\phi - 1$ where the autoregressive form is $\Delta Z_t = \phi Z_{t-1} + \ldots$. Since $\lambda = -(1 - \phi)$ for the OU process in discrete time, we have:

def calculate_half_life(spread: pd.Series) -> float:
    """Calculate the Ornstein-Uhlenbeck half-life of a mean-reverting spread.
    
    The half-life tells us approximately how many periods it takes for the spread
    to revert halfway back to its mean. Pairs with half-lives between 5 and 60 days
    are typically the most tradeable — short enough to cycle capital efficiently,
    long enough to absorb transaction costs.
    
    Half-lives under 5 days may incur excessive brokerage commissions.
    Half-lives over 120 days may not generate sufficient annual returns.
    
    Args:
        spread: Stationary residuals from the cointegration regression
    
    Returns:
        Half-life in periods (days for daily data)
    """
    spread_lag = spread.shift(1).dropna()
    delta_spread = spread.diff().dropna()
    
    # Align the series
    common_idx = spread_lag.index.intersection(delta_spread.index)
    spread_lag = spread_lag.loc[common_idx]
    delta_spread = delta_spread.loc[common_idx]
    
    # Regress ΔZ_t on Z_{t-1}
    X = sm.add_constant(spread_lag)
    model = sm.OLS(delta_spread, X).fit()
    theta = model.params[1]  # This is -(1 - phi) in OU formulation
    
    if theta >= 0:
        return float("inf")  # Not mean-reverting
    
    half_life = -np.log(2) / theta
    return half_life

Practical half-life interpretation:

Half-life	Interpretation	Trade suitability
< 5 days	Very fast reversion	High transaction costs may eliminate edge
5–20 days	Fast reversion	Good for high-frequency capital cycling
20–60 days	Moderate	Standard pairs trading territory
60–120 days	Slow	Viable for larger portfolios with lower turnover
> 120 days	Very slow	Unlikely to be economically viable after costs

A pair with a half-life of 15 days and a standard deviation of the spread of 2.5% generates roughly one round-trip trade per month. If your round-trip transaction cost is 0.10% (bid-ask + slippage), you need the expected reversion magnitude to exceed your cost threshold consistently.

4. Kalman Filter for Dynamic Hedge Ratio

The static hedge ratio from OLS assumes $\beta$ is constant over time. It is not. In practice, the fundamental relationship between two stocks drifts. A static hedge ratio computed over two years of data will be wrong six months from now if the two companies' business dynamics have diverged.

The Kalman filter solves this by updating the hedge ratio recursively as new data arrives. It treats $\beta$ as a hidden state that evolves over time according to a random walk, and updates it based on each new observation of the spread.

State-space model:

State equation: $\beta_t = \beta_{t-1} + w_t$, where $w_t \sim N(0, Q)$
Observation equation: $y_t = \beta_t x_t + v_t$, where $v_t \sim N(0, R)$

Here, $y_t$ is the price of the dependent stock, $x_t$ is the price of the independent stock, and $\beta_t$ is the time-varying hedge ratio. $Q$ (process noise) and $R$ (observation noise) are hyperparameters that control how quickly the hedge ratio adapts.

import requests

class KalmanFilterHedgeRatio:
    """Dynamic hedge ratio estimation using a Kalman filter.
    
    This class implements a 1D Kalman filter to track the time-varying
    hedge ratio between two assets. Unlike OLS, which assumes a constant
    beta, the Kalman filter allows beta to drift smoothly over time.
    
    Key parameters:
        delta: Controls the process noise variance Q = delta^2 * (1 - phi^2)
               Smaller delta → slower beta adaptation
               Larger delta → faster beta adaptation
        phi: State transition coefficient (default 1.0 = random walk)
               phi < 1 adds mean-reversion to beta (more stable estimates)
    
    ⚠️ For live trading, re-initialize the filter after a corporate action
    (stock split, merger, dividend) to avoid contaminating the estimate.
    
    Args:
        delta: Process noise parameter (controls adaptation speed)
        phi: State transition coefficient (default 1.0 for random walk)
        R: Observation noise variance (default 1e-3)
    """
    
    def __init__(self, delta: float = 1e-4, phi: float = 1.0, R: float = 1e-3):
        self.delta = delta
        self.phi = phi
        self.R = R
        
        # State: [hedge_ratio]
        self.beta = 0.0
        # State covariance
        self.P = 1.0
        # Running residuals for spread analysis
        self.residuals = []
        self.hedge_ratios = []
    
    def update(self, x: float, y: float) -> tuple[float, float, float]:
        """Update the hedge ratio with a new observation.
        
        Args:
            x: Price of the independent asset (e.g., MSFT)
            y: Price of the dependent asset (e.g., AAPL)
        
        Returns:
            (predicted_spread, observed_spread, updated_beta)
        """
        # Prediction step
        beta_pred = self.phi * self.beta
        P_pred = self.phi ** 2 * self.P + self.delta ** 2
        
        # Observation
        z_pred = y - beta_pred * x
        
        # Kalman gain
        S = x ** 2 * P_pred + self.R
        K = P_pred * x / S
        
        # Update step
        z_actual = y - self.beta * x  # Residuals computed with current beta
        innovation = z_actual - z_pred
        
        self.beta = beta_pred + K * innovation
        self.P = (1 - K * x) * P_pred
        
        # Store for spread monitoring
        spread = y - self.beta * x
        self.residuals.append(spread)
        self.hedge_ratios.append(self.beta)
        
        return z_pred, spread, self.beta
    
    def get_zscore(self, lookback: int = 20) -> float | None:
        """Calculate the z-score of the current spread vs. a rolling mean.
        
        Args:
            lookback: Number of periods for rolling mean and std estimation
        
        Returns:
            Z-score of the current spread, or None if insufficient data
        """
        if len(self.residuals) < lookback:
            return None
        
        recent = np.array(self.residuals[-lookback:])
        current = self.residuals[-1]
        mean = np.mean(recent)
        std = np.std(recent)
        
        if std < 1e-10:
            return None
        
        return (current - mean) / std


def kalman_filter_pairs_trading(stock1: str, stock2: str, entry_threshold: float = 2.0,
                                 exit_threshold: float = 0.5, lookback: int = 20) -> dict:
    """Walk-forward backtest of a Kalman filter-based pairs trading strategy.
    
    This function simulates the strategy using historical data:
    - Go long the spread when z-score < -entry_threshold (spread too low → long stock2, short stock1)
    - Go short the spread when z-score > +entry_threshold (spread too high → short stock2, long stock1)
    - Exit when |z-score| < exit_threshold
    
    ⚠️ This backtest uses static entry/exit thresholds and does NOT account for:
    - Transaction costs (brokerage commissions + bid-ask spread)
    - Slippage and market impact
    - Overnight gap risk
    - Corporate actions (splits, mergers, dividends)
    A production backtest should incorporate a cost model and out-of-sample validation.
    
    Args:
        stock1: Independent asset symbol (e.g., "MSFT.US")
        stock2: Dependent asset symbol (e.g., "AAPL.US")
        entry_threshold: Z-score threshold to enter a position (default 2.0)
        exit_threshold: Z-score threshold to exit (default 0.5)
        lookback: Periods for z-score rolling window
    
    Returns:
        Dictionary with performance metrics and trade log
    """
    # Fetch historical data for both stocks
    df1 = fetch_historical_kline(stock1, interval="1d", limit=500)
    df2 = fetch_historical_kline(stock2, interval="1d", limit=500)
    
    # Align on common dates
    merged = pd.merge(df1[["timestamp", "close"]], df2[["timestamp", "close"]],
                      on="timestamp", suffixes=("_1", "_2")).dropna()
    merged.columns = ["timestamp", "price1", "price2"]
    
    if len(merged) < 100:
        raise ValueError(f"Insufficient data for {stock1}/{stock2} pair")
    
    # Initialize Kalman filter
    kf = KalmanFilterHedgeRatio(delta=1e-4)
    
    # Trading simulation
    position = 0  # +1 = long spread, -1 = short spread, 0 = flat
    entry_spread = 0
    trades = []
    equity_curve = [1.0]
    
    for i in range(1, len(merged)):
        x = merged["price1"].iloc[i]
        y = merged["price2"].iloc[i]
        
        z_pred, spread, beta = kf.update(x, y)
        zscore = kf.get_zscore(lookback=lookback)
        
        if zscore is None:
            equity_curve.append(equity_curve[-1])
            continue
        
        timestamp = merged["timestamp"].iloc[i]
        pnl = 0
        
        # Entry logic
        if position == 0:
            if zscore < -entry_threshold:
                position = 1
                entry_spread = spread
                trades.append({"date": timestamp, "action": "long_spread",
                               "zscore": zscore, "beta": beta})
            elif zscore > entry_threshold:
                position = -1
                entry_spread = spread
                trades.append({"date": timestamp, "action": "short_spread",
                               "zscore": zscore, "beta": beta})
        
        # Exit logic
        elif position == 1 and zscore > -exit_threshold:
            pnl = (spread - entry_spread) / entry_spread if entry_spread != 0 else 0
            equity_curve.append(equity_curve[-1] * (1 + pnl))
            trades.append({"date": timestamp, "action": "exit_long_spread",
                           "zscore": zscore, "pnl": pnl})
            position = 0
        elif position == -1 and zscore < exit_threshold:
            pnl = (entry_spread - spread) / entry_spread if entry_spread != 0 else 0
            equity_curve.append(equity_curve[-1] * (1 + pnl))
            trades.append({"date": timestamp, "action": "exit_short_spread",
                           "zscore": zscore, "pnl": pnl})
            position = 0
        else:
            # Running P&L calculation (simplified)
            if position == 1:
                pnl = (spread - entry_spread) / entry_spread if entry_spread != 0 else 0
            else:
                pnl = (entry_spread - spread) / entry_spread if entry_spread != 0 else 0
            equity_curve.append(equity_curve[-1] * (1 + pnl * 0.01))  # Daily accrual approximation
    
    equity_series = pd.Series(equity_curve)
    returns = equity_series.pct_change().dropna()
    
    total_return = equity_curve[-1] - 1.0
    sharpe = returns.mean() / returns.std() * np.sqrt(252) if returns.std() > 0 else 0.0
    max_dd = (equity_series / equity_series.cummax() - 1).min()
    
    return {
        "pair": f"{stock1}/{stock2}",
        "total_return": total_return,
        "sharpe_ratio": sharpe,
        "max_drawdown": max_dd,
        "n_trades": len(trades),
        "equity_curve": equity_curve,
        "trades": pd.DataFrame(trades)
    }

5. Full Pairs Screening Implementation

Putting it all together, here is the complete screening pipeline that fetches data, applies the correlation filter, runs cointegration tests, calculates half-life, and outputs a ranked list of tradeable pairs:

def run_pairs_screening_pipeline(
    universe: list[str],
    min_correlation: float = 0.80,
    max_half_life: int = 120,
    min_half_life: int = 5,
    top_n: int = 20
) -> pd.DataFrame:
    """Complete pairs trading screening pipeline.
    
    Workflow:
    1. Fetch 500-day OHLCV data for all stocks in the universe
    2. Compute 252-day rolling correlation matrix
    3. Apply correlation filter (keep pairs above threshold)
    4. Run Engle-Granger cointegration tests on survivors
    5. Calculate Ornstein-Uhlenbeck half-life for cointegrated pairs
    6. Filter by half-life range (5–120 days is typically most tradeable)
    7. Rank by p-value (strongest cointegration first)
    
    ⚠️ For large universes (1,000+ stocks), this function can take 30–60 minutes
    due to the O(N²) pair generation. Consider parallelizing with multiprocessing.
    
    Args:
        universe: List of TickDB symbols, e.g. ["AAPL.US", "MSFT.US", "GOOGL.US"]
        min_correlation: Minimum 252-day correlation to proceed to cointegration test
        max_half_life: Maximum half-life in days (pairs reverting slower are discarded)
        min_half_life: Minimum half-life in days (pairs reverting too fast may be noisy)
        top_n: Return the top N pairs by p-value
    
    Returns:
        DataFrame of ranked pairs with correlation, p-value, hedge ratio, half-life
    """
    print("=" * 60)
    print("PAIRS TRADING SCREENING PIPELINE")
    print("=" * 60)
    
    # Step 1: Screen pairs
    pairs_df = screen_pairs(universe, min_correlation=min_correlation)
    
    if pairs_df.empty:
        print("No cointegrated pairs found.")
        return pd.DataFrame()
    
    # Step 2: Filter by half-life
    pairs_df = pairs_df[
        (pairs_df["half_life_days"] >= min_half_life) &
        (pairs_df["half_life_days"] <= max_half_life)
    ]
    
    if pairs_df.empty:
        print(f"No pairs found with half-life between {min_half_life} and {max_half_life} days.")
        return pd.DataFrame()
    
    # Step 3: Rank by p-value (strongest cointegration first)
    pairs_df = pairs_df.sort_values("p_value").head(top_n)
    pairs_df = pairs_df.reset_index(drop=True)
    
    # Display summary table
    print("\nTOP PAIRS (ranked by cointegration p-value):")
    print("-" * 80)
    print(f"{'Rank':<5} {'Pair':<20} {'Corr':<8} {'p-value':<10} {'Hedge β':<10} {'Half-life':<10}")
    print("-" * 80)
    for i, row in pairs_df.iterrows():
        print(f"{i+1:<5} {row['stock1']}/{row['stock2']:<12} "
              f"{row['correlation']:<8.3f} {row['p_value']:<10.4f} "
              f"{row['hedge_ratio']:<10.4f} {row['half_life_days']:<10.1f}")
    
    return pairs_df


# Example: Screen a basket of tech and financial stocks
if __name__ == "__main__":
    universe = [
        "AAPL.US", "MSFT.US", "GOOGL.US", "AMZN.US", "META.US",
        "NVDA.US", "JPM.US", "BAC.US", "GS.US", "MS.US",
        "XOM.US", "CVX.US", "JNJ.US", "PFE.US", "UNH.US"
    ]
    
    top_pairs = run_pairs_screening_pipeline(
        universe,
        min_correlation=0.80,
        max_half_life=90,
        min_half_life=5,
        top_n=10
    )

6. Common Pitfalls and Production Warnings

Overfitting the cointegration test. Testing 500,000 pairs with a 5% significance threshold will produce roughly 25,000 false positives by definition. The correct approach is to apply in-sample/out-of-sample validation: test cointegration on the first 300 days, then verify that the spread remains mean-reverting over the next 200 days. If it does not, discard the pair.

Ignoring regime changes. A pair that is cointegrated in 2020–2022 may not be cointegrated in 2023–2024 if the fundamental relationship breaks. Monitor the rolling p-value of the cointegration test. If the 60-day rolling p-value consistently exceeds 0.10, suspend trading that pair.

Static transaction cost assumptions. The textbook strategy assumes negligible transaction costs. In production, round-trip costs for US equities are typically 0.05–0.15% per side (bid-ask + commission + short borrow). A pair with an expected reversion of 1.5% and a half-life of 20 days looks profitable on paper but may be marginal after costs. Always build in a cost buffer.

Corporate actions. Stock splits, mergers, spin-offs, and large dividends change the price series in ways that break the cointegration relationship. Recompute the hedge ratio after any corporate action affecting either leg of the pair.

Spread normalization. The raw spread $Z_t = Y_t - \beta X_t$ has units of dollars. A spread value of 5.0 means different things for a pair with prices around $50 versus a pair with prices around $500. Always use z-score normalization for entry/exit thresholds to make them comparable across pairs and markets.

7. Reference Ticker Universe for US Equities

The following table provides a starting universe organized by sector. These instruments are selected for high liquidity, active options markets, and sufficient price history for statistical analysis.

Sector	Tickers	Rationale
Large-cap Technology	AAPL.US, MSFT.US, GOOGL.US, AMZN.US, META.US, NVDA.US	High correlation within sector; strong cointegration candidates
Investment Banking	GS.US, MS.US, JPM.US, BAC.US, C.US	Shared revenue sensitivity to rate environment
Integrated Energy	XOM.US, CVX.US, COP.US	Commodity-price co-movement; sector-wide factor exposure
Major Airlines	DAL.US, UAL.US, AAL.US, LUV.US	High operational correlation; capacity decisions ripple across the sector
US Index ETFs	SPY.US, QQQ.US, IWM.US	Liquid proxies for sector pairs; useful as benchmark or hedge instruments

For institutional-grade backtesting, the TickDB /v1/market/kline endpoint provides 10+ years of cleaned, time-aligned OHLCV data across all US equity symbols. Set the limit parameter to 500 for two-year windows or 1,000 for four-year windows to ensure your cointegration tests have sufficient statistical power.

8. Closing

The hardest part of pairs trading is not the math. It is the discipline to let the process run: to generate a ranked universe of pairs, pick the top 10, execute with consistent position sizing, and measure performance over quarters—not days.

Cointegration tells you the pair has a long-run equilibrium. Kalman filtering tells you the current hedge ratio without needing to re-estimate over the full window every time. Half-life tells you whether the pair is fast enough to trade given your cost structure. Together, they form a defensible screening pipeline that can survive contact with a live market.

The signal is in the spread. The discipline is in the system.

Next Steps

If you want to run this screening pipeline yourself: Sign up at tickdb.ai for a free API key (no credit card required), set the TICKDB_API_KEY environment variable, and copy the code from this article. The /v1/market/kline endpoint provides 500+ days of daily OHLCV data for US equities across 15,000+ symbols.

If you're building a real-time monitoring system: The TickDB WebSocket API supports live price streams for the instruments in your pairs universe. Update the Kalman filter hedge ratio in real time and trigger alerts when the z-score crosses your entry threshold.

If you need a complete historical backtest: The /v1/market/kline endpoint supports intervals from 1m to 1d. For intraday pairs trading, fetch 1h or 5m bars, apply the same cointegration pipeline, and measure whether the strategy's edge survives at higher frequencies.

If you use AI coding assistants: Search for and install the tickdb-market-data SKILL in your AI tool's marketplace to get TickDB API access integrated directly into your development workflow.

This article does not constitute investment advice. Pairs trading involves substantial risk including but not limited to market risk, liquidity risk, and model risk. Backtested results do not guarantee future performance. Past performance does not guarantee future results.