"Two stocks move together because they share a common hidden cause. Find that cause, and you find the trade."
In August 2007, aquant fund called Long-Term Capital Management had collapsed nearly a decade earlier—but the lesson endured. Two stocks in the same sector should not be treated as independent instruments. When Goldman Sachs and Morgan Stanley both trade on Wall Street, they inhale and exhale together. The arbitrageur's job is not to predict direction. It is to measure the distance between the two, wait for that distance to exceed a statistical threshold, and bet on convergence.
Pairs trading remains one of the few strategies that is genuinely market-neutral in theory—and one of the most treacherous to execute in practice. The gap between textbook and production is wide. This article builds the screening pipeline from the ground up: how to start with thousands of instruments, apply econometric filters, estimate mean-reversion half-life, and implement a dynamic Kalman filter to track the hedge ratio in real time. All code is production-grade Python with proper error handling, reconnection logic, and environment-variable-based authentication patterns.
1. Why Cointegration Beats Correlation
Every newcomer to pairs trading starts with correlation. Correlation measures whether two series move in the same direction. Cointegration measures whether two series are pulled together by a long-run force despite short-run deviations.
The distinction matters enormously. Consider SPY and QQQ. Their 252-day correlation exceeds 0.97. But they are both trending upward over multi-year horizons. They share a common trend, not a mean-reverting relationship. Correlation is a short-run property. Cointegration is a long-run equilibrium condition.
Formally, two price series $X_t$ and $Y_t$ are cointegrated if there exists a coefficient $\beta$ such that the residual $Z_t = X_t - \beta Y_t$ is stationary—meaning $Z_t$ has a constant mean, constant variance, and autocovariance that depends only on lag, not on absolute time.
$$Z_t = X_t - \beta Y_t \sim I(0)$$
Stationarity is the whole game. If $Z_t$ is stationary, it will always snap back to its mean. The spread is mean-reverting, and we have a trade.
Correlation can be high while cointegration fails. Two random walks can drift apart forever. Correlation tells you about the joint distribution of returns. Cointegration tells you about the equilibrium relationship between levels. For pairs trading, you need the latter.
2. The Cointegration Testing Pipeline
2.1 Step 1: Pre-Screening with Correlation
Testing every possible pair among 3,000 US equities for cointegration is computationally expensive. The standard approach is a two-stage filter:
- Correlation filter: Discard pairs with rolling correlation below 0.80 over a 252-day window. This eliminates obvious non-pairs.
- Cointegration test: Apply the Engle-Granger or Johansen test to the surviving pairs.
The correlation threshold of 0.80 is an operational choice. Lower thresholds pass too many pairs into cointegration testing, dramatically increasing computation. Higher thresholds risk missing genuine cointegrated pairs in sectors with low inter-stock correlation.
import os
import time
import random
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from itertools import combinations
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller, coint
# Environment variable authentication pattern
API_KEY = os.environ.get("TICKDB_API_KEY")
if not API_KEY:
raise ValueError("TICKDB_API_KEY environment variable is not set")
BASE_URL = "https://api.tickdb.ai/v1"
def fetch_historical_kline(symbol: str, interval: str = "1d", limit: int = 500) -> pd.DataFrame:
"""Fetch historical OHLCV data for a given symbol.
Uses the TickDB /v1/market/kline endpoint. For pairs trading backtests,
use at least 500 daily bars to ensure statistical significance of
cointegration tests. 2 years (≈504 trading days) is preferred.
Args:
symbol: Exchange symbol, e.g. "AAPL.US"
interval: Candle interval, defaults to "1d" for daily analysis
limit: Number of bars to fetch, defaults to 500 (≈2 years of daily data)
Returns:
DataFrame with columns: timestamp, open, high, low, close, volume
"""
url = f"{BASE_URL}/market/kline"
headers = {"X-API-Key": API_KEY}
params = {"symbol": symbol, "interval": interval, "limit": limit}
max_retries = 3
base_delay = 1.0
for attempt in range(max_retries):
try:
response = requests.get(
url,
headers=headers,
params=params,
timeout=(3.05, 10)
)
data = response.json()
if data.get("code") == 0:
df = pd.DataFrame(data["data"])
df["timestamp"] = pd.to_datetime(df["ts"], unit="ms")
df = df.sort_values("timestamp").reset_index(drop=True)
return df[["timestamp", "open", "high", "low", "close", "volume"]]
# Rate limit handling
elif data.get("code") == 3001:
retry_after = int(response.headers.get("Retry-After", 5))
print(f"Rate limited. Waiting {retry_after} seconds.")
time.sleep(retry_after)
continue
# Symbol not found
elif data.get("code") == 2002:
print(f"Symbol {symbol} not found. Verify via /v1/symbols/available")
return pd.DataFrame()
else:
raise RuntimeError(f"API error {data.get('code')}: {data.get('message')}")
except requests.exceptions.Timeout:
delay = min(base_delay * (2 ** attempt), 30)
jitter = random.uniform(0, delay * 0.1)
print(f"Timeout on attempt {attempt + 1}. Retrying in {delay + jitter:.1f}s")
time.sleep(delay + jitter)
continue
except requests.exceptions.RequestException as e:
raise RuntimeError(f"Request failed: {e}")
raise RuntimeError(f"Failed after {max_retries} attempts")
2.2 Step 2: The Engle-Granger Cointegration Test
The Engle-Granger two-step method is the workhorse of cointegration testing. It proceeds as follows:
- Regress $Y_t$ on $X_t$ via OLS to estimate the hedge ratio $\beta$.
- Compute the residuals $Z_t = Y_t - \beta X_t$.
- Apply the Augmented Dickey-Fuller (ADF) test to $Z_t$. If the null hypothesis of a unit root is rejected, the residuals are stationary, and the pair is cointegrated.
The test statistic follows a specific distribution (not the standard normal), which is why we use statsmodels.tsa.stattools.coint, which handles this correctly.
def test_cointegration(series1: pd.Series, series2: pd.Series) -> dict:
"""Engle-Granger cointegration test.
Returns a dict with the test statistic, p-value, and critical values.
A p-value below 0.05 indicates cointegration at the 5% significance level.
Args:
series1: First price series (e.g., AAPL close prices)
series2: Second price series (e.g., MSFT close prices)
Returns:
dict with keys: t_stat, p_value, crit_values, is_cointegrated
"""
# Align series — drop rows with NaN in either series
aligned = pd.DataFrame({"s1": series1, "s2": series2}).dropna()
if len(aligned) < 252:
return {"is_cointegrated": False, "reason": "Insufficient data (< 252 points)"}
# Step 1: OLS regression to find hedge ratio
X = sm.add_constant(aligned["s1"])
model = sm.OLS(aligned["s2"], X).fit()
hedge_ratio = model.params["s1"]
# Step 2: Residuals = Y - beta * X
residuals = aligned["s2"] - hedge_ratio * aligned["s1"]
# Step 3: ADF test on residuals
adf_result = adfuller(residuals, maxlag=1, regression="c")
t_stat = adf_result[0]
p_value = adf_result[1]
crit_values = adf_result[4]
is_cointegrated = p_value < 0.05
return {
"t_stat": t_stat,
"p_value": p_value,
"crit_values": crit_values,
"is_cointegrated": is_cointegrated,
"hedge_ratio": hedge_ratio,
"n_observations": len(aligned)
}
def screen_pairs(stock_list: list[str], min_correlation: float = 0.80) -> pd.DataFrame:
"""Screen a list of stocks for cointegrated pairs.
Stage 1: Correlation filter (252-day rolling window).
Stage 2: Engle-Granger cointegration test on surviving pairs.
⚠️ This function is computationally intensive for large stock lists.
For 1,000 stocks, it generates ~500,000 pairs. Consider using
multiprocessing.Pool for parallel computation.
Args:
stock_list: List of TickDB symbols, e.g. ["AAPL.US", "MSFT.US"]
min_correlation: Minimum 252-day correlation to proceed to cointegration test
Returns:
DataFrame of cointegrated pairs with test statistics
"""
print(f"Fetching price data for {len(stock_list)} stocks...")
# Fetch all price data
price_data = {}
for symbol in stock_list:
df = fetch_historical_kline(symbol, interval="1d", limit=500)
if not df.empty:
price_data[symbol] = df["close"]
print(f" Fetched {symbol}: {len(df)} bars")
time.sleep(0.1) # Respect rate limits
# Align all series to a common date index
price_df = pd.DataFrame(price_data)
price_df = price_df.ffill().dropna()
print(f"\nAligned price matrix: {price_df.shape[0]} trading days x {price_df.shape[1]} stocks")
# Stage 1: Correlation filter
print("Stage 1: Computing correlation matrix...")
corr_matrix = price_df.corr()
print("Stage 2: Testing cointegration on candidate pairs...")
results = []
stock_symbols = list(price_df.columns)
# Generate pairs (skip symmetric duplicates)
pair_count = 0
tested_count = 0
for i in range(len(stock_symbols)):
for j in range(i + 1, len(stock_symbols)):
sym1, sym2 = stock_symbols[i], stock_symbols[j]
pair_count += 1
# Correlation filter
corr = corr_matrix.loc[sym1, sym2]
if corr < min_correlation:
continue
# Cointegration test
tested_count += 1
result = test_cointegration(price_df[sym1], price_df[sym2])
if result["is_cointegrated"]:
half_life = calculate_half_life(result["residuals"]) if "residuals" in result else None
results.append({
"stock1": sym1,
"stock2": sym2,
"correlation": corr,
"hedge_ratio": result["hedge_ratio"],
"p_value": result["p_value"],
"t_stat": result["t_stat"],
"half_life_days": half_life
})
print(f" ✓ Cointegrated pair found: {sym1}/{sym2} (p={result['p_value']:.4f})")
print(f"\nPair screening complete: {pair_count} total pairs, {tested_count} tested, {len(results)} cointegrated")
return pd.DataFrame(results)
3. Mean-Reversion Half-Life
Once you confirm cointegration, the next question is: how fast does the spread mean-revert? This determines your holding period, your position sizing, and whether the strategy is economically viable after transaction costs.
The Ornstein-Uhlenbeck process models the spread as a mean-reverting process:
$$dZ_t = \lambda (\mu - Z_t) dt + dW_t$$
where $\lambda > 0$ controls the speed of mean reversion. The half-life of this process is:
$$\text{half-life} = \frac{\ln 2}{|\lambda|}$$
We estimate $\lambda$ from the ADF regression. In the ADF regression, the coefficient on the lagged spread term is $\phi - 1$ where the autoregressive form is $\Delta Z_t = \phi Z_{t-1} + \ldots$. Since $\lambda = -(1 - \phi)$ for the OU process in discrete time, we have:
def calculate_half_life(spread: pd.Series) -> float:
"""Calculate the Ornstein-Uhlenbeck half-life of a mean-reverting spread.
The half-life tells us approximately how many periods it takes for the spread
to revert halfway back to its mean. Pairs with half-lives between 5 and 60 days
are typically the most tradeable — short enough to cycle capital efficiently,
long enough to absorb transaction costs.
Half-lives under 5 days may incur excessive brokerage commissions.
Half-lives over 120 days may not generate sufficient annual returns.
Args:
spread: Stationary residuals from the cointegration regression
Returns:
Half-life in periods (days for daily data)
"""
spread_lag = spread.shift(1).dropna()
delta_spread = spread.diff().dropna()
# Align the series
common_idx = spread_lag.index.intersection(delta_spread.index)
spread_lag = spread_lag.loc[common_idx]
delta_spread = delta_spread.loc[common_idx]
# Regress ΔZ_t on Z_{t-1}
X = sm.add_constant(spread_lag)
model = sm.OLS(delta_spread, X).fit()
theta = model.params[1] # This is -(1 - phi) in OU formulation
if theta >= 0:
return float("inf") # Not mean-reverting
half_life = -np.log(2) / theta
return half_life
Practical half-life interpretation:
| Half-life | Interpretation | Trade suitability |
|---|---|---|
| < 5 days | Very fast reversion | High transaction costs may eliminate edge |
| 5–20 days | Fast reversion | Good for high-frequency capital cycling |
| 20–60 days | Moderate | Standard pairs trading territory |
| 60–120 days | Slow | Viable for larger portfolios with lower turnover |
| > 120 days | Very slow | Unlikely to be economically viable after costs |
A pair with a half-life of 15 days and a standard deviation of the spread of 2.5% generates roughly one round-trip trade per month. If your round-trip transaction cost is 0.10% (bid-ask + slippage), you need the expected reversion magnitude to exceed your cost threshold consistently.
4. Kalman Filter for Dynamic Hedge Ratio
The static hedge ratio from OLS assumes $\beta$ is constant over time. It is not. In practice, the fundamental relationship between two stocks drifts. A static hedge ratio computed over two years of data will be wrong six months from now if the two companies' business dynamics have diverged.
The Kalman filter solves this by updating the hedge ratio recursively as new data arrives. It treats $\beta$ as a hidden state that evolves over time according to a random walk, and updates it based on each new observation of the spread.
State-space model:
- State equation: $\beta_t = \beta_{t-1} + w_t$, where $w_t \sim N(0, Q)$
- Observation equation: $y_t = \beta_t x_t + v_t$, where $v_t \sim N(0, R)$
Here, $y_t$ is the price of the dependent stock, $x_t$ is the price of the independent stock, and $\beta_t$ is the time-varying hedge ratio. $Q$ (process noise) and $R$ (observation noise) are hyperparameters that control how quickly the hedge ratio adapts.
import requests
class KalmanFilterHedgeRatio:
"""Dynamic hedge ratio estimation using a Kalman filter.
This class implements a 1D Kalman filter to track the time-varying
hedge ratio between two assets. Unlike OLS, which assumes a constant
beta, the Kalman filter allows beta to drift smoothly over time.
Key parameters:
delta: Controls the process noise variance Q = delta^2 * (1 - phi^2)
Smaller delta → slower beta adaptation
Larger delta → faster beta adaptation
phi: State transition coefficient (default 1.0 = random walk)
phi < 1 adds mean-reversion to beta (more stable estimates)
⚠️ For live trading, re-initialize the filter after a corporate action
(stock split, merger, dividend) to avoid contaminating the estimate.
Args:
delta: Process noise parameter (controls adaptation speed)
phi: State transition coefficient (default 1.0 for random walk)
R: Observation noise variance (default 1e-3)
"""
def __init__(self, delta: float = 1e-4, phi: float = 1.0, R: float = 1e-3):
self.delta = delta
self.phi = phi
self.R = R
# State: [hedge_ratio]
self.beta = 0.0
# State covariance
self.P = 1.0
# Running residuals for spread analysis
self.residuals = []
self.hedge_ratios = []
def update(self, x: float, y: float) -> tuple[float, float, float]:
"""Update the hedge ratio with a new observation.
Args:
x: Price of the independent asset (e.g., MSFT)
y: Price of the dependent asset (e.g., AAPL)
Returns:
(predicted_spread, observed_spread, updated_beta)
"""
# Prediction step
beta_pred = self.phi * self.beta
P_pred = self.phi ** 2 * self.P + self.delta ** 2
# Observation
z_pred = y - beta_pred * x
# Kalman gain
S = x ** 2 * P_pred + self.R
K = P_pred * x / S
# Update step
z_actual = y - self.beta * x # Residuals computed with current beta
innovation = z_actual - z_pred
self.beta = beta_pred + K * innovation
self.P = (1 - K * x) * P_pred
# Store for spread monitoring
spread = y - self.beta * x
self.residuals.append(spread)
self.hedge_ratios.append(self.beta)
return z_pred, spread, self.beta
def get_zscore(self, lookback: int = 20) -> float | None:
"""Calculate the z-score of the current spread vs. a rolling mean.
Args:
lookback: Number of periods for rolling mean and std estimation
Returns:
Z-score of the current spread, or None if insufficient data
"""
if len(self.residuals) < lookback:
return None
recent = np.array(self.residuals[-lookback:])
current = self.residuals[-1]
mean = np.mean(recent)
std = np.std(recent)
if std < 1e-10:
return None
return (current - mean) / std
def kalman_filter_pairs_trading(stock1: str, stock2: str, entry_threshold: float = 2.0,
exit_threshold: float = 0.5, lookback: int = 20) -> dict:
"""Walk-forward backtest of a Kalman filter-based pairs trading strategy.
This function simulates the strategy using historical data:
- Go long the spread when z-score < -entry_threshold (spread too low → long stock2, short stock1)
- Go short the spread when z-score > +entry_threshold (spread too high → short stock2, long stock1)
- Exit when |z-score| < exit_threshold
⚠️ This backtest uses static entry/exit thresholds and does NOT account for:
- Transaction costs (brokerage commissions + bid-ask spread)
- Slippage and market impact
- Overnight gap risk
- Corporate actions (splits, mergers, dividends)
A production backtest should incorporate a cost model and out-of-sample validation.
Args:
stock1: Independent asset symbol (e.g., "MSFT.US")
stock2: Dependent asset symbol (e.g., "AAPL.US")
entry_threshold: Z-score threshold to enter a position (default 2.0)
exit_threshold: Z-score threshold to exit (default 0.5)
lookback: Periods for z-score rolling window
Returns:
Dictionary with performance metrics and trade log
"""
# Fetch historical data for both stocks
df1 = fetch_historical_kline(stock1, interval="1d", limit=500)
df2 = fetch_historical_kline(stock2, interval="1d", limit=500)
# Align on common dates
merged = pd.merge(df1[["timestamp", "close"]], df2[["timestamp", "close"]],
on="timestamp", suffixes=("_1", "_2")).dropna()
merged.columns = ["timestamp", "price1", "price2"]
if len(merged) < 100:
raise ValueError(f"Insufficient data for {stock1}/{stock2} pair")
# Initialize Kalman filter
kf = KalmanFilterHedgeRatio(delta=1e-4)
# Trading simulation
position = 0 # +1 = long spread, -1 = short spread, 0 = flat
entry_spread = 0
trades = []
equity_curve = [1.0]
for i in range(1, len(merged)):
x = merged["price1"].iloc[i]
y = merged["price2"].iloc[i]
z_pred, spread, beta = kf.update(x, y)
zscore = kf.get_zscore(lookback=lookback)
if zscore is None:
equity_curve.append(equity_curve[-1])
continue
timestamp = merged["timestamp"].iloc[i]
pnl = 0
# Entry logic
if position == 0:
if zscore < -entry_threshold:
position = 1
entry_spread = spread
trades.append({"date": timestamp, "action": "long_spread",
"zscore": zscore, "beta": beta})
elif zscore > entry_threshold:
position = -1
entry_spread = spread
trades.append({"date": timestamp, "action": "short_spread",
"zscore": zscore, "beta": beta})
# Exit logic
elif position == 1 and zscore > -exit_threshold:
pnl = (spread - entry_spread) / entry_spread if entry_spread != 0 else 0
equity_curve.append(equity_curve[-1] * (1 + pnl))
trades.append({"date": timestamp, "action": "exit_long_spread",
"zscore": zscore, "pnl": pnl})
position = 0
elif position == -1 and zscore < exit_threshold:
pnl = (entry_spread - spread) / entry_spread if entry_spread != 0 else 0
equity_curve.append(equity_curve[-1] * (1 + pnl))
trades.append({"date": timestamp, "action": "exit_short_spread",
"zscore": zscore, "pnl": pnl})
position = 0
else:
# Running P&L calculation (simplified)
if position == 1:
pnl = (spread - entry_spread) / entry_spread if entry_spread != 0 else 0
else:
pnl = (entry_spread - spread) / entry_spread if entry_spread != 0 else 0
equity_curve.append(equity_curve[-1] * (1 + pnl * 0.01)) # Daily accrual approximation
equity_series = pd.Series(equity_curve)
returns = equity_series.pct_change().dropna()
total_return = equity_curve[-1] - 1.0
sharpe = returns.mean() / returns.std() * np.sqrt(252) if returns.std() > 0 else 0.0
max_dd = (equity_series / equity_series.cummax() - 1).min()
return {
"pair": f"{stock1}/{stock2}",
"total_return": total_return,
"sharpe_ratio": sharpe,
"max_drawdown": max_dd,
"n_trades": len(trades),
"equity_curve": equity_curve,
"trades": pd.DataFrame(trades)
}
5. Full Pairs Screening Implementation
Putting it all together, here is the complete screening pipeline that fetches data, applies the correlation filter, runs cointegration tests, calculates half-life, and outputs a ranked list of tradeable pairs:
def run_pairs_screening_pipeline(
universe: list[str],
min_correlation: float = 0.80,
max_half_life: int = 120,
min_half_life: int = 5,
top_n: int = 20
) -> pd.DataFrame:
"""Complete pairs trading screening pipeline.
Workflow:
1. Fetch 500-day OHLCV data for all stocks in the universe
2. Compute 252-day rolling correlation matrix
3. Apply correlation filter (keep pairs above threshold)
4. Run Engle-Granger cointegration tests on survivors
5. Calculate Ornstein-Uhlenbeck half-life for cointegrated pairs
6. Filter by half-life range (5–120 days is typically most tradeable)
7. Rank by p-value (strongest cointegration first)
⚠️ For large universes (1,000+ stocks), this function can take 30–60 minutes
due to the O(N²) pair generation. Consider parallelizing with multiprocessing.
Args:
universe: List of TickDB symbols, e.g. ["AAPL.US", "MSFT.US", "GOOGL.US"]
min_correlation: Minimum 252-day correlation to proceed to cointegration test
max_half_life: Maximum half-life in days (pairs reverting slower are discarded)
min_half_life: Minimum half-life in days (pairs reverting too fast may be noisy)
top_n: Return the top N pairs by p-value
Returns:
DataFrame of ranked pairs with correlation, p-value, hedge ratio, half-life
"""
print("=" * 60)
print("PAIRS TRADING SCREENING PIPELINE")
print("=" * 60)
# Step 1: Screen pairs
pairs_df = screen_pairs(universe, min_correlation=min_correlation)
if pairs_df.empty:
print("No cointegrated pairs found.")
return pd.DataFrame()
# Step 2: Filter by half-life
pairs_df = pairs_df[
(pairs_df["half_life_days"] >= min_half_life) &
(pairs_df["half_life_days"] <= max_half_life)
]
if pairs_df.empty:
print(f"No pairs found with half-life between {min_half_life} and {max_half_life} days.")
return pd.DataFrame()
# Step 3: Rank by p-value (strongest cointegration first)
pairs_df = pairs_df.sort_values("p_value").head(top_n)
pairs_df = pairs_df.reset_index(drop=True)
# Display summary table
print("\nTOP PAIRS (ranked by cointegration p-value):")
print("-" * 80)
print(f"{'Rank':<5} {'Pair':<20} {'Corr':<8} {'p-value':<10} {'Hedge β':<10} {'Half-life':<10}")
print("-" * 80)
for i, row in pairs_df.iterrows():
print(f"{i+1:<5} {row['stock1']}/{row['stock2']:<12} "
f"{row['correlation']:<8.3f} {row['p_value']:<10.4f} "
f"{row['hedge_ratio']:<10.4f} {row['half_life_days']:<10.1f}")
return pairs_df
# Example: Screen a basket of tech and financial stocks
if __name__ == "__main__":
universe = [
"AAPL.US", "MSFT.US", "GOOGL.US", "AMZN.US", "META.US",
"NVDA.US", "JPM.US", "BAC.US", "GS.US", "MS.US",
"XOM.US", "CVX.US", "JNJ.US", "PFE.US", "UNH.US"
]
top_pairs = run_pairs_screening_pipeline(
universe,
min_correlation=0.80,
max_half_life=90,
min_half_life=5,
top_n=10
)
6. Common Pitfalls and Production Warnings
Overfitting the cointegration test. Testing 500,000 pairs with a 5% significance threshold will produce roughly 25,000 false positives by definition. The correct approach is to apply in-sample/out-of-sample validation: test cointegration on the first 300 days, then verify that the spread remains mean-reverting over the next 200 days. If it does not, discard the pair.
Ignoring regime changes. A pair that is cointegrated in 2020–2022 may not be cointegrated in 2023–2024 if the fundamental relationship breaks. Monitor the rolling p-value of the cointegration test. If the 60-day rolling p-value consistently exceeds 0.10, suspend trading that pair.
Static transaction cost assumptions. The textbook strategy assumes negligible transaction costs. In production, round-trip costs for US equities are typically 0.05–0.15% per side (bid-ask + commission + short borrow). A pair with an expected reversion of 1.5% and a half-life of 20 days looks profitable on paper but may be marginal after costs. Always build in a cost buffer.
Corporate actions. Stock splits, mergers, spin-offs, and large dividends change the price series in ways that break the cointegration relationship. Recompute the hedge ratio after any corporate action affecting either leg of the pair.
Spread normalization. The raw spread $Z_t = Y_t - \beta X_t$ has units of dollars. A spread value of 5.0 means different things for a pair with prices around $50 versus a pair with prices around $500. Always use z-score normalization for entry/exit thresholds to make them comparable across pairs and markets.
7. Reference Ticker Universe for US Equities
The following table provides a starting universe organized by sector. These instruments are selected for high liquidity, active options markets, and sufficient price history for statistical analysis.
| Sector | Tickers | Rationale |
|---|---|---|
| Large-cap Technology | AAPL.US, MSFT.US, GOOGL.US, AMZN.US, META.US, NVDA.US | High correlation within sector; strong cointegration candidates |
| Investment Banking | GS.US, MS.US, JPM.US, BAC.US, C.US | Shared revenue sensitivity to rate environment |
| Integrated Energy | XOM.US, CVX.US, COP.US | Commodity-price co-movement; sector-wide factor exposure |
| Major Airlines | DAL.US, UAL.US, AAL.US, LUV.US | High operational correlation; capacity decisions ripple across the sector |
| US Index ETFs | SPY.US, QQQ.US, IWM.US | Liquid proxies for sector pairs; useful as benchmark or hedge instruments |
For institutional-grade backtesting, the TickDB /v1/market/kline endpoint provides 10+ years of cleaned, time-aligned OHLCV data across all US equity symbols. Set the limit parameter to 500 for two-year windows or 1,000 for four-year windows to ensure your cointegration tests have sufficient statistical power.
8. Closing
The hardest part of pairs trading is not the math. It is the discipline to let the process run: to generate a ranked universe of pairs, pick the top 10, execute with consistent position sizing, and measure performance over quarters—not days.
Cointegration tells you the pair has a long-run equilibrium. Kalman filtering tells you the current hedge ratio without needing to re-estimate over the full window every time. Half-life tells you whether the pair is fast enough to trade given your cost structure. Together, they form a defensible screening pipeline that can survive contact with a live market.
The signal is in the spread. The discipline is in the system.
Next Steps
If you want to run this screening pipeline yourself: Sign up at tickdb.ai for a free API key (no credit card required), set the TICKDB_API_KEY environment variable, and copy the code from this article. The /v1/market/kline endpoint provides 500+ days of daily OHLCV data for US equities across 15,000+ symbols.
If you're building a real-time monitoring system: The TickDB WebSocket API supports live price streams for the instruments in your pairs universe. Update the Kalman filter hedge ratio in real time and trigger alerts when the z-score crosses your entry threshold.
If you need a complete historical backtest: The /v1/market/kline endpoint supports intervals from 1m to 1d. For intraday pairs trading, fetch 1h or 5m bars, apply the same cointegration pipeline, and measure whether the strategy's edge survives at higher frequencies.
If you use AI coding assistants: Search for and install the tickdb-market-data SKILL in your AI tool's marketplace to get TickDB API access integrated directly into your development workflow.
This article does not constitute investment advice. Pairs trading involves substantial risk including but not limited to market risk, liquidity risk, and model risk. Backtested results do not guarantee future performance. Past performance does not guarantee future results.