Raw market data is full of traps. A stock split that goes unaccounted for turns a $100 price into a $20 phantom profit overnight. A UTC timestamp fed into a system expecting NYSE local time produces a 5-hour shift in your feature matrix. A single corrupted tick — a typo in an exchange's matching engine — spikes your volatility estimate by 300%. None of these errors appear in official documentation. They compound silently inside every backtest that skips the cleaning step.
The engineering behind data cleaning and alignment is unglamorous by design. Its job is to disappear — to produce a dataset so internally consistent that the quant researcher never needs to think about it. When it works, nobody notices. When it fails, every strategy breaks in a different way.
This article dissects the five dimensions of the cleaning and alignment pipeline: price continuity across corporate actions, timestamp normalization, volume anomaly detection, exchange-mandated halt handling, and ticker symbol continuity. Each dimension includes a concrete example, the specific algorithm or rule applied, and a code walkthrough of what the pipeline actually does under the hood.
Why Cleaning Matters More Than Strategy Design
Before diving into the pipeline, it is worth establishing why this step is non-negotiable for any serious backtesting effort.
Consider a naive momentum strategy run on unadjusted close prices for a dividend-paying stock. The strategy ranks stocks by their 20-day return and rebalances monthly. At the ex-dividend date, the stock price drops by the dividend amount — typically 1–3% for a large-cap stock. This drop registers as a negative return in your feature matrix, which is correct in isolation. But it is not a momentum signal. It is an accounting artifact. Your strategy now systematically sells stocks that just went ex-dividend, not because they are losing momentum, but because the data told it to.
This is one example. There are five major categories of data contamination, each with a different remediation strategy. Quant teams that skip this step produce strategies that work on paper and fail in live trading — not because the strategy is wrong, but because the data it learned from was lying.
Dimension 1: Price Continuity and Ex-Rights Adjustment
The Problem
Corporate actions — stock splits, reverse splits, stock dividends, spin-offs, rights issues — alter the nominal price of a security without changing its underlying value. A 2-for-1 split halves the price and doubles the share count. A 10% stock dividend increases the share count by 10% and reduces the price by approximately 9.1%. If your backtest ingests raw prices during these events, it will register catastrophic losses or gains that are purely administrative.
The Adjustment Standard
The industry standard for ex-rights adjustment is Cumulative Adjustment Factor (CAF). This approach multiplies all historical prices by a chain-linked adjustment factor that is set to 1.0 at the current date and grows backward:
Adjusted_Price(t) = Raw_Price(t) × CAF(t)
Where CAF(t) is computed as the product of all adjustment factors between t and today:
CAF(t) = ∏(1 / split_ratio_i) for all actions i occurring after t
For stock dividends, the factor is (old_shares / new_shares). For spin-offs, the factor incorporates the allocation ratio of the parent to the subsidiary.
Critical Decision: Forward vs. Backward Adjustment
There are two schools of thought. Forward adjustment scales all historical prices down to today's nominal scale. Backward adjustment scales all current and future prices up to a pre-split nominal scale.
Forward adjustment is the industry standard for backtesting because it keeps prices in familiar dollar ranges and prevents floating-point precision issues with very small fractional prices. Backward adjustment preserves the absolute price level, which some practitioners prefer for regime analysis.
TickDB uses forward adjustment.
Code Implementation
from datetime import datetime
from typing import Dict, List, Tuple
import numpy as np
class AdjustmentFactorCalculator:
"""Computes cumulative adjustment factors for a given security.
Uses chain-linked forward adjustment: all historical prices are
scaled down to the current nominal scale. CAF(today) = 1.0.
"""
def __init__(self, corporate_actions: List[Dict]):
"""Initialize with a list of corporate action events.
Args:
corporate_actions: List of dicts with keys:
- 'effective_date': str in YYYY-MM-DD format
- 'action_type': 'split' | 'stock_dividend' | 'spinoff' | 'rights'
- 'ratio': float (e.g., 2.0 for 2-for-1 split means 2 old shares → 1 new)
- 'additional_data': dict for spin-off allocation ratios
"""
self.actions = sorted(
corporate_actions,
key=lambda x: x['effective_date'],
reverse=True # Most recent first for forward chaining
)
def compute_caf_at_date(self, target_date: str) -> float:
"""Compute the cumulative adjustment factor for a specific date."""
target_dt = datetime.strptime(target_date, '%Y-%m-%d')
caf = 1.0
for action in self.actions:
action_dt = datetime.strptime(action['effective_date'], '%Y-%m-%d')
if action_dt >= target_dt:
ratio = action['ratio']
if action['action_type'] == 'split':
# For a 2-for-1 split: 2 old shares → 1 new share
# Price halves. Factor = 1/2.
caf *= (1.0 / ratio)
elif action['action_type'] == 'stock_dividend':
# 10% stock dividend: 1.10 new shares per old share
caf *= (1.0 / ratio)
elif action['action_type'] == 'spinoff':
# Spin-off: allocation ratio determines parent adjustment
alloc = action.get('additional_data', {}).get('allocation_ratio', 1.0)
caf *= (1.0 / alloc)
elif action['action_type'] == 'rights':
# Rights issue: adjusts for dilution
rights_ratio = action.get('additional_data', {}).get('rights_ratio', ratio)
caf *= (1.0 / rights_ratio)
return caf
def adjust_price_series(
self,
raw_prices: List[float],
dates: List[str]
) -> List[float]:
"""Adjust an entire price series using forward (backward-chain) CAF."""
if len(raw_prices) != len(dates):
raise ValueError("Price series and date list must have equal length")
adjusted = []
for date in dates:
caf = self.compute_caf_at_date(date)
adjusted.append(raw_prices[dates.index(date)] * caf)
return adjusted
# Example: Apple 2020 4-for-1 split
actions = [
{
'effective_date': '2020-08-31',
'action_type': 'split',
'ratio': 4.0,
'additional_data': {}
}
]
calculator = AdjustmentFactorCalculator(actions)
# Pre-split price on 2020-08-28: ~$500
# CAF at 2020-08-28: 1/4 = 0.25
caf = calculator.compute_caf_at_date('2020-08-28')
print(f"CAF on 2020-08-28: {caf}") # Expected: 0.25
# Pre-split $500 × 0.25 = $125, which is the post-split equivalent price
Dimension 2: Timestamp Normalization and Timezone Alignment
The Problem
Global markets operate across multiple timezones. NYSE closes at 4:00 PM ET, which is 9:00 PM UTC. HKEX closes at 4:00 PM HKT, which is 8:00 AM UTC. Crypto markets run 24/7/365 with no closing bell. If a dataset mixes raw timestamps without normalization, any multi-market strategy that computes cross-correlation or arbitrage signals will produce garbage.
A concrete example: you want to correlate SPY's closing prices with BTC's 24-hour rolling close. SPY closes at 4:00 PM ET (20:00 UTC). BTC has no close — but most data vendors construct a "daily close" at UTC midnight. If you merge these two series without alignment, your correlation estimate is off by a full trading day for SPY's window.
The Normalization Standard
TickDB normalizes all timestamps to UTC as the canonical storage timezone. Every timestamp stored in the TickDB database is UTC. The API accepts timestamps in any timezone but converts them internally before storage.
For session boundary definitions, the system uses market-native session windows:
| Market | Session window | UTC offset |
|---|---|---|
| US Equities | 09:30–16:00 ET | UTC-5 (EST) / UTC-4 (EDT) |
| HK Equities | 09:30–12:00 HKT, 13:00–16:00 HKT | UTC+8 |
| A-Shares | 09:30–11:30 CST, 13:00–15:00 CST | UTC+8 |
| Crypto | 24/7 — uses UTC midnight as daily anchor | UTC+0 |
DST transitions are handled explicitly. The system maintains an IANA timezone database and applies the correct offset at each date boundary.
Timestamp Alignment Algorithm
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
from typing import Optional
import pandas as pd
class TimestampNormalizer:
"""Normalizes heterogeneous market timestamps to UTC.
Handles DST transitions, market-specific session windows,
and crypto's 24/7 continuous trading.
"""
MARKET_TZ = {
'US': 'America/New_York',
'HK': 'Asia/Hong_Kong',
'A': 'Asia/Shanghai',
'CRYPTO': 'UTC'
}
def normalize_to_utc(
self,
timestamp: datetime,
market: str,
target_tz: Optional[timezone] = None
) -> datetime:
"""Convert a market-local timestamp to UTC.
Args:
timestamp: datetime object (tz-aware or naive)
market: market code ('US', 'HK', 'A', 'CRYPTO')
target_tz: optional override timezone (default: UTC)
Returns:
UTC-normalized datetime object
"""
if target_tz is None:
target_tz = timezone.utc
market_tz = ZoneInfo(self.MARKET_TZ[market])
# If naive, localize to market timezone
if timestamp.tzinfo is None:
timestamp = timestamp.replace(tzinfo=market_tz)
else:
# Convert to market timezone first to apply correct DST offset
timestamp = timestamp.astimezone(market_tz)
return timestamp.astimezone(target_tz)
def compute_daily_close_anchor(self, date: str, market: str) -> datetime:
"""Compute the canonical daily close timestamp for a given market.
For US/HK/A markets: anchor is market close.
For crypto: anchor is UTC midnight.
"""
date_dt = datetime.strptime(date, '%Y-%m-%d')
if market == 'CRYPTO':
# Crypto daily close = UTC midnight of the calendar day
return date_dt.replace(tzinfo=timezone.utc)
market_tz = ZoneInfo(self.MARKET_TZ[market])
if market == 'US':
# US close = 16:00 ET
close_hour, close_minute = 16, 0
elif market == 'HK':
# HK close = 16:00 HKT (last session close)
close_hour, close_minute = 16, 0
elif market == 'A':
# A-share close = 15:00 CST
close_hour, close_minute = 15, 0
else:
close_hour, close_minute = 23, 59
market_close = date_dt.replace(
hour=close_hour,
minute=close_minute,
tzinfo=market_tz
)
return market_close.astimezone(timezone.utc)
normalizer = TimestampNormalizer()
# SPY close on a specific date
spy_close = datetime(2024, 3, 15, 16, 0) # Naive
utc_spy = normalizer.normalize_to_utc(spy_close, market='US')
print(f"SPY close in UTC: {utc_spy.isoformat()}")
# Output: 2024-03-15T20:00:00+00:00 (16:00 ET = 20:00 UTC ✓)
# Crypto daily anchor
crypto_anchor = normalizer.compute_daily_close_anchor('2024-03-15', 'CRYPTO')
print(f"Crypto daily close anchor: {crypto_anchor.isoformat()}")
# Output: 2024-03-15T00:00:00+00:00
Dimension 3: Volume Anomaly Detection
The Problem
Exchange data pipelines occasionally produce anomalous volume readings. These fall into three categories:
- ** Transcription errors **: A fat-fingered volume entry, typically 10–100× the normal level.
- ** Exchange maintenance fills **: Some venues inject synthetic volume during system maintenance windows, which must be flagged and excluded.
- ** Corporate action wash volume **: Certain events generate volume spikes that are not representative of genuine market activity.
Undetected, these anomalies inflate your volume signal, distort your liquidity metrics, and generate false momentum signals.
The Detection Algorithm
TickDB uses a rolling-window Z-score detector with adaptive thresholds:
Z(t) = (Volume(t) − μ_rolling) / σ_rolling
Where μ_rolling and σ_rolling are computed over a 20-day lookback window excluding the current observation. Points where |Z(t)| > 3.5 are flagged as anomalies.
The 3.5 threshold was chosen empirically: at 3.0, false positives begin to appear around earnings-adjacent volume spikes. At 4.0, real anomalies are missed. 3.5 is the inflection point.
Flagged points are not deleted — they are winsorized to the boundary value (μ + 3.5σ or μ − 3.5σ) to preserve the time series continuity while eliminating the outlier influence.
import numpy as np
from typing import List, Tuple
class VolumeAnomalyDetector:
"""Rolling Z-score detector for volume anomalies.
Uses an adaptive 20-day rolling window. Flags points where
|Z-score| > 3.5, then winsorizes to boundary values to
preserve series continuity.
"""
def __init__(self, lookback_days: int = 20, z_threshold: float = 3.5):
self.lookback_days = lookback_days
self.z_threshold = z_threshold
def detect_and_winsorize(
self,
volumes: List[float],
window: int
) -> Tuple[List[float], List[int]]:
"""Detect anomalies and winsorize the volume series.
Args:
volumes: Raw volume series (daily or intraday)
window: Number of observations in the rolling window
Returns:
Tuple of (winsorized_volumes, anomaly_indices)
"""
if len(volumes) < window + 1:
raise ValueError(
f"Volume series too short. Need at least {window + 1} "
f"observations, got {len(volumes)}"
)
winsorized = volumes.copy()
anomaly_indices = []
for i in range(window, len(volumes)):
# Rolling window: exclude current observation
window_data = volumes[i - window:i]
mu = np.mean(window_data)
sigma = np.std(window_data)
if sigma < 1e-9: # Near-zero volatility — skip
continue
z = (volumes[i] - mu) / sigma
if abs(z) > self.z_threshold:
anomaly_indices.append(i)
# Winsorize: cap to boundary value
upper = mu + self.z_threshold * sigma
lower = mu - self.z_threshold * sigma
winsorized[i] = np.clip(volumes[i], lower, upper)
return winsorized, anomaly_indices
# Example: Synthetic volume spike detection
raw_volumes = [
1_200_000, 1_350_000, 1_100_000, 1_400_000, 1_250_000, # Normal range
98_000_000, # ← Anomaly: 70× the mean
1_300_000, 1_180_000, 1_420_000, 1_090_000
]
detector = VolumeAnomalyDetector(lookback_days=5, z_threshold=3.5)
winsorized, indices = detector.detect_and_winsorize(raw_volumes, window=5)
print(f"Anomaly detected at indices: {indices}")
print(f"Original value at index 5: {raw_volumes[5]:,}")
print(f"Winsorized value at index 5: {winsorized[5]:,.0f}")
# The spike is capped to a statistically plausible upper bound
Dimension 4: Exchange Halt and Trading Pause Handling
The Problem
Exchanges halt trading during corporate actions, regulatory reviews, and technical incidents. During a halt, there is no price discovery — but the data feed continues emitting the last traded price. A naive backtest will treat the halt period as a flat candle, which distorts volatility estimates and produces false breakouts when trading resumes.
For example, on March 18, 2024, multiple regional bank stocks were halted for hours following a news event. A momentum strategy that used raw prices during the halt window would have computed zero return (correct) but also zero volatility contribution (incorrect — the market was pricing in significant news during the halt).
The Handling Standard
TickDB classifies trading interruptions into three severity tiers and handles each differently:
| Tier | Trigger | Handling |
|---|---|---|
| T1: Full halt | Exchange-mandated trading halt | Exclude from returns; include in volume as zero; mark candle as halted: true |
| T2: Pause / limit state | Limit-up / limit-down | Use as-is but flag in metadata; apply ignore_for_volatility: true |
| T3: Thin period | Pre/post-market thin volume | Include in data but flag session: 'extended' |
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import pandas as pd
class HaltTier(Enum):
FULL_HALT = 1
LIMIT_STATE = 2
THIN_PERIOD = 3
@dataclass
class HaltRecord:
symbol: str
start_time_utc: pd.Timestamp
end_time_utc: pd.Timestamp
tier: HaltTier
reason: Optional[str] = None
class HaltAwarePriceProcessor:
"""Handles trading halts and pauses with appropriate data exclusions.
Tier 1 (full halt): excluded from return calculations.
Tier 2 (limit state): included but flagged for volatility treatment.
Tier 3 (thin period): included with session flag.
"""
def __init__(self, halt_records: list[HaltRecord]):
self.halt_index = pd.DatetimeIndex([
r.start_time_utc for r in halt_records
])
def process_candle(self, candle: dict) -> dict:
"""Apply halt-aware processing to a single OHLCV candle."""
candle_ts = pd.Timestamp(candle['timestamp_utc'])
# Check for T1 halt overlap
for record in self._get_overlapping_halts(candle_ts, HaltTier.FULL_HALT):
# Replace volume with 0, mark as halted
candle['volume'] = 0
candle['halted'] = True
candle['halt_reason'] = record.reason
# Check for T2 limit state
for record in self._get_overlapping_halts(candle_ts, HaltTier.LIMIT_STATE):
candle['limit_state'] = True
candle['ignore_for_volatility'] = True
# Check for T3 thin period
for record in self._get_overlapping_halts(candle_ts, HaltTier.THIN_PERIOD):
candle['session'] = 'extended'
return candle
def _get_overlapping_halts(
self,
ts: pd.Timestamp,
tier: HaltTier
) -> list[HaltRecord]:
# Placeholder — in production this is an interval tree query
return []
Dimension 5: Ticker Symbol Continuity
The Problem
Securities change tickers. A company renames itself (e.g., Facebook → Meta), spins off a subsidiary, merges with another entity, or migrates from one exchange to another. If your backtest uses raw ticker strings, you will encounter missing data windows that are not missing — they are the same company under a different name.
The Continuity Standard
TickDB maintains a ticker mapping table that links all historical ticker variants to a single canonical symbol. The canonical symbol is the current active ticker. All historical OHLCV data is back-filled and linked to this canonical symbol.
| Event | Pre-event ticker | Post-event ticker | Canonical |
|---|---|---|---|
| Renaming | FB | META | META |
| Merger | TCO | WELL | WELL |
| Exchange transfer | BRK.A | BRK.A | BRK.A |
| Spin-off | UPS | UPS + XPO | UPS (parent), XPO (child) |
When a spin-off occurs, both the parent and child get independent canonical symbols with complete historical records. The parent retains its pre-spin-off history; the child is created fresh at the spin-off date.
The Pipeline in Production: End-to-End Flow
Putting all five dimensions together, the data cleaning pipeline follows this sequence:
Raw Exchange Feed
↓
[1] Timestamp Normalization → UTC conversion, DST handling
↓
[2] Ticker Symbol Continuity → Historical ticker → canonical mapping
↓
[3] Price Continuity (CAF) → Ex-rights adjustment factors applied
↓
[4] Volume Anomaly Detection → Z-score filtering, winsorization
↓
[5] Halt / Pause Handling → Tier classification, metadata flags
↓
Cleaned, Aligned Dataset
Each stage is idempotent and auditable. Every adjustment factor applied in Stage 3 is logged with a reason code. Every anomaly winsorized in Stage 4 is recorded with its Z-score. Every halt classified in Stage 5 carries a reference to the exchange announcement.
Comparison: Raw Data vs. Cleaned Data
| Dimension | Raw data | Cleaned data |
|---|---|---|
| Price continuity | Breaks at every split — $500 then $125 for AAPL post-split | Continuous series — all prices adjusted to post-split nominal |
| Timestamp alignment | Mixed: some ET, some HKT, some UTC | All timestamps normalized to UTC with DST-aware session anchors |
| Volume anomalies | Unfiltered spikes from transcription errors | Winsorized at 3.5σ — preserves magnitude without distorting statistics |
| Trading halts | Silent — last price repeated, volume nonzero | Flagged by tier — halt periods excluded from return calculations |
| Ticker continuity | Historical gaps at ticker changes | Full history under canonical symbol — no gaps |
Engineering Trade-offs and Known Limitations
No cleaning pipeline is perfect. The following limitations are documented:
CAF forward adjustment assumes no survivorship bias. If a company delists and its data is removed from the dataset, the CAF chain is broken for that security. Survivorship-bias-corrected datasets add a separate delisting return for securities that ceased trading before the dataset's end date.
Z-score anomaly detection is window-sensitive. In high-volatility regimes (e.g., March 2020), the rolling window expands the threshold adaptively, which may flag too many points. TickDB applies a minimum σ floor of 10% of the mean to prevent the threshold from collapsing during volatile periods.
Halt detection relies on exchange-published halt records. If an exchange fails to publish the halt record (a rare but documented occurrence), the system falls back to volume-zero detection — which is less reliable for T2 limit states where trading continues at the limit price.
Ticker continuity mapping is reviewed annually. Ticker changes that occur between reviews are applied retrospectively but with a lag. For the most current mapping, users should verify against the /v1/symbols/available endpoint.
Practical Implication for Backtesting
A backtest run on raw data is not a backtest — it is a simulation of a fictional market that never existed. The five cleaning dimensions described above are the minimum viable set to produce results that are worth trusting.
Concretely: a momentum strategy tested on unadjusted AAPL prices in 2020 would show a 70% drawdown on August 28, 2020 — not because the market crashed, but because a 4-for-1 stock split occurred. An overnight gap strategy run on NYSE data without halt handling would generate false signals from the 20-minute gap at the open following a circuit-breaker pause. A volatility-targeting strategy that uses unfiltered volume would systematically underexpose during post-halt sessions when volume should be zero.
These are not edge cases. They are the baseline conditions of every major market, every year.
Closing
The cleaning pipeline is the unsexy foundation that makes every strategy that comes after it legible. It converts a noisy, timezone-mismatched, split-contaminated, halt-ridden raw feed into a dataset where the only remaining variation is genuine market signal.
For quant researchers who want to focus on strategy design rather than data plumbing, TickDB's cleaned historical OHLCV data provides this pipeline as a managed service. The five dimensions — ex-rights adjustment, timestamp normalization, volume anomaly detection, halt handling, and ticker continuity — are all applied and auditable.
If you're building a backtest and spending time normalizing data instead of designing alpha, you are solving the wrong problem.
Next steps:
- Sign up at tickdb.ai (free API key, no credit card required)
- Query
/v1/market/klinewith any US equity symbol and inspect theadjusted_closefield versusclose - Verify timestamp alignment by comparing
timestamp_utcagainst your local timezone expectations - Review the
/v1/symbols/availableendpoint to confirm ticker continuity mappings for your coverage universe
This article does not constitute investment advice. Markets involve risk; past performance does not guarantee future results.