Data Cleaning Behind the Scenes: How Historical Market Data Gets Backtest-Ready | API Guide

Raw market data is full of traps. A stock split that goes unaccounted for turns a $100 price into a $20 phantom profit overnight. A UTC timestamp fed into a system expecting NYSE local time produces a 5-hour shift in your feature matrix. A single corrupted tick — a typo in an exchange's matching engine — spikes your volatility estimate by 300%. None of these errors appear in official documentation. They compound silently inside every backtest that skips the cleaning step.

The engineering behind data cleaning and alignment is unglamorous by design. Its job is to disappear — to produce a dataset so internally consistent that the quant researcher never needs to think about it. When it works, nobody notices. When it fails, every strategy breaks in a different way.

This article dissects the five dimensions of the cleaning and alignment pipeline: price continuity across corporate actions, timestamp normalization, volume anomaly detection, exchange-mandated halt handling, and ticker symbol continuity. Each dimension includes a concrete example, the specific algorithm or rule applied, and a code walkthrough of what the pipeline actually does under the hood.

Why Cleaning Matters More Than Strategy Design

Before diving into the pipeline, it is worth establishing why this step is non-negotiable for any serious backtesting effort.

Consider a naive momentum strategy run on unadjusted close prices for a dividend-paying stock. The strategy ranks stocks by their 20-day return and rebalances monthly. At the ex-dividend date, the stock price drops by the dividend amount — typically 1–3% for a large-cap stock. This drop registers as a negative return in your feature matrix, which is correct in isolation. But it is not a momentum signal. It is an accounting artifact. Your strategy now systematically sells stocks that just went ex-dividend, not because they are losing momentum, but because the data told it to.

This is one example. There are five major categories of data contamination, each with a different remediation strategy. Quant teams that skip this step produce strategies that work on paper and fail in live trading — not because the strategy is wrong, but because the data it learned from was lying.

Dimension 1: Price Continuity and Ex-Rights Adjustment

The Problem

Corporate actions — stock splits, reverse splits, stock dividends, spin-offs, rights issues — alter the nominal price of a security without changing its underlying value. A 2-for-1 split halves the price and doubles the share count. A 10% stock dividend increases the share count by 10% and reduces the price by approximately 9.1%. If your backtest ingests raw prices during these events, it will register catastrophic losses or gains that are purely administrative.

The Adjustment Standard

The industry standard for ex-rights adjustment is Cumulative Adjustment Factor (CAF). This approach multiplies all historical prices by a chain-linked adjustment factor that is set to 1.0 at the current date and grows backward:

Adjusted_Price(t) = Raw_Price(t) × CAF(t)

Where CAF(t) is computed as the product of all adjustment factors between t and today:

CAF(t) = ∏(1 / split_ratio_i)  for all actions i occurring after t

For stock dividends, the factor is (old_shares / new_shares). For spin-offs, the factor incorporates the allocation ratio of the parent to the subsidiary.

Critical Decision: Forward vs. Backward Adjustment

There are two schools of thought. Forward adjustment scales all historical prices down to today's nominal scale. Backward adjustment scales all current and future prices up to a pre-split nominal scale.

Forward adjustment is the industry standard for backtesting because it keeps prices in familiar dollar ranges and prevents floating-point precision issues with very small fractional prices. Backward adjustment preserves the absolute price level, which some practitioners prefer for regime analysis.

TickDB uses forward adjustment.

Code Implementation

from datetime import datetime
from typing import Dict, List, Tuple
import numpy as np

class AdjustmentFactorCalculator:
    """Computes cumulative adjustment factors for a given security.
    
    Uses chain-linked forward adjustment: all historical prices are
    scaled down to the current nominal scale. CAF(today) = 1.0.
    """

    def __init__(self, corporate_actions: List[Dict]):
        """Initialize with a list of corporate action events.
        
        Args:
            corporate_actions: List of dicts with keys:
                - 'effective_date': str in YYYY-MM-DD format
                - 'action_type': 'split' | 'stock_dividend' | 'spinoff' | 'rights'
                - 'ratio': float (e.g., 2.0 for 2-for-1 split means 2 old shares → 1 new)
                - 'additional_data': dict for spin-off allocation ratios
        """
        self.actions = sorted(
            corporate_actions,
            key=lambda x: x['effective_date'],
            reverse=True  # Most recent first for forward chaining
        )

    def compute_caf_at_date(self, target_date: str) -> float:
        """Compute the cumulative adjustment factor for a specific date."""
        target_dt = datetime.strptime(target_date, '%Y-%m-%d')
        caf = 1.0
        
        for action in self.actions:
            action_dt = datetime.strptime(action['effective_date'], '%Y-%m-%d')
            if action_dt >= target_dt:
                ratio = action['ratio']
                if action['action_type'] == 'split':
                    # For a 2-for-1 split: 2 old shares → 1 new share
                    # Price halves. Factor = 1/2.
                    caf *= (1.0 / ratio)
                elif action['action_type'] == 'stock_dividend':
                    # 10% stock dividend: 1.10 new shares per old share
                    caf *= (1.0 / ratio)
                elif action['action_type'] == 'spinoff':
                    # Spin-off: allocation ratio determines parent adjustment
                    alloc = action.get('additional_data', {}).get('allocation_ratio', 1.0)
                    caf *= (1.0 / alloc)
                elif action['action_type'] == 'rights':
                    # Rights issue: adjusts for dilution
                    rights_ratio = action.get('additional_data', {}).get('rights_ratio', ratio)
                    caf *= (1.0 / rights_ratio)
        
        return caf

    def adjust_price_series(
        self,
        raw_prices: List[float],
        dates: List[str]
    ) -> List[float]:
        """Adjust an entire price series using forward (backward-chain) CAF."""
        if len(raw_prices) != len(dates):
            raise ValueError("Price series and date list must have equal length")
        
        adjusted = []
        for date in dates:
            caf = self.compute_caf_at_date(date)
            adjusted.append(raw_prices[dates.index(date)] * caf)
        
        return adjusted


# Example: Apple 2020 4-for-1 split
actions = [
    {
        'effective_date': '2020-08-31',
        'action_type': 'split',
        'ratio': 4.0,
        'additional_data': {}
    }
]

calculator = AdjustmentFactorCalculator(actions)

# Pre-split price on 2020-08-28: ~$500
# CAF at 2020-08-28: 1/4 = 0.25
caf = calculator.compute_caf_at_date('2020-08-28')
print(f"CAF on 2020-08-28: {caf}")  # Expected: 0.25

# Pre-split $500 × 0.25 = $125, which is the post-split equivalent price

Dimension 2: Timestamp Normalization and Timezone Alignment

The Problem

Global markets operate across multiple timezones. NYSE closes at 4:00 PM ET, which is 9:00 PM UTC. HKEX closes at 4:00 PM HKT, which is 8:00 AM UTC. Crypto markets run 24/7/365 with no closing bell. If a dataset mixes raw timestamps without normalization, any multi-market strategy that computes cross-correlation or arbitrage signals will produce garbage.

A concrete example: you want to correlate SPY's closing prices with BTC's 24-hour rolling close. SPY closes at 4:00 PM ET (20:00 UTC). BTC has no close — but most data vendors construct a "daily close" at UTC midnight. If you merge these two series without alignment, your correlation estimate is off by a full trading day for SPY's window.

The Normalization Standard

TickDB normalizes all timestamps to UTC as the canonical storage timezone. Every timestamp stored in the TickDB database is UTC. The API accepts timestamps in any timezone but converts them internally before storage.

For session boundary definitions, the system uses market-native session windows:

Market	Session window	UTC offset
US Equities	09:30–16:00 ET	UTC-5 (EST) / UTC-4 (EDT)
HK Equities	09:30–12:00 HKT, 13:00–16:00 HKT	UTC+8
A-Shares	09:30–11:30 CST, 13:00–15:00 CST	UTC+8
Crypto	24/7 — uses UTC midnight as daily anchor	UTC+0

DST transitions are handled explicitly. The system maintains an IANA timezone database and applies the correct offset at each date boundary.

Timestamp Alignment Algorithm

from datetime import datetime, timezone
from zoneinfo import ZoneInfo
from typing import Optional
import pandas as pd

class TimestampNormalizer:
    """Normalizes heterogeneous market timestamps to UTC.
    
    Handles DST transitions, market-specific session windows,
    and crypto's 24/7 continuous trading.
    """

    MARKET_TZ = {
        'US': 'America/New_York',
        'HK': 'Asia/Hong_Kong',
        'A': 'Asia/Shanghai',
        'CRYPTO': 'UTC'
    }

    def normalize_to_utc(
        self,
        timestamp: datetime,
        market: str,
        target_tz: Optional[timezone] = None
    ) -> datetime:
        """Convert a market-local timestamp to UTC.
        
        Args:
            timestamp: datetime object (tz-aware or naive)
            market: market code ('US', 'HK', 'A', 'CRYPTO')
            target_tz: optional override timezone (default: UTC)
        
        Returns:
            UTC-normalized datetime object
        """
        if target_tz is None:
            target_tz = timezone.utc
        
        market_tz = ZoneInfo(self.MARKET_TZ[market])
        
        # If naive, localize to market timezone
        if timestamp.tzinfo is None:
            timestamp = timestamp.replace(tzinfo=market_tz)
        else:
            # Convert to market timezone first to apply correct DST offset
            timestamp = timestamp.astimezone(market_tz)
        
        return timestamp.astimezone(target_tz)

    def compute_daily_close_anchor(self, date: str, market: str) -> datetime:
        """Compute the canonical daily close timestamp for a given market.
        
        For US/HK/A markets: anchor is market close.
        For crypto: anchor is UTC midnight.
        """
        date_dt = datetime.strptime(date, '%Y-%m-%d')
        
        if market == 'CRYPTO':
            # Crypto daily close = UTC midnight of the calendar day
            return date_dt.replace(tzinfo=timezone.utc)
        
        market_tz = ZoneInfo(self.MARKET_TZ[market])
        
        if market == 'US':
            # US close = 16:00 ET
            close_hour, close_minute = 16, 0
        elif market == 'HK':
            # HK close = 16:00 HKT (last session close)
            close_hour, close_minute = 16, 0
        elif market == 'A':
            # A-share close = 15:00 CST
            close_hour, close_minute = 15, 0
        else:
            close_hour, close_minute = 23, 59
        
        market_close = date_dt.replace(
            hour=close_hour,
            minute=close_minute,
            tzinfo=market_tz
        )
        return market_close.astimezone(timezone.utc)


normalizer = TimestampNormalizer()

# SPY close on a specific date
spy_close = datetime(2024, 3, 15, 16, 0)  # Naive
utc_spy = normalizer.normalize_to_utc(spy_close, market='US')
print(f"SPY close in UTC: {utc_spy.isoformat()}")
# Output: 2024-03-15T20:00:00+00:00 (16:00 ET = 20:00 UTC ✓)

# Crypto daily anchor
crypto_anchor = normalizer.compute_daily_close_anchor('2024-03-15', 'CRYPTO')
print(f"Crypto daily close anchor: {crypto_anchor.isoformat()}")
# Output: 2024-03-15T00:00:00+00:00

Dimension 3: Volume Anomaly Detection

The Problem

Exchange data pipelines occasionally produce anomalous volume readings. These fall into three categories:

** Transcription errors **: A fat-fingered volume entry, typically 10–100× the normal level.
** Exchange maintenance fills **: Some venues inject synthetic volume during system maintenance windows, which must be flagged and excluded.
** Corporate action wash volume **: Certain events generate volume spikes that are not representative of genuine market activity.

Undetected, these anomalies inflate your volume signal, distort your liquidity metrics, and generate false momentum signals.

The Detection Algorithm

TickDB uses a rolling-window Z-score detector with adaptive thresholds:

Z(t) = (Volume(t) − μ_rolling) / σ_rolling

Where μ_rolling and σ_rolling are computed over a 20-day lookback window excluding the current observation. Points where |Z(t)| > 3.5 are flagged as anomalies.

The 3.5 threshold was chosen empirically: at 3.0, false positives begin to appear around earnings-adjacent volume spikes. At 4.0, real anomalies are missed. 3.5 is the inflection point.

Flagged points are not deleted — they are winsorized to the boundary value (μ + 3.5σ or μ − 3.5σ) to preserve the time series continuity while eliminating the outlier influence.

import numpy as np
from typing import List, Tuple

class VolumeAnomalyDetector:
    """Rolling Z-score detector for volume anomalies.
    
    Uses an adaptive 20-day rolling window. Flags points where
    |Z-score| > 3.5, then winsorizes to boundary values to
    preserve series continuity.
    """

    def __init__(self, lookback_days: int = 20, z_threshold: float = 3.5):
        self.lookback_days = lookback_days
        self.z_threshold = z_threshold

    def detect_and_winsorize(
        self,
        volumes: List[float],
        window: int
    ) -> Tuple[List[float], List[int]]:
        """Detect anomalies and winsorize the volume series.
        
        Args:
            volumes: Raw volume series (daily or intraday)
            window: Number of observations in the rolling window
        
        Returns:
            Tuple of (winsorized_volumes, anomaly_indices)
        """
        if len(volumes) < window + 1:
            raise ValueError(
                f"Volume series too short. Need at least {window + 1} "
                f"observations, got {len(volumes)}"
            )

        winsorized = volumes.copy()
        anomaly_indices = []

        for i in range(window, len(volumes)):
            # Rolling window: exclude current observation
            window_data = volumes[i - window:i]
            mu = np.mean(window_data)
            sigma = np.std(window_data)

            if sigma < 1e-9:  # Near-zero volatility — skip
                continue

            z = (volumes[i] - mu) / sigma

            if abs(z) > self.z_threshold:
                anomaly_indices.append(i)
                # Winsorize: cap to boundary value
                upper = mu + self.z_threshold * sigma
                lower = mu - self.z_threshold * sigma
                winsorized[i] = np.clip(volumes[i], lower, upper)

        return winsorized, anomaly_indices


# Example: Synthetic volume spike detection
raw_volumes = [
    1_200_000, 1_350_000, 1_100_000, 1_400_000, 1_250_000,  # Normal range
    98_000_000,  # ← Anomaly: 70× the mean
    1_300_000, 1_180_000, 1_420_000, 1_090_000
]

detector = VolumeAnomalyDetector(lookback_days=5, z_threshold=3.5)
winsorized, indices = detector.detect_and_winsorize(raw_volumes, window=5)

print(f"Anomaly detected at indices: {indices}")
print(f"Original value at index 5: {raw_volumes[5]:,}")
print(f"Winsorized value at index 5: {winsorized[5]:,.0f}")
# The spike is capped to a statistically plausible upper bound

Dimension 4: Exchange Halt and Trading Pause Handling

The Problem

Exchanges halt trading during corporate actions, regulatory reviews, and technical incidents. During a halt, there is no price discovery — but the data feed continues emitting the last traded price. A naive backtest will treat the halt period as a flat candle, which distorts volatility estimates and produces false breakouts when trading resumes.

For example, on March 18, 2024, multiple regional bank stocks were halted for hours following a news event. A momentum strategy that used raw prices during the halt window would have computed zero return (correct) but also zero volatility contribution (incorrect — the market was pricing in significant news during the halt).

The Handling Standard

TickDB classifies trading interruptions into three severity tiers and handles each differently:

Tier	Trigger	Handling
T1: Full halt	Exchange-mandated trading halt	Exclude from returns; include in volume as zero; mark candle as `halted: true`
T2: Pause / limit state	Limit-up / limit-down	Use as-is but flag in metadata; apply `ignore_for_volatility: true`
T3: Thin period	Pre/post-market thin volume	Include in data but flag `session: 'extended'`

from dataclasses import dataclass
from enum import Enum
from typing import Optional
import pandas as pd

class HaltTier(Enum):
    FULL_HALT = 1
    LIMIT_STATE = 2
    THIN_PERIOD = 3

@dataclass
class HaltRecord:
    symbol: str
    start_time_utc: pd.Timestamp
    end_time_utc: pd.Timestamp
    tier: HaltTier
    reason: Optional[str] = None

class HaltAwarePriceProcessor:
    """Handles trading halts and pauses with appropriate data exclusions.
    
    Tier 1 (full halt): excluded from return calculations.
    Tier 2 (limit state): included but flagged for volatility treatment.
    Tier 3 (thin period): included with session flag.
    """

    def __init__(self, halt_records: list[HaltRecord]):
        self.halt_index = pd.DatetimeIndex([
            r.start_time_utc for r in halt_records
        ])

    def process_candle(self, candle: dict) -> dict:
        """Apply halt-aware processing to a single OHLCV candle."""
        candle_ts = pd.Timestamp(candle['timestamp_utc'])
        
        # Check for T1 halt overlap
        for record in self._get_overlapping_halts(candle_ts, HaltTier.FULL_HALT):
            # Replace volume with 0, mark as halted
            candle['volume'] = 0
            candle['halted'] = True
            candle['halt_reason'] = record.reason
            
        # Check for T2 limit state
        for record in self._get_overlapping_halts(candle_ts, HaltTier.LIMIT_STATE):
            candle['limit_state'] = True
            candle['ignore_for_volatility'] = True
            
        # Check for T3 thin period
        for record in self._get_overlapping_halts(candle_ts, HaltTier.THIN_PERIOD):
            candle['session'] = 'extended'
        
        return candle

    def _get_overlapping_halts(
        self,
        ts: pd.Timestamp,
        tier: HaltTier
    ) -> list[HaltRecord]:
        # Placeholder — in production this is an interval tree query
        return []

Dimension 5: Ticker Symbol Continuity

The Problem

Securities change tickers. A company renames itself (e.g., Facebook → Meta), spins off a subsidiary, merges with another entity, or migrates from one exchange to another. If your backtest uses raw ticker strings, you will encounter missing data windows that are not missing — they are the same company under a different name.

The Continuity Standard

TickDB maintains a ticker mapping table that links all historical ticker variants to a single canonical symbol. The canonical symbol is the current active ticker. All historical OHLCV data is back-filled and linked to this canonical symbol.

Event	Pre-event ticker	Post-event ticker	Canonical
Renaming	FB	META	META
Merger	TCO	WELL	WELL
Exchange transfer	BRK.A	BRK.A	BRK.A
Spin-off	UPS	UPS + XPO	UPS (parent), XPO (child)

When a spin-off occurs, both the parent and child get independent canonical symbols with complete historical records. The parent retains its pre-spin-off history; the child is created fresh at the spin-off date.

The Pipeline in Production: End-to-End Flow

Putting all five dimensions together, the data cleaning pipeline follows this sequence:

Raw Exchange Feed
       ↓
[1] Timestamp Normalization → UTC conversion, DST handling
       ↓
[2] Ticker Symbol Continuity → Historical ticker → canonical mapping
       ↓
[3] Price Continuity (CAF) → Ex-rights adjustment factors applied
       ↓
[4] Volume Anomaly Detection → Z-score filtering, winsorization
       ↓
[5] Halt / Pause Handling → Tier classification, metadata flags
       ↓
Cleaned, Aligned Dataset

Each stage is idempotent and auditable. Every adjustment factor applied in Stage 3 is logged with a reason code. Every anomaly winsorized in Stage 4 is recorded with its Z-score. Every halt classified in Stage 5 carries a reference to the exchange announcement.

Comparison: Raw Data vs. Cleaned Data

Dimension	Raw data	Cleaned data
Price continuity	Breaks at every split — $500 then $125 for AAPL post-split	Continuous series — all prices adjusted to post-split nominal
Timestamp alignment	Mixed: some ET, some HKT, some UTC	All timestamps normalized to UTC with DST-aware session anchors
Volume anomalies	Unfiltered spikes from transcription errors	Winsorized at 3.5σ — preserves magnitude without distorting statistics
Trading halts	Silent — last price repeated, volume nonzero	Flagged by tier — halt periods excluded from return calculations
Ticker continuity	Historical gaps at ticker changes	Full history under canonical symbol — no gaps

Engineering Trade-offs and Known Limitations

No cleaning pipeline is perfect. The following limitations are documented:

CAF forward adjustment assumes no survivorship bias. If a company delists and its data is removed from the dataset, the CAF chain is broken for that security. Survivorship-bias-corrected datasets add a separate delisting return for securities that ceased trading before the dataset's end date.

Z-score anomaly detection is window-sensitive. In high-volatility regimes (e.g., March 2020), the rolling window expands the threshold adaptively, which may flag too many points. TickDB applies a minimum σ floor of 10% of the mean to prevent the threshold from collapsing during volatile periods.

Halt detection relies on exchange-published halt records. If an exchange fails to publish the halt record (a rare but documented occurrence), the system falls back to volume-zero detection — which is less reliable for T2 limit states where trading continues at the limit price.

Ticker continuity mapping is reviewed annually. Ticker changes that occur between reviews are applied retrospectively but with a lag. For the most current mapping, users should verify against the /v1/symbols/available endpoint.

Practical Implication for Backtesting

A backtest run on raw data is not a backtest — it is a simulation of a fictional market that never existed. The five cleaning dimensions described above are the minimum viable set to produce results that are worth trusting.

Concretely: a momentum strategy tested on unadjusted AAPL prices in 2020 would show a 70% drawdown on August 28, 2020 — not because the market crashed, but because a 4-for-1 stock split occurred. An overnight gap strategy run on NYSE data without halt handling would generate false signals from the 20-minute gap at the open following a circuit-breaker pause. A volatility-targeting strategy that uses unfiltered volume would systematically underexpose during post-halt sessions when volume should be zero.

These are not edge cases. They are the baseline conditions of every major market, every year.

Closing

The cleaning pipeline is the unsexy foundation that makes every strategy that comes after it legible. It converts a noisy, timezone-mismatched, split-contaminated, halt-ridden raw feed into a dataset where the only remaining variation is genuine market signal.

For quant researchers who want to focus on strategy design rather than data plumbing, TickDB's cleaned historical OHLCV data provides this pipeline as a managed service. The five dimensions — ex-rights adjustment, timestamp normalization, volume anomaly detection, halt handling, and ticker continuity — are all applied and auditable.

If you're building a backtest and spending time normalizing data instead of designing alpha, you are solving the wrong problem.

Next steps:

Sign up at tickdb.ai (free API key, no credit card required)
Query /v1/market/kline with any US equity symbol and inspect the adjusted_close field versus close
Verify timestamp alignment by comparing timestamp_utc against your local timezone expectations
Review the /v1/symbols/available endpoint to confirm ticker continuity mappings for your coverage universe

This article does not constitute investment advice. Markets involve risk; past performance does not guarantee future results.