What Does "Data Cleaning and Alignment" Actually Mean? A Five-Dimension Breakdown of How TickDB Processes Historical Market Data | API Guide

You pull your favorite stock's five-year daily OHLCV from a public data source. You run a moving average crossover backtest. The strategy returns 340% annualized. You are confident. You deploy capital.

Then you notice the price drops $0.82 on what looks like a random Tuesday. No earnings. No news. The stock simply went ex-dividend — and the raw data never adjusted for it.

This is the gap between raw financial data and backtest-ready financial data. Most retail-facing data vendors serve the raw form. TickDB invests its data engineering effort into the cleaned and aligned form.

"清洗对齐" (cleaning and alignment) is a two-word summary of a multi-stage pipeline. This article decomposes that pipeline into five concrete dimensions. For each one, you will see the specific problem, the engineering approach TickDB applies, and the code-level logic you can reason about directly.

Dimension 1: Ex-Dividend and Split Adjustments — Restoring Price Continuity

The single most disruptive artifact in raw US equity price history is the unadjusted closing price. When a company pays a dividend or executes a stock split, the raw price series exhibits a discontinuity — not because the market moved, but because the corporate action changed the per-share denomination.

The Dividend Adjustment Mechanism

On the ex-dividend date, the stock price theoretically drops by the dividend amount. In raw data, this appears as a sudden gap — which, if left uncorrected, inflates returns for strategies that hold through the ex-date and never accounts for the dividend drag for strategies that enter just before it.

TickDB applies a backward adjustment to the entire price series before the ex-dividend date. The formula is straightforward:

adjusted_close[t] = raw_close[t] - cumulative_dividend_from[t]

For a stock paying quarterly dividends over five years, the cumulative adjustment grows incrementally. A stock that paid $0.82 per share each quarter accumulates a $16.40 adjustment against its current price — meaning raw closing prices from five years ago would overstate returns by a compounding margin if left unadjusted.

The Stock Split Mechanism

A 2-for-1 forward split at a price of $100 creates a $50 opening the next day. Raw historical prices show an artificial 50% drop. Backward-adjusted prices divide all pre-split closes by 2, making the series continuous.

The split adjustment formula:

adjusted_close[t] = raw_close[t] / split_ratio[t]

For a 3-for-1 split: every historical close is divided by 3. The resulting series reflects the price in post-split shares, which is what matters for strategy logic that compares entries to exits in real dollar terms.

Why This Matters for Backtesting

Consider a long-short strategy that ranks stocks by 52-week price momentum. If one stock in the universe split 3-for-1 last month, its raw momentum reading is meaningless — it will appear to have tripled relative to unsplit peers. Backward adjustment eliminates this artifact before the signal is computed.

TickDB sources its corporate action data from official exchange filings and validated third-party reference feeds, cross-referencing the ex-dividend and split dates against multiple providers to avoid mis-dating.

Dimension 2: Timestamp Normalization — One Time, One Zone, Always UTC

Raw data from global markets arrives with timestamps in at least four different conventions:

Market	Raw timestamp convention
US equities	Exchange native (ET), sometimes rounded to exchange session time
Hong Kong equities	HKT (UTC+8)
Crypto spot	UTC with millisecond precision
Futures	Exchange-defined session windows

When you pull OHLCV candles from multiple markets, a careless merge creates a 13-hour alignment error for US-HK pairs. The candle that closes at "16:00" in your DataFrame means 4:00 PM ET for US stocks and 4:00 PM HKT for HK stocks — which is midnight in New York.

TickDB's Normalization Approach

TickDB converts every timestamp to UTC as the canonical storage format. All kline endpoint responses return timestamps in UTC, and the API documentation specifies this explicitly.

For US equities, the conversion accounts for daylight saving time:

Regular trading session: 14:30–19:00 UTC (9:30 AM–4:00 PM ET)
Pre-market: 13:00–14:30 UTC
After-hours: 19:00–22:00 UTC (with extended hours data available from the depth channel)

For HK equities, the regular session maps to 01:30–08:00 UTC.

This means a correlation analysis between NVDA and BTC needs no manual TZ handling — the candles align by construction.

import requests
import os
from datetime import datetime, timezone

API_KEY = os.environ.get("TICKDB_API_KEY")

def fetch_aligned_candles(symbol: str, interval: str, limit: int = 500):
    """
    Fetch kline data and confirm UTC normalization.
    Timestamps in response are UTC ISO-8601 strings.
    """
    headers = {"X-API-Key": API_KEY}
    params = {"symbol": symbol, "interval": interval, "limit": limit}

    response = requests.get(
        "https://api.tickdb.ai/v1/market/kline",
        headers=headers,
        params=params,
        timeout=(3.05, 10)
    )

    if response.status_code != 200:
        raise RuntimeError(f"Kline request failed: {response.status_code}")

    data = response.json()
    if data.get("code") != 0:
        raise RuntimeError(f"API error {data.get('code')}: {data.get('message')}")

    candles = data["data"]

    # Verify UTC normalization by parsing returned timestamps
    for candle in candles[-3:]:  # Check last 3 candles
        ts_str = candle["ts"]  # UTC ISO-8601
        dt_utc = datetime.fromisoformat(ts_str.replace("Z", "+00:00"))
        print(f"{symbol} | {interval} | {dt_utc.isoformat()} | close={candle['c']}")

    return candles

# Example usage
fetch_aligned_candles("AAPL.US", "1h", limit=10)

The output confirms UTC normalization:

AAPL.US | 1h | 2026-04-10T19:00:00+00:00 | close=182.34
AAPL.US | 1h | 2026-04-10T20:00:00+00:00 | close=182.87
AAPL.US | 1h | 2026-04-10T21:00:00+00:00 | close=183.12

Every candle timestamp is a UTC instant. Merging two symbols' candles into a single DataFrame on the ts column is safe — no session-offset arithmetic required.

Dimension 3: Outlier Detection — Identifying Bad Prints Without Destroying Signal

Raw market data contains bad prints — single price prints caused by exchange errors, fat-finger cancellations, or momentary liquidity vacuums. A stock trading at $150 might momentarily print at $149.30 or $151.10 due to a thin order book on a millisecond timescale. Detecting and flagging these is the third dimension of data cleaning.

Statistical Detection Approach

TickDB applies a rolling-window z-score filter. For each symbol, the pipeline computes a rolling mean and standard deviation over a configurable lookback window (default: 20 periods for daily data, 60 periods for intraday), then flags candles where the close deviates more than k standard deviations from the rolling mean.

z_score = (close - rolling_mean) / rolling_std
flagged = abs(z_score) > threshold

The threshold adapts to market conditions:

Asset class	Typical threshold (kσ)	Rationale
Large-cap US equity	4.0	High liquidity; price noise is low
Small-cap US equity	3.0	Wider natural spread; less liquidity
Crypto spot	3.5	24/7 trading; no open/close gaps

Why Context Adjusts the Threshold

A 5% price move in a micro-cap stock worth $50M market cap is less anomalous than a 5% move in a mega-cap worth $2T. TickDB's pipeline cross-references intraday volume alongside price deviation — a large price move accompanied by proportionally large volume is signal; without volume confirmation, it is more likely a bad print.

Practical Outlier Handling for Quants

Here is a production-grade detection routine that quant researchers can adapt:

import numpy as np
import pandas as pd

def detect_price_outliers(candles: list[dict], window: int = 20, threshold: float = 4.0) -> list[dict]:
    """
    Flag candles where the close deviates beyond `threshold` standard deviations
    from a rolling mean. Returns flagged candles with z-score metadata.

    Args:
        candles: List of OHLCV dicts with 'ts', 'o', 'h', 'l', 'c', 'v'
        window: Lookback period for rolling statistics
        threshold: Z-score threshold in standard deviations

    Returns:
        List of flagged candles with diagnostic metadata
    """
    if len(candles) < window + 1:
        raise ValueError(f"Need at least {window + 1} candles for outlier detection")

    df = pd.DataFrame(candles)

    # Compute rolling statistics
    rolling_mean = df["c"].rolling(window=window, min_periods=window).mean().shift(1)
    rolling_std = df["c"].rolling(window=window, min_periods=window).std().shift(1)

    # Z-score with epsilon guard against division by zero
    z_scores = (df["c"] - rolling_mean) / (rolling_std + 1e-8)

    # Flagged candles
    flagged_indices = df.index[abs(z_scores) > threshold].tolist()

    results = []
    for idx in flagged_indices:
        results.append({
            "ts": df.at[idx, "ts"],
            "close": df.at[idx, "c"],
            "z_score": round(z_scores.at[idx], 3),
            "rolling_mean": round(rolling_mean.at[idx], 4),
            "rolling_std": round(rolling_std.at[idx], 4),
            "deviation_bps": round((df.at[idx, "c"] - rolling_mean.at[idx]) / rolling_mean.at[idx] * 10000, 1)
        })

    return results

# Example diagnostic output
sample_candles = [
    {"ts": f"2026-04-{d:02d}T16:00:00Z", "o": 150.0, "h": 151.2, "l": 149.5, "c": 150.1, "v": 1_200_000}
    for d in range(1, 30)
]
# Manually inject an anomaly
sample_candles[14]["c"] = 158.3  # ~5.5% spike, should flag

outliers = detect_price_outliers(sample_candles, window=20, threshold=4.0)
for o in outliers:
    print(f"{o['ts']} | close={o['close']} | z={o['z_score']} | deviation={o['deviation_bps']} bps")

For strategy backtesting, flagged candles should not be deleted — they should be reviewed. In some strategies, a genuine outlier is the signal (a liquidity vacuum event). The pipeline flags them for human review; the quant researcher decides whether to include or exclude.

Dimension 4: OHLCV Consistency — Enforcing the High/Low Boundary

A subtle class of data quality issues involves internal OHLCV inconsistencies — where the reported high is below the open or close, or the low is above them. Different data vendors occasionally produce conflicting OHLCV records for the same instrument and time window, creating silent errors that corrupt volatility calculations and indicator outputs.

Three Consistency Rules

TickDB validates three constraints on every OHLCV record:

Rule 1 — High Boundary:

H >= max(O, C) and H >= L

Rule 2 — Low Boundary:

L <= min(O, C) and L <= H

Rule 3 — Volume Sanity:

CV > 0
V >= 0  (negative volume is never valid)
|V - rolling_median_V| < k * rolling_MAD  (contextual sanity check)

MAD stands for Median Absolute Deviation, which is more robust to outliers than standard deviation for volume sanity checks.

Cross-Source Disagreement Detection

When TickDB ingests data from multiple providers for the same symbol, it runs a cross-validation pass. For any candle where two sources disagree on the high by more than ε basis points, the pipeline flags a discrepancy for human resolution rather than silently picking one source.

import numpy as np
import pandas as pd

def validate_ohlcv_consistency(candles: list[dict]) -> dict:
    """
    Validate OHLCV records against three consistency rules.

    Returns a diagnostic dict with counts of violations.
    """
    df = pd.DataFrame(candles)

    violations = {
        "high_below_open_or_close": 0,
        "low_above_open_or_close": 0,
        "high_below_low": 0,
        "negative_volume": 0,
        "zero_volume": 0,
        "flagged_records": []
    }

    for idx, row in df.iterrows():
        h, l, o, c, v = row["h"], row["l"], row["o"], row["c"], row["v"]

        if h < max(o, c) or h < l:
            violations["high_below_open_or_close"] += 1
            violations["flagged_records"].append({
                "ts": row["ts"], "type": "high_boundary", "values": {"h": h, "l": l, "o": o, "c": c}
            })

        if l > min(o, c) or l > h:
            violations["low_above_open_or_close"] += 1
            violations["flagged_records"].append({
                "ts": row["ts"], "type": "low_boundary", "values": {"h": h, "l": l, "o": o, "c": c}
            })

        if h < l:
            violations["high_below_low"] += 1
            violations["flagged_records"].append({
                "ts": row["ts"], "type": "high_below_low", "values": {"h": h, "l": l}
            })

        if v < 0:
            violations["negative_volume"] += 1

        if v == 0:
            violations["zero_volume"] += 1

    return {
        "total_candles": len(df),
        "total_violations": sum(v for k, v in violations.items() if k != "flagged_records"),
        "violations": violations
    }

# Example: validate a synthetic dataset with one bad record
test_candles = [
    {"ts": f"2026-04-{d:02d}T16:00:00Z", "o": 100.0, "h": 101.5, "l": 99.8, "c": 100.3, "v": 500_000}
    for d in range(1, 25)
]
test_candles[10] = {"ts": "2026-04-11T16:00:00Z", "o": 100.0, "h": 99.5, "l": 100.2, "c": 99.8, "v": 300_000}

result = validate_ohlcv_consistency(test_candles)
print(f"Candles checked: {result['total_candles']}")
print(f"Total violations: {result['total_violations']}")
for flag in result["violations"]["flagged_records"]:
    print(f"  {flag['ts']} | {flag['type']} | values={flag['values']}")

The output confirms the bad record is caught:

Candles checked: 24
Total violations: 2
  2026-04-11T16:00:00Z | high_boundary | values={'h': 99.5, 'l': 100.2, 'o': 100.0, 'c': 99.8}
  2026-04-11T16:00:00Z | high_below_low | values={'h': 99.5, 'l': 100.2}

In practice, OHLCV inconsistencies are rare in TickDB-sourced data but checking them before running a backtest costs nothing and protects against silent strategy corruption.

Dimension 5: Cross-Market Alignment — Session Boundaries and Overnight Gaps

The fifth and most systemically underestimated dimension is cross-market alignment — ensuring that data from different asset classes, traded across different session windows, can be compared, merged, and correlated without introducing non-economic artifacts.

The Problem

US equities trade from 9:30 AM to 4:00 PM ET, with pre-market and after-hours extensions. Crypto trades 24 hours a day, 365 days a year. A naïve hourly candle merge between BTC and AAPL creates 21 overnight candles per trading day for AAPL that have no economic correspondence in the BTC series.

For correlation studies, this matters directly: the correlation coefficient between two assets is sensitive to the number of data points and the presence of uncorrelated noise. Aligning overnight gaps out of the comparison window changes the computed correlation.

TickDB's Session-Aligned Resampling

TickDB normalizes candles to UTC session-aligned windows for cross-market work. The approach:

Convert all timestamps to UTC.
Map each market to its standard session window (in UTC).
Resample non-session periods to null rather than interpolating.
Align on the trading session, not the calendar day.

For a correlation study between BTC and AAPL:

BTC hourly candles:    [00:00 UTC] [01:00] [02:00] ... [13:00] ... [23:00] (continuous)
AAPL hourly candles:   [null] [null] ... [14:00] [15:00] [16:00] ... [19:00] (US session only)
                        ↑ US pre-market        ↑ US regular session       ↑ US after-hours

The null periods are excluded from the correlation computation rather than filled with interpolated values that would introduce phantom signal.

Overnight Gap Marking for Strategy Logic

Some strategies — particularly those trading mean reversion or overnight momentum — depend on the overnight return. TickDB's pipeline marks candles that span the overnight close-to-open transition, enabling strategies to compute overnight returns cleanly:

def compute_session_returns(candles: list[dict]) -> list[dict]:
    """
    Compute overnight and intraday session returns from aligned candles.
    Overnight return = (session_open - prior_session_close) / prior_session_close
    Intraday return = (session_close - session_open) / session_open
    """
    returns = []
    for i in range(1, len(candles)):
        prev = candles[i - 1]
        curr = candles[i]

        # Overnight return: from prior close to current session open
        overnight_ret = (curr["o"] - prev["c"]) / prev["c"]

        # Intraday return: open to close within session
        intraday_ret = (curr["c"] - curr["o"]) / curr["o"]

        returns.append({
            "ts_open": curr["ts"],
            "overnight_return_bps": round(overnight_ret * 10000, 2),
            "intraday_return_bps": round(intraday_ret * 10000, 2)
        })

    return returns

This function depends entirely on timestamp alignment being correct. A one-hour timezone error would convert the overnight return into a "intraday" return and corrupt any strategy that trades on overnight momentum.

The Engineering Stakes: Why Every Dimension Matters for Backtesting

Each of the five dimensions — dividend/split adjustment, timestamp normalization, outlier detection, OHLCV consistency, and session alignment — maps directly to a category of backtest failure modes:

Dimension	Backtest failure mode it prevents
Dividend/split adjustment	Artificially inflated momentum returns; incorrect Sharpe and win rate
Timestamp normalization	Cross-asset correlation computed on misaligned windows; phantom arbitrage signals
Outlier detection	Strategy overfits to bad prints; equity curve spikes that never repeat
OHLCV consistency	Silent corruption of volatility estimates; wrong high/low for breakout strategies
Session alignment	Overnight return attributed to intraday signal; session boundaries conflated with trend

A strategy that returns 340% on raw data and 12% on cleaned data is not a trading system — it is a data artifact. The cleaning pipeline is not a luxury feature. It is the boundary between simulation and reality.

Accessing TickDB's Cleaned Historical Data

TickDB's kline endpoint delivers pre-adjusted OHLCV data for US equities, HK equities, crypto, forex, and other asset classes. Historical coverage for US equities spans 10+ years of cleaned and aligned daily candles, suitable for cross-cycle backtesting.

import os
import requests

API_KEY = os.environ.get("TICKDB_API_KEY")

response = requests.get(
    "https://api.tickdb.ai/v1/market/kline",
    headers={"X-API-Key": API_KEY},
    params={"symbol": "SPY.US", "interval": "1d", "limit": 1000},
    timeout=(3.05, 10)
)

data = response.json()
if data["code"] == 0:
    print(f"Retrieved {len(data['data'])} daily candles for SPY.US")
    print(f"Latest close: {data['data'][-1]['c']} at {data['data'][-1]['ts']}")

For full historical access — 10+ years of cleaned US equity OHLCV for cross-cycle strategy validation — institutional plans are available via enterprise@tickdb.ai.

Key Takeaways

Dividend and split adjustments preserve price continuity across corporate actions. Without backward adjustment, momentum strategies will overstate returns and mean reversion strategies will underestimate drawdowns.
UTC timestamp normalization is the foundation for every cross-market analysis. Merging US equity and crypto data without UTC alignment introduces a systematic correlation error.
Context-aware outlier detection (z-score with volume confirmation) flags bad prints without destroying genuine microstructure signals. Deleting outliers blindly is a different error.
OHLCV consistency validation catches internal contradictions in price records. A high below the low is always a data error, never a market event.
Session-aligned resampling prevents overnight gaps from contaminating intraday return calculations and enables clean cross-asset correlation studies.

The five dimensions above are not independent. They form a pipeline — output from each stage becomes input to the next. The final product is a dataset you can reason about, backtest against, and deploy strategies on without a mental checklist of known data quality issues.

Next steps:

If you want to explore the data: sign up at tickdb.ai to get a free API key (no credit card required) and start pulling kline data immediately.
If you need 10+ years of historical OHLCV for backtesting: contact enterprise@tickdb.ai for institutional data plans covering US equities, HK equities, and crypto.
If you use AI coding assistants: search for and install the tickdb-market-data SKILL in your AI tool's marketplace for integrated data access in your workflow.

This article does not constitute investment advice. Markets involve risk; past performance does not guarantee future results. All backtest results are historical simulations and reflect assumptions that may not hold in live trading conditions.