The first rule of quantitative research: garbage in, garbage out. Every quant trader who has ever run a backtest that looked spectacular on paper only to hemorrhage money in live trading has traced the problem back to the same root cause — dirty data.
The specific failures are often hideously mundane. A dividend-adjusted close that was actually adjusted using the wrong split ratio. A 9:30:00 AM timestamp that actually represents 9:30:00 Pacific time in the summer when New York is on Eastern Daylight Time. A single corrupted tick that flips a volatility calculation by an order of magnitude. These are not exotic failure modes. They are the quiet default of raw market data.
This article dissects what happens inside TickDB's data pipeline before any OHLCV bar ever reaches your backtesting engine. It covers five distinct dimensions of cleaning and alignment: price adjustment standards, timestamp normalization, timezone handling, outlier detection, and cross-venue synchronization. For each dimension, we show what raw data looks like, what the cleaned output looks like, and — where relevant — the specific algorithm or standard TickDB applies.
The goal is not to sell you on TickDB's data quality. It is to give you a technical framework for evaluating any market data vendor's cleaning pipeline. If you finish this article and decide to build your own cleaning pipeline, this article has done its job.
Why Data Cleaning Is Non-Negotiable for Backtesting
Before examining the five dimensions, it is worth establishing why cleaning matters so disproportionately in quantitative work.
A backtester consumes a time series. It computes returns, drawdowns, Sharpe ratios, and factor exposures. Every downstream calculation is a function of the upstream data. If the input contains timestamp errors, the rolling window calculations produce wrong window boundaries. If the input contains unadjusted prices, a 4:1 stock split appears as a 300% one-day return. If the input contains outliers, your volatility estimate becomes unstable.
These errors compound. An incorrect volatility estimate feeds into position sizing, which feeds into max drawdown, which feeds into your entire risk management framework. A single bad bar can invalidate an entire strategy's risk profile.
The financial industry has developed informal standards for what "clean" data means, but these standards are rarely documented, often contradictory, and almost never consistent across vendors. TickDB's pipeline attempts to codify these standards into explicit, auditable processing steps.
Dimension 1: Price Adjustment (Split and Dividend Alignment)
The Problem with Raw Corporate Action Data
Corporate actions — stock splits, reverse splits, dividends, spin-offs, and rights offerings — create discontinuities in price series that have nothing to do with market supply and demand. A 3:1 stock split divides the share price by three while multiplying the share count by three. If your backtest does not account for this, it sees a 66.7% one-day price drop followed by a recovery that never actually occurred in the market.
Raw price data from exchanges arrives unadjusted. The exchange records what actually traded. This is correct behavior for the exchange — they are recording historical truth. But for backtesting, adjusted data is almost always the correct choice, because you are testing a strategy's logic, not its ability to survive phantom price discontinuities.
The Adjustment Standard: Which Prices Get Adjusted
TickDB applies backward-adjusted prices for historical OHLCV (kline) data. This means that all historical bars are adjusted to reflect the current capital structure. When you look at a price bar from 2019, the prices in that bar have been adjusted downward to account for splits and dividends that occurred after 2019.
The practical consequence: if you compute simple returns between adjacent bars using adjusted data, you get returns that reflect actual market movements minus the mechanical effect of corporate actions. A $1.00 dividend paid by a $50 stock on a given day produces a price drop of approximately $1.00 on the ex-dividend date in the raw data. In adjusted data, that $1.00 price drop is pre-subtracted from all historical bars, so returns computed from adjusted prices do not include the dividend effect — which is the correct behavior for testing price-based strategies that do not explicitly model dividend reinvestment.
Split Ratio Application
For stock splits, TickDB applies the split ratio uniformly across all fields that represent per-share quantities:
adjusted_open = raw_open / split_ratio
adjusted_high = raw_high / split_ratio
adjusted_low = raw_low / split_ratio
adjusted_close = raw_close / split_ratio
adjusted_volume = raw_volume * split_ratio
Volume is multiplied by the split ratio (not divided) because the number of shares traded increases proportionally when the share price decreases.
Dividend Adjustment Factor
For dividends, TickDB calculates an adjustment factor for each ex-dividend date and applies it cumulatively:
adjustment_factor = 1 - (dividend_per_share / close_price_before_ex_date)
Each historical bar is multiplied by the cumulative product of all subsequent adjustment factors. This is the standard practice in the industry, but the key subtlety is the denominator — TickDB uses the closing price immediately before the ex-dividend date, not the opening price on the ex-dividend date, to avoid incorporating the dividend-induced price drop itself into the denominator.
Data Table: Split-Adjusted vs. Unadjusted Comparison
The following table illustrates the difference for a hypothetical 2:1 split on March 15, 2024:
| Date | Unadjusted Close | Split Ratio | Adjusted Close | Volume (Adj) |
|---|---|---|---|---|
| Mar 13, 2024 | $124.50 | 2.0 | $62.25 | 45,200,000 |
| Mar 14, 2024 | $122.80 | 2.0 | $61.40 | 38,100,000 |
| Mar 15, 2024 (pre-split) | $61.00 | 1.0 | $61.00 | 52,300,000 |
| Mar 18, 2024 (post-split) | $61.50 | 1.0 | $61.50 | 104,600,000 |
Note that on March 15, 2024, the unadjusted close of $61.00 appears lower than the previous day's $122.80 — a 49.7% apparent drop — but the adjusted series shows a smooth continuation because the entire pre-split history has been scaled down by the split ratio.
Dimension 2: Timestamp Normalization
The Problem with Exchange Timestamps
Market data arrives with timestamps, but those timestamps are not standardized. Exchanges in New York, Hong Kong, Tokyo, and Shanghai each emit data in their local time zone, often with ambiguity about whether the timestamp represents the exchange's local time or UTC. Some venues include daylight saving time transitions; others do not. Some APIs return millisecond-resolution timestamps; others return only second-level or even minute-level bars.
When you are building a cross-asset strategy that combines US equity data with options data, futures data, and foreign exchange data, timestamp misalignment is not a cosmetic issue. A 1-second discrepancy in timestamp alignment between a US equity bar and an options bar can produce an entirely different implied volatility surface reconstruction.
TickDB's Timestamp Standard
TickDB normalizes all timestamps to UTC at the point of ingestion. This is the industry-accepted standard for cross-asset data pipelines because UTC is unambiguous and does not observe daylight saving time.
The ingestion pipeline applies market-specific correction rules:
| Market | Exchange timezone | Trading hours (local) | UTC offset behavior |
|---|---|---|---|
| US Equities | America/New_York | 9:30–16:00 ET | Observes EDT/EST transitions |
| HK Stocks | Asia/Hong_Kong | 9:30–16:00 HKT | No DST; fixed +08:00 |
| A-Stocks | Asia/Shanghai | 9:30–11:30, 13:00–15:00 CST | No DST; fixed +08:00 |
| Crypto | UTC | 24/7 | Fixed UTC |
For US equities, the pipeline must account for the fact that from mid-March to early November, New York is on Eastern Daylight Time (EDT, UTC-4), while from early November to mid-March, it is on Eastern Standard Time (EST, UTC-5). A bar timestamp of "2024-03-10 09:30:00" from a US exchange must be converted to "2024-03-10 13:30:00 UTC" during EDT, but "2024-03-10 14:30:00 UTC" during EST.
Bar Aggregation Boundary Alignment
For aggregated bars (1-minute, 5-minute, 1-hour, 1-day), TickDB aligns bar boundaries to the exchange session start time in the exchange's local timezone, then converts the boundary timestamps to UTC. This means a 1-minute bar in US equities always starts at a whole minute offset from 9:30:00 ET, not from midnight UTC.
This matters for strategies that generate signals based on bar count within a session — for example, a strategy that enters a position on the 30th 5-minute bar of the session. If bar boundaries are misaligned, the 30th bar does not correspond to the intended time in the session.
Dimension 3: Timezone Handling for Strategy Execution
The Distinction Between Data Storage and Data Presentation
TickDB stores all timestamps in UTC internally. This is an implementation decision that guarantees consistency across the entire dataset. However, the API supports returning timestamps in the user's preferred timezone via the tz parameter.
When a client requests US equity kline data with tz=America/New_York, the API returns bar boundaries expressed in Eastern time, accounting for DST transitions. The bar content — the OHLCV values — is identical regardless of the timezone parameter. Only the timestamp formatting changes.
This distinction matters because it means you can request data in your local timezone for display purposes without affecting the numeric integrity of the price and volume data.
Practical Example: DST Transition Handling
Consider a strategy that monitors the 9:30–10:00 ET window for opening volatility. During the DST transition weekend in March, US equity markets operate on EST (UTC-5) through the Friday before the transition and switch to EDT (UTC-4) on the following Monday. A hardcoded UTC offset of -5 would produce timestamps that are off by one hour for the entire EDT period.
TickDB's pipeline uses the IANA timezone database (tzdata) to determine the correct offset for any given date. The same bar, requested in different timezones, produces correctly offset timestamps:
import os
import requests
API_KEY = os.environ.get("TICKDB_API_KEY")
headers = {"X-API-Key": API_KEY}
# Request data in Eastern time (handles DST automatically)
params = {
"symbol": "AAPL.US",
"interval": "5m",
"limit": 20,
"tz": "America/New_York"
}
response = requests.get(
"https://api.tickdb.ai/v1/market/kline/latest",
headers=headers,
params=params,
timeout=(3.05, 10)
)
data = response.json()
print(data["data"]["klines"][0]["timestamp"])
# Output in EDT: "2024-07-15T09:35:00-04:00"
# Output in EST: would differ by 1 hour for the same UTC instant
Dimension 4: Outlier Detection and Anomaly Handling
The Taxonomy of Data Anomalies
Raw market data contains several categories of anomalies that must be detected and handled before data is served to clients:
Category 1: Exchange-originated errors. These include misreported prices (a stock trading at $150 that is reported as $1.50 due to a data entry error), duplicated bars, missing bars, and bars with obviously wrong volumes. These are rare but devastating if not caught.
Category 2: Survivorship bias artifacts. When a company delists, its historical data may disappear from some data feeds, creating gaps that bias backtests toward surviving companies. TickDB maintains a complete historical constituent list and ensures that delisted securities retain their historical data with appropriate end-of-trading annotations.
Category 3: Corporate action processing lag. If a split announcement is made but the exchange data has not yet been updated to reflect the split, the pipeline may receive both pre-split and post-split prices within the same trading day. The cleaning pipeline must resolve these conflicts using the effective split date, not the announcement date.
Category 4: Venue-specific microstructure artifacts. OTC markets, dark pools, and certain international venues sometimes produce price prints that are clearly erroneous — trades at $0.0001 or volume prints that exceed the total shares outstanding. These require domain-specific thresholds.
The Anomaly Detection Pipeline
TickDB applies a multi-stage outlier detection pipeline:
Stage 1: Range validation. Each price is checked against a dynamically computed acceptable range. For equities, this is based on a rolling window of the security's historical volatility plus a configurable number of standard deviations (default: 5σ). A price outside this range is flagged as a potential anomaly.
Stage 2: Percentage change filtering. Within each bar, the high-low range is compared against a historical baseline. A bar where (high - low) / close exceeds 3× the security's average true range is flagged for review.
Stage 3: Volume anomaly detection. For each security, a rolling median volume is computed. Any bar where volume deviates by more than 10× the interquartile range from the rolling median is flagged.
Stage 4: Duplicate bar detection. Adjacent bars with identical OHLCV values are flagged. A small number of identical bars is expected during low-volume periods; an excessive run of identical bars suggests data stagnation rather than genuine trading activity.
Stage 5: Survivorship gap filling. When a security transitions from active to delisted status, the pipeline inserts an explicit end-of-trading marker rather than allowing the data to simply stop. This prevents backtesters from accidentally extending a delisted security's price forward in time.
Handling Strategy: Correction vs. Flagging
Not all anomalies are corrected. The pipeline distinguishes between two handling modes:
Correction: Applied when the anomaly can be confidently identified as a data error and a reliable replacement value can be computed. Example: a single corrupted bar in the middle of an otherwise normal trading day, where the surrounding bars provide a reliable interpolation context.
Flagging: Applied when the anomaly cannot be corrected with high confidence. The raw value is retained but annotated with a quality flag that clients can inspect. This preserves data integrity while ensuring that clients are aware of potential quality issues.
The flagging approach is philosophically important: TickDB does not silently discard data that might be inconvenient. If a bar has a quality flag, the data is still available, but the client can choose whether to include it in calculations.
Dimension 5: Cross-Venue Data Alignment
The Challenge of Multi-Venue Markets
US equities trade across 13 registered exchanges plus numerous dark pools and ATS venues. The consolidated tape (administered by the Consolidated Tape Association) provides a best-effort cross-venue composite, but the underlying venues report with different latencies, and the consolidation process introduces its own artifacts.
The most common artifact is the "late print" — a trade that occurred on Venue B at 9:30:00.001 but was reported to the consolidated tape at 9:30:00.850 due to Venue B's reporting latency. In the raw consolidated tape, this trade appears to have occurred at the same timestamp as the Venue A trades from 9:30:00.000, even though it technically occurred 1 millisecond later.
For daily bar aggregation, this millisecond discrepancy is irrelevant. For tick-level analysis or for strategies that compute metrics based on intrabar time sequences, it can produce artifacts.
TickDB's Alignment Approach
For its kline (OHLCV) endpoint, TickDB derives bar boundaries from the official exchange-native session open and close timestamps in the exchange's local timezone. The Open, High, Low, Close, and Volume values within each bar are computed from the trades and quotes that occurred within those boundaries.
The key properties of this approach:
Reproducibility: Given the same raw tick data, the same bar will always be produced. There is no ambiguity about which trades belong to which bar.
Consistency: Bar boundaries are deterministic and aligned to the session clock, not to when data arrived at the tape.
Auditability: Each bar can be traced back to the constituent trades that produced it.
Handling Cross-Market Calendar Alignment
For strategies that span multiple markets with different trading hours (US equities, HK stocks, crypto), the pipeline provides explicit calendar metadata so that clients can align bars to the correct session boundaries. A US equity 5-minute bar and a crypto 5-minute bar are both available with their respective session metadata, but the pipeline does not assume they share a common calendar.
import os
import requests
API_KEY = os.environ.get("TICKDB_API_KEY")
headers = {"X-API-Key": API_KEY}
# Request kline data with explicit calendar metadata
params = {
"symbol": "AAPL.US",
"interval": "5m",
"limit": 100,
"include_calendar": "true"
}
response = requests.get(
"https://api.tickdb.ai/v1/market/kline",
headers=headers,
params=params,
timeout=(3.05, 10)
)
data = response.json()
# Each bar includes session metadata
for bar in data["data"]["klines"]:
print(f"Bar {bar['timestamp']}: session_open={bar.get('session_open')}, "
f"is_auction={bar.get('is_auction')}, quality_flag={bar.get('quality_flag')}")
Comparison: Raw vs. Cleaned Data — A Concrete Example
The following table illustrates what the five cleaning dimensions produce when applied to a real historical event: Apple (AAPL) stock on August 1, 2020, when Apple executed a 4:1 stock split.
| Property | Raw Data | Cleaned Data (TickDB) |
|---|---|---|
| Pre-split close (Jul 31) | $425.04 | $106.26 |
| Post-split close (Aug 3) | $126.21 | $126.21 |
| Volume (Aug 3, raw) | ~39M shares | ~156M shares (split-adjusted) |
| Timestamp (Aug 3 open) | "09:30:00" (ambiguous TZ) | "2020-08-03T13:30:00Z" (UTC) |
| Outlier detection | None | No anomalies on this date |
| Session boundary | N/A | Aligned to NYSE open/close |
The cleaned data provides a continuous, split-adjusted time series with unambiguous UTC timestamps, where each bar is traceable to its constituent trades and validated against anomaly detection thresholds.
How to Inspect Data Quality in Your Own Pipeline
Understanding TickDB's cleaning pipeline is valuable even if you are not using TickDB. The principles apply universally. Here is a checklist you can apply to any market data vendor:
For price adjustment:
- Request the same security's data before and after a known split date. Compute the return across the split date using the vendor's data. It should be a smooth continuation, not a catastrophic drop.
- Verify that volume is multiplied (not divided) by the split ratio for post-split bars.
For timestamp normalization:
- Request data for a US equity and a HK stock for the same UTC time window. Verify that the timestamps are in UTC and that the market-specific session boundaries are correct.
- Test DST transitions explicitly. Request data for the two weeks before and after a March DST transition for a US security. Verify that bar boundaries shift correctly when the EDT/EST transition occurs.
For outlier detection:
- Request data for a known high-volatility event (earnings release, economic announcement). Check for quality flags on bars during the event window. Ask the vendor what their threshold is and whether they correct or flag anomalies.
For cross-venue alignment:
- If you are building a multi-asset strategy, verify that session metadata is included with each bar. A US equity bar and a crypto bar should not share the same session_start value.
The Engineering Discipline Behind Data Quality
Data cleaning is not a feature. It is an engineering discipline that requires explicit decisions at every step: which adjustment standard, which timezone database version, which outlier threshold, which handling mode (correct vs. flag). Every decision is a trade-off between data fidelity and data usability.
Raw data is historically accurate — it records what actually happened. Adjusted data is analytically useful — it enables strategies that should work in a world without corporate action artifacts. Neither is universally correct. The value of a data pipeline is in making these decisions explicitly, documenting them consistently, and giving clients the tools to inspect and override them when needed.
TickDB's pipeline makes these decisions in favor of adjusted, normalized, flagged data with full auditability. This is the appropriate default for quantitative research. If your strategy explicitly requires unadjusted data — for example, a merger arbitrage strategy that models dividend capture — the API supports requesting the raw data fields alongside the adjusted fields.
Next Steps
If you want to inspect TickDB's data quality for yourself, sign up at tickdb.ai to access the free API tier. The /v1/market/kline endpoint returns quality_flag and adjustment_factor fields that expose the pipeline's annotations on each bar.
If you are evaluating TickDB against another vendor, use the five-dimension framework in this article to audit their cleaning pipeline. Ask specifically about split adjustment methodology (backward vs. forward), timezone handling (IANA database vs. fixed offsets), and outlier handling (correction vs. flagging). The answers will reveal whether their pipeline is a production-grade engineering system or a black box with unknown internals.
If you need institutional-grade historical data with full audit trails, contact enterprise@tickdb.ai for access to the extended metadata fields — including per-bar constituent trade counts, adjustment factor histories, and DST transition annotations.
This article does not constitute investment advice. Market data is an input to quantitative analysis; the quality of analysis depends on the quality of data and the rigor of the methodology applied to it.