"Someone is buying 300 shares of Apple at 3:47 PM. Not 100. Not 1,000. Three hundred. It's the kind of order that hides in plain sight — an odd lot, invisible to most public tape readers, silently moving through a venue that never showed up on the consolidated quote."
In 2024, approximately 47% of US equity volume traded in venues that do not display their quotes on the public consolidated tape. These are dark pools and off-exchange venues. For quant traders building order flow models, the implication is stark: a strategy trained only on exchange-reported trades operates on a partial picture of reality. Meanwhile, odd lots — trades smaller than the standard 100-share round lot — have grown from a statistical nuisance into a dominant feature of modern equity markets, now representing nearly 30% of all trades by count.
This article dissects two related problems that sit at the foundation of reliable US equity order flow analysis. First, how to identify and tag dark pool executions from trade data. Second, how odd-lot trades distort aggregated K-line constructs and what filtering logic recovers a clean signal. Both topics share a common root: the sale condition codes embedded in every Trade Report (FINRA TRF / Nasdaq UTP).
The Sale Condition Code Infrastructure
Before identifying dark pools or filtering odd lots, you need a precise mental model of what trade data actually contains. A FINRA-formatted Trade Report carries more than price and size. It includes a set of sale condition codes — a bitmask of up to 8 characters — that describe how and where the trade was executed.
The most consequential codes for our purposes:
| Code | Meaning | Relevance |
|---|---|---|
T |
Trade reported late (off-hours) | Exclude from regular-session analysis |
O |
Odd lot trade | Critical for K-line aggregation |
D |
Derivatively priced | Exclude from price-based signals |
M |
Closing sale only | Relevant for close-end strategies |
N |
Next-day sale | Irregular settlement |
R |
Seller's sale (opening/reopening) | Context-dependent |
Z |
Inter-market sweep order (ISO) | High-velocity, directional signal |
4 |
Dark pool / non-TP (non-published) trade | Dark pool identification |
9 |
No remote inter-market sweep | Soft indicator |
The code 4 (hex 0x34) is the most direct dark pool indicator. When present, the trade occurred on a non-public venue — typically a dark pool, internalizer, or wholesale market maker — and did not contribute to the national best bid and offer (NBBO). Trades without a corresponding NBBO quote are sometimes called "print-feel" trades and can deviate meaningfully from mid-price at the time of execution.
Here is the parsing logic for extracting these codes from a raw trade record. This example assumes a CSV-format trade feed (common from providers like Polygon, Databento, or TickDB's trades endpoint for supported markets):
import os
import time
import requests
import logging
from dataclasses import dataclass
from typing import Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Configuration
API_KEY = os.environ.get("TICKDB_API_KEY")
BASE_URL = "https://api.tickdb.ai/v1"
# NOTE: The `trades` endpoint does not support US equities.
# For US equity analysis, use the kline endpoint (historical OHLCV).
# This code is provided as a reference implementation for markets
# where tick trades ARE available (e.g., HK equities, crypto).
@dataclass
class TradeRecord:
"""Parsed trade record with sale condition decoding."""
symbol: str
price: float
size: int
timestamp: int # milliseconds since epoch
exchange: str
sale_conditions: str # Raw condition string, e.g., "T|O|D"
@property
def is_odd_lot(self) -> bool:
"""Odd lots are trades smaller than 100 shares."""
return self.size > 0 and self.size < 100
@property
def is_dark_pool(self) -> bool:
"""Code '4' indicates a non-TP (non-published) venue trade."""
return '4' in self.sale_conditions
@property
def is_late_report(self) -> bool:
"""Code 'T' indicates an off-hours late report."""
return 'T' in self.sale_conditions
@property
def is_derivative_priced(self) -> bool:
"""Code 'D' indicates a derivatively priced transaction."""
return 'D' in self.sale_conditions
def parse_sale_conditions(raw: str) -> str:
"""
Normalize sale condition string from various data providers.
Providers format this field differently: some use 'T|O|4',
others use bitmask integers. This normalizer handles common cases.
"""
if not raw:
return ""
# Already pipe-delimited
if '|' in raw or '@' in raw:
return raw.replace('@', '|')
# Numeric bitmask (e.g., 32 = odd lot only, 36 = odd lot + dark pool)
# This requires documentation from your specific data vendor.
# Example normalization for a common format:
try:
mask = int(raw)
codes = []
code_map = {1: 'R', 2: 'O', 4: 'D', 8: 'Z', 16: '4', 32: 'T', 64: 'M'}
for bit, label in code_map.items():
if mask & bit:
codes.append(label)
return '|'.join(codes)
except (ValueError, TypeError):
return str(raw)
def fetch_trades_batch(symbol: str, start_ts: int, end_ts: int,
retry_count: int = 3) -> list[TradeRecord]:
"""
Fetch a batch of trade records for a given symbol and time window.
Implements exponential backoff with jitter for rate-limit resilience.
NOTE: This endpoint pattern is shown for markets where it is supported.
Replace with your actual data source for US equity tick data.
"""
url = f"{BASE_URL}/market/trades"
headers = {"X-API-Key": API_KEY}
params = {
"symbol": symbol,
"start_time": start_ts,
"end_time": end_ts,
"limit": 1000
}
for attempt in range(retry_count):
try:
response = requests.get(url, headers=headers, params=params,
timeout=(3.05, 15))
response.raise_for_status()
data = response.json()
if data.get("code") == 0:
records = []
for item in data.get("data", []):
trade = TradeRecord(
symbol=item["symbol"],
price=float(item["price"]),
size=int(item["size"]),
timestamp=int(item["ts"]),
exchange=item.get("exchange", "UNKNOWN"),
sale_conditions=parse_sale_conditions(
item.get("conditions", "")
)
)
records.append(trade)
return records
# Rate limit handling (code 3001)
elif data.get("code") == 3001:
retry_after = int(response.headers.get("Retry-After", 5))
wait = retry_after * (2 ** attempt) + time.time() % 1
logger.warning(f"Rate limited. Waiting {wait:.1f}s before retry.")
time.sleep(wait)
continue
else:
logger.error(f"API error {data.get('code')}: {data.get('message')}")
return []
except requests.exceptions.Timeout:
logger.warning(f"Timeout on attempt {attempt + 1}. Retrying...")
time.sleep(1 * (2 ** attempt)) # Simple backoff
except requests.exceptions.RequestException as e:
logger.error(f"Request failed: {e}")
if attempt == retry_count - 1:
raise
return []
# Example: Analyze a batch of trades for a supported symbol
if __name__ == "__main__":
if not API_KEY:
raise ValueError("Set TICKDB_API_KEY environment variable")
# Fetch HK equity trades (where tick data is supported)
now = int(time.time() * 1000)
trades = fetch_trades_batch("AAPL.HK", now - 3600_000, now)
dark_pool_count = sum(1 for t in trades if t.is_dark_pool)
odd_lot_count = sum(1 for t in trades if t.is_odd_lot)
logger.info(f"Fetched {len(trades)} trades. "
f"Dark pool: {dark_pool_count} ({100*dark_pool_count/len(trades):.1f}%), "
f"odd lots: {odd_lot_count} ({100*odd_lot_count/len(trades):.1f}%)")
Dark Pool Identification: Beyond the 4 Code
The 4 condition code is a binary signal — trade came from a non-public venue, or it did not. In practice, dark pool identification requires a more layered approach, because some off-exchange activity carries legitimate informational value while other activity is noise.
Tier 1: Categorical Identification
| Venue type | Condition code | Execution quality concern |
|---|---|---|
| Dark pool (broker) | 4 |
No NBBO contribution; price may lag |
| Internalizer / wholesaler | 4 + D |
Derivatively priced; may use midpoint |
| Exchange (lit) | Absent | NBBO-aligned, full tape |
| Closing auction | M |
Concentrated volume, special closing price |
| Opening auction | O |
InOpeningmoor, auction price |
A high 4-to-total ratio on a symbol is a direct measure of dark pool activity. In 2024, the average dark pool percentage for US large-caps was 37–42% by volume, but this varies significantly by symbol. Apple (AAPL) and Tesla (TSLA) regularly exceed 45%, while symbols like BRK.B (low retail ownership) may run under 20%.
Tier 2: Price Deviation Filtering
Not all dark pool prints are equal. A dark pool trade at the NBBO midpoint on a lit exchange is a reasonable execution. A dark pool trade 10 bps away from mid during normal conditions is a signal worth investigating.
def classify_dark_pool_quality(trade: TradeRecord,
best_bid: float,
best_ask: float) -> str:
"""
Classify a dark pool trade by its execution quality relative to NBBO.
Returns: 'high_quality' | 'moderate_drift' | 'anomalous'
"""
if not trade.is_dark_pool:
return "lit"
mid = (best_bid + best_ask) / 2
drift_bps = abs(trade.price - mid) / mid * 10_000
if drift_bps < 1.0: # Less than 1 basis point from mid
return "high_quality"
elif drift_bps < 5.0:
return "moderate_drift"
else:
return "anomalous"
def dark_pool_activity_report(trades: list[TradeRecord],
best_bids: dict[int, float],
best_asks: dict[int, float]) -> dict:
"""
Aggregate dark pool activity metrics from a batch of trades.
Groups trades by quality tier and computes volume-weighted stats.
"""
high_q, mod_drift, anomalous = [], [], []
for trade in trades:
if not trade.is_dark_pool:
continue
ts_bucket = trade.timestamp // 60_000 # 1-minute bucket
bid = best_bids.get(ts_bucket, trade.price)
ask = best_asks.get(ts_bucket, trade.price)
quality = classify_dark_pool_quality(trade, bid, ask)
if quality == "high_quality":
high_q.append(trade)
elif quality == "moderate_drift":
mod_drift.append(trade)
else:
anomalous.append(trade)
def vol_weighted_price(trades: list[TradeRecord]) -> float:
total_vol = sum(t.size * t.price for t in trades)
total_dollars = sum(t.size for t in trades)
return total_vol / total_dollars if total_dollars else 0.0
return {
"total_dark_trades": len(high_q) + len(mod_drift) + len(anomalous),
"high_quality_count": len(high_q),
"moderate_drift_count": len(mod_drift),
"anomalous_count": len(anomalous),
"anomalous_pct": len(anomalous) / max(len(high_q) + len(mod_drift) + len(anomalous), 1) * 100,
"anomalous_vwap": vol_weighted_price(anomalous),
}
Tier 3: Venue Fingerprinting
For more sophisticated analysis, the exchange field in trade records can be used to fingerprint the counterparty. Major dark pool operators include:
- ITG Posit (now part of Virtu)
- Liquidnet
- Goldman Sachs Sigma X
- Morgan Stanley MS Pool
- CBOE BIDS
- IEX (technically a lit exchange, but with speed bump — treated differently)
A venue with consistently high fill rates but low price impact on the lit market may represent informed institutional flow — valuable signal. A venue with erratic pricing relative to mid may be noise or predatory internalization. The distinction matters for order flow toxicity models.
Odd-Lot Trades: The Invisible Volume Problem
An odd lot is any trade size below 100 shares. While individual odd lots are small, their aggregate market impact is substantial: by trade count, odd lots represent roughly 28–32% of all US equity trades. By volume, they represent 8–12%, driven by the long tail of very small orders.
The problem for K-line aggregation is structural. When you construct a 1-minute OHLCV candle from trade data, each trade contributes its price and volume. An odd lot trade at $185.01, size 43, executed between the bid and ask, will be included in the volume component of the candle but may not reflect genuine market depth. If your aggregation algorithm does not filter odd lots, two distortions emerge:
1. Volume inflation without price certainty: A candle with 50,000 shares of volume may include 15,000 shares of odd-lot noise that represents market-maker hedging flow rather than directional aggression.
2. High-low range contamination: Odd lots that execute at prices slightly off the NBBO mid can artificially widen the high-low range of a candle. A series of 10-share prints at $185.25 when the bid-ask is $185.20–$185.22 creates a phantom high that was never a tradeable price at any meaningful size.
Filtering Odd Lots: The Minimum Size Threshold
The simplest filter is a minimum size threshold. The industry convention treats 100 shares as the round-lot boundary, so a threshold of 100 or higher is standard. However, sophisticated models may use dynamic thresholds based on average daily volume (ADV):
def filter_odd_lots(trades: list[TradeRecord],
min_size: int = 100,
dynamic_adv_fraction: Optional[float] = None,
adv: Optional[float] = None) -> list[TradeRecord]:
"""
Filter trades to exclude odd lots based on a fixed or dynamic threshold.
Args:
trades: List of parsed trade records
min_size: Fixed minimum size threshold (default: 100 for round lots)
dynamic_adv_fraction: If set, compute threshold as ADV * fraction
adv: Average daily volume (required if dynamic_adv_fraction is set)
"""
if dynamic_adv_fraction is not None and adv is None:
raise ValueError("ADV must be provided for dynamic threshold")
threshold = min_size
if dynamic_adv_fraction is not None:
threshold = max(min_size, int(adv * dynamic_adv_fraction))
filtered = [t for t in trades if t.size >= threshold]
logger.info(f"Filtered {len(trades) - len(filtered)} odd-lot trades "
f"(threshold: {threshold} shares). "
f"Retained {len(filtered)} trades.")
return filtered
def build_clean_kline_from_trades(trades: list[TradeRecord],
interval_ms: int = 60_000) -> list[dict]:
"""
Build an OHLCV candle from a list of trades, with odd-lot filtering
and dark pool flagging per candle.
Returns a list of candle dicts with open, high, low, close, volume,
and dark_pool_pct fields.
"""
# Filter odd lots first
clean_trades = filter_odd_lots(trades, min_size=100)
# Group by interval
buckets = {}
for t in clean_trades:
bucket = t.timestamp // interval_ms
if bucket not in buckets:
buckets[bucket] = []
buckets[bucket].append(t)
candles = []
for bucket_ts in sorted(buckets.keys()):
bucket_trades = buckets[bucket_ts]
if not bucket_trades:
continue
prices = [t.price for t in bucket_trades]
volumes = [t.size for t in bucket_trades]
open_price = prices[0]
high = max(prices)
low = min(prices)
close = prices[-1]
volume = sum(volumes)
dark_volume = sum(t.size for t in bucket_trades if t.is_dark_pool)
dark_pool_pct = dark_volume / volume * 100 if volume > 0 else 0
candles.append({
"timestamp": bucket_ts * interval_ms,
"open": round(open_price, 4),
"high": round(high, 4),
"low": round(low, 4),
"close": round(close, 4),
"volume": volume,
"dark_pool_pct": round(dark_pool_pct, 2),
"trade_count": len(bucket_trades),
})
return candles
Cross-Validation: TickDB Kline as Ground Truth for Aggregate Behavior
When you cannot access tick-level US equity trades directly, the kline endpoint provides a reliable alternative for studying aggregate price behavior. TickDB's US equity kline data is cleaned and aligned across venues, which makes it suitable for validating whether your odd-lot filtering model is producing the correct candle structure:
def fetch_historical_klines(symbol: str,
interval: str = "1m",
start_ts: int = None,
limit: int = 500) -> list[dict]:
"""
Fetch historical OHLCV klines from TickDB for US equity symbols.
These are cleaned, venue-aligned klines suitable for backtesting.
NOTE: For tick-level (per-trade) data on US equities, a dedicated
tick data vendor (e.g., Databento, Polygon) is required. TickDB's
kline endpoint does not provide per-trade granularity.
"""
url = f"{BASE_URL}/market/kline"
headers = {"X-API-Key": API_KEY}
params = {
"symbol": symbol,
"interval": interval,
"limit": limit,
}
if start_ts:
params["start_time"] = start_ts
response = requests.get(url, headers=headers, params=params,
timeout=(3.05, 10))
response.raise_for_status()
data = response.json()
if data.get("code") == 0:
return data.get("data", [])
else:
raise RuntimeError(f"Kline fetch failed: {data}")
def compare_kline_vs_tick_model(kline_candles: list[dict],
tick_model_candles: list[dict]) -> dict:
"""
Compare candles built from filtered tick data against TickDB's
cleaned klines to validate your filtering model.
Returns per-candle divergence metrics and aggregate statistics.
"""
kline_map = {c["timestamp"]: c for c in kline_candles}
tick_map = {c["timestamp"]: c for c in tick_model_candles}
common_ts = set(kline_map.keys()) & set(tick_map.keys())
if not common_ts:
return {"error": "No overlapping timestamps"}
close_diffs, volume_diffs, high_diffs, low_diffs = [], [], [], []
for ts in common_ts:
k = kline_map[ts]
t = tick_map[ts]
close_diff_pct = abs(k["close"] - t["close"]) / k["close"] * 100
volume_diff_pct = abs(k["volume"] - t["volume"]) / max(k["volume"], 1) * 100
high_diff = max(k["high"], t["high"]) - min(k["high"], t["high"])
low_diff = max(k["low"], t["low"]) - min(k["low"], t["low"])
close_diffs.append(close_diff_pct)
volume_diffs.append(volume_diff_pct)
high_diffs.append(high_diff)
low_diffs.append(low_diff)
import statistics
return {
"overlapping_candles": len(common_ts),
"avg_close_divergence_bps": statistics.mean(close_diffs) * 100,
"max_close_divergence_bps": max(close_diffs) * 100,
"avg_volume_divergence_pct": statistics.mean(volume_diffs),
"avg_high_low_noise": statistics.mean(high_diffs) + statistics.mean(low_diffs),
}
Order Flow Toxicity: Combining Dark Pool and Odd-Lot Signals
Separately, dark pool activity and odd-lot volume are interesting. Together, they form the foundation of a order flow toxicity score. The logic: informed traders prefer dark pools (to minimize market impact), while noise traders (retail, HFT arbitrage) often use odd lots. A symbol with high dark pool volume and low odd-lot volume likely has institutional flow. The inverse suggests retail-dominated flow.
A simple toxicity metric:
Toxicity Score = (Dark Pool Volume % × Volume) / (Odd Lot Volume % × Volume + 1)
A score above 2.0 on a 5-minute rolling window is a weak signal of informed trading. Combine this with short-term VWAP deviation and you have a defensible entry filter for mean-reversion strategies.
Practical Limitations and Backtest Disclosure
The analysis above reflects idealized trade data conditions. Real-world implementation faces constraints that affect backtest reliability:
| Limitation | Impact | Mitigation |
|---|---|---|
| Sale condition availability varies by data vendor | Some vendors suppress individual condition codes | Cross-validate against exchange direct feeds |
| Odd-lot classification may use a 100-share threshold, but some venues treat "mixed lots" differently | 100-share round lots with non-standard settlement may be mislabeled | Check venue-specific documentation |
Dark pool identification via code 4 does not include "off-exchange" trades that carry no condition code |
Latencyarbitrage flow on internalizers may be invisible | Supplement with venue fingerprinting |
| NBBO drift measurement requires co-located quote data | Without real-time NBBO, the quality classification is approximate | Use a static spread model for backtesting; real-time NBBO for live |
Backtest limitations: The order flow toxicity model and odd-lot filtering strategy described here have been validated on synthetic and historical data but have not been live-tested. Backtested results reflect assumptions about data completeness and execution costs. Fixed slippage of 5 bps is assumed; actual execution costs during high-volatility periods (earnings, macroeconomic releases) will be materially higher. The dark pool percentage metric is sourced from FINRA public data (2024); individual venue volumes fluctuate.
Closing
The two problems — dark pool identification and odd-lot filtering — share a deeper lesson: public market data is a curated abstraction of a more complex reality. Every candle, every volume bar, every VWAP calculation is built on a filtered view of a market where roughly half the volume never appeared on the tape. Understanding what is filtered, why, and how to recover the signal that matters is what separates a backtest that survives live deployment from one that does not.
For US equity K-line analysis, TickDB provides a reliable foundation of cleaned, aligned OHLCV data covering 10+ years of history — sufficient for building and validating microstructure models across bull and bear regimes. Pair that foundation with the filtering logic above, and you have a defensible pipeline from raw data to clean signal.
Next Steps
If you are building a backtesting pipeline for US equity strategies: Sign up at tickdb.ai to access 10+ years of cleaned US equity kline data across all major exchanges. The kline endpoint supports intervals from 1m to 1M, with full OHLCV and volume fields.
If you need tick-level (per-trade) granularity for US equities: The trades endpoint does not currently support US equity tick data. For tick-level US equity analysis, consult a specialized provider such as Databento, Polygon, or Refinitiv. Use TickDB's kline endpoint for your backtesting OHLCV foundation and validate your odd-lot filtering model against those providers' tick feeds.
If you are integrating with AI-assisted analysis tools: The tickdb-market-data SKILL on ClawHub provides a structured interface for pulling TickDB klines directly into your AI pipeline for microstructure analysis, earnings reaction modeling, and strategy validation.
If you are an institutional team: Contact enterprise@tickdb.ai for access to extended historical data, custom market coverage, and dedicated API support for cross-asset strategy research.
This article does not constitute investment advice. Market data analysis involves inherent data quality limitations; backtested results do not guarantee future performance. Trading involves risk, including the potential loss of principal.