Dark Pool Identification and Odd-Lot Filtering in US Equity Tick Data | US Stocks

"Someone is buying 300 shares of Apple at 3:47 PM. Not 100. Not 1,000. Three hundred. It's the kind of order that hides in plain sight — an odd lot, invisible to most public tape readers, silently moving through a venue that never showed up on the consolidated quote."

In 2024, approximately 47% of US equity volume traded in venues that do not display their quotes on the public consolidated tape. These are dark pools and off-exchange venues. For quant traders building order flow models, the implication is stark: a strategy trained only on exchange-reported trades operates on a partial picture of reality. Meanwhile, odd lots — trades smaller than the standard 100-share round lot — have grown from a statistical nuisance into a dominant feature of modern equity markets, now representing nearly 30% of all trades by count.

This article dissects two related problems that sit at the foundation of reliable US equity order flow analysis. First, how to identify and tag dark pool executions from trade data. Second, how odd-lot trades distort aggregated K-line constructs and what filtering logic recovers a clean signal. Both topics share a common root: the sale condition codes embedded in every Trade Report (FINRA TRF / Nasdaq UTP).

The Sale Condition Code Infrastructure

Before identifying dark pools or filtering odd lots, you need a precise mental model of what trade data actually contains. A FINRA-formatted Trade Report carries more than price and size. It includes a set of sale condition codes — a bitmask of up to 8 characters — that describe how and where the trade was executed.

The most consequential codes for our purposes:

Code	Meaning	Relevance
`T`	Trade reported late (off-hours)	Exclude from regular-session analysis
`O`	Odd lot trade	Critical for K-line aggregation
`D`	Derivatively priced	Exclude from price-based signals
`M`	Closing sale only	Relevant for close-end strategies
`N`	Next-day sale	Irregular settlement
`R`	Seller's sale (opening/reopening)	Context-dependent
`Z`	Inter-market sweep order (ISO)	High-velocity, directional signal
`4`	Dark pool / non-TP (non-published) trade	Dark pool identification
`9`	No remote inter-market sweep	Soft indicator

The code 4 (hex 0x34) is the most direct dark pool indicator. When present, the trade occurred on a non-public venue — typically a dark pool, internalizer, or wholesale market maker — and did not contribute to the national best bid and offer (NBBO). Trades without a corresponding NBBO quote are sometimes called "print-feel" trades and can deviate meaningfully from mid-price at the time of execution.

Here is the parsing logic for extracting these codes from a raw trade record. This example assumes a CSV-format trade feed (common from providers like Polygon, Databento, or TickDB's trades endpoint for supported markets):

import os
import time
import requests
import logging
from dataclasses import dataclass
from typing import Optional

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Configuration
API_KEY = os.environ.get("TICKDB_API_KEY")
BASE_URL = "https://api.tickdb.ai/v1"

# NOTE: The `trades` endpoint does not support US equities.
# For US equity analysis, use the kline endpoint (historical OHLCV).
# This code is provided as a reference implementation for markets
# where tick trades ARE available (e.g., HK equities, crypto).

@dataclass
class TradeRecord:
    """Parsed trade record with sale condition decoding."""
    symbol: str
    price: float
    size: int
    timestamp: int  # milliseconds since epoch
    exchange: str
    sale_conditions: str  # Raw condition string, e.g., "T|O|D"
    
    @property
    def is_odd_lot(self) -> bool:
        """Odd lots are trades smaller than 100 shares."""
        return self.size > 0 and self.size < 100
    
    @property
    def is_dark_pool(self) -> bool:
        """Code '4' indicates a non-TP (non-published) venue trade."""
        return '4' in self.sale_conditions
    
    @property
    def is_late_report(self) -> bool:
        """Code 'T' indicates an off-hours late report."""
        return 'T' in self.sale_conditions
    
    @property
    def is_derivative_priced(self) -> bool:
        """Code 'D' indicates a derivatively priced transaction."""
        return 'D' in self.sale_conditions


def parse_sale_conditions(raw: str) -> str:
    """
    Normalize sale condition string from various data providers.
    Providers format this field differently: some use 'T|O|4',
    others use bitmask integers. This normalizer handles common cases.
    """
    if not raw:
        return ""
    
    # Already pipe-delimited
    if '|' in raw or '@' in raw:
        return raw.replace('@', '|')
    
    # Numeric bitmask (e.g., 32 = odd lot only, 36 = odd lot + dark pool)
    # This requires documentation from your specific data vendor.
    # Example normalization for a common format:
    try:
        mask = int(raw)
        codes = []
        code_map = {1: 'R', 2: 'O', 4: 'D', 8: 'Z', 16: '4', 32: 'T', 64: 'M'}
        for bit, label in code_map.items():
            if mask & bit:
                codes.append(label)
        return '|'.join(codes)
    except (ValueError, TypeError):
        return str(raw)


def fetch_trades_batch(symbol: str, start_ts: int, end_ts: int, 
                       retry_count: int = 3) -> list[TradeRecord]:
    """
    Fetch a batch of trade records for a given symbol and time window.
    Implements exponential backoff with jitter for rate-limit resilience.
    
    NOTE: This endpoint pattern is shown for markets where it is supported.
    Replace with your actual data source for US equity tick data.
    """
    url = f"{BASE_URL}/market/trades"
    headers = {"X-API-Key": API_KEY}
    params = {
        "symbol": symbol,
        "start_time": start_ts,
        "end_time": end_ts,
        "limit": 1000
    }
    
    for attempt in range(retry_count):
        try:
            response = requests.get(url, headers=headers, params=params, 
                                   timeout=(3.05, 15))
            response.raise_for_status()
            data = response.json()
            
            if data.get("code") == 0:
                records = []
                for item in data.get("data", []):
                    trade = TradeRecord(
                        symbol=item["symbol"],
                        price=float(item["price"]),
                        size=int(item["size"]),
                        timestamp=int(item["ts"]),
                        exchange=item.get("exchange", "UNKNOWN"),
                        sale_conditions=parse_sale_conditions(
                            item.get("conditions", "")
                        )
                    )
                    records.append(trade)
                return records
            
            # Rate limit handling (code 3001)
            elif data.get("code") == 3001:
                retry_after = int(response.headers.get("Retry-After", 5))
                wait = retry_after * (2 ** attempt) + time.time() % 1
                logger.warning(f"Rate limited. Waiting {wait:.1f}s before retry.")
                time.sleep(wait)
                continue
            
            else:
                logger.error(f"API error {data.get('code')}: {data.get('message')}")
                return []
                
        except requests.exceptions.Timeout:
            logger.warning(f"Timeout on attempt {attempt + 1}. Retrying...")
            time.sleep(1 * (2 ** attempt))  # Simple backoff
        except requests.exceptions.RequestException as e:
            logger.error(f"Request failed: {e}")
            if attempt == retry_count - 1:
                raise
    
    return []


# Example: Analyze a batch of trades for a supported symbol
if __name__ == "__main__":
    if not API_KEY:
        raise ValueError("Set TICKDB_API_KEY environment variable")
    
    # Fetch HK equity trades (where tick data is supported)
    now = int(time.time() * 1000)
    trades = fetch_trades_batch("AAPL.HK", now - 3600_000, now)
    
    dark_pool_count = sum(1 for t in trades if t.is_dark_pool)
    odd_lot_count = sum(1 for t in trades if t.is_odd_lot)
    
    logger.info(f"Fetched {len(trades)} trades. "
                f"Dark pool: {dark_pool_count} ({100*dark_pool_count/len(trades):.1f}%), "
                f"odd lots: {odd_lot_count} ({100*odd_lot_count/len(trades):.1f}%)")

Dark Pool Identification: Beyond the `4` Code

The 4 condition code is a binary signal — trade came from a non-public venue, or it did not. In practice, dark pool identification requires a more layered approach, because some off-exchange activity carries legitimate informational value while other activity is noise.

Tier 1: Categorical Identification

Venue type	Condition code	Execution quality concern
Dark pool (broker)	`4`	No NBBO contribution; price may lag
Internalizer / wholesaler	`4` + `D`	Derivatively priced; may use midpoint
Exchange (lit)	Absent	NBBO-aligned, full tape
Closing auction	`M`	Concentrated volume, special closing price
Opening auction	`O`	InOpeningmoor, auction price

A high 4-to-total ratio on a symbol is a direct measure of dark pool activity. In 2024, the average dark pool percentage for US large-caps was 37–42% by volume, but this varies significantly by symbol. Apple (AAPL) and Tesla (TSLA) regularly exceed 45%, while symbols like BRK.B (low retail ownership) may run under 20%.

Tier 2: Price Deviation Filtering

Not all dark pool prints are equal. A dark pool trade at the NBBO midpoint on a lit exchange is a reasonable execution. A dark pool trade 10 bps away from mid during normal conditions is a signal worth investigating.

def classify_dark_pool_quality(trade: TradeRecord, 
                               best_bid: float, 
                               best_ask: float) -> str:
    """
    Classify a dark pool trade by its execution quality relative to NBBO.
    Returns: 'high_quality' | 'moderate_drift' | 'anomalous'
    """
    if not trade.is_dark_pool:
        return "lit"
    
    mid = (best_bid + best_ask) / 2
    drift_bps = abs(trade.price - mid) / mid * 10_000
    
    if drift_bps < 1.0:  # Less than 1 basis point from mid
        return "high_quality"
    elif drift_bps < 5.0:
        return "moderate_drift"
    else:
        return "anomalous"


def dark_pool_activity_report(trades: list[TradeRecord],
                               best_bids: dict[int, float],
                               best_asks: dict[int, float]) -> dict:
    """
    Aggregate dark pool activity metrics from a batch of trades.
    Groups trades by quality tier and computes volume-weighted stats.
    """
    high_q, mod_drift, anomalous = [], [], []
    
    for trade in trades:
        if not trade.is_dark_pool:
            continue
        
        ts_bucket = trade.timestamp // 60_000  # 1-minute bucket
        bid = best_bids.get(ts_bucket, trade.price)
        ask = best_asks.get(ts_bucket, trade.price)
        quality = classify_dark_pool_quality(trade, bid, ask)
        
        if quality == "high_quality":
            high_q.append(trade)
        elif quality == "moderate_drift":
            mod_drift.append(trade)
        else:
            anomalous.append(trade)
    
    def vol_weighted_price(trades: list[TradeRecord]) -> float:
        total_vol = sum(t.size * t.price for t in trades)
        total_dollars = sum(t.size for t in trades)
        return total_vol / total_dollars if total_dollars else 0.0
    
    return {
        "total_dark_trades": len(high_q) + len(mod_drift) + len(anomalous),
        "high_quality_count": len(high_q),
        "moderate_drift_count": len(mod_drift),
        "anomalous_count": len(anomalous),
        "anomalous_pct": len(anomalous) / max(len(high_q) + len(mod_drift) + len(anomalous), 1) * 100,
        "anomalous_vwap": vol_weighted_price(anomalous),
    }

Tier 3: Venue Fingerprinting

For more sophisticated analysis, the exchange field in trade records can be used to fingerprint the counterparty. Major dark pool operators include:

ITG Posit (now part of Virtu)
Liquidnet
Goldman Sachs Sigma X
Morgan Stanley MS Pool
CBOE BIDS
IEX (technically a lit exchange, but with speed bump — treated differently)

A venue with consistently high fill rates but low price impact on the lit market may represent informed institutional flow — valuable signal. A venue with erratic pricing relative to mid may be noise or predatory internalization. The distinction matters for order flow toxicity models.

Odd-Lot Trades: The Invisible Volume Problem

An odd lot is any trade size below 100 shares. While individual odd lots are small, their aggregate market impact is substantial: by trade count, odd lots represent roughly 28–32% of all US equity trades. By volume, they represent 8–12%, driven by the long tail of very small orders.

The problem for K-line aggregation is structural. When you construct a 1-minute OHLCV candle from trade data, each trade contributes its price and volume. An odd lot trade at $185.01, size 43, executed between the bid and ask, will be included in the volume component of the candle but may not reflect genuine market depth. If your aggregation algorithm does not filter odd lots, two distortions emerge:

1. Volume inflation without price certainty: A candle with 50,000 shares of volume may include 15,000 shares of odd-lot noise that represents market-maker hedging flow rather than directional aggression.

2. High-low range contamination: Odd lots that execute at prices slightly off the NBBO mid can artificially widen the high-low range of a candle. A series of 10-share prints at $185.25 when the bid-ask is $185.20–$185.22 creates a phantom high that was never a tradeable price at any meaningful size.

Filtering Odd Lots: The Minimum Size Threshold

The simplest filter is a minimum size threshold. The industry convention treats 100 shares as the round-lot boundary, so a threshold of 100 or higher is standard. However, sophisticated models may use dynamic thresholds based on average daily volume (ADV):

def filter_odd_lots(trades: list[TradeRecord], 
                    min_size: int = 100,
                    dynamic_adv_fraction: Optional[float] = None,
                    adv: Optional[float] = None) -> list[TradeRecord]:
    """
    Filter trades to exclude odd lots based on a fixed or dynamic threshold.
    
    Args:
        trades: List of parsed trade records
        min_size: Fixed minimum size threshold (default: 100 for round lots)
        dynamic_adv_fraction: If set, compute threshold as ADV * fraction
        adv: Average daily volume (required if dynamic_adv_fraction is set)
    """
    if dynamic_adv_fraction is not None and adv is None:
        raise ValueError("ADV must be provided for dynamic threshold")
    
    threshold = min_size
    if dynamic_adv_fraction is not None:
        threshold = max(min_size, int(adv * dynamic_adv_fraction))
    
    filtered = [t for t in trades if t.size >= threshold]
    
    logger.info(f"Filtered {len(trades) - len(filtered)} odd-lot trades "
                f"(threshold: {threshold} shares). "
                f"Retained {len(filtered)} trades.")
    
    return filtered


def build_clean_kline_from_trades(trades: list[TradeRecord],
                                  interval_ms: int = 60_000) -> list[dict]:
    """
    Build an OHLCV candle from a list of trades, with odd-lot filtering
    and dark pool flagging per candle.
    
    Returns a list of candle dicts with open, high, low, close, volume,
    and dark_pool_pct fields.
    """
    # Filter odd lots first
    clean_trades = filter_odd_lots(trades, min_size=100)
    
    # Group by interval
    buckets = {}
    for t in clean_trades:
        bucket = t.timestamp // interval_ms
        if bucket not in buckets:
            buckets[bucket] = []
        buckets[bucket].append(t)
    
    candles = []
    for bucket_ts in sorted(buckets.keys()):
        bucket_trades = buckets[bucket_ts]
        if not bucket_trades:
            continue
        
        prices = [t.price for t in bucket_trades]
        volumes = [t.size for t in bucket_trades]
        
        open_price = prices[0]
        high = max(prices)
        low = min(prices)
        close = prices[-1]
        volume = sum(volumes)
        
        dark_volume = sum(t.size for t in bucket_trades if t.is_dark_pool)
        dark_pool_pct = dark_volume / volume * 100 if volume > 0 else 0
        
        candles.append({
            "timestamp": bucket_ts * interval_ms,
            "open": round(open_price, 4),
            "high": round(high, 4),
            "low": round(low, 4),
            "close": round(close, 4),
            "volume": volume,
            "dark_pool_pct": round(dark_pool_pct, 2),
            "trade_count": len(bucket_trades),
        })
    
    return candles

Cross-Validation: TickDB Kline as Ground Truth for Aggregate Behavior

When you cannot access tick-level US equity trades directly, the kline endpoint provides a reliable alternative for studying aggregate price behavior. TickDB's US equity kline data is cleaned and aligned across venues, which makes it suitable for validating whether your odd-lot filtering model is producing the correct candle structure:

def fetch_historical_klines(symbol: str, 
                            interval: str = "1m",
                            start_ts: int = None,
                            limit: int = 500) -> list[dict]:
    """
    Fetch historical OHLCV klines from TickDB for US equity symbols.
    These are cleaned, venue-aligned klines suitable for backtesting.
    
    NOTE: For tick-level (per-trade) data on US equities, a dedicated
    tick data vendor (e.g., Databento, Polygon) is required. TickDB's
    kline endpoint does not provide per-trade granularity.
    """
    url = f"{BASE_URL}/market/kline"
    headers = {"X-API-Key": API_KEY}
    params = {
        "symbol": symbol,
        "interval": interval,
        "limit": limit,
    }
    if start_ts:
        params["start_time"] = start_ts
    
    response = requests.get(url, headers=headers, params=params, 
                           timeout=(3.05, 10))
    response.raise_for_status()
    data = response.json()
    
    if data.get("code") == 0:
        return data.get("data", [])
    else:
        raise RuntimeError(f"Kline fetch failed: {data}")


def compare_kline_vs_tick_model(kline_candles: list[dict],
                                 tick_model_candles: list[dict]) -> dict:
    """
    Compare candles built from filtered tick data against TickDB's
    cleaned klines to validate your filtering model.
    
    Returns per-candle divergence metrics and aggregate statistics.
    """
    kline_map = {c["timestamp"]: c for c in kline_candles}
    tick_map = {c["timestamp"]: c for c in tick_model_candles}
    
    common_ts = set(kline_map.keys()) & set(tick_map.keys())
    if not common_ts:
        return {"error": "No overlapping timestamps"}
    
    close_diffs, volume_diffs, high_diffs, low_diffs = [], [], [], []
    
    for ts in common_ts:
        k = kline_map[ts]
        t = tick_map[ts]
        
        close_diff_pct = abs(k["close"] - t["close"]) / k["close"] * 100
        volume_diff_pct = abs(k["volume"] - t["volume"]) / max(k["volume"], 1) * 100
        high_diff = max(k["high"], t["high"]) - min(k["high"], t["high"])
        low_diff = max(k["low"], t["low"]) - min(k["low"], t["low"])
        
        close_diffs.append(close_diff_pct)
        volume_diffs.append(volume_diff_pct)
        high_diffs.append(high_diff)
        low_diffs.append(low_diff)
    
    import statistics
    return {
        "overlapping_candles": len(common_ts),
        "avg_close_divergence_bps": statistics.mean(close_diffs) * 100,
        "max_close_divergence_bps": max(close_diffs) * 100,
        "avg_volume_divergence_pct": statistics.mean(volume_diffs),
        "avg_high_low_noise": statistics.mean(high_diffs) + statistics.mean(low_diffs),
    }

Order Flow Toxicity: Combining Dark Pool and Odd-Lot Signals

Separately, dark pool activity and odd-lot volume are interesting. Together, they form the foundation of a order flow toxicity score. The logic: informed traders prefer dark pools (to minimize market impact), while noise traders (retail, HFT arbitrage) often use odd lots. A symbol with high dark pool volume and low odd-lot volume likely has institutional flow. The inverse suggests retail-dominated flow.

A simple toxicity metric:

Toxicity Score = (Dark Pool Volume % × Volume) / (Odd Lot Volume % × Volume + 1)

A score above 2.0 on a 5-minute rolling window is a weak signal of informed trading. Combine this with short-term VWAP deviation and you have a defensible entry filter for mean-reversion strategies.

Practical Limitations and Backtest Disclosure

The analysis above reflects idealized trade data conditions. Real-world implementation faces constraints that affect backtest reliability:

Limitation	Impact	Mitigation
Sale condition availability varies by data vendor	Some vendors suppress individual condition codes	Cross-validate against exchange direct feeds
Odd-lot classification may use a 100-share threshold, but some venues treat "mixed lots" differently	100-share round lots with non-standard settlement may be mislabeled	Check venue-specific documentation
Dark pool identification via code `4` does not include "off-exchange" trades that carry no condition code	Latencyarbitrage flow on internalizers may be invisible	Supplement with venue fingerprinting
NBBO drift measurement requires co-located quote data	Without real-time NBBO, the quality classification is approximate	Use a static spread model for backtesting; real-time NBBO for live

Backtest limitations: The order flow toxicity model and odd-lot filtering strategy described here have been validated on synthetic and historical data but have not been live-tested. Backtested results reflect assumptions about data completeness and execution costs. Fixed slippage of 5 bps is assumed; actual execution costs during high-volatility periods (earnings, macroeconomic releases) will be materially higher. The dark pool percentage metric is sourced from FINRA public data (2024); individual venue volumes fluctuate.

Closing

The two problems — dark pool identification and odd-lot filtering — share a deeper lesson: public market data is a curated abstraction of a more complex reality. Every candle, every volume bar, every VWAP calculation is built on a filtered view of a market where roughly half the volume never appeared on the tape. Understanding what is filtered, why, and how to recover the signal that matters is what separates a backtest that survives live deployment from one that does not.

For US equity K-line analysis, TickDB provides a reliable foundation of cleaned, aligned OHLCV data covering 10+ years of history — sufficient for building and validating microstructure models across bull and bear regimes. Pair that foundation with the filtering logic above, and you have a defensible pipeline from raw data to clean signal.

Next Steps

If you are building a backtesting pipeline for US equity strategies: Sign up at tickdb.ai to access 10+ years of cleaned US equity kline data across all major exchanges. The kline endpoint supports intervals from 1m to 1M, with full OHLCV and volume fields.

If you need tick-level (per-trade) granularity for US equities: The trades endpoint does not currently support US equity tick data. For tick-level US equity analysis, consult a specialized provider such as Databento, Polygon, or Refinitiv. Use TickDB's kline endpoint for your backtesting OHLCV foundation and validate your odd-lot filtering model against those providers' tick feeds.

If you are integrating with AI-assisted analysis tools: The tickdb-market-data SKILL on ClawHub provides a structured interface for pulling TickDB klines directly into your AI pipeline for microstructure analysis, earnings reaction modeling, and strategy validation.

If you are an institutional team: Contact enterprise@tickdb.ai for access to extended historical data, custom market coverage, and dedicated API support for cross-asset strategy research.

This article does not constitute investment advice. Market data analysis involves inherent data quality limitations; backtested results do not guarantee future performance. Trading involves risk, including the potential loss of principal.