From Paper to Code: A Practical Guide to Reproducing Academic Quant Strategies | API Guide

The 3 AM Realization

At 3:14 AM on a Tuesday, you find it. The paper that promises everything: a novel alpha factor with a Sharpe ratio of 2.3, published in a peer-reviewed journal, backed by five years of out-of-sample testing. You read it twice. Then a third time. The mathematics are sound. The intuition is elegant. The results seem... achievable.

You spend the next six weeks trying to reproduce it. You request the same dataset. You implement the exact formulas from the paper. You run the backtest. The Sharpe ratio comes back as 0.4.

What went wrong?

This scenario plays out in quant teams around the world every week. Academic papers are written to communicate ideas, not to provide production-ready implementations. The gap between "readable math" and "runnable code" is filled with implicit knowledge, data selection decisions, transaction cost assumptions, and implementation choices that authors rarely document.

This article provides a systematic framework for bridging that gap. We will walk through the complete pipeline: how to read a quant paper critically, how to acquire the data you actually need versus the data the authors claim to use, how to structure your backtest to isolate the strategy's true signal from overfitting artifacts, and how to diagnose why your results diverge from the original.

Along the way, we will demonstrate each step with production-grade Python code using TickDB's API for data acquisition, because a reproducibility pipeline is only as reliable as its data infrastructure.

Module 1: Reading Papers as Engineers

Most quant researchers read papers to understand the strategy. Reproducing a paper requires reading it to identify implementation requirements. These are different tasks, and they demand different reading strategies.

1.1 The First Pass: Identify the Signal Architecture

On your first read, your goal is to understand the strategy's core logic. Ask these questions:

What is the input data? (Price? Volume? Order book? Macroeconomic?)
What is the transformation? (Ranking? Normalization? Signal construction?)
What is the output? (Alpha score? Ranking? Directional forecast?)

Do not get bogged down in the mathematical proofs. Those are important for understanding why the strategy works, but they are not what prevents you from reproducing the results.

1.2 The Second Pass: Catalog Every Data Dependency

This is where most reproducibility efforts fail. Authors describe their data in broad strokes ("we use daily US equity data from 2000 to 2020") but omit critical specifics. Your second pass must extract every data dependency with precision.

Create a data dependency table. For each data series the paper uses, document:

Data field	Explicit in paper?	Assumptions needed
Adjusted close vs. unadjusted close	Explicit in some papers	Often implied, never stated
Split adjustments	Usually not mentioned	CRSP-adjusted vs. raw
Dividend adjustments	Varies widely	Price returns vs. total returns
Survivorship bias	Almost never discussed	Critical for equity long-short
Market capitalization	Sometimes used for weighting	Float-adjusted vs. total shares

1.3 The Third Pass: Reverse-Engineer the Backtest Design

Academic papers optimize for readability, not for reproducible engineering. You need to reconstruct the backtest design from fragments. Key questions:

Universe definition: Which stocks? How many? What exclusion rules (financials, utilities, ADRs)?

Rebalancing frequency: Daily? Weekly? Monthly? And at what time of day?

Transaction cost model: Fixed commission? Percentage of notional? Spread cost? Most papers use 0.1% one-way as a placeholder, but this assumption alone can turn a profitable strategy into a breakeven one.

Long/short construction: Equal weight? Value-weighted? Factor-neutral? The long-short construction often accounts for more of the returns than the alpha signal itself.

Risk management: Are stops used? Position limits? Leverage constraints?

Module 2: Data Acquisition Infrastructure

With your data dependency table in hand, you can now build the data acquisition layer. This is where TickDB's API becomes essential: it provides 10+ years of cleaned, aligned US equity OHLCV data suitable for cross-cycle backtesting.

2.1 Setting Up the Data Pipeline

Production-grade data acquisition requires heartbeat, reconnection logic, rate-limit handling, and environment-variable-based authentication. Here is the foundational client:

import os
import time
import random
import logging
from typing import Optional

import requests

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class TickDBClient:
    """
    Production-grade TickDB API client with:
    - Exponential backoff + jitter on reconnection
    - Rate-limit handling (code 3001 + Retry-After)
    - Environment-variable-based authentication
    - Timeout on all HTTP requests
    """

    def __init__(self, api_key: Optional[str] = None):
        self.api_key = api_key or os.environ.get("TICKDB_API_KEY")
        if not self.api_key:
            raise ValueError(
                "TickDB API key not found. Set TICKDB_API_KEY environment variable."
            )
        self.base_url = "https://api.tickdb.ai/v1"
        self.max_retries = 5
        self.base_delay = 1.0
        self.max_delay = 32.0

    def _request_with_retry(self, method: str, endpoint: str, **kwargs) -> dict:
        """
        Execute HTTP request with exponential backoff, jitter, and rate-limit handling.
        """
        timeout = kwargs.pop("timeout", (3.05, 10))
        retry_count = 0

        while retry_count <= self.max_retries:
            try:
                response = requests.request(
                    method=method,
                    url=f"{self.base_url}{endpoint}",
                    headers={
                        "X-API-Key": self.api_key,
                        "Content-Type": "application/json",
                    },
                    timeout=timeout,
                    **kwargs,
                )

                data = response.json() if response.content else {}

                # Handle rate limiting
                if data.get("code") == 3001:
                    retry_after = int(
                        response.headers.get("Retry-After", 5)
                    )
                    logger.warning(
                        f"Rate limited. Retrying after {retry_after}s."
                    )
                    time.sleep(retry_after)
                    continue

                # Handle authentication errors
                if data.get("code") in (1001, 1002):
                    raise ValueError(
                        f"Authentication error ({data.get('code')}): "
                        f"Invalid API key — check TICKDB_API_KEY env var."
                    )

                # Handle symbol not found
                if data.get("code") == 2002:
                    raise KeyError(
                        f"Symbol not found — verify via /v1/symbols/available."
                    )

                if data.get("code") == 0:
                    return data.get("data", {})
                else:
                    raise RuntimeError(
                        f"API error {data.get('code')}: {data.get('message')}"
                    )

            except requests.exceptions.Timeout:
                retry_count += 1
                delay = min(self.base_delay * (2 ** retry_count), self.max_delay)
                jitter = random.uniform(0, delay * 0.1)
                logger.warning(
                    f"Request timeout. Retrying in {delay + jitter:.2f}s "
                    f"(attempt {retry_count}/{self.max_retries})."
                )
                time.sleep(delay + jitter)

            except requests.exceptions.RequestException as e:
                retry_count += 1
                delay = min(self.base_delay * (2 ** retry_count), self.max_delay)
                jitter = random.uniform(0, delay * 0.1)
                logger.warning(
                    f"Request failed: {e}. Retrying in {delay + jitter:.2f}s "
                    f"(attempt {retry_count}/{self.max_retries})."
                )
                time.sleep(delay + jitter)

        raise RuntimeError(
            f"Max retries ({self.max_retries}) exceeded for {endpoint}."
        )

    def get_kline(
        self,
        symbol: str,
        interval: str = "1d",
        limit: int = 500,
        start_time: Optional[int] = None,
        end_time: Optional[int] = None,
    ) -> list:
        """
        Fetch historical OHLCV (kline) data for backtesting.

        Args:
            symbol: Trading symbol (e.g., "AAPL.US", "NVDA.US")
            interval: Candle interval ("1d", "1h", "5m", etc.)
            limit: Number of candles to fetch (max 1000 per request)
            start_time: Unix timestamp (ms) for range start
            end_time: Unix timestamp (ms) for range end

        Returns:
            List of OHLCV candles, sorted oldest to newest
        """
        params = {"symbol": symbol, "interval": interval, "limit": limit}
        if start_time:
            params["start_time"] = start_time
        if end_time:
            params["end_time"] = end_time

        return self._request_with_retry("GET", "/market/kline", params=params)

    def get_symbols(self, market: str = "US") -> list:
        """
        List available symbols for a given market.

        Args:
            market: Market code ("US", "HK", "CRYPTO", etc.)

        Returns:
            List of available trading symbols
        """
        return self._request_with_retry(
            "GET", f"/symbols/available", params={"market": market}
        )


# ⚠️ Engineering warning:
# For high-frequency live trading systems, replace this synchronous client
# with an async implementation using aiohttp.
# The above is designed for backtesting and lower-frequency strategies.

2.2 Fetching the Backtest Universe

With the client in place, you can now construct your universe data. For a typical US equity strategy, you need to fetch a broad universe of stocks. The following function builds a daily OHLCV DataFrame for a list of symbols over a given date range:

import pandas as pd
from datetime import datetime, timedelta
from tqdm import tqdm


def fetch_universe_ohlcv(
    client: TickDBClient,
    symbols: list[str],
    start_date: str,
    end_date: str,
    interval: str = "1d",
) -> pd.DataFrame:
    """
    Fetch OHLCV data for a universe of symbols.

    Args:
        client: TickDBClient instance
        symbols: List of tickers (e.g., ["AAPL.US", "MSFT.US"])
        start_date: Start date string (YYYY-MM-DD)
        end_date: End date string (YYYY-MM-DD)
        interval: Candle interval

    Returns:
        DataFrame with columns: timestamp, symbol, open, high, low, close, volume
    """
    start_ts = int(datetime.strptime(start_date, "%Y-%m-%d").timestamp() * 1000)
    end_ts = int(datetime.strptime(end_date, "%Y-%m-%d").timestamp() * 1000)

    all_candles = []

    for symbol in tqdm(symbols, desc="Fetching OHLCV data"):
        try:
            candles = client.get_kline(
                symbol=symbol,
                interval=interval,
                limit=1000,
                start_time=start_ts,
                end_time=end_ts,
            )

            if not candles:
                continue

            df = pd.DataFrame(candles)
            df["symbol"] = symbol
            df["timestamp"] = pd.to_datetime(df["t"], unit="ms")
            df = df.rename(columns={
                "o": "open",
                "h": "high",
                "l": "low",
                "c": "close",
                "v": "volume",
            })
            all_candles.append(df[["timestamp", "symbol", "open", "high", "low", "close", "volume"]])

        except Exception as e:
            logger.warning(f"Failed to fetch {symbol}: {e}")
            continue

    if not all_candles:
        return pd.DataFrame()

    combined = pd.concat(all_candles, ignore_index=True)
    combined = combined.sort_values(["symbol", "timestamp"]).reset_index(drop=True)

    logger.info(
        f"Fetched {len(combined):,} candles for {combined['symbol'].nunique()} symbols "
        f"from {start_date} to {end_date}."
    )
    return combined


# Usage example:
# client = TickDBClient()
# universe_data = fetch_universe_ohlcv(
#     client=client,
#     symbols=["AAPL.US", "MSFT.US", "GOOGL.US", "AMZN.US", "META.US"],
#     start_date="2015-01-01",
#     end_date="2024-12-31",
# )

Module 3: Signal Construction and Factor Implementation

With clean data in hand, you can now implement the strategy itself. This is where precision matters most: a single normalization step, one missing adjustment, or a data alignment error will corrupt your entire backtest.

3.1 Reconstructing the Factor from the Paper

Academic papers typically describe factors using mathematical notation that maps directly to pandas operations. The challenge is translating that notation into code that handles edge cases gracefully.

Consider a paper that describes a momentum factor as follows: "We rank stocks by their 12-month return, skipping the most recent month, and go long the top decile and short the bottom decile, rebalancing monthly."

This description requires several implementation decisions that the paper leaves implicit:

import numpy as np


def compute_momentum_factor(
    prices: pd.DataFrame,
    lookback_months: int = 12,
    skip_months: int = 1,
    min_lookback_days: int = 200,
) -> pd.DataFrame:
    """
    Compute the classic momentum factor: 12-month return, skipping the last month.

    The paper specifies: rank stocks by their cumulative return over months T-12
    to T-1, excluding the most recent month (T).

    Implementation decisions (not specified in the paper):
    - We use trading days as a proxy for months (252 trading days/year)
    - We require at least min_lookback_days of data to compute the factor
    - Returns are computed as log returns for continuous compounding
    - NaN values are excluded from ranking (stocks without sufficient history)

    Args:
        prices: DataFrame with columns [timestamp, symbol, close]
        lookback_months: Number of months for momentum lookback (converted to trading days)
        skip_months: Number of months to skip at the end (momentum skip period)
        min_lookback_days: Minimum trading days required for a valid factor value

    Returns:
        DataFrame with columns [timestamp, symbol, momentum_factor]
    """
    df = prices.copy()
    df = df.sort_values(["symbol", "timestamp"])

    # Convert month specifications to trading day approximations
    lookback_days = int(lookback_months * (252 / 12))
    skip_days = int(skip_months * (252 / 12))

    # Compute log returns
    df["log_return"] = df.groupby("symbol")["close"].transform(
        lambda x: np.log(x / x.shift(1))
    )

    # Compute cumulative return over the lookback period
    # Shift by (lookback_days + skip_days) to exclude the skip period
    def rolling_cumulative_return(series, window: int) -> pd.Series:
        return series.rolling(window=window, min_periods=int(window * 0.75)).sum()

    df["momentum_raw"] = df.groupby("symbol")["log_return"].transform(
        lambda x: rolling_cumulative_return(x, lookback_days - skip_days).shift(skip_days)
    )

    # Drop stocks with insufficient history
    df = df.groupby("symbol").apply(
        lambda g: g[g.groupby("symbol").cumcount() >= lookback_days]
    ).reset_index(drop=True)

    # Rank within each timestamp (cross-sectional ranking)
    # Rank 1 = best momentum (highest return)
    df["momentum_factor"] = df.groupby("timestamp")["momentum_raw"].rank(
        pct=True, ascending=True
    )

    result = df[["timestamp", "symbol", "momentum_factor"]].copy()
    return result


def construct_long_short_portfolios(
    factor_df: pd.DataFrame,
    top_pct: float = 0.1,
    bottom_pct: float = 0.1,
) -> pd.DataFrame:
    """
    Construct long-short portfolios from a factor ranking.

    The paper specifies: go long the top decile, short the bottom decile.
    This implementation generalizes to any top/bottom percentile.

    Args:
        factor_df: DataFrame with columns [timestamp, symbol, factor_value]
        top_pct: Fraction of universe to go long (e.g., 0.1 = top decile)
        bottom_pct: Fraction of universe to go short (e.g., 0.1 = bottom decile)

    Returns:
        DataFrame with columns [timestamp, symbol, position]
        where position = 1 (long), -1 (short), or 0 (no position)
    """
    df = factor_df.copy()

    def assign_positions(group):
        n = len(group)
        top_n = int(np.ceil(n * top_pct))
        bottom_n = int(np.ceil(n * bottom_pct))

        # Sort by factor value descending (highest factor = best)
        group = group.sort_values("factor_value", ascending=False)

        group["position"] = 0
        group.iloc[:top_n, group.columns.get_loc("position")] = 1
        group.iloc[-bottom_n:, group.columns.get_loc("position")] = -1

        return group

    result = df.groupby("timestamp", group_keys=False).apply(assign_positions)
    return result.reset_index(drop=True)

3.2 Handling the Data Adjustments the Paper Forgot

This is where your results will most likely diverge from the paper's reported performance. Three common issues:

Survivorship bias: The paper's universe likely consisted only of stocks that survived to the end of the sample period. Your backtest must include delisted stocks (or at minimum, account for survivorship by using a point-in-time universe).

Adjustment methodology: CRSP-adjusted returns differ from unadjusted returns by 1–3% annually on average. Using the wrong adjustment will shift your factor returns by a meaningful amount.

Data cleaning: Academic datasets often exclude penny stocks, stocks below a price threshold, or stocks with insufficient liquidity. If the paper does not specify these filters, you need to decide whether to apply them conservatively (fewer exclusions, closer to reality for a live strategy) or match the paper exactly (more exclusions, for accurate comparison).

def apply_universe_filters(
    prices: pd.DataFrame,
    min_price: float = 5.0,
    min_volume: float = 100_000,
    min_volume_days_pct: float = 0.8,
) -> pd.DataFrame:
    """
    Apply conservative liquidity and price filters.

    These filters are not specified in most academic papers.
    They represent common-sense risk management for live deployment.

    Args:
        prices: DataFrame with columns [timestamp, symbol, close, volume]
        min_price: Minimum average price over the lookback window
        min_volume: Minimum average daily volume
        min_volume_days_pct: Fraction of days that must meet the min_volume threshold

    Returns:
        Filtered DataFrame
    """
    df = prices.copy()

    # Compute trailing 30-day average price and volume per symbol
    df["avg_price_30d"] = df.groupby("symbol")["close"].transform(
        lambda x: x.rolling(window=30, min_periods=20).mean()
    )
    df["avg_volume_30d"] = df.groupby("symbol")["volume"].transform(
        lambda x: x.rolling(window=30, min_periods=20).mean()
    )

    # Compute fraction of days above volume threshold over trailing 60 days
    def volume_compliance(group):
        trailing = group["volume"].shift(1).rolling(window=60, min_periods=30)
        above_threshold = trailing > min_volume
        return above_threshold.mean()

    df["volume_compliance"] = df.groupby("symbol").apply(
        lambda g: pd.Series(
            volume_compliance(g).values, index=g.index
        )
    ).reset_index(level=0, drop=True)

    # Apply filters
    mask = (
        (df["avg_price_30d"] >= min_price) &
        (df["avg_volume_30d"] >= min_volume) &
        (df["volume_compliance"] >= min_volume_days_pct)
    )
    filtered = df[mask].copy()

    logger.info(
        f"Universe filter: {df['symbol'].nunique()} symbols before → "
        f"{filtered['symbol'].nunique()} symbols after. "
        f"Rows removed: {len(df) - len(filtered):,} "
        f"({100 * (1 - len(filtered)/len(df)):.1f}%)."
    )

    return filtered

Module 4: Backtesting Engine with Transaction Cost Modeling

A backtest without realistic transaction costs is not a strategy evaluation. It is a fantasy. Your backtest engine must model the costs that exist in live trading, not the idealized costs that make the paper look good.

4.1 The Transaction Cost Model

Transaction costs in equity markets consist of three components:

Component	Typical magnitude	How to model
Commission	$0.005–$0.005 per share	Fixed per-share charge
Spread cost	0.5–2 bps for liquid stocks	Half-spread × position size
Market impact	Non-linear, depends on order size	Assumed proportional for small orders

For a liquid US equity strategy trading the top/bottom decile of a 500-stock universe, a reasonable cost model is:

class TransactionCostModel:
    """
    Three-component transaction cost model.

    Model components:
    1. Commission: $0.005 per share (e.g., Interactive Brokers tier)
    2. Half-spread: 0.5 bps for liquid stocks (configurable)
    3. Market impact: 0.25 bps (conservative for decile-portfolio turnover)

    Total one-way cost ≈ 1.0 bp for a $100M portfolio with $10M per side
    """

    def __init__(
        self,
        commission_per_share: float = 0.005,
        half_spread_bps: float = 0.5,
        market_impact_bps: float = 0.25,
    ):
        self.commission_per_share = commission_per_share
        self.half_spread_bps = half_spread_bps
        self.market_impact_bps = market_impact_bps

    def one_way_cost_bps(self, price: float, shares: int) -> float:
        """
        Compute one-way transaction cost in basis points.

        Args:
            price: Execution price per share
            shares: Number of shares traded

        Returns:
            One-way cost in basis points
        """
        notional = price * shares
        commission_cost = shares * self.commission_per_share
        spread_cost = notional * (self.half_spread_bps / 10_000)
        impact_cost = notional * (self.market_impact_bps / 10_000)

        total_cost = commission_cost + spread_cost + impact_cost
        cost_bps = (total_cost / notional) * 10_000

        return cost_bps

    def round_trip_cost_bps(self, price: float, shares: int) -> float:
        """Round-trip cost (entry + exit) in basis points."""
        return 2 * self.one_way_cost_bps(price, shares)


def run_backtest(
    prices: pd.DataFrame,
    positions: pd.DataFrame,
    tcm: TransactionCostModel,
    initial_capital: float = 10_000_000,
) -> pd.DataFrame:
    """
    Run a backtest with transaction costs and portfolio-level P&L.

    Args:
        prices: DataFrame with columns [timestamp, symbol, close, volume]
        positions: DataFrame with columns [timestamp, symbol, position]
                   (position = 1 for long, -1 for short, 0 for no position)
        tcm: TransactionCostModel instance
        initial_capital: Starting portfolio value

    Returns:
        DataFrame with daily portfolio returns and cumulative performance
    """
    # Align prices and positions by timestamp
    prices_sorted = prices.sort_values(["timestamp", "symbol"])
    positions_sorted = positions.sort_values(["timestamp", "symbol"])

    # Compute daily returns
    prices_sorted["daily_return"] = prices_sorted.groupby("symbol")["close"].pct_change()

    # Merge positions with daily returns
    merged = pd.merge(
        prices_sorted[["timestamp", "symbol", "close", "daily_return"]],
        positions_sorted[["timestamp", "symbol", "position"]],
        on=["timestamp", "symbol"],
        how="left",
    )
    merged["position"] = merged["position"].fillna(0)

    # Detect position changes (trades) and compute transaction costs
    merged = merged.sort_values(["symbol", "timestamp"])
    merged["prev_position"] = merged.groupby("symbol")["position"].shift(1).fillna(0)
    merged["trade"] = merged["position"] - merged["prev_position"]
    merged["trade_abs"] = merged["trade"].abs()

    # Estimate position value per symbol (equal-weight across active positions)
    daily_stats = merged.groupby("timestamp").agg(
        n_positions=("position", lambda x: (x != 0).sum()),
        total_trades=("trade_abs", "sum"),
        avg_price=("close", "mean"),
    ).reset_index()

    # Assume equal allocation across active positions
    merged = merged.merge(daily_stats, on="timestamp")
    merged["position_value"] = (initial_capital * 0.9) / merged["n_positions"].replace(0, 1)
    merged["shares"] = (merged["position_value"] / merged["close"]).astype(int)

    # Compute transaction costs per row
    merged["trade_cost_bps"] = merged.apply(
        lambda row: tcm.round_trip_cost_bps(row["close"], row["shares"])
        if row["trade_abs"] > 0 else 0,
        axis=1,
    )

    # Daily P&L: return from price movement + transaction cost drag
    merged["daily_pnl_pct"] = merged["position"] * merged["daily_return"]
    merged["cost_drag_pct"] = -merged["trade_abs"] * merged["trade_cost_bps"] / 10_000

    # Aggregate to portfolio level
    portfolio = merged.groupby("timestamp").agg(
        gross_return_pct=("daily_pnl_pct", "sum"),
        cost_drag_pct=("cost_drag_pct", "sum"),
        n_positions=("n_positions", "first"),
    ).reset_index()

    portfolio["net_return_pct"] = portfolio["gross_return_pct"] + portfolio["cost_drag_pct"]
    portfolio["cumulative_return"] = (1 + portfolio["net_return_pct"]).cumprod() - 1
    portfolio["portfolio_value"] = initial_capital * (1 + portfolio["cumulative_return"])

    return portfolio

4.2 Walk-Forward Validation

Academic papers typically report in-sample results. A production-quality backtest requires out-of-sample validation. The walk-forward approach trains on a rolling window and tests on the subsequent period:

def walk_forward_backtest(
    prices: pd.DataFrame,
    factor_func: callable,
    train_months: int = 36,
    test_months: int = 12,
    rebalance_days: int = 21,
) -> pd.DataFrame:
    """
    Walk-forward backtest with rolling train/test windows.

    Args:
        prices: DataFrame with columns [timestamp, symbol, close, volume]
        factor_func: Function that takes (train_data, prices) and returns factor DataFrame
        train_months: Training window in months (trading days = months * 21)
        test_months: Testing window in months
        rebalance_days: Rebalance frequency within the test window

    Returns:
        DataFrame with out-of-sample performance metrics
    """
    train_days = train_months * 21
    test_days = test_months * 21
    step_days = test_days  # Non-overlapping test windows

    all_results = []

    # Get sorted timestamps
    timestamps = sorted(prices["timestamp"].unique())
    start_idx = train_days

    while start_idx + test_days <= len(timestamps):
        train_end = start_idx
        test_end = min(start_idx + test_days, len(timestamps))

        train_data = prices[prices["timestamp"] < timestamps[train_end]].copy()
        test_data = prices[
            (prices["timestamp"] >= timestamps[train_end]) &
            (prices["timestamp"] < timestamps[test_end])
        ].copy()

        if len(train_data) < train_days * 0.8:
            start_idx += step_days
            continue

        # Compute factors on training data
        factor_df = factor_func(train_data, prices)

        # Get positions at test start
        test_start_positions = factor_df[
            factor_df["timestamp"] == timestamps[train_end]
        ].copy()
        test_positions = test_start_positions.assign(
            timestamp=test_data["timestamp"].values[0]
        )

        # Run backtest
        tcm = TransactionCostModel()
        result = run_backtest(test_data, test_positions, tcm)

        if len(result) > 0:
            result["train_end"] = timestamps[train_end]
            result["test_end"] = timestamps[test_end]
            all_results.append(result)

        start_idx += step_days

    if not all_results:
        return pd.DataFrame()

    combined = pd.concat(all_results, ignore_index=True)
    return combined

Module 5: Results Comparison and Gap Analysis

Once you have your backtest results, you need to compare them to the paper's reported results and diagnose any discrepancies. This is both a debugging exercise and a learning exercise: the gaps often reveal hidden assumptions in the paper's design.

5.1 Standard Comparison Metrics

def compute_performance_metrics(returns: pd.Series) -> dict:
    """
    Compute a comprehensive set of performance metrics.

    These metrics allow direct comparison with the paper's reported results.
    """
    n = len(returns)
    if n < 20:
        return {"error": "Insufficient data points"}

    excess_returns = returns
    annualized_return = excess_returns.mean() * 252
    annualized_vol = excess_returns.std() * np.sqrt(252)
    sharpe = annualized_return / annualized_vol if annualized_vol > 0 else 0

    # Downside deviation (Sortino denominator)
    downside_returns = excess_returns[excess_returns < 0]
    downside_vol = (
        downside_returns.std() * np.sqrt(252) if len(downside_returns) > 0 else 0
    )
    sortino = annualized_return / downside_vol if downside_vol > 0 else 0

    # Max drawdown
    cumulative = (1 + excess_returns).cumprod()
    running_max = cumulative.expanding().max()
    drawdown = (cumulative - running_max) / running_max
    max_drawdown = drawdown.min()

    # Win rate
    win_rate = (excess_returns > 0).mean()

    # Profit factor
    gross_profits = excess_returns[excess_returns > 0].sum()
    gross_losses = abs(excess_returns[excess_returns < 0].sum())
    profit_factor = gross_profits / gross_losses if gross_losses > 0 else np.inf

    # Average win / average loss
    avg_win = excess_returns[excess_returns > 0].mean() if len(excess_returns[excess_returns > 0]) > 0 else 0
    avg_loss = excess_returns[excess_returns < 0].mean() if len(excess_returns[excess_returns < 0]) > 0 else 0

    return {
        "annualized_return": annualized_return,
        "annualized_volatility": annualized_vol,
        "sharpe_ratio": sharpe,
        "sortino_ratio": sortino,
        "max_drawdown": max_drawdown,
        "win_rate": win_rate,
        "profit_factor": profit_factor,
        "avg_win": avg_win,
        "avg_loss": avg_loss,
        "n_observations": n,
        "n_trading_days": n,
    }


def compare_with_paper(
    your_metrics: dict,
    paper_metrics: dict,
    tolerance_bps: dict = None,
) -> pd.DataFrame:
    """
    Compare your backtest results against the paper's reported metrics.

    Args:
        your_metrics: Dict of metrics from compute_performance_metrics()
        paper_metrics: Dict of reported metrics from the paper
        tolerance_bps: Allowed difference in basis points per metric

    Returns:
        DataFrame showing comparison with pass/fail status
    """
    if tolerance_bps is None:
        # Default tolerance: 20 bps annualized return, 0.1 Sharpe
        tolerance_bps = {
            "annualized_return": 0.02,  # 2% absolute difference
            "sharpe_ratio": 0.2,
            "max_drawdown": 0.05,
        }

    comparison = []
    for metric, your_value in your_metrics.items():
        if metric == "error":
            continue
        paper_value = paper_metrics.get(metric, None)
        if paper_value is None:
            continue

        if isinstance(your_value, (int, float)) and isinstance(paper_value, (int, float)):
            diff = your_value - paper_value
            tolerance = tolerance_bps.get(metric, 0.1)
            passed = abs(diff) <= tolerance
            comparison.append({
                "metric": metric,
                "your_value": round(your_value, 4),
                "paper_value": round(paper_value, 4),
                "difference": round(diff, 4),
                "status": "✅ Pass" if passed else "❌ Gap",
            })

    return pd.DataFrame(comparison)

5.2 Diagnostic Checklist for Diverging Results

When your Sharpe ratio comes back as 0.4 instead of 2.3, systematically eliminate the following sources of divergence:

Check	How to verify	Common fix
Data source mismatch	Compare your return distribution to the paper's	Use the same adjusted close data the paper references
Lookback window definition	Print the dates of the first and last factor signals	Check whether the paper counts calendar days or trading days
Survivorship bias	Compare your universe size to the paper's	Acquire delisted stock data or use a point-in-time survivorship-free dataset
Cost assumptions	Run the backtest with zero costs	If results still diverge, costs are not the cause
Long-short construction	Check whether your long and short legs are equally weighted	Many papers implicitly use a 50/50 long/short split; verify this
Exclusion filters	Count stocks excluded by your price/volume filters	Try running without filters to match the paper's implicit universe
Rebalancing timing	Check whether the paper rebalances at open, close, or some other time	This alone can shift annual returns by 2–4%

Module 6: Closing — The Honest Answer

There is a version of this story that ends with you building a profitable, live strategy. There is also a version that ends with you publishing a critical review of the paper's methodology. Both are valid outcomes. The difference lies not in your coding ability, but in your relationship with the data.

Reproducibility is not about matching a number. It is about understanding the system that produced the number. Every time you run a backtest that disagrees with the paper, you have an opportunity to learn something — about the strategy, about the data, about your own assumptions.

The pipeline we have built in this article is designed to make that learning systematic. Use TickDB's historical OHLCV data to establish a clean, reproducible data foundation. Use the factor construction framework to implement strategies with full transparency about your assumptions. Use the transaction cost model to ground your results in economic reality. Use the walk-forward validation to test whether the strategy generalizes beyond the sample period.

And when your Sharpe ratio comes back as 0.4 instead of 2.3, run the diagnostic checklist before you abandon the strategy. The gap is not always a failure of the strategy. Sometimes it is a failure of the paper's design. Sometimes it is an opportunity to build something better.

Next Steps

If you want to reproduce academic strategies with institutional-grade data, sign up at tickdb.ai for a free API key and access 10+ years of cleaned US equity OHLCV data for backtesting.

If you need to compare multiple data sources or validate your results across vendors, TickDB provides a unified API for cross-market data acquisition, covering US equities, HK equities, and crypto.

If you are working with a quant team, reach out to enterprise@tickdb.ai for institutional plans with higher rate limits, dedicated support, and extended historical coverage.

If you use AI coding assistants, search for and install the tickdb-market-data SKILL in your AI tool's marketplace to get TickDB API integration directly in your development environment.

This article does not constitute investment advice. Backtested results are based on historical simulation and do not guarantee future performance. Transaction costs, slippage, and market impact are approximated and may differ from actual live trading conditions. Always conduct out-of-sample validation and paper trading before deploying any strategy with real capital.