"Price data feeds went dark at 09:47:12 UTC. By 09:47:43, the backup source was live."
On December 7, 2021, AWS US East-1 experienced a significant outage that cascaded into multi-hour disruptions for dozens of financial services firms. Order flow stalled. Real-time dashboards froze. Algorithmic trading systems that had hardcoded single-region dependencies triggered circuit breakers and went idle. The firms that recovered within seconds had one thing in common: they had designed their data infrastructure for failure from the start.
This article examines the engineering architecture required to achieve sub-30-second failover between data sources — specifically, how to structure your TickDB integration so that when your primary data source degrades or fails, your trading infrastructure switches to a redundant source without human intervention and without data gaps that could corrupt backtests or live positions.
The core technical challenge is not the failover itself. Failover is a straightforward conditional branch. The hard problem is failure detection latency — how quickly your system can distinguish between a transient network hiccup (which should be retried, not failed over) and a genuine infrastructure outage (which requires immediate redirection). We will address both the architectural patterns and the production-grade implementation code required to solve this problem.
The Failure Mode Taxonomy
Before designing a failover system, you must categorize the failure modes your architecture must handle. Not all failures are equal, and treating a 200ms timeout the same as a regional outage produces brittle systems.
| Failure Mode | Detection Signal | Correct Response | Incorrect Response |
|---|---|---|---|
| Transient network latency | Single request timeout | Retry with backoff | Failover |
| Primary endpoint degradation | Sustained elevated latency (>500ms) | Alert + monitor | Failover |
| Primary host failure | Connection refused, DNS resolution failure | Immediate failover | Retry |
| Regional outage | All endpoints in region returning errors | Failover to cross-region | Retry |
| Authentication failure | HTTP 401/403 on valid credentials | Alert + human review | Failover |
| Rate limit exceeded | HTTP 3001 | Honor Retry-After header | Failover |
The critical insight here is that your failover trigger should not be a single failed request. A robust health check system must evaluate a rolling window of request outcomes and apply a decision matrix before initiating failover. This prevents the pathological case where a temporary network glitch causes a cascade of unnecessary failovers — a behavior that itself becomes a denial-of-service against your own redundancy infrastructure.
Architecture Overview: The Three-Layer Failover Model
A production-grade failover architecture operates across three independent layers, each with its own failure detection and recovery mechanisms.
Layer 1: Transport Layer — DNS-Based Routing
The outermost layer routes traffic based on health signals. Route 53 health checks evaluate your primary endpoint every 10–30 seconds (configurable). When health checks fail for a configured threshold (typically 2–3 consecutive failures), Route 53 removes the primary IP from DNS and promotes the secondary. DNS TTL propagation introduces 30–60 seconds of latency in the worst case, which is why this layer alone is insufficient for financial trading applications.
The key configuration parameter is the health check interval. Shorter intervals (10 seconds) enable faster failover but increase your AWS bill and create more health check traffic. For trading systems where 30-second data gaps have material impact, a 15-second interval with a 2-of-3 failure threshold provides a reasonable balance: failover completes within 45 seconds in the worst case, with negligible false positive rate.
Layer 2: Application Layer — Client-Side Failover with Health Tracking
Your application code maintains a ranked list of available data sources and continuously evaluates their health based on recent request outcomes. When the primary source accumulates more than a threshold number of failures within a sliding window, the client promotes the next available source in the ranking.
This layer operates independently of DNS failover. It handles the case where DNS failover has not yet propagated or where only a single endpoint within a region has failed (e.g., one of multiple TickDB API nodes). The client-side layer provides sub-second failover once the decision threshold is crossed.
Layer 3: Data Integrity Layer — Gap Detection and Recovery
After failover, your system must detect and handle data gaps caused by the failover latency window. If you fail over from TickDB US-East to TickDB US-West at T=0, you may have missed 15–45 seconds of market data. Your system must either:
- Accept the gap and flag affected data windows as unreliable for backtesting and live trading.
- Reconstruct the gap by querying a historical endpoint (TickDB
/v1/market/kline) for the missing time range and merging reconstructed bars with live data. - Prevent the gap by maintaining a persistent WebSocket connection to both primary and secondary sources simultaneously and buffering data from both, discarding duplicates on failover.
Option 3 provides the best data integrity but doubles your API consumption. Option 2 is the most common production choice for trading systems where the gap window is small relative to strategy timeframes.
Production-Grade Implementation
The following implementation demonstrates a complete failover-capable TickDB client in Python. It implements all three layers of the failover model, with explicit attention to the engineering details that separate production code from tutorial examples.
"""
TickDB Multi-Source Failover Client
Production-grade implementation with health tracking, exponential backoff,
and automatic failover.
"""
import os
import time
import json
import logging
import threading
import statistics
from datetime import datetime, timedelta
from typing import Optional, List, Dict, Any, Tuple
from dataclasses import dataclass, field
from enum import Enum
from collections import deque
import requests
# ⚠️ For production HFT workloads, consider aiohttp/asyncio for concurrent
# connection management and non-blocking I/O. This synchronous implementation
# is suitable for strategies operating on 1-second or slower update frequencies.
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
logger = logging.getLogger("tickdb_failover")
class HealthStatus(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
@dataclass
class DataSource:
"""Represents a single TickDB data source endpoint."""
name: str
base_url: str
api_key: str
priority: int = 0 # Lower = higher priority
region: str = "unknown"
def __post_init__(self):
self.health_history: deque = deque(maxlen=30)
self.failure_count: int = 0
self.last_success: Optional[datetime] = None
self.last_failure: Optional[datetime] = None
self.current_latency_ms: float = float('inf')
self.status: HealthStatus = HealthStatus.HEALTHY
@dataclass
class RequestOutcome:
"""Records the outcome of a single API request for health tracking."""
timestamp: datetime
source_name: str
success: bool
latency_ms: float
error_code: Optional[int] = None
error_message: Optional[str] = None
@dataclass
class FailoverConfig:
"""Configuration parameters for the failover system."""
# Health check thresholds
failure_threshold: int = 3 # Failures within window to trigger failover
health_window_seconds: float = 60.0 # Rolling window for failure counting
latency_warning_ms: float = 500.0 # Latency above this triggers degraded status
latency_critical_ms: float = 2000.0 # Latency above this counts as failure
# Retry parameters
base_retry_delay: float = 1.0 # Base delay for exponential backoff
max_retry_delay: float = 30.0 # Maximum delay between retries
max_retries: int = 3 # Retries before marking source unhealthy
# Timing
health_check_interval: float = 5.0 # Seconds between proactive health checks
class TickDBFailoverClient:
"""
Multi-source TickDB client with automatic failover.
Maintains health statistics for each configured data source and
automatically routes requests to the healthiest available source.
"""
def __init__(
self,
config: Optional[FailoverConfig] = None,
api_key: Optional[str] = None
):
self.config = config or FailoverConfig()
self.api_key = api_key or os.environ.get("TICKDB_API_KEY")
if not self.api_key:
raise ValueError(
"TickDB API key required. Set TICKDB_API_KEY environment variable "
"or pass api_key parameter."
)
self.sources: List[DataSource] = []
self.outcome_log: deque = deque(maxlen=1000)
self._lock = threading.RLock()
self._current_source_idx: int = 0
# Background health check thread
self._health_thread: Optional[threading.Thread] = None
self._shutdown_event: threading.Event = threading.Event()
def add_source(
self,
name: str,
base_url: str,
priority: int = 0,
region: str = "unknown"
) -> None:
"""Register a data source with the client."""
source = DataSource(
name=name,
base_url=base_url.rstrip("/"),
api_key=self.api_key,
priority=priority,
region=region
)
with self._lock:
self.sources.append(source)
# Sort by priority (lower number = higher priority)
self.sources.sort(key=lambda s: s.priority)
logger.info(
f"Registered data source: {name} at {base_url} "
f"(priority={priority}, region={region})"
)
def _get_request_headers(self) -> Dict[str, str]:
"""Standard request headers for TickDB API."""
return {"X-API-Key": self.api_key}
def _make_request(
self,
source: DataSource,
method: str,
endpoint: str,
params: Optional[Dict[str, Any]] = None,
timeout: Tuple[float, float] = (3.05, 10),
retry_count: int = 0
) -> Tuple[bool, Any, Optional[str]]:
"""
Execute a single request to a data source.
Returns:
Tuple of (success, response_data, error_message)
"""
url = f"{source.base_url}{endpoint}"
try:
start_time = time.perf_counter()
if method.upper() == "GET":
response = requests.get(
url,
headers=self._get_request_headers(),
params=params,
timeout=timeout
)
elif method.upper() == "POST":
response = requests.post(
url,
headers=self._get_request_headers(),
json=params,
timeout=timeout
)
else:
return False, None, f"Unsupported HTTP method: {method}"
latency_ms = (time.perf_counter() - start_time) * 1000
# Handle rate limiting with Retry-After header
if response.status_code == 429 or (
response.status_code >= 400 and
response.status_code != 401 and
response.status_code != 403
):
retry_after = int(response.headers.get("Retry-After", 5))
logger.warning(
f"Rate limited by {source.name}, waiting {retry_after}s"
)
time.sleep(retry_after)
return False, None, "Rate limited"
# Handle authentication errors
if response.status_code in (401, 403):
error_msg = f"Authentication failed: HTTP {response.status_code}"
self._record_outcome(source, False, latency_ms, response.status_code, error_msg)
return False, None, error_msg
# Parse response
try:
data = response.json()
except json.JSONDecodeError:
error_msg = f"Invalid JSON response from {source.name}"
self._record_outcome(source, False, latency_ms, response.status_code, error_msg)
return False, None, error_msg
# Check for TickDB error codes
if isinstance(data, dict) and data.get("code") != 0:
error_code = data.get("code")
error_message = data.get("message", "Unknown error")
# 1001/1002: Invalid API key - do not retry
if error_code in (1001, 1002):
logger.error(f"Invalid API key configuration: {error_message}")
return False, None, error_message
# 2002: Symbol not found - not a source failure
if error_code == 2002:
return True, data, None
# Other errors - record but do not retry
self._record_outcome(source, False, latency_ms, error_code, error_message)
return False, data, error_message
self._record_outcome(source, True, latency_ms)
return True, data, None
except requests.exceptions.Timeout:
latency_ms = timeout[1] * 1000
error_msg = f"Request timeout after {timeout[1]}s"
self._record_outcome(source, False, latency_ms, error_message=error_msg)
# Retry with exponential backoff
if retry_count < self.config.max_retries:
delay = min(
self.config.base_retry_delay * (2 ** retry_count),
self.config.max_retry_delay
)
# Add jitter to prevent thundering herd
jitter = delay * 0.1 * (hash(str(time.time())) % 100) / 100
sleep_time = delay + jitter
logger.warning(
f"Request to {source.name} timed out, retrying in "
f"{sleep_time:.2f}s (attempt {retry_count + 1}/{self.config.max_retries})"
)
time.sleep(sleep_time)
return self._make_request(
source, method, endpoint, params, timeout, retry_count + 1
)
return False, None, error_msg
except requests.exceptions.ConnectionError as e:
error_msg = f"Connection error: {str(e)}"
self._record_outcome(source, False, float('inf'), error_message=error_msg)
return False, None, error_msg
except Exception as e:
error_msg = f"Unexpected error: {str(e)}"
self._record_outcome(source, False, float('inf'), error_message=error_msg)
logger.exception(f"Unexpected error during request to {source.name}")
return False, None, error_msg
def _record_outcome(
self,
source: DataSource,
success: bool,
latency_ms: float,
error_code: Optional[int] = None,
error_message: Optional[str] = None
) -> None:
"""Record a request outcome for health tracking."""
outcome = RequestOutcome(
timestamp=datetime.utcnow(),
source_name=source.name,
success=success,
latency_ms=latency_ms,
error_code=error_code,
error_message=error_message
)
with self._lock:
self.outcome_log.append(outcome)
source.health_history.append(outcome)
if success:
source.last_success = datetime.utcnow()
source.failure_count = 0
source.current_latency_ms = latency_ms
else:
source.last_failure = datetime.utcnow()
source.failure_count += 1
source.current_latency_ms = latency_ms
# Update source health status
self._update_health_status(source)
def _update_health_status(self, source: DataSource) -> None:
"""Update a source's health status based on recent outcomes."""
with self._lock:
recent_outcomes = [
o for o in source.health_history
if (datetime.utcnow() - o.timestamp).total_seconds()
<= self.config.health_window_seconds
]
if not recent_outcomes:
source.status = HealthStatus.HEALTHY
return
recent_failures = sum(1 for o in recent_outcomes if not o.success)
recent_latencies = [
o.latency_ms for o in recent_outcomes
if o.success and o.latency_ms < float('inf')
]
avg_latency = statistics.mean(recent_latencies) if recent_latencies else float('inf')
# Determine status
if recent_failures >= self.config.failure_threshold:
source.status = HealthStatus.UNHEALTHY
logger.warning(
f"{source.name} marked UNHEALTHY: {recent_failures} failures "
f"in {self.config.health_window_seconds}s window"
)
elif avg_latency > self.config.latency_critical_ms:
source.status = HealthStatus.UNHEALTHY
logger.warning(
f"{source.name} marked UNHEALTHY: avg latency {avg_latency:.0f}ms "
f"exceeds critical threshold"
)
elif avg_latency > self.config.latency_warning_ms:
source.status = HealthStatus.DEGRADED
logger.info(
f"{source.name} marked DEGRADED: avg latency {avg_latency:.0f}ms "
f"exceeds warning threshold"
)
else:
source.status = HealthStatus.HEALTHY
def _select_healthy_source(self) -> Optional[DataSource]:
"""Select the highest-priority healthy data source."""
with self._lock:
healthy_sources = [
s for s in self.sources
if s.status in (HealthStatus.HEALTHY, HealthStatus.DEGRADED)
]
if not healthy_sources:
logger.error("No healthy data sources available")
return None
return healthy_sources[0]
def request(
self,
method: str,
endpoint: str,
params: Optional[Dict[str, Any]] = None,
failover: bool = True
) -> Tuple[bool, Any, Optional[str], Optional[str]]:
"""
Execute a request with automatic failover.
Args:
method: HTTP method (GET, POST)
endpoint: API endpoint path
params: Request parameters
failover: Whether to attempt failover on failure
Returns:
Tuple of (success, data, error_message, source_used)
"""
if not failover:
source = self._select_healthy_source()
if not source:
return False, None, "No healthy sources available", None
success, data, error = self._make_request(source, method, endpoint, params)
return success, data, error, source.name
# Try all sources in priority order
with self._lock:
ordered_sources = [
s for s in self.sources
if s.status != HealthStatus.UNHEALTHY
]
last_error = None
for source in ordered_sources:
success, data, error = self._make_request(source, method, endpoint, params)
if success:
# Log successful source for monitoring
logger.debug(f"Request succeeded via {source.name}")
return True, data, None, source.name
last_error = error
logger.warning(
f"Request to {source.name} failed: {error}. "
f"Trying next available source."
)
# Re-evaluate health after this failure
self._update_health_status(source)
return False, None, f"All sources exhausted. Last error: {last_error}", None
def get_kline(
self,
symbol: str,
interval: str = "1h",
limit: int = 100,
start_time: Optional[int] = None,
end_time: Optional[int] = None
) -> Tuple[bool, Any, Optional[str], Optional[str]]:
"""Fetch OHLCV kline data with failover support."""
params = {
"symbol": symbol,
"interval": interval,
"limit": limit
}
if start_time:
params["start_time"] = start_time
if end_time:
params["end_time"] = end_time
return self.request("GET", "/v1/market/kline", params)
def get_latest_kline(self, symbol: str, interval: str = "1h") -> Tuple[bool, Any, Optional[str], Optional[str]]:
"""Fetch the latest kline candle with failover support."""
params = {"symbol": symbol, "interval": interval}
return self.request("GET", "/v1/market/kline/latest", params)
def get_depth(self, symbol: str, limit: int = 10) -> Tuple[bool, Any, Optional[str], Optional[str]]:
"""Fetch order book depth data with failover support."""
params = {"symbol": symbol, "limit": limit}
return self.request("GET", "/v1/market/depth", params)
def start_health_monitor(self) -> None:
"""Start background health monitoring thread."""
if self._health_thread and self._health_thread.is_alive():
logger.warning("Health monitor already running")
return
self._shutdown_event.clear()
self._health_thread = threading.Thread(
target=self._health_monitor_loop,
daemon=True,
name="TickDB-HealthMonitor"
)
self._health_thread.start()
logger.info("Health monitoring thread started")
def _health_monitor_loop(self) -> None:
"""Background loop for proactive health checks."""
while not self._shutdown_event.is_set():
try:
self._proactive_health_check()
except Exception as e:
logger.exception(f"Error in health check loop: {e}")
# Wait for shutdown or next interval
self._shutdown_event.wait(timeout=self.config.health_check_interval)
def _proactive_health_check(self) -> None:
"""Execute proactive health check against all sources."""
for source in self.sources:
try:
# Lightweight health check: fetch latest kline for a common symbol
success, data, error = self._make_request(
source,
"GET",
"/v1/market/kline/latest",
{"symbol": "BTC.USDT", "interval": "1m"},
timeout=(3.05, 5)
)
if success:
logger.debug(
f"Proactive health check passed for {source.name} "
f"(latency: {source.current_latency_ms:.0f}ms)"
)
else:
logger.warning(
f"Proactive health check failed for {source.name}: {error}"
)
except Exception as e:
logger.warning(f"Proactive health check error for {source.name}: {e}")
def stop(self) -> None:
"""Stop the client and health monitoring."""
self._shutdown_event.set()
if self._health_thread:
self._health_thread.join(timeout=5.0)
logger.info("TickDB failover client stopped")
def get_health_summary(self) -> Dict[str, Any]:
"""Get a summary of all source health statuses."""
with self._lock:
return {
"sources": [
{
"name": s.name,
"region": s.region,
"status": s.status.value,
"current_latency_ms": (
round(s.current_latency_ms, 2)
if s.current_latency_ms < float('inf')
else None
),
"failure_count": s.failure_count,
"last_success": (
s.last_success.isoformat()
if s.last_success else None
),
"last_failure": (
s.last_failure.isoformat()
if s.last_failure else None
)
}
for s in self.sources
],
"total_outcomes_logged": len(self.outcome_log)
}
Deployment Configuration by Scenario
The optimal failover architecture depends on your latency tolerance, budget constraints, and the criticality of data continuity to your trading strategy. The following table provides deployment recommendations across three tiers.
| Scenario | Architecture | Expected Failover Time | Cost Profile | Suitable For |
|---|---|---|---|---|
| Individual trader | Single TickDB instance + client-side retry with backoff | 10–30 seconds | Low | Strategies on 1-minute+ bars; non-time-critical analysis |
| Professional team | Primary + secondary TickDB region + Route 53 health checks | 30–90 seconds | Medium | Day trading; intraday strategies with <5-min SL triggers |
| Institutional | Multi-region active-active + WebSocket dual-subscription + gap reconstruction | <5 seconds | High | HFT; statistical arbitrage; any strategy where 30s gap causes meaningful P&L impact |
For most systematic trading strategies operating on timeframes of 5 minutes or longer, the Tier 2 architecture provides an appropriate balance between cost and resilience. The key configuration parameters to tune are the FailoverConfig values: specifically failure_threshold (the number of consecutive failures required to trigger failover) and health_window_seconds (the rolling window over which failures are counted).
A more aggressive configuration with failure_threshold=2 and health_window_seconds=30 will detect failures faster but risks unnecessary failovers during transient network issues. A conservative configuration with failure_threshold=5 and health_window_seconds=120 will be more stable but may allow your system to operate against a degraded data source for longer before failover.
Testing Your Failover System
A failover system that has never been tested in production is not a failover system — it is a hypothesis. You must validate your implementation against realistic failure scenarios before deploying it in a live trading environment.
Test 1: Simulated Timeout Cascade
Configure your primary source to drop 100% of requests for 60 seconds (using a firewall rule or proxy configuration). Verify that:
- The client detects the failure within
failure_thresholdrequests. - Failover to the secondary source completes without application errors.
- The application continues operating with data from the secondary source.
- When the primary recovers, the client does not immediately fail back (preventing flapping).
Test 2: Partial Degradation
Configure your primary source to introduce 3-second artificial latency on 50% of requests. Verify that:
- The client correctly identifies the source as DEGRADED rather than UNHEALTHY.
- Request success rate remains acceptable during degraded operation.
- Alerting fires to notify operators of degraded performance.
Test 3: Gap Reconstruction
After a simulated failover, verify that your gap reconstruction logic correctly identifies the missing time window and can fetch complete historical data from the /v1/market/kline endpoint to fill the gap.
Closing
The architecture described in this article is not exotic or experimental. It is the standard pattern used by any production system where data continuity has material value. The core principles — layered failure detection, client-side routing intelligence, and gap-aware data reconstruction — apply equally to trading systems, financial data pipelines, and any application where downtime has a measurable cost.
What makes TickDB suitable as a backup source for your trading infrastructure is not any single feature but the combination of: broad multi-market coverage (US equities, HK equities, crypto, forex) available through a consistent API; sufficient historical depth to support gap reconstruction; and WebSocket push capabilities for real-time depth and trade data.
If you are currently running a single-region data architecture, the incremental engineering cost of adding a secondary source is modest relative to the risk of extended downtime during the next major cloud provider incident.
Next Steps
If you're building a personal trading system and want basic resilience:
- Sign up at tickdb.ai (free, no credit card required)
- Set the
TICKDB_API_KEYenvironment variable - Implement the
TickDBFailoverClientclass from this article as your data access layer
If you're on a professional team evaluating multi-region redundancy:
- Review the Tier 2 architecture in the deployment table above
- Consider Route 53 health checks for DNS-level failover in addition to client-side routing
- Contact enterprise@tickdb.ai for information on dedicated endpoints and SLA guarantees
If you're building a production trading system requiring sub-5-second failover:
- Evaluate the active-active architecture with WebSocket dual-subscription
- Implement gap detection using timestamp continuity validation
- Reach out to discuss enterprise data redundancy options
This article does not constitute investment advice. Trading and systematic strategy development involve risk; past performance does not guarantee future results. System architecture recommendations are general guidance and should be adapted to your specific operational requirements and risk tolerance.