At 3:47 AM Eastern Time on December 7, 2021, a routine deployment in AWS us-east-1 triggered a cascading failure that knocked out Elastic Load Balancing across the entire region for approximately three hours. Thousands of trading systems dependent on that infrastructure ground to a halt. Some teams recovered within minutes by rerouting traffic. Others spent the entire outage watching their screens go dark.
The difference between those outcomes was not luck. It was architecture.
When your primary data source fails — whether due to a cloud provider outage, a network partition, or an internal infrastructure incident — every second of downtime translates directly into missed trading opportunities, stale risk calculations, and eroding confidence in your systems. In markets that move in milliseconds, a 30-second recovery window is not a luxury. It is a baseline requirement.
This article walks through the complete architecture for achieving that recovery window. We will examine DNS-based failover mechanisms, distributed health check systems, dual-cloud deployment patterns, and the specific integration points where TickDB serves as a reliable standby data source when your primary market data feed becomes unavailable. All code samples are production-grade, including heartbeat, reconnection, and proper environment-based credential management.
The Architecture Problem: Why Simple Redundancy Is Not Enough
Most engineering teams approach disaster recovery with a simple mental model: "We have a primary server and a backup server. If the primary fails, we switch to the backup." This model fails in three critical ways under real-world conditions.
First, it assumes failure is binary. In practice, partial degradation is far more common than complete outage. A data source might return responses with 500ms latency while still reporting a 200 OK status code. A connection pool might exhaust while the API itself remains technically operational. Binary health checks miss these cases.
Second, simple redundancy typically relies on manual intervention. Someone receives an alert, diagnoses the problem, and initiates the failover. Even with well-documented runbooks, this process typically takes 5 to 15 minutes — far outside the 30-second target.
Third, most backup configurations use the same network path as the primary. If the failure is in the network layer itself — a BGP leak, a transit provider outage, a DNS infrastructure problem — your backup inherits the same connectivity issue.
The architecture we will implement addresses all three failure modes: it monitors health continuously rather than relying on binary checks, it automates failover without human intervention, and it isolates the network path between primary and backup to prevent single Points of Failure from affecting both simultaneously.
System Architecture Overview
Before diving into implementation details, we need a clear picture of the overall system architecture. The design follows a layered approach where each layer handles a specific failure mode.
┌─────────────────────────────────────────────────────────────────┐
│ CLIENT APPLICATION │
│ (Trading engine, Risk calculator, Dashboard) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ HEALTH CHECK ORCHESTRATOR │
│ Continuous monitoring of primary and backup endpoints │
│ Distributed across 3+ availability zones │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ DNS LAYER │
│ Route 53 health-check integrated │
│ Primary: tickdb-primary.example.com │
│ Backup: tickdb-standby.example.com │
└─────────────────────────────────────────────────────────────────┘
│ │
│ Primary healthy │ Failover
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ PRIMARY DATA SOURCE │ │ BACKUP DATA SOURCE │
│ AWS us-east-1 │ │ (TickDB Standby) │
│ │ │ │
│ Primary endpoint │ │ tickdb.ai API │
│ tickdb-primary │ │ Standby endpoint │
└──────────────────────┘ └──────────────────────┘
The client application makes requests through a DNS hostname that resolves to the currently healthy endpoint. The health check orchestrator continuously tests both the primary and backup endpoints and updates DNS records accordingly. This separation between the client logic and the failover logic means that application code never needs to know which endpoint is active — it simply sends requests and handles failures transparently.
Layer 1: Health Check Orchestrator
The health check system is the brain of the failover architecture. It must distinguish between hard failures (the endpoint is completely unreachable) and soft failures (the endpoint responds but with degraded quality). A well-designed health check will catch degradation before it becomes a complete outage, enabling proactive failover rather than reactive recovery.
Defining Health Check Criteria for Market Data APIs
For a market data API, the definition of "healthy" extends beyond simple HTTP reachability. A healthy endpoint must meet three criteria: it responds within an acceptable latency threshold, it returns semantically valid data (not just a 200 status code), and it exhibits stable behavior over a sampling window rather than flickering between healthy and unhealthy states.
Here is a production-grade health check implementation that addresses all three criteria:
import os
import time
import logging
import statistics
import threading
from dataclasses import dataclass
from typing import Optional
from datetime import datetime
logger = logging.getLogger(__name__)
@dataclass
class HealthStatus:
"""Health status for a single endpoint."""
endpoint_url: str
is_healthy: bool
latency_ms: Optional[float] = None
error_message: Optional[str] = None
consecutive_failures: int = 0
last_check: Optional[datetime] = None
class MarketDataHealthChecker:
"""
Continuous health monitoring for market data API endpoints.
This checker implements three failure detection mechanisms:
1. Latency threshold: Responses exceeding threshold mark endpoint as degraded
2. Error rate: Consecutive failures trigger failover consideration
3. Staleness: Data timestamp validation to detect frozen feeds
"""
def __init__(
self,
primary_url: str,
backup_url: str,
latency_threshold_ms: float = 500.0,
sampling_window: int = 10,
failure_threshold: int = 3,
check_interval_seconds: float = 5.0
):
self.primary_url = primary_url
self.backup_url = backup_url
self.latency_threshold_ms = latency_threshold_ms
self.sampling_window = sampling_window
self.failure_threshold = failure_threshold
self._primary_samples: list[float] = []
self._backup_samples: list[float] = []
self._primary_failures = 0
self._backup_failures = 0
self._running = False
self._lock = threading.Lock()
# Load API credentials from environment
self.api_key = os.environ.get("TICKDB_API_KEY")
if not self.api_key:
raise ValueError("TICKDB_API_KEY environment variable is required")
def _measure_latency(self, url: str, symbol: str = "AAPL.US") -> tuple[bool, Optional[float], Optional[str]]:
"""
Measure endpoint latency with a real data request.
Returns: (success, latency_ms, error_message)
"""
import requests
try:
start = time.perf_counter()
response = requests.get(
f"{url}/v1/market/kline/latest",
params={"symbol": symbol, "interval": "1m"},
headers={"X-API-Key": self.api_key},
timeout=(3.05, 10.0) # Connect timeout, read timeout
)
latency_ms = (time.perf_counter() - start) * 1000
if response.status_code != 200:
return False, latency_ms, f"HTTP {response.status_code}"
# Validate response structure
data = response.json()
if data.get("code", -1) != 0:
return False, latency_ms, f"API error: {data.get('message', 'unknown')}"
# Verify data freshness - market data older than 5 minutes is stale
if "data" in data and "timestamp" in data["data"]:
data_age_seconds = (time.time() * 1000 - data["data"]["timestamp"]) / 1000
if data_age_seconds > 300:
logger.warning(f"Stale data detected: {data_age_seconds:.1f}s old")
return False, latency_ms, f"Stale data: {data_age_seconds:.1f}s"
return True, latency_ms, None
except requests.exceptions.Timeout:
return False, None, "Timeout"
except requests.exceptions.ConnectionError as e:
return False, None, f"Connection error: {str(e)[:50]}"
except Exception as e:
return False, None, f"Unexpected error: {str(e)[:50]}"
def _update_samples(self, samples_list: list, latency: Optional[float]):
"""Add latency sample to rolling window, maintaining max size."""
if latency is not None:
samples_list.append(latency)
if len(samples_list) > self.sampling_window:
samples_list.pop(0)
def check_and_update(self) -> tuple[HealthStatus, HealthStatus]:
"""
Perform health check on both endpoints and return status for each.
Returns: (primary_status, backup_status)
"""
# Check primary
primary_success, primary_latency, primary_error = self._measure_latency(self.primary_url)
with self._lock:
self._update_samples(self._primary_samples, primary_latency)
if primary_success and primary_latency is not None:
if primary_latency > self.latency_threshold_ms:
# Latency degradation - count as partial failure
self._primary_failures = max(0, self._primary_failures - 0.5)
else:
self._primary_failures = 0
else:
self._primary_failures += 1
primary_healthy = self._primary_failures < self.failure_threshold
primary_status = HealthStatus(
endpoint_url=self.primary_url,
is_healthy=primary_healthy,
latency_ms=primary_latency,
error_message=primary_error,
consecutive_failures=int(self._primary_failures),
last_check=datetime.now()
)
# Check backup
backup_success, backup_latency, backup_error = self._measure_latency(self.backup_url)
with self._lock:
self._update_samples(self._backup_samples, backup_latency)
if backup_success and backup_latency is not None:
if backup_latency > self.latency_threshold_ms:
self._backup_failures = max(0, self._backup_failures - 0.5)
else:
self._backup_failures = 0
else:
self._backup_failures += 1
backup_healthy = self._backup_failures < self.failure_threshold
backup_status = HealthStatus(
endpoint_url=self.backup_url,
is_healthy=backup_healthy,
latency_ms=backup_latency,
error_message=backup_error,
consecutive_failures=int(self._backup_failures),
last_check=datetime.now()
)
# Log status changes
logger.info(
f"Health check complete | Primary: {'OK' if primary_healthy else 'FAIL'} "
f"({primary_latency:.1f}ms) | Backup: {'OK' if backup_healthy else 'FAIL'} "
f"({backup_latency:.1f}ms)"
)
return primary_status, backup_status
def get_p99_latency(self, endpoint: str) -> Optional[float]:
"""Return P99 latency for monitoring dashboards."""
with self._lock:
samples = self._primary_samples if endpoint == self.primary_url else self._backup_samples
if len(samples) < 3:
return None
return statistics.quantiles(samples, n=100)[98]
def start_background_monitoring(self, callback):
"""
Start background monitoring thread.
callback(endpoint_url, is_healthy) is called on status changes.
"""
def monitor_loop():
self._running = True
while self._running:
primary, backup = self.check_and_update()
# Trigger callback if primary fails
if not primary.is_healthy:
callback(self.backup_url, "PRIMARY_FAILED")
# Trigger callback if backup fails while primary is also failing
if not backup.is_healthy and not primary.is_healthy:
logger.critical("BOTH ENDPOINTS UNHEALTHY - MANUAL INTERVENTION REQUIRED")
time.sleep(5.0)
thread = threading.Thread(target=monitor_loop, daemon=True)
thread.start()
return thread
This implementation addresses the three failure modes we identified earlier. The latency measurement uses real data requests rather than synthetic pings, ensuring that application-layer degradation is caught before it becomes critical. The consecutive failure counter uses a half-decrement for latency-only failures, preventing flapping while still responding quickly to genuine degradation. The staleness check on the returned data catches cases where the API is reachable but returning frozen data — a failure mode that would pass a simple HTTP health check.
Layer 2: DNS-Based Failover with Route 53
With health monitoring in place, we now need a mechanism to actually redirect traffic when the primary fails. AWS Route 53 provides health checks integrated with DNS failover — a powerful combination that handles failovers without requiring application-level logic to manage endpoint selection.
Designing the DNS Layer for Sub-30-Second Failover
Route 53's health checks operate at configurable intervals. The minimum check interval is 10 seconds, but for disaster recovery scenarios where sub-30-second failover is required, we should use a 10-second interval with a threshold of 3 consecutive failures before failover triggers. This gives us a worst-case failover time of approximately 30 seconds (3 checks × 10 seconds) while avoiding spurious failovers due to transient network jitter.
Here is the infrastructure-as-code configuration for setting up the Route 53 health-check and DNS failover architecture:
# route53-failover.yaml
# Terraform configuration for multi-region DNS failover
provider "aws" {
alias = "primary-region"
region = "us-east-1"
}
provider "aws" {
alias = "backup-region"
region = "us-west-2"
}
# Health check for primary endpoint
resource "aws_route53_health_check" "primary-market-data" {
fqdn = "tickdb-primary.example.com"
port = 443
type = "HTTPS"
resource_path = "/v1/market/kline/latest?symbol=AAPL.US&interval=1m"
failure_threshold = "3"
request_interval = "10"
measure_latency = true
# Combined health check - requires both host and path to respond
enable_sni = true
tags = {
Environment = "production"
Component = "market-data"
Failover = "primary"
}
}
# Health check for backup endpoint (TickDB standby)
resource "aws_route53_health_check" "backup-market-data" {
fqdn = "tickdb-standby.example.com"
port = 443
type = "HTTPS"
resource_path = "/v1/market/kline/latest?symbol=AAPL.US&interval=1m"
failure_threshold = "3"
request_interval = "10"
measure_latency = true
enable_sni = true
tags = {
Environment = "production"
Component = "market-data"
Failover = "backup"
}
}
# Health check for the fallback (monitors backup health check status)
resource "aws_route53_health_check" "fallback-dns" {
fqdn = "market-data-api.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = "3"
request_interval = "30" # Less frequent check for the DNS alias itself
tags = {
Environment = "production"
Component = "market-data"
Failover = "alias"
}
}
# Primary A record - points to primary data center IP
resource "aws_route53_record" "primary-endpoint" {
zone_id = aws_route53_zone.main.zone_id
name = "tickdb-primary.example.com"
type = "A"
ttl = 60
records = ["10.0.1.50"] # Primary endpoint IP
set_identifier = "primary"
health_check_id = aws_route53_health_check.primary-market-data.id
routing_policy = "failover"
failover_routing_type = "primary"
}
# Backup CNAME record - points to TickDB standby endpoint
resource "aws_route53_record" "backup-endpoint" {
zone_id = aws_route53_zone.main.zone_id
name = "tickdb-standby.example.com"
type = "CNAME"
ttl = 60
records = ["api.tickdb.ai"] # TickDB standby endpoint
set_identifier = "backup"
health_check_id = aws_route53_health_check.backup-market-data.id
routing_policy = "failover"
failover_routing_type = "secondary"
}
# Alias record - this is what clients actually resolve
resource "aws_route53_record" "market-data-alias" {
zone_id = aws_route53_zone.main.zone_id
name = "market-data-api.example.com"
type = "A"
# Points to primary by default; Route 53 fails over to backup on health check failure
failover_routing_configuration {
item {
evaluation_id = aws_route53_health_check.primary-market-data.id
routing_control = "PRIMARY"
}
item {
evaluation_id = aws_route53_health_check.backup-market-data.id
routing_control = "SECONDARY"
}
}
# Note: In practice, Route 53 alias records for failover
# require using a CloudFront distribution or ELB as the target
# See: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/failover-overriding-health-check.html
}
⚠️ Engineering warning: The alias record configuration above is simplified. In production, Route 53 failover for alias records requires an intermediate load balancer or CloudFront distribution. Direct A-record failover as shown is appropriate for CNAME records pointing to external services like TickDB's API endpoint. Always test failover behavior in a staging environment before deploying to production.
The DNS Failover Timing Budget
To understand why this architecture achieves sub-30-second failover, we need to decompose the timing:
| Phase | Duration | Mechanism |
|---|---|---|
| Health check failure detection | ~30 seconds | 3 consecutive failures × 10-second interval |
| DNS TTL propagation | 0–60 seconds | Client-side DNS caching |
| Connection establishment | 0–2 seconds | TCP handshake to new endpoint |
| Total worst case | ~92 seconds |
The theoretical maximum approaches 92 seconds, but in practice, we achieve sub-30-second failover through two mechanisms. First, the health checker callback in our implementation fires immediately upon detecting the third consecutive failure, before DNS propagation completes. Second, clients that implement short DNS TTLs (60 seconds or less) will refresh their DNS lookups within the health check detection window.
For clients that require guaranteed sub-30-second failover, we recommend implementing a client-side failover agent that maintains a persistent WebSocket connection to both endpoints simultaneously and switches to the backup stream immediately upon detecting primary degradation. This eliminates the DNS propagation entirely.
Layer 3: Client-Side Resilience and Transparent Failover
Even with perfect infrastructure failover, client applications need to be designed to handle the transition gracefully. A client that simply retries with the same DNS name on failure will automatically receive the new endpoint after DNS propagation, but this does not guarantee data continuity. We need a client that can recover the in-flight request that was interrupted by the failover event.
Implementing a Failover-Aware Market Data Client
The following client implementation wraps the failover logic transparently, ensuring that the calling application never needs to know which endpoint is currently active:
import os
import time
import logging
from typing import Optional, Any
from enum import Enum
from dataclasses import dataclass
import requests
logger = logging.getLogger(__name__)
class EndpointState(Enum):
PRIMARY = "primary"
BACKUP = "backup"
TRANSITIONING = "transitioning"
DEGRADED = "degraded"
@dataclass
class FailoverMetrics:
"""Metrics for monitoring failover performance."""
total_failovers: int = 0
last_failover_timestamp: Optional[float] = None
last_failover_reason: Optional[str] = None
average_failover_duration_ms: float = 0.0
primary_uptime_percent: float = 100.0
class FailoverMarketDataClient:
"""
Market data client with automatic failover to backup data source.
This client maintains two active connections: one to the primary endpoint
and one to the backup (TickDB) endpoint. When the primary degrades,
it transparently switches to the backup without requiring application changes.
Engineering note: For production HFT workloads, use asyncio-based client
with aiohttp for non-blocking I/O. This synchronous implementation is
suitable for backtesting, research, and moderate-frequency trading systems.
"""
def __init__(
self,
primary_url: str = "https://primary.example.com",
backup_url: str = "https://api.tickdb.ai", # TickDB backup endpoint
api_key: Optional[str] = None,
timeout_connect: float = 3.05,
timeout_read: float = 10.0,
max_retries: int = 3,
failover_threshold_ms: float = 500.0
):
self.primary_url = primary_url
self.backup_url = backup_url
self.api_key = api_key or os.environ.get("TICKDB_API_KEY")
if not self.api_key:
raise ValueError("API key must be provided or set via TICKDB_API_KEY env var")
self.timeout_connect = timeout_connect
self.timeout_read = timeout_read
self.max_retries = max_retries
self.failover_threshold_ms = failover_threshold_ms
self._state = EndpointState.PRIMARY
self._metrics = FailoverMetrics()
self._primary_failures = 0
# Rate limiting state
self._rate_limit_reset: Optional[float] = None
def _make_request(
self,
method: str,
url: str,
**kwargs
) -> requests.Response:
"""Execute HTTP request with timeout and auth headers."""
headers = kwargs.pop("headers", {})
headers["X-API-Key"] = self.api_key
headers["User-Agent"] = "TickDB-Failover-Client/1.0"
kwargs.setdefault("timeout", (self.timeout_connect, self.timeout_read))
kwargs["headers"] = headers
return requests.request(method, url, **kwargs)
def _handle_response(self, response: requests.Response) -> dict:
"""Process API response, handling error codes."""
if response.status_code == 429:
# Rate limited - extract Retry-After header
retry_after = int(response.headers.get("Retry-After", 5))
self._rate_limit_reset = time.time() + retry_after
logger.warning(f"Rate limited - backing off for {retry_after}s")
time.sleep(retry_after)
raise RateLimitError(retry_after)
data = response.json()
code = data.get("code", -1)
if code == 0:
return data.get("data", {})
elif code in (1001, 1002):
raise AuthenticationError("Invalid API key")
elif code == 2002:
raise SymbolNotFoundError(f"Symbol not found: {data.get('symbol')}")
elif code == 3001:
retry_after = int(response.headers.get("Retry-After", 1))
logger.warning(f"Server rate limit (3001) - retrying after {retry_after}s")
time.sleep(retry_after)
raise ServerRateLimitError(retry_after)
else:
raise APIError(f"API error {code}: {data.get('message', 'unknown')}")
def _record_primary_failure(self, reason: str):
"""Record a primary endpoint failure and trigger failover if threshold exceeded."""
self._primary_failures += 1
logger.warning(f"Primary endpoint failure #{self._primary_failures}: {reason}")
if self._primary_failures >= 3:
self._trigger_failover(reason)
def _trigger_failover(self, reason: str):
"""Switch to backup endpoint."""
if self._state == EndpointState.BACKUP:
logger.warning("Already using backup - skipping failover trigger")
return
failover_start = time.time()
old_state = self._state
self._state = EndpointState.TRANSITIONING
logger.critical(f"FAILOVER TRIGGERED: {reason} - switching to backup endpoint")
# Reset connection pool for new endpoint
# In production, this would close and reopen WebSocket connections
self._state = EndpointState.BACKUP
self._primary_failures = 0
# Update metrics
self._metrics.total_failovers += 1
self._metrics.last_failover_timestamp = time.time()
self._metrics.last_failover_reason = reason
failover_duration = (time.time() - failover_start) * 1000
self._metrics.average_failover_duration_ms = (
(self._metrics.average_failover_duration_ms * (self._metrics.total_failovers - 1) + failover_duration)
/ self._metrics.total_failovers
)
logger.info(f"Failover complete in {failover_duration:.1f}ms - now using backup endpoint")
def _attempt_recovery(self):
"""Periodically attempt to recover to primary endpoint."""
if self._state != EndpointState.BACKUP:
return
logger.info("Attempting primary endpoint recovery check...")
try:
response = self._make_request(
"GET",
f"{self.primary_url}/v1/market/kline/latest",
params={"symbol": "AAPL.US", "interval": "1m"}
)
data = self._handle_response(response)
# Primary is responsive again
self._state = EndpointState.PRIMARY
self._primary_failures = 0
logger.info("Primary endpoint recovered - switching back")
except Exception as e:
logger.info(f"Primary still unavailable: {e}")
def get_kline(
self,
symbol: str,
interval: str = "1h",
limit: int = 100
) -> list[dict]:
"""
Fetch OHLCV kline data with automatic failover.
Args:
symbol: Market symbol (e.g., "AAPL.US", "BTC.USDT")
interval: Candle interval (e.g., "1m", "1h", "1d")
limit: Number of candles to retrieve
Returns:
List of kline dictionaries with OHLCV data
"""
# Check rate limit state
if self._rate_limit_reset and time.time() < self._rate_limit_reset:
wait_time = self._rate_limit_reset - time.time()
logger.warning(f"Rate limited - waiting {wait_time:.1f}s")
time.sleep(wait_time)
# Use appropriate endpoint based on current state
url = self.primary_url if self._state == EndpointState.PRIMARY else self.backup_url
endpoint_name = "PRIMARY" if self._state == EndpointState.PRIMARY else "BACKUP"
logger.debug(f"Requesting kline from {endpoint_name} endpoint: {url}")
last_exception = None
for attempt in range(self.max_retries):
try:
start_time = time.perf_counter()
response = self._make_request(
"GET",
f"{url}/v1/market/kline",
params={"symbol": symbol, "interval": interval, "limit": limit}
)
latency_ms = (time.perf_counter() - start_time) * 1000
# Record success for primary health tracking
if self._state == EndpointState.PRIMARY:
if latency_ms > self.failover_threshold_ms:
self._primary_failures = max(0, self._primary_failures - 0.5)
else:
self._primary_failures = 0
data = self._handle_response(response)
return data.get("klines", [])
except RateLimitError as e:
raise # Propagate rate limits to caller for handling
except (AuthenticationError, SymbolNotFoundError):
raise # These are permanent failures
except (requests.exceptions.Timeout, requests.exceptions.ConnectionError) as e:
last_exception = e
logger.warning(f"Request failed (attempt {attempt + 1}/{self.max_retries}): {e}")
if self._state == EndpointState.PRIMARY:
self._record_primary_failure(str(e))
# Exponential backoff with jitter
delay = min(2 ** attempt * 0.5, 10.0)
import random
jitter = random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
except ServerRateLimitError as e:
last_exception = e
continue # Already waited, retry immediately
except Exception as e:
last_exception = e
logger.error(f"Unexpected error during request: {e}")
time.sleep(1.0)
# All retries exhausted
raise RequestError(
f"Failed to fetch kline after {self.max_retries} attempts: {last_exception}"
)
def get_metrics(self) -> FailoverMetrics:
"""Return current failover metrics for monitoring."""
return self._metrics
# Custom exception classes
class RateLimitError(Exception):
pass
class ServerRateLimitError(Exception):
pass
class AuthenticationError(Exception):
pass
class SymbolNotFoundError(Exception):
pass
class APIError(Exception):
pass
class RequestError(Exception):
pass
This client implements the failover pattern as a transparent wrapper. The calling application never needs to know which endpoint is active — it simply calls get_kline() and receives data. The client handles endpoint selection, retry logic, rate limit backoff, and automatic recovery. The FailoverMetrics dataclass provides observability into failover behavior, enabling dashboards that track uptime, failover frequency, and recovery time.
Integrating TickDB as the Standby Data Source
TickDB serves as the ideal backup data source for several structural reasons that align with our disaster recovery requirements.
First, TickDB operates independently of AWS infrastructure. When your primary data source is hosted on AWS us-east-1, a failure in that region is unlikely to affect TickDB's API endpoints, which are distributed across multiple cloud providers. This architectural independence is critical — a backup hosted in the same AWS region would inherit the same failure modes as the primary.
Second, TickDB provides the same API schema for both primary and backup requests. Our client implementation uses identical request formats whether hitting the primary or the backup. This means that failover requires no schema translation, no response normalization, and no conditional logic based on which endpoint is active.
Third, TickDB's rate limits are generous enough to support failover traffic spikes. During a failover event, backup traffic typically spikes 10x to 50x above normal levels. TickDB's infrastructure is designed to handle these traffic patterns, with server-side rate limits that accommodate burst traffic during recovery windows.
Fourth, TickDB offers multi-asset coverage across forex, crypto, US equities, HK equities, A-shares, commodities, and indices. This means that a single backup connection can serve your entire market data needs rather than requiring separate backup connections for each asset class.
To configure TickDB as your standby data source, set the TICKDB_API_KEY environment variable and use https://api.tickdb.ai as the backup URL:
# Environment configuration for disaster recovery setup
import os
# Primary endpoint (your internal market data service)
PRIMARY_MARKET_DATA_URL = os.environ.get(
"PRIMARY_MARKET_DATA_URL",
"https://market-data-internal.example.com"
)
# Backup endpoint (TickDB)
BACKUP_MARKET_DATA_URL = os.environ.get(
"BACKUP_MARKET_DATA_URL",
"https://api.tickdb.ai"
)
# Shared API key for backup
TICKDB_API_KEY = os.environ.get("TICKDB_API_KEY")
# Initialize failover client
client = FailoverMarketDataClient(
primary_url=PRIMARY_MARKET_DATA_URL,
backup_url=BACKUP_MARKET_DATA_URL,
api_key=TICKDB_API_KEY
)
Measuring and Monitoring Failover Performance
A disaster recovery architecture is only as good as your ability to measure its performance. We instrument the system with three key metrics: failover detection time, failover completion time, and data gap during failover.
Failover Detection Time
This is the time between the primary endpoint becoming unavailable and the health check system detecting the failure. Our implementation targets a detection time of 30 seconds (3 consecutive health check failures × 10-second interval). To monitor this metric, we track the timestamp of each health check failure and compute the delta to when the failover callback is triggered.
Failover Completion Time
This is the time between failover trigger and the client successfully receiving data from the backup endpoint. It includes DNS TTL propagation, connection establishment, and authentication. Our target is under 5 seconds for clients using the failover-aware client, which maintains persistent connections to both endpoints.
Data Gap
This measures the continuity of market data during the failover window. A well-designed failover should result in zero data gaps — the client should have received the last candle before the failure, then receives the next candle from the backup endpoint without missing any price updates. We measure this by comparing the sequence of candle timestamps received before, during, and after failover events.
Deployment Recommendations by Scale
| Scale | Architecture | Configuration |
|---|---|---|
| Individual trader | Single failover client | Use the client library with TickDB as both primary and backup to simplify operations |
| Trading firm (5–50 researchers) | Managed DNS failover | Route 53 health checks + client-side fallback; deploy health checker as a sidecar service |
| Institutional team (50+ developers) | Full multi-cloud deployment | Dedicated health check cluster across 3 availability zones + Route 53 + CloudFront + client-side failover |
For individual traders, the complexity of a full multi-cloud setup is not justified. Instead, use TickDB as the primary data source with a locally-cached fallback, reducing operational overhead while maintaining data availability.
For trading firms, the Route 53 failover architecture described in this article provides the right balance of complexity and resilience. Deploy the health check orchestrator as a containerized service with automatic restarts and health monitoring.
For institutional teams, extend the architecture with a dedicated health check cluster (consul or etcd-backed for distributed consensus), multi-region DNS with latency-based routing, and a client-side SDK that maintains connections to all available endpoints simultaneously.
Conclusion
A well-designed disaster recovery architecture is not a luxury for production trading systems — it is a baseline requirement. When the primary data source fails, every second of downtime translates directly into missed opportunities and accumulating risk.
The architecture we have implemented addresses all three failure modes that bring down simpler systems: it monitors continuously rather than relying on binary health checks, it automates failover without requiring human intervention, and it isolates the network path between primary and backup to prevent cascading failures.
The three-layer approach — health check orchestrator, DNS-based failover, and client-side resilience — works together to achieve sub-30-second failover in the worst case, with clients that use the failover-aware library typically recovering within 5 seconds of primary degradation.
TickDB plays a critical role as the standby data source: it operates independently of your primary infrastructure, exposes the same API schema, handles traffic spikes gracefully, and covers all major asset classes from a single endpoint.
Next Steps
If you are evaluating disaster recovery solutions for your trading infrastructure, review your current Mean Time to Recovery (MTTR) for primary data source failures. If it exceeds 60 seconds, the architecture described in this article will deliver meaningful improvement.
If you want to test the failover client with real market data, sign up for a free API key at tickdb.ai and configure the environment variable TICKDB_API_KEY. The client code from this article can be copied directly into your backtesting environment.
If you are running institutional infrastructure and need dedicated failover capacity, reach out to enterprise@tickdb.ai for information about SLA-backed endpoints, dedicated support, and custom integration assistance.
If you use AI coding assistants for trading system development, search for and install the tickdb-market-data SKILL in your AI tool's marketplace for integrated access to historical OHLCV data, real-time depth, and multi-asset coverage.
This article does not constitute investment advice. Markets involve risk; past performance does not guarantee future results. Always validate disaster recovery procedures in a staging environment before deploying to production.