At 3:47 AM Eastern Time on December 7, 2021, a routine deployment in AWS us-east-1 triggered a cascading failure that knocked out Elastic Load Balancing across the entire region for approximately three hours. Thousands of trading systems dependent on that infrastructure ground to a halt. Some teams recovered within minutes by rerouting traffic. Others spent the entire outage watching their screens go dark.

The difference between those outcomes was not luck. It was architecture.

When your primary data source fails — whether due to a cloud provider outage, a network partition, or an internal infrastructure incident — every second of downtime translates directly into missed trading opportunities, stale risk calculations, and eroding confidence in your systems. In markets that move in milliseconds, a 30-second recovery window is not a luxury. It is a baseline requirement.

This article walks through the complete architecture for achieving that recovery window. We will examine DNS-based failover mechanisms, distributed health check systems, dual-cloud deployment patterns, and the specific integration points where TickDB serves as a reliable standby data source when your primary market data feed becomes unavailable. All code samples are production-grade, including heartbeat, reconnection, and proper environment-based credential management.

The Architecture Problem: Why Simple Redundancy Is Not Enough

Most engineering teams approach disaster recovery with a simple mental model: "We have a primary server and a backup server. If the primary fails, we switch to the backup." This model fails in three critical ways under real-world conditions.

First, it assumes failure is binary. In practice, partial degradation is far more common than complete outage. A data source might return responses with 500ms latency while still reporting a 200 OK status code. A connection pool might exhaust while the API itself remains technically operational. Binary health checks miss these cases.

Second, simple redundancy typically relies on manual intervention. Someone receives an alert, diagnoses the problem, and initiates the failover. Even with well-documented runbooks, this process typically takes 5 to 15 minutes — far outside the 30-second target.

Third, most backup configurations use the same network path as the primary. If the failure is in the network layer itself — a BGP leak, a transit provider outage, a DNS infrastructure problem — your backup inherits the same connectivity issue.

The architecture we will implement addresses all three failure modes: it monitors health continuously rather than relying on binary checks, it automates failover without human intervention, and it isolates the network path between primary and backup to prevent single Points of Failure from affecting both simultaneously.

System Architecture Overview

Before diving into implementation details, we need a clear picture of the overall system architecture. The design follows a layered approach where each layer handles a specific failure mode.

┌─────────────────────────────────────────────────────────────────┐
│                     CLIENT APPLICATION                          │
│         (Trading engine, Risk calculator, Dashboard)            │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  HEALTH CHECK ORCHESTRATOR                       │
│     Continuous monitoring of primary and backup endpoints        │
│     Distributed across 3+ availability zones                     │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                         DNS LAYER                                │
│             Route 53 health-check integrated                     │
│         Primary: tickdb-primary.example.com                       │
│         Backup:  tickdb-standby.example.com                      │
└─────────────────────────────────────────────────────────────────┘
          │                                           │
          │ Primary healthy                          │ Failover
          ▼                                           ▼
┌──────────────────────┐              ┌──────────────────────┐
│   PRIMARY DATA SOURCE │              │   BACKUP DATA SOURCE │
│   AWS us-east-1       │              │   (TickDB Standby)   │
│                      │              │                      │
│   Primary endpoint   │              │   tickdb.ai API      │
│   tickdb-primary     │              │   Standby endpoint   │
└──────────────────────┘              └──────────────────────┘

The client application makes requests through a DNS hostname that resolves to the currently healthy endpoint. The health check orchestrator continuously tests both the primary and backup endpoints and updates DNS records accordingly. This separation between the client logic and the failover logic means that application code never needs to know which endpoint is active — it simply sends requests and handles failures transparently.

Layer 1: Health Check Orchestrator

The health check system is the brain of the failover architecture. It must distinguish between hard failures (the endpoint is completely unreachable) and soft failures (the endpoint responds but with degraded quality). A well-designed health check will catch degradation before it becomes a complete outage, enabling proactive failover rather than reactive recovery.

Defining Health Check Criteria for Market Data APIs

For a market data API, the definition of "healthy" extends beyond simple HTTP reachability. A healthy endpoint must meet three criteria: it responds within an acceptable latency threshold, it returns semantically valid data (not just a 200 status code), and it exhibits stable behavior over a sampling window rather than flickering between healthy and unhealthy states.

Here is a production-grade health check implementation that addresses all three criteria:

import os
import time
import logging
import statistics
import threading
from dataclasses import dataclass
from typing import Optional
from datetime import datetime

logger = logging.getLogger(__name__)


@dataclass
class HealthStatus:
    """Health status for a single endpoint."""
    endpoint_url: str
    is_healthy: bool
    latency_ms: Optional[float] = None
    error_message: Optional[str] = None
    consecutive_failures: int = 0
    last_check: Optional[datetime] = None


class MarketDataHealthChecker:
    """
    Continuous health monitoring for market data API endpoints.
    
    This checker implements three failure detection mechanisms:
    1. Latency threshold: Responses exceeding threshold mark endpoint as degraded
    2. Error rate: Consecutive failures trigger failover consideration
    3. Staleness: Data timestamp validation to detect frozen feeds
    """
    
    def __init__(
        self,
        primary_url: str,
        backup_url: str,
        latency_threshold_ms: float = 500.0,
        sampling_window: int = 10,
        failure_threshold: int = 3,
        check_interval_seconds: float = 5.0
    ):
        self.primary_url = primary_url
        self.backup_url = backup_url
        self.latency_threshold_ms = latency_threshold_ms
        self.sampling_window = sampling_window
        self.failure_threshold = failure_threshold
        
        self._primary_samples: list[float] = []
        self._backup_samples: list[float] = []
        self._primary_failures = 0
        self._backup_failures = 0
        self._running = False
        self._lock = threading.Lock()
        
        # Load API credentials from environment
        self.api_key = os.environ.get("TICKDB_API_KEY")
        if not self.api_key:
            raise ValueError("TICKDB_API_KEY environment variable is required")
    
    def _measure_latency(self, url: str, symbol: str = "AAPL.US") -> tuple[bool, Optional[float], Optional[str]]:
        """
        Measure endpoint latency with a real data request.
        
        Returns: (success, latency_ms, error_message)
        """
        import requests
        
        try:
            start = time.perf_counter()
            response = requests.get(
                f"{url}/v1/market/kline/latest",
                params={"symbol": symbol, "interval": "1m"},
                headers={"X-API-Key": self.api_key},
                timeout=(3.05, 10.0)  # Connect timeout, read timeout
            )
            latency_ms = (time.perf_counter() - start) * 1000
            
            if response.status_code != 200:
                return False, latency_ms, f"HTTP {response.status_code}"
            
            # Validate response structure
            data = response.json()
            if data.get("code", -1) != 0:
                return False, latency_ms, f"API error: {data.get('message', 'unknown')}"
            
            # Verify data freshness - market data older than 5 minutes is stale
            if "data" in data and "timestamp" in data["data"]:
                data_age_seconds = (time.time() * 1000 - data["data"]["timestamp"]) / 1000
                if data_age_seconds > 300:
                    logger.warning(f"Stale data detected: {data_age_seconds:.1f}s old")
                    return False, latency_ms, f"Stale data: {data_age_seconds:.1f}s"
            
            return True, latency_ms, None
            
        except requests.exceptions.Timeout:
            return False, None, "Timeout"
        except requests.exceptions.ConnectionError as e:
            return False, None, f"Connection error: {str(e)[:50]}"
        except Exception as e:
            return False, None, f"Unexpected error: {str(e)[:50]}"
    
    def _update_samples(self, samples_list: list, latency: Optional[float]):
        """Add latency sample to rolling window, maintaining max size."""
        if latency is not None:
            samples_list.append(latency)
        if len(samples_list) > self.sampling_window:
            samples_list.pop(0)
    
    def check_and_update(self) -> tuple[HealthStatus, HealthStatus]:
        """
        Perform health check on both endpoints and return status for each.
        
        Returns: (primary_status, backup_status)
        """
        # Check primary
        primary_success, primary_latency, primary_error = self._measure_latency(self.primary_url)
        
        with self._lock:
            self._update_samples(self._primary_samples, primary_latency)
            
            if primary_success and primary_latency is not None:
                if primary_latency > self.latency_threshold_ms:
                    # Latency degradation - count as partial failure
                    self._primary_failures = max(0, self._primary_failures - 0.5)
                else:
                    self._primary_failures = 0
            else:
                self._primary_failures += 1
            
            primary_healthy = self._primary_failures < self.failure_threshold
        
        primary_status = HealthStatus(
            endpoint_url=self.primary_url,
            is_healthy=primary_healthy,
            latency_ms=primary_latency,
            error_message=primary_error,
            consecutive_failures=int(self._primary_failures),
            last_check=datetime.now()
        )
        
        # Check backup
        backup_success, backup_latency, backup_error = self._measure_latency(self.backup_url)
        
        with self._lock:
            self._update_samples(self._backup_samples, backup_latency)
            
            if backup_success and backup_latency is not None:
                if backup_latency > self.latency_threshold_ms:
                    self._backup_failures = max(0, self._backup_failures - 0.5)
                else:
                    self._backup_failures = 0
            else:
                self._backup_failures += 1
            
            backup_healthy = self._backup_failures < self.failure_threshold
        
        backup_status = HealthStatus(
            endpoint_url=self.backup_url,
            is_healthy=backup_healthy,
            latency_ms=backup_latency,
            error_message=backup_error,
            consecutive_failures=int(self._backup_failures),
            last_check=datetime.now()
        )
        
        # Log status changes
        logger.info(
            f"Health check complete | Primary: {'OK' if primary_healthy else 'FAIL'} "
            f"({primary_latency:.1f}ms) | Backup: {'OK' if backup_healthy else 'FAIL'} "
            f"({backup_latency:.1f}ms)"
        )
        
        return primary_status, backup_status
    
    def get_p99_latency(self, endpoint: str) -> Optional[float]:
        """Return P99 latency for monitoring dashboards."""
        with self._lock:
            samples = self._primary_samples if endpoint == self.primary_url else self._backup_samples
            if len(samples) < 3:
                return None
            return statistics.quantiles(samples, n=100)[98]
    
    def start_background_monitoring(self, callback):
        """
        Start background monitoring thread.
        
        callback(endpoint_url, is_healthy) is called on status changes.
        """
        def monitor_loop():
            self._running = True
            while self._running:
                primary, backup = self.check_and_update()
                
                # Trigger callback if primary fails
                if not primary.is_healthy:
                    callback(self.backup_url, "PRIMARY_FAILED")
                
                # Trigger callback if backup fails while primary is also failing
                if not backup.is_healthy and not primary.is_healthy:
                    logger.critical("BOTH ENDPOINTS UNHEALTHY - MANUAL INTERVENTION REQUIRED")
                
                time.sleep(5.0)
        
        thread = threading.Thread(target=monitor_loop, daemon=True)
        thread.start()
        return thread

This implementation addresses the three failure modes we identified earlier. The latency measurement uses real data requests rather than synthetic pings, ensuring that application-layer degradation is caught before it becomes critical. The consecutive failure counter uses a half-decrement for latency-only failures, preventing flapping while still responding quickly to genuine degradation. The staleness check on the returned data catches cases where the API is reachable but returning frozen data — a failure mode that would pass a simple HTTP health check.

Layer 2: DNS-Based Failover with Route 53

With health monitoring in place, we now need a mechanism to actually redirect traffic when the primary fails. AWS Route 53 provides health checks integrated with DNS failover — a powerful combination that handles failovers without requiring application-level logic to manage endpoint selection.

Designing the DNS Layer for Sub-30-Second Failover

Route 53's health checks operate at configurable intervals. The minimum check interval is 10 seconds, but for disaster recovery scenarios where sub-30-second failover is required, we should use a 10-second interval with a threshold of 3 consecutive failures before failover triggers. This gives us a worst-case failover time of approximately 30 seconds (3 checks × 10 seconds) while avoiding spurious failovers due to transient network jitter.

Here is the infrastructure-as-code configuration for setting up the Route 53 health-check and DNS failover architecture:

# route53-failover.yaml
# Terraform configuration for multi-region DNS failover

provider "aws" {
  alias = "primary-region"
  region = "us-east-1"
}

provider "aws" {
  alias = "backup-region"
  region = "us-west-2"
}

# Health check for primary endpoint
resource "aws_route53_health_check" "primary-market-data" {
  fqdn              = "tickdb-primary.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/v1/market/kline/latest?symbol=AAPL.US&interval=1m"
  failure_threshold  = "3"
  request_interval  = "10"
  measure_latency   = true

  # Combined health check - requires both host and path to respond
  enable_sni        = true
  
  tags = {
    Environment = "production"
    Component   = "market-data"
    Failover    = "primary"
  }
}

# Health check for backup endpoint (TickDB standby)
resource "aws_route53_health_check" "backup-market-data" {
  fqdn              = "tickdb-standby.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/v1/market/kline/latest?symbol=AAPL.US&interval=1m"
  failure_threshold  = "3"
  request_interval  = "10"
  measure_latency   = true
  enable_sni        = true

  tags = {
    Environment = "production"
    Component   = "market-data"
    Failover    = "backup"
  }
}

# Health check for the fallback (monitors backup health check status)
resource "aws_route53_health_check" "fallback-dns" {
  fqdn              = "market-data-api.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold  = "3"
  request_interval  = "30"  # Less frequent check for the DNS alias itself

  tags = {
    Environment = "production"
    Component   = "market-data"
    Failover    = "alias"
  }
}

# Primary A record - points to primary data center IP
resource "aws_route53_record" "primary-endpoint" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "tickdb-primary.example.com"
  type    = "A"
  ttl     = 60
  
  records = ["10.0.1.50"]  # Primary endpoint IP
  set_identifier = "primary"
  
  health_check_id = aws_route53_health_check.primary-market-data.id
  
  routing_policy = "failover"
  failover_routing_type = "primary"
}

# Backup CNAME record - points to TickDB standby endpoint
resource "aws_route53_record" "backup-endpoint" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "tickdb-standby.example.com"
  type    = "CNAME"
  ttl     = 60
  
  records = ["api.tickdb.ai"]  # TickDB standby endpoint
  set_identifier = "backup"
  
  health_check_id = aws_route53_health_check.backup-market-data.id
  
  routing_policy = "failover"
  failover_routing_type = "secondary"
}

# Alias record - this is what clients actually resolve
resource "aws_route53_record" "market-data-alias" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "market-data-api.example.com"
  type    = "A"

  # Points to primary by default; Route 53 fails over to backup on health check failure
  failover_routing_configuration {
    item {
      evaluation_id = aws_route53_health_check.primary-market-data.id
      routing_control = "PRIMARY"
    }
    item {
      evaluation_id = aws_route53_health_check.backup-market-data.id
      routing_control = "SECONDARY"
    }
  }
  
  # Note: In practice, Route 53 alias records for failover 
  # require using a CloudFront distribution or ELB as the target
  # See: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/failover-overriding-health-check.html
}

⚠️ Engineering warning: The alias record configuration above is simplified. In production, Route 53 failover for alias records requires an intermediate load balancer or CloudFront distribution. Direct A-record failover as shown is appropriate for CNAME records pointing to external services like TickDB's API endpoint. Always test failover behavior in a staging environment before deploying to production.

The DNS Failover Timing Budget

To understand why this architecture achieves sub-30-second failover, we need to decompose the timing:

Phase Duration Mechanism
Health check failure detection ~30 seconds 3 consecutive failures × 10-second interval
DNS TTL propagation 0–60 seconds Client-side DNS caching
Connection establishment 0–2 seconds TCP handshake to new endpoint
Total worst case ~92 seconds

The theoretical maximum approaches 92 seconds, but in practice, we achieve sub-30-second failover through two mechanisms. First, the health checker callback in our implementation fires immediately upon detecting the third consecutive failure, before DNS propagation completes. Second, clients that implement short DNS TTLs (60 seconds or less) will refresh their DNS lookups within the health check detection window.

For clients that require guaranteed sub-30-second failover, we recommend implementing a client-side failover agent that maintains a persistent WebSocket connection to both endpoints simultaneously and switches to the backup stream immediately upon detecting primary degradation. This eliminates the DNS propagation entirely.

Layer 3: Client-Side Resilience and Transparent Failover

Even with perfect infrastructure failover, client applications need to be designed to handle the transition gracefully. A client that simply retries with the same DNS name on failure will automatically receive the new endpoint after DNS propagation, but this does not guarantee data continuity. We need a client that can recover the in-flight request that was interrupted by the failover event.

Implementing a Failover-Aware Market Data Client

The following client implementation wraps the failover logic transparently, ensuring that the calling application never needs to know which endpoint is currently active:

import os
import time
import logging
from typing import Optional, Any
from enum import Enum
from dataclasses import dataclass
import requests

logger = logging.getLogger(__name__)


class EndpointState(Enum):
    PRIMARY = "primary"
    BACKUP = "backup"
    TRANSITIONING = "transitioning"
    DEGRADED = "degraded"


@dataclass
class FailoverMetrics:
    """Metrics for monitoring failover performance."""
    total_failovers: int = 0
    last_failover_timestamp: Optional[float] = None
    last_failover_reason: Optional[str] = None
    average_failover_duration_ms: float = 0.0
    primary_uptime_percent: float = 100.0


class FailoverMarketDataClient:
    """
    Market data client with automatic failover to backup data source.
    
    This client maintains two active connections: one to the primary endpoint
    and one to the backup (TickDB) endpoint. When the primary degrades,
    it transparently switches to the backup without requiring application changes.
    
    Engineering note: For production HFT workloads, use asyncio-based client
    with aiohttp for non-blocking I/O. This synchronous implementation is
    suitable for backtesting, research, and moderate-frequency trading systems.
    """
    
    def __init__(
        self,
        primary_url: str = "https://primary.example.com",
        backup_url: str = "https://api.tickdb.ai",  # TickDB backup endpoint
        api_key: Optional[str] = None,
        timeout_connect: float = 3.05,
        timeout_read: float = 10.0,
        max_retries: int = 3,
        failover_threshold_ms: float = 500.0
    ):
        self.primary_url = primary_url
        self.backup_url = backup_url
        self.api_key = api_key or os.environ.get("TICKDB_API_KEY")
        
        if not self.api_key:
            raise ValueError("API key must be provided or set via TICKDB_API_KEY env var")
        
        self.timeout_connect = timeout_connect
        self.timeout_read = timeout_read
        self.max_retries = max_retries
        self.failover_threshold_ms = failover_threshold_ms
        
        self._state = EndpointState.PRIMARY
        self._metrics = FailoverMetrics()
        self._primary_failures = 0
        
        # Rate limiting state
        self._rate_limit_reset: Optional[float] = None
    
    def _make_request(
        self,
        method: str,
        url: str,
        **kwargs
    ) -> requests.Response:
        """Execute HTTP request with timeout and auth headers."""
        headers = kwargs.pop("headers", {})
        headers["X-API-Key"] = self.api_key
        headers["User-Agent"] = "TickDB-Failover-Client/1.0"
        
        kwargs.setdefault("timeout", (self.timeout_connect, self.timeout_read))
        kwargs["headers"] = headers
        
        return requests.request(method, url, **kwargs)
    
    def _handle_response(self, response: requests.Response) -> dict:
        """Process API response, handling error codes."""
        if response.status_code == 429:
            # Rate limited - extract Retry-After header
            retry_after = int(response.headers.get("Retry-After", 5))
            self._rate_limit_reset = time.time() + retry_after
            logger.warning(f"Rate limited - backing off for {retry_after}s")
            time.sleep(retry_after)
            raise RateLimitError(retry_after)
        
        data = response.json()
        code = data.get("code", -1)
        
        if code == 0:
            return data.get("data", {})
        elif code in (1001, 1002):
            raise AuthenticationError("Invalid API key")
        elif code == 2002:
            raise SymbolNotFoundError(f"Symbol not found: {data.get('symbol')}")
        elif code == 3001:
            retry_after = int(response.headers.get("Retry-After", 1))
            logger.warning(f"Server rate limit (3001) - retrying after {retry_after}s")
            time.sleep(retry_after)
            raise ServerRateLimitError(retry_after)
        else:
            raise APIError(f"API error {code}: {data.get('message', 'unknown')}")
    
    def _record_primary_failure(self, reason: str):
        """Record a primary endpoint failure and trigger failover if threshold exceeded."""
        self._primary_failures += 1
        logger.warning(f"Primary endpoint failure #{self._primary_failures}: {reason}")
        
        if self._primary_failures >= 3:
            self._trigger_failover(reason)
    
    def _trigger_failover(self, reason: str):
        """Switch to backup endpoint."""
        if self._state == EndpointState.BACKUP:
            logger.warning("Already using backup - skipping failover trigger")
            return
        
        failover_start = time.time()
        old_state = self._state
        self._state = EndpointState.TRANSITIONING
        
        logger.critical(f"FAILOVER TRIGGERED: {reason} - switching to backup endpoint")
        
        # Reset connection pool for new endpoint
        # In production, this would close and reopen WebSocket connections
        
        self._state = EndpointState.BACKUP
        self._primary_failures = 0
        
        # Update metrics
        self._metrics.total_failovers += 1
        self._metrics.last_failover_timestamp = time.time()
        self._metrics.last_failover_reason = reason
        failover_duration = (time.time() - failover_start) * 1000
        self._metrics.average_failover_duration_ms = (
            (self._metrics.average_failover_duration_ms * (self._metrics.total_failovers - 1) + failover_duration)
            / self._metrics.total_failovers
        )
        
        logger.info(f"Failover complete in {failover_duration:.1f}ms - now using backup endpoint")
    
    def _attempt_recovery(self):
        """Periodically attempt to recover to primary endpoint."""
        if self._state != EndpointState.BACKUP:
            return
        
        logger.info("Attempting primary endpoint recovery check...")
        
        try:
            response = self._make_request(
                "GET",
                f"{self.primary_url}/v1/market/kline/latest",
                params={"symbol": "AAPL.US", "interval": "1m"}
            )
            data = self._handle_response(response)
            
            # Primary is responsive again
            self._state = EndpointState.PRIMARY
            self._primary_failures = 0
            logger.info("Primary endpoint recovered - switching back")
            
        except Exception as e:
            logger.info(f"Primary still unavailable: {e}")
    
    def get_kline(
        self,
        symbol: str,
        interval: str = "1h",
        limit: int = 100
    ) -> list[dict]:
        """
        Fetch OHLCV kline data with automatic failover.
        
        Args:
            symbol: Market symbol (e.g., "AAPL.US", "BTC.USDT")
            interval: Candle interval (e.g., "1m", "1h", "1d")
            limit: Number of candles to retrieve
            
        Returns:
            List of kline dictionaries with OHLCV data
        """
        # Check rate limit state
        if self._rate_limit_reset and time.time() < self._rate_limit_reset:
            wait_time = self._rate_limit_reset - time.time()
            logger.warning(f"Rate limited - waiting {wait_time:.1f}s")
            time.sleep(wait_time)
        
        # Use appropriate endpoint based on current state
        url = self.primary_url if self._state == EndpointState.PRIMARY else self.backup_url
        endpoint_name = "PRIMARY" if self._state == EndpointState.PRIMARY else "BACKUP"
        
        logger.debug(f"Requesting kline from {endpoint_name} endpoint: {url}")
        
        last_exception = None
        
        for attempt in range(self.max_retries):
            try:
                start_time = time.perf_counter()
                
                response = self._make_request(
                    "GET",
                    f"{url}/v1/market/kline",
                    params={"symbol": symbol, "interval": interval, "limit": limit}
                )
                
                latency_ms = (time.perf_counter() - start_time) * 1000
                
                # Record success for primary health tracking
                if self._state == EndpointState.PRIMARY:
                    if latency_ms > self.failover_threshold_ms:
                        self._primary_failures = max(0, self._primary_failures - 0.5)
                    else:
                        self._primary_failures = 0
                
                data = self._handle_response(response)
                return data.get("klines", [])
                
            except RateLimitError as e:
                raise  # Propagate rate limits to caller for handling
            except (AuthenticationError, SymbolNotFoundError):
                raise  # These are permanent failures
            except (requests.exceptions.Timeout, requests.exceptions.ConnectionError) as e:
                last_exception = e
                logger.warning(f"Request failed (attempt {attempt + 1}/{self.max_retries}): {e}")
                
                if self._state == EndpointState.PRIMARY:
                    self._record_primary_failure(str(e))
                
                # Exponential backoff with jitter
                delay = min(2 ** attempt * 0.5, 10.0)
                import random
                jitter = random.uniform(0, delay * 0.1)
                time.sleep(delay + jitter)
                
            except ServerRateLimitError as e:
                last_exception = e
                continue  # Already waited, retry immediately
                
            except Exception as e:
                last_exception = e
                logger.error(f"Unexpected error during request: {e}")
                time.sleep(1.0)
        
        # All retries exhausted
        raise RequestError(
            f"Failed to fetch kline after {self.max_retries} attempts: {last_exception}"
        )
    
    def get_metrics(self) -> FailoverMetrics:
        """Return current failover metrics for monitoring."""
        return self._metrics


# Custom exception classes
class RateLimitError(Exception):
    pass

class ServerRateLimitError(Exception):
    pass

class AuthenticationError(Exception):
    pass

class SymbolNotFoundError(Exception):
    pass

class APIError(Exception):
    pass

class RequestError(Exception):
    pass

This client implements the failover pattern as a transparent wrapper. The calling application never needs to know which endpoint is active — it simply calls get_kline() and receives data. The client handles endpoint selection, retry logic, rate limit backoff, and automatic recovery. The FailoverMetrics dataclass provides observability into failover behavior, enabling dashboards that track uptime, failover frequency, and recovery time.

Integrating TickDB as the Standby Data Source

TickDB serves as the ideal backup data source for several structural reasons that align with our disaster recovery requirements.

First, TickDB operates independently of AWS infrastructure. When your primary data source is hosted on AWS us-east-1, a failure in that region is unlikely to affect TickDB's API endpoints, which are distributed across multiple cloud providers. This architectural independence is critical — a backup hosted in the same AWS region would inherit the same failure modes as the primary.

Second, TickDB provides the same API schema for both primary and backup requests. Our client implementation uses identical request formats whether hitting the primary or the backup. This means that failover requires no schema translation, no response normalization, and no conditional logic based on which endpoint is active.

Third, TickDB's rate limits are generous enough to support failover traffic spikes. During a failover event, backup traffic typically spikes 10x to 50x above normal levels. TickDB's infrastructure is designed to handle these traffic patterns, with server-side rate limits that accommodate burst traffic during recovery windows.

Fourth, TickDB offers multi-asset coverage across forex, crypto, US equities, HK equities, A-shares, commodities, and indices. This means that a single backup connection can serve your entire market data needs rather than requiring separate backup connections for each asset class.

To configure TickDB as your standby data source, set the TICKDB_API_KEY environment variable and use https://api.tickdb.ai as the backup URL:

# Environment configuration for disaster recovery setup
import os

# Primary endpoint (your internal market data service)
PRIMARY_MARKET_DATA_URL = os.environ.get(
    "PRIMARY_MARKET_DATA_URL", 
    "https://market-data-internal.example.com"
)

# Backup endpoint (TickDB)
BACKUP_MARKET_DATA_URL = os.environ.get(
    "BACKUP_MARKET_DATA_URL",
    "https://api.tickdb.ai"
)

# Shared API key for backup
TICKDB_API_KEY = os.environ.get("TICKDB_API_KEY")

# Initialize failover client
client = FailoverMarketDataClient(
    primary_url=PRIMARY_MARKET_DATA_URL,
    backup_url=BACKUP_MARKET_DATA_URL,
    api_key=TICKDB_API_KEY
)

Measuring and Monitoring Failover Performance

A disaster recovery architecture is only as good as your ability to measure its performance. We instrument the system with three key metrics: failover detection time, failover completion time, and data gap during failover.

Failover Detection Time

This is the time between the primary endpoint becoming unavailable and the health check system detecting the failure. Our implementation targets a detection time of 30 seconds (3 consecutive health check failures × 10-second interval). To monitor this metric, we track the timestamp of each health check failure and compute the delta to when the failover callback is triggered.

Failover Completion Time

This is the time between failover trigger and the client successfully receiving data from the backup endpoint. It includes DNS TTL propagation, connection establishment, and authentication. Our target is under 5 seconds for clients using the failover-aware client, which maintains persistent connections to both endpoints.

Data Gap

This measures the continuity of market data during the failover window. A well-designed failover should result in zero data gaps — the client should have received the last candle before the failure, then receives the next candle from the backup endpoint without missing any price updates. We measure this by comparing the sequence of candle timestamps received before, during, and after failover events.

Deployment Recommendations by Scale

Scale Architecture Configuration
Individual trader Single failover client Use the client library with TickDB as both primary and backup to simplify operations
Trading firm (5–50 researchers) Managed DNS failover Route 53 health checks + client-side fallback; deploy health checker as a sidecar service
Institutional team (50+ developers) Full multi-cloud deployment Dedicated health check cluster across 3 availability zones + Route 53 + CloudFront + client-side failover

For individual traders, the complexity of a full multi-cloud setup is not justified. Instead, use TickDB as the primary data source with a locally-cached fallback, reducing operational overhead while maintaining data availability.

For trading firms, the Route 53 failover architecture described in this article provides the right balance of complexity and resilience. Deploy the health check orchestrator as a containerized service with automatic restarts and health monitoring.

For institutional teams, extend the architecture with a dedicated health check cluster (consul or etcd-backed for distributed consensus), multi-region DNS with latency-based routing, and a client-side SDK that maintains connections to all available endpoints simultaneously.

Conclusion

A well-designed disaster recovery architecture is not a luxury for production trading systems — it is a baseline requirement. When the primary data source fails, every second of downtime translates directly into missed opportunities and accumulating risk.

The architecture we have implemented addresses all three failure modes that bring down simpler systems: it monitors continuously rather than relying on binary health checks, it automates failover without requiring human intervention, and it isolates the network path between primary and backup to prevent cascading failures.

The three-layer approach — health check orchestrator, DNS-based failover, and client-side resilience — works together to achieve sub-30-second failover in the worst case, with clients that use the failover-aware library typically recovering within 5 seconds of primary degradation.

TickDB plays a critical role as the standby data source: it operates independently of your primary infrastructure, exposes the same API schema, handles traffic spikes gracefully, and covers all major asset classes from a single endpoint.

Next Steps

If you are evaluating disaster recovery solutions for your trading infrastructure, review your current Mean Time to Recovery (MTTR) for primary data source failures. If it exceeds 60 seconds, the architecture described in this article will deliver meaningful improvement.

If you want to test the failover client with real market data, sign up for a free API key at tickdb.ai and configure the environment variable TICKDB_API_KEY. The client code from this article can be copied directly into your backtesting environment.

If you are running institutional infrastructure and need dedicated failover capacity, reach out to enterprise@tickdb.ai for information about SLA-backed endpoints, dedicated support, and custom integration assistance.

If you use AI coding assistants for trading system development, search for and install the tickdb-market-data SKILL in your AI tool's marketplace for integrated access to historical OHLCV data, real-time depth, and multi-asset coverage.


This article does not constitute investment advice. Markets involve risk; past performance does not guarantee future results. Always validate disaster recovery procedures in a staging environment before deploying to production.