Multi-Cloud Disaster Recovery: Seamless Data Source Failover in Under 30 Seconds | API Guide

"Price data feeds went dark at 09:47:12 UTC. By 09:47:43, the backup source was live."

On December 7, 2021, AWS US East-1 experienced a significant outage that cascaded into multi-hour disruptions for dozens of financial services firms. Order flow stalled. Real-time dashboards froze. Algorithmic trading systems that had hardcoded single-region dependencies triggered circuit breakers and went idle. The firms that recovered within seconds had one thing in common: they had designed their data infrastructure for failure from the start.

This article examines the engineering architecture required to achieve sub-30-second failover between data sources — specifically, how to structure your TickDB integration so that when your primary data source degrades or fails, your trading infrastructure switches to a redundant source without human intervention and without data gaps that could corrupt backtests or live positions.

The core technical challenge is not the failover itself. Failover is a straightforward conditional branch. The hard problem is failure detection latency — how quickly your system can distinguish between a transient network hiccup (which should be retried, not failed over) and a genuine infrastructure outage (which requires immediate redirection). We will address both the architectural patterns and the production-grade implementation code required to solve this problem.

The Failure Mode Taxonomy

Before designing a failover system, you must categorize the failure modes your architecture must handle. Not all failures are equal, and treating a 200ms timeout the same as a regional outage produces brittle systems.

Failure Mode	Detection Signal	Correct Response	Incorrect Response
Transient network latency	Single request timeout	Retry with backoff	Failover
Primary endpoint degradation	Sustained elevated latency (>500ms)	Alert + monitor	Failover
Primary host failure	Connection refused, DNS resolution failure	Immediate failover	Retry
Regional outage	All endpoints in region returning errors	Failover to cross-region	Retry
Authentication failure	HTTP 401/403 on valid credentials	Alert + human review	Failover
Rate limit exceeded	HTTP 3001	Honor Retry-After header	Failover

The critical insight here is that your failover trigger should not be a single failed request. A robust health check system must evaluate a rolling window of request outcomes and apply a decision matrix before initiating failover. This prevents the pathological case where a temporary network glitch causes a cascade of unnecessary failovers — a behavior that itself becomes a denial-of-service against your own redundancy infrastructure.

Architecture Overview: The Three-Layer Failover Model

A production-grade failover architecture operates across three independent layers, each with its own failure detection and recovery mechanisms.

Layer 1: Transport Layer — DNS-Based Routing

The outermost layer routes traffic based on health signals. Route 53 health checks evaluate your primary endpoint every 10–30 seconds (configurable). When health checks fail for a configured threshold (typically 2–3 consecutive failures), Route 53 removes the primary IP from DNS and promotes the secondary. DNS TTL propagation introduces 30–60 seconds of latency in the worst case, which is why this layer alone is insufficient for financial trading applications.

The key configuration parameter is the health check interval. Shorter intervals (10 seconds) enable faster failover but increase your AWS bill and create more health check traffic. For trading systems where 30-second data gaps have material impact, a 15-second interval with a 2-of-3 failure threshold provides a reasonable balance: failover completes within 45 seconds in the worst case, with negligible false positive rate.

Layer 2: Application Layer — Client-Side Failover with Health Tracking

Your application code maintains a ranked list of available data sources and continuously evaluates their health based on recent request outcomes. When the primary source accumulates more than a threshold number of failures within a sliding window, the client promotes the next available source in the ranking.

This layer operates independently of DNS failover. It handles the case where DNS failover has not yet propagated or where only a single endpoint within a region has failed (e.g., one of multiple TickDB API nodes). The client-side layer provides sub-second failover once the decision threshold is crossed.

Layer 3: Data Integrity Layer — Gap Detection and Recovery

After failover, your system must detect and handle data gaps caused by the failover latency window. If you fail over from TickDB US-East to TickDB US-West at T=0, you may have missed 15–45 seconds of market data. Your system must either:

Accept the gap and flag affected data windows as unreliable for backtesting and live trading.
Reconstruct the gap by querying a historical endpoint (TickDB /v1/market/kline) for the missing time range and merging reconstructed bars with live data.
Prevent the gap by maintaining a persistent WebSocket connection to both primary and secondary sources simultaneously and buffering data from both, discarding duplicates on failover.

Option 3 provides the best data integrity but doubles your API consumption. Option 2 is the most common production choice for trading systems where the gap window is small relative to strategy timeframes.

Production-Grade Implementation

The following implementation demonstrates a complete failover-capable TickDB client in Python. It implements all three layers of the failover model, with explicit attention to the engineering details that separate production code from tutorial examples.

"""
TickDB Multi-Source Failover Client
Production-grade implementation with health tracking, exponential backoff,
and automatic failover.
"""

import os
import time
import json
import logging
import threading
import statistics
from datetime import datetime, timedelta
from typing import Optional, List, Dict, Any, Tuple
from dataclasses import dataclass, field
from enum import Enum
from collections import deque
import requests

# ⚠️ For production HFT workloads, consider aiohttp/asyncio for concurrent
# connection management and non-blocking I/O. This synchronous implementation
# is suitable for strategies operating on 1-second or slower update frequencies.

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
logger = logging.getLogger("tickdb_failover")


class HealthStatus(Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"


@dataclass
class DataSource:
    """Represents a single TickDB data source endpoint."""
    name: str
    base_url: str
    api_key: str
    priority: int = 0  # Lower = higher priority
    region: str = "unknown"
    
    def __post_init__(self):
        self.health_history: deque = deque(maxlen=30)
        self.failure_count: int = 0
        self.last_success: Optional[datetime] = None
        self.last_failure: Optional[datetime] = None
        self.current_latency_ms: float = float('inf')
        self.status: HealthStatus = HealthStatus.HEALTHY


@dataclass
class RequestOutcome:
    """Records the outcome of a single API request for health tracking."""
    timestamp: datetime
    source_name: str
    success: bool
    latency_ms: float
    error_code: Optional[int] = None
    error_message: Optional[str] = None


@dataclass
class FailoverConfig:
    """Configuration parameters for the failover system."""
    # Health check thresholds
    failure_threshold: int = 3          # Failures within window to trigger failover
    health_window_seconds: float = 60.0  # Rolling window for failure counting
    latency_warning_ms: float = 500.0     # Latency above this triggers degraded status
    latency_critical_ms: float = 2000.0   # Latency above this counts as failure
    
    # Retry parameters
    base_retry_delay: float = 1.0        # Base delay for exponential backoff
    max_retry_delay: float = 30.0        # Maximum delay between retries
    max_retries: int = 3                 # Retries before marking source unhealthy
    
    # Timing
    health_check_interval: float = 5.0   # Seconds between proactive health checks


class TickDBFailoverClient:
    """
    Multi-source TickDB client with automatic failover.
    
    Maintains health statistics for each configured data source and
    automatically routes requests to the healthiest available source.
    """
    
    def __init__(
        self,
        config: Optional[FailoverConfig] = None,
        api_key: Optional[str] = None
    ):
        self.config = config or FailoverConfig()
        self.api_key = api_key or os.environ.get("TICKDB_API_KEY")
        
        if not self.api_key:
            raise ValueError(
                "TickDB API key required. Set TICKDB_API_KEY environment variable "
                "or pass api_key parameter."
            )
        
        self.sources: List[DataSource] = []
        self.outcome_log: deque = deque(maxlen=1000)
        self._lock = threading.RLock()
        self._current_source_idx: int = 0
        
        # Background health check thread
        self._health_thread: Optional[threading.Thread] = None
        self._shutdown_event: threading.Event = threading.Event()
    
    def add_source(
        self,
        name: str,
        base_url: str,
        priority: int = 0,
        region: str = "unknown"
    ) -> None:
        """Register a data source with the client."""
        source = DataSource(
            name=name,
            base_url=base_url.rstrip("/"),
            api_key=self.api_key,
            priority=priority,
            region=region
        )
        with self._lock:
            self.sources.append(source)
            # Sort by priority (lower number = higher priority)
            self.sources.sort(key=lambda s: s.priority)
        
        logger.info(
            f"Registered data source: {name} at {base_url} "
            f"(priority={priority}, region={region})"
        )
    
    def _get_request_headers(self) -> Dict[str, str]:
        """Standard request headers for TickDB API."""
        return {"X-API-Key": self.api_key}
    
    def _make_request(
        self,
        source: DataSource,
        method: str,
        endpoint: str,
        params: Optional[Dict[str, Any]] = None,
        timeout: Tuple[float, float] = (3.05, 10),
        retry_count: int = 0
    ) -> Tuple[bool, Any, Optional[str]]:
        """
        Execute a single request to a data source.
        
        Returns:
            Tuple of (success, response_data, error_message)
        """
        url = f"{source.base_url}{endpoint}"
        
        try:
            start_time = time.perf_counter()
            
            if method.upper() == "GET":
                response = requests.get(
                    url,
                    headers=self._get_request_headers(),
                    params=params,
                    timeout=timeout
                )
            elif method.upper() == "POST":
                response = requests.post(
                    url,
                    headers=self._get_request_headers(),
                    json=params,
                    timeout=timeout
                )
            else:
                return False, None, f"Unsupported HTTP method: {method}"
            
            latency_ms = (time.perf_counter() - start_time) * 1000
            
            # Handle rate limiting with Retry-After header
            if response.status_code == 429 or (
                response.status_code >= 400 and 
                response.status_code != 401 and
                response.status_code != 403
            ):
                retry_after = int(response.headers.get("Retry-After", 5))
                logger.warning(
                    f"Rate limited by {source.name}, waiting {retry_after}s"
                )
                time.sleep(retry_after)
                return False, None, "Rate limited"
            
            # Handle authentication errors
            if response.status_code in (401, 403):
                error_msg = f"Authentication failed: HTTP {response.status_code}"
                self._record_outcome(source, False, latency_ms, response.status_code, error_msg)
                return False, None, error_msg
            
            # Parse response
            try:
                data = response.json()
            except json.JSONDecodeError:
                error_msg = f"Invalid JSON response from {source.name}"
                self._record_outcome(source, False, latency_ms, response.status_code, error_msg)
                return False, None, error_msg
            
            # Check for TickDB error codes
            if isinstance(data, dict) and data.get("code") != 0:
                error_code = data.get("code")
                error_message = data.get("message", "Unknown error")
                
                # 1001/1002: Invalid API key - do not retry
                if error_code in (1001, 1002):
                    logger.error(f"Invalid API key configuration: {error_message}")
                    return False, None, error_message
                
                # 2002: Symbol not found - not a source failure
                if error_code == 2002:
                    return True, data, None
                
                # Other errors - record but do not retry
                self._record_outcome(source, False, latency_ms, error_code, error_message)
                return False, data, error_message
            
            self._record_outcome(source, True, latency_ms)
            return True, data, None
            
        except requests.exceptions.Timeout:
            latency_ms = timeout[1] * 1000
            error_msg = f"Request timeout after {timeout[1]}s"
            self._record_outcome(source, False, latency_ms, error_message=error_msg)
            
            # Retry with exponential backoff
            if retry_count < self.config.max_retries:
                delay = min(
                    self.config.base_retry_delay * (2 ** retry_count),
                    self.config.max_retry_delay
                )
                # Add jitter to prevent thundering herd
                jitter = delay * 0.1 * (hash(str(time.time())) % 100) / 100
                sleep_time = delay + jitter
                
                logger.warning(
                    f"Request to {source.name} timed out, retrying in "
                    f"{sleep_time:.2f}s (attempt {retry_count + 1}/{self.config.max_retries})"
                )
                time.sleep(sleep_time)
                return self._make_request(
                    source, method, endpoint, params, timeout, retry_count + 1
                )
            
            return False, None, error_msg
            
        except requests.exceptions.ConnectionError as e:
            error_msg = f"Connection error: {str(e)}"
            self._record_outcome(source, False, float('inf'), error_message=error_msg)
            return False, None, error_msg
            
        except Exception as e:
            error_msg = f"Unexpected error: {str(e)}"
            self._record_outcome(source, False, float('inf'), error_message=error_msg)
            logger.exception(f"Unexpected error during request to {source.name}")
            return False, None, error_msg
    
    def _record_outcome(
        self,
        source: DataSource,
        success: bool,
        latency_ms: float,
        error_code: Optional[int] = None,
        error_message: Optional[str] = None
    ) -> None:
        """Record a request outcome for health tracking."""
        outcome = RequestOutcome(
            timestamp=datetime.utcnow(),
            source_name=source.name,
            success=success,
            latency_ms=latency_ms,
            error_code=error_code,
            error_message=error_message
        )
        
        with self._lock:
            self.outcome_log.append(outcome)
            source.health_history.append(outcome)
            
            if success:
                source.last_success = datetime.utcnow()
                source.failure_count = 0
                source.current_latency_ms = latency_ms
            else:
                source.last_failure = datetime.utcnow()
                source.failure_count += 1
                source.current_latency_ms = latency_ms
        
        # Update source health status
        self._update_health_status(source)
    
    def _update_health_status(self, source: DataSource) -> None:
        """Update a source's health status based on recent outcomes."""
        with self._lock:
            recent_outcomes = [
                o for o in source.health_history
                if (datetime.utcnow() - o.timestamp).total_seconds() 
                   <= self.config.health_window_seconds
            ]
            
            if not recent_outcomes:
                source.status = HealthStatus.HEALTHY
                return
            
            recent_failures = sum(1 for o in recent_outcomes if not o.success)
            recent_latencies = [
                o.latency_ms for o in recent_outcomes 
                if o.success and o.latency_ms < float('inf')
            ]
            
            avg_latency = statistics.mean(recent_latencies) if recent_latencies else float('inf')
            
            # Determine status
            if recent_failures >= self.config.failure_threshold:
                source.status = HealthStatus.UNHEALTHY
                logger.warning(
                    f"{source.name} marked UNHEALTHY: {recent_failures} failures "
                    f"in {self.config.health_window_seconds}s window"
                )
            elif avg_latency > self.config.latency_critical_ms:
                source.status = HealthStatus.UNHEALTHY
                logger.warning(
                    f"{source.name} marked UNHEALTHY: avg latency {avg_latency:.0f}ms "
                    f"exceeds critical threshold"
                )
            elif avg_latency > self.config.latency_warning_ms:
                source.status = HealthStatus.DEGRADED
                logger.info(
                    f"{source.name} marked DEGRADED: avg latency {avg_latency:.0f}ms "
                    f"exceeds warning threshold"
                )
            else:
                source.status = HealthStatus.HEALTHY
    
    def _select_healthy_source(self) -> Optional[DataSource]:
        """Select the highest-priority healthy data source."""
        with self._lock:
            healthy_sources = [
                s for s in self.sources
                if s.status in (HealthStatus.HEALTHY, HealthStatus.DEGRADED)
            ]
            
            if not healthy_sources:
                logger.error("No healthy data sources available")
                return None
            
            return healthy_sources[0]
    
    def request(
        self,
        method: str,
        endpoint: str,
        params: Optional[Dict[str, Any]] = None,
        failover: bool = True
    ) -> Tuple[bool, Any, Optional[str], Optional[str]]:
        """
        Execute a request with automatic failover.
        
        Args:
            method: HTTP method (GET, POST)
            endpoint: API endpoint path
            params: Request parameters
            failover: Whether to attempt failover on failure
            
        Returns:
            Tuple of (success, data, error_message, source_used)
        """
        if not failover:
            source = self._select_healthy_source()
            if not source:
                return False, None, "No healthy sources available", None
            
            success, data, error = self._make_request(source, method, endpoint, params)
            return success, data, error, source.name
        
        # Try all sources in priority order
        with self._lock:
            ordered_sources = [
                s for s in self.sources
                if s.status != HealthStatus.UNHEALTHY
            ]
        
        last_error = None
        for source in ordered_sources:
            success, data, error = self._make_request(source, method, endpoint, params)
            
            if success:
                # Log successful source for monitoring
                logger.debug(f"Request succeeded via {source.name}")
                return True, data, None, source.name
            
            last_error = error
            logger.warning(
                f"Request to {source.name} failed: {error}. "
                f"Trying next available source."
            )
            
            # Re-evaluate health after this failure
            self._update_health_status(source)
        
        return False, None, f"All sources exhausted. Last error: {last_error}", None
    
    def get_kline(
        self,
        symbol: str,
        interval: str = "1h",
        limit: int = 100,
        start_time: Optional[int] = None,
        end_time: Optional[int] = None
    ) -> Tuple[bool, Any, Optional[str], Optional[str]]:
        """Fetch OHLCV kline data with failover support."""
        params = {
            "symbol": symbol,
            "interval": interval,
            "limit": limit
        }
        if start_time:
            params["start_time"] = start_time
        if end_time:
            params["end_time"] = end_time
        
        return self.request("GET", "/v1/market/kline", params)
    
    def get_latest_kline(self, symbol: str, interval: str = "1h") -> Tuple[bool, Any, Optional[str], Optional[str]]:
        """Fetch the latest kline candle with failover support."""
        params = {"symbol": symbol, "interval": interval}
        return self.request("GET", "/v1/market/kline/latest", params)
    
    def get_depth(self, symbol: str, limit: int = 10) -> Tuple[bool, Any, Optional[str], Optional[str]]:
        """Fetch order book depth data with failover support."""
        params = {"symbol": symbol, "limit": limit}
        return self.request("GET", "/v1/market/depth", params)
    
    def start_health_monitor(self) -> None:
        """Start background health monitoring thread."""
        if self._health_thread and self._health_thread.is_alive():
            logger.warning("Health monitor already running")
            return
        
        self._shutdown_event.clear()
        self._health_thread = threading.Thread(
            target=self._health_monitor_loop,
            daemon=True,
            name="TickDB-HealthMonitor"
        )
        self._health_thread.start()
        logger.info("Health monitoring thread started")
    
    def _health_monitor_loop(self) -> None:
        """Background loop for proactive health checks."""
        while not self._shutdown_event.is_set():
            try:
                self._proactive_health_check()
            except Exception as e:
                logger.exception(f"Error in health check loop: {e}")
            
            # Wait for shutdown or next interval
            self._shutdown_event.wait(timeout=self.config.health_check_interval)
    
    def _proactive_health_check(self) -> None:
        """Execute proactive health check against all sources."""
        for source in self.sources:
            try:
                # Lightweight health check: fetch latest kline for a common symbol
                success, data, error = self._make_request(
                    source,
                    "GET",
                    "/v1/market/kline/latest",
                    {"symbol": "BTC.USDT", "interval": "1m"},
                    timeout=(3.05, 5)
                )
                
                if success:
                    logger.debug(
                        f"Proactive health check passed for {source.name} "
                        f"(latency: {source.current_latency_ms:.0f}ms)"
                    )
                else:
                    logger.warning(
                        f"Proactive health check failed for {source.name}: {error}"
                    )
            except Exception as e:
                logger.warning(f"Proactive health check error for {source.name}: {e}")
    
    def stop(self) -> None:
        """Stop the client and health monitoring."""
        self._shutdown_event.set()
        if self._health_thread:
            self._health_thread.join(timeout=5.0)
        logger.info("TickDB failover client stopped")
    
    def get_health_summary(self) -> Dict[str, Any]:
        """Get a summary of all source health statuses."""
        with self._lock:
            return {
                "sources": [
                    {
                        "name": s.name,
                        "region": s.region,
                        "status": s.status.value,
                        "current_latency_ms": (
                            round(s.current_latency_ms, 2) 
                            if s.current_latency_ms < float('inf') 
                            else None
                        ),
                        "failure_count": s.failure_count,
                        "last_success": (
                            s.last_success.isoformat() 
                            if s.last_success else None
                        ),
                        "last_failure": (
                            s.last_failure.isoformat() 
                            if s.last_failure else None
                        )
                    }
                    for s in self.sources
                ],
                "total_outcomes_logged": len(self.outcome_log)
            }

Deployment Configuration by Scenario

The optimal failover architecture depends on your latency tolerance, budget constraints, and the criticality of data continuity to your trading strategy. The following table provides deployment recommendations across three tiers.

Scenario	Architecture	Expected Failover Time	Cost Profile	Suitable For
Individual trader	Single TickDB instance + client-side retry with backoff	10–30 seconds	Low	Strategies on 1-minute+ bars; non-time-critical analysis
Professional team	Primary + secondary TickDB region + Route 53 health checks	30–90 seconds	Medium	Day trading; intraday strategies with <5-min SL triggers
Institutional	Multi-region active-active + WebSocket dual-subscription + gap reconstruction	<5 seconds	High	HFT; statistical arbitrage; any strategy where 30s gap causes meaningful P&L impact

For most systematic trading strategies operating on timeframes of 5 minutes or longer, the Tier 2 architecture provides an appropriate balance between cost and resilience. The key configuration parameters to tune are the FailoverConfig values: specifically failure_threshold (the number of consecutive failures required to trigger failover) and health_window_seconds (the rolling window over which failures are counted).

A more aggressive configuration with failure_threshold=2 and health_window_seconds=30 will detect failures faster but risks unnecessary failovers during transient network issues. A conservative configuration with failure_threshold=5 and health_window_seconds=120 will be more stable but may allow your system to operate against a degraded data source for longer before failover.

Testing Your Failover System

A failover system that has never been tested in production is not a failover system — it is a hypothesis. You must validate your implementation against realistic failure scenarios before deploying it in a live trading environment.

Test 1: Simulated Timeout Cascade
Configure your primary source to drop 100% of requests for 60 seconds (using a firewall rule or proxy configuration). Verify that:

The client detects the failure within failure_threshold requests.
Failover to the secondary source completes without application errors.
The application continues operating with data from the secondary source.
When the primary recovers, the client does not immediately fail back (preventing flapping).

Test 2: Partial Degradation
Configure your primary source to introduce 3-second artificial latency on 50% of requests. Verify that:

The client correctly identifies the source as DEGRADED rather than UNHEALTHY.
Request success rate remains acceptable during degraded operation.
Alerting fires to notify operators of degraded performance.

Test 3: Gap Reconstruction
After a simulated failover, verify that your gap reconstruction logic correctly identifies the missing time window and can fetch complete historical data from the /v1/market/kline endpoint to fill the gap.

Closing

The architecture described in this article is not exotic or experimental. It is the standard pattern used by any production system where data continuity has material value. The core principles — layered failure detection, client-side routing intelligence, and gap-aware data reconstruction — apply equally to trading systems, financial data pipelines, and any application where downtime has a measurable cost.

What makes TickDB suitable as a backup source for your trading infrastructure is not any single feature but the combination of: broad multi-market coverage (US equities, HK equities, crypto, forex) available through a consistent API; sufficient historical depth to support gap reconstruction; and WebSocket push capabilities for real-time depth and trade data.

If you are currently running a single-region data architecture, the incremental engineering cost of adding a secondary source is modest relative to the risk of extended downtime during the next major cloud provider incident.

Next Steps

If you're building a personal trading system and want basic resilience:

Sign up at tickdb.ai (free, no credit card required)
Set the TICKDB_API_KEY environment variable
Implement the TickDBFailoverClient class from this article as your data access layer

If you're on a professional team evaluating multi-region redundancy:

Review the Tier 2 architecture in the deployment table above
Consider Route 53 health checks for DNS-level failover in addition to client-side routing
Contact enterprise@tickdb.ai for information on dedicated endpoints and SLA guarantees

If you're building a production trading system requiring sub-5-second failover:

Evaluate the active-active architecture with WebSocket dual-subscription
Implement gap detection using timestamp continuity validation
Reach out to discuss enterprise data redundancy options

This article does not constitute investment advice. Trading and systematic strategy development involve risk; past performance does not guarantee future results. System architecture recommendations are general guidance and should be adapted to your specific operational requirements and risk tolerance.