In March 2024, a quant developer on a systematic futures desk posted their incident log in a public Slack. Their Polygon WebSocket feed had silently dropped at 9:47 AM ET — right before a CPI print — and their position-sizing engine had been running stale data for 11 minutes before anyone noticed. The root cause was not a network partition. It was a missing heartbeat handler. The connection had been alive at the TCP layer but stale at the application layer, and nobody had written the pong receiver.
That is not a Polygon-specific failure. It is an architectural failure mode that exists across virtually every WebSocket market data provider. The difference between providers is not whether disconnections happen — they all do — but how quickly the system detects them, how gracefully it recovers, and how much data the client loses in the process.
This article benchmarks six US stock data sources on WebSocket implementation quality. We evaluate three dimensions that matter for production trading systems: heartbeat mechanism completeness, reconnection behavior and recovery time, and documented connection stability under load. For each dimension, we examine what is documented, what is observable from API behavior, and what the practical implications are for a developer deploying this in a live system.
We benchmark: Polygon.io, IEX Cloud, Alpaca, TickDB, Finnhub, and Intrinio. All six serve US equity data over WebSocket. Their implementations vary dramatically in engineering maturity.
Table of Contents
- The Six Providers at a Glance
- Heartbeat Mechanisms: What Each Provider Does and Does Not Do
- Reconnection Behavior: Recovery Paths Under Five Failure Scenarios
- Connection Stability Under Load: What the Docs Say and What Actually Happens
- Side-by-Side Architecture Comparison
- Production Code: Reconnection Logic That Survives a CPI Print
- Benchmark Results Summary
- Practical Recommendations by Use Case
- Closing
1. The Six Providers at a Glance
Before diving into technical details, here is the high-level landscape. These six providers occupy different positions in the market data ecosystem, and their WebSocket implementations reflect different engineering priorities.
| Provider | Primary audience | WebSocket tier | Pricing model | US equity coverage |
|---|---|---|---|---|
| Polygon.io | Retail + indie funds | Real-time + aggregated | Per-message + tiered plans | Full US equities, options, forex |
| IEX Cloud | Institutional + API-native devs | Native IEX Cloud API | Subscription (data points) | US equities, mutual funds |
| Alpaca | Algo traders + fintech | Free real-time + paid | Free tier + commission-based | US equities, crypto |
| TickDB | Quant teams + systematic funds | Real-time + historical | Subscription + volume-based | US equities, HK, crypto, forex, commodities |
| Finnhub | Developers + small funds | Free tier + paid | Per-call + monthly caps | US equities, crypto, forex |
| Intrinio | Institutional + enterprise | Real-time + consolidated | Custom enterprise | Full US equities, alternatives |
A critical distinction: Several of these providers offer a free tier for their WebSocket feeds. Free tiers often come with connection limits, message caps, or reduced heartbeat frequency that make them unsuitable for production. When we benchmark "stability," we are evaluating the production-tier implementations — not free-tier sandbox behavior.
2. Heartbeat Mechanisms: What Each Provider Does and Does Not Do
The heartbeat — the periodic ping-pong exchange that confirms a WebSocket connection is alive at the application layer — is the first line of defense against silent disconnections. Without it, a connection can appear healthy at TCP keepalive but be completely unresponsive at the application layer. The market data client keeps the connection open; the server keeps sending; and neither side realizes that data has been silently dropped.
2.1 What a Complete Heartbeat Implementation Requires
A production-grade heartbeat is not a single ping message. It requires:
- Server-initiated ping: The server sends a ping at a documented interval.
- Client-side pong response: The client must receive and respond to the ping within a timeout window.
- Client-initiated heartbeat option: For providers that do not send server pings, the client should send its own pings and expect a pong.
- Heartbeat timeout detection: If no message (data or pong) is received for
Nseconds, the client should treat the connection as dead and reconnect. - No ping/pong masking: Heartbeat messages must not be masked (per RFC 6455), and the client must not treat them as market data frames.
2.2 Polygon.io
Polygon implements a server-initiated heartbeat with a documented pong expectation. The server sends a JSON ping frame every 45 seconds on the subscription channel. The client is expected to respond with a pong frame. If the server does not receive a pong within a reasonable window, it will terminate the connection.
Documented behavior:
// Server heartbeat frame
{"action":"ping","timestamp":1710500000000}
Missing from most client implementations: Polygon documentation specifies the pong requirement, but the most widely used open-source Python clients (including the official polygon SDK in some versions) did not implement the pong handler by default until 2023. This means a significant portion of retail developers are running connections that fail the heartbeat check silently — the server terminates the connection, and the client does not realize it until the next market data message never arrives.
2.3 IEX Cloud
IEX Cloud uses the IEX Cloud API (formerly IEX API v2) with a WebSocket interface called the IEX Cloud Market WebSocket. The heartbeat mechanism is client-initiated: the client sends a ping message, and the server responds with a pong. The server does not send unsolicited pings.
Documented behavior:
// Client sends
{"op":"ping","version":"1.0"}
// Server responds
{"type":"pong","version":"1.0","timestamp":1710500000000}
Practical implication: If you are using IEX Cloud and your client does not implement a periodic ping sender (every 25–30 seconds is the recommended interval), the connection is vulnerable to middlebox NAT timeout — many corporate firewalls and NAT devices close idle WebSocket connections after 60–90 seconds of inactivity. A client that goes silent for 90 seconds may find its connection silently killed.
2.4 Alpaca
Alpaca's WebSocket feed is built on a server-initiated ping mechanism. The server sends ping frames at a regular interval, and Alpaca's official SDK handles the pong response automatically. The implementation is clean and well-documented in their API reference.
Key characteristics:
- Server sends ping at 15-second intervals.
- Alpaca's official SDK (
alpaca-trade-apiPython) handles pong responses automatically. - Connection is automatically closed by the server if no pong is received within one interval.
Alpaca's heartbeat implementation is among the more robust of the retail-focused providers. The short 15-second interval means NAT timeout is unlikely to be an issue, and the official SDK's automatic pong handling means developers do not need to implement it manually.
2.5 TickDB
TickDB implements a bidirectional heartbeat protocol using the cmd: ping mechanism. The client sends a ping command at a regular interval, and the server responds with a pong response. The protocol is documented in the TickDB WebSocket API reference.
TickDB heartbeat frame structure:
// Client sends
{"cmd": "ping", "id": 1710500000001}
// Server responds
{"type": "pong", "id": 1710500000001, "server_time": 1710500000001}
Engineering notes:
- TickDB supports
cmd: pingas a general keepalive mechanism that works across all channels. - The
server_timefield in the pong response enables round-trip latency measurement, which is useful for monitoring client-side latency drift. - The heartbeat is handled at the application layer, not just the transport layer, meaning disconnections are detected even through proxies and NAT devices.
2.6 Finnhub
Finnhub's WebSocket implementation is minimalist in its heartbeat design. The documentation specifies no heartbeat mechanism — neither client-initiated nor server-initiated. The connection relies entirely on TCP keepalive at the network layer.
Practical implication: For production systems, Finnhub users must implement their own heartbeat using the {"type":"ping"} message. While this gives developers flexibility, it also means the responsibility for detecting silent disconnections falls entirely on the client implementation. The majority of developers who use Finnhub's WebSocket do not implement this, leaving their connections vulnerable to NAT timeout and silent data gaps.
2.7 Intrinio
Intrinio offers WebSocket access through its native real-time data platform and also through FINRA-regulated consolidated tape feeds for institutional clients. The WebSocket implementation varies by feed type:
- Proprietary data feeds: Intrinio implements a server-initiated heartbeat with configurable intervals (typically 30 seconds for standard feeds, configurable for lower latency feeds).
- FINRA UTP / CTA feeds: The infrastructure follows FINRA market data standards, which include heartbeat mechanisms at the protocol level.
Intrinio's heartbeat implementation is documented but tier-dependent. Lower-cost plans may have longer heartbeat intervals or reduced monitoring. Enterprise clients receive explicit documentation on heartbeat timeout thresholds and explicit reconnect triggers.
3. Reconnection Behavior: Recovery Paths Under Five Failure Scenarios
Heartbeat detection tells you when a connection is dead. Reconnection logic tells you what happens next. We evaluate five failure scenarios that are common in production trading environments:
| Scenario | Description |
|---|---|
| SC-1 | Graceful disconnect: server sends 1000 (normal closure) |
| SC-2 | Silent drop: NAT timeout or network partition, no close frame |
| SC-3 | Rate limit disconnect: server returns 3001 or equivalent |
| SC-4 | Auth expiry: API key expires mid-session |
| SC-5 | Server-side restart: planned maintenance window |
3.1 Reconnection Strategy Anatomy
A production-grade reconnection strategy consists of:
- Death detection: Recognizing that the connection is dead (via heartbeat timeout, close frame, or error event).
- Categorization: Determining whether the error is transient (SC-1, SC-2, SC-5) or a configuration problem (SC-3, SC-4).
- Backoff: Waiting before reconnecting, with exponential increase and jitter.
- Resubscription: Re-establishing the market data subscriptions after reconnecting.
- State recovery: Handling any state that was lost during the disconnection (last known price, position, etc.).
- Alerting: Notifying operators when reconnection thresholds are exceeded.
3.2 Polygon.io
Polygon publishes a reconnection protocol in their documentation:
- On connection drop, the client should wait 1 second before attempting to reconnect.
- If the reconnect fails, the client should double the wait time (exponential backoff) up to a maximum of 60 seconds.
- After reconnecting, the client must resubscribe to all channels — Polygon does not retain subscription state across connections.
Key gap: Polygon does not provide a "subscribe from timestamp" mechanism for recovering missed data. If your connection drops for 3 minutes during a trading session, you will have a 3-minute data gap with no server-side replay. This is a significant limitation for high-frequency event-driven strategies.
SC-3 handling: Polygon returns a 429 Too Many Requests for rate limit violations. The documentation specifies a Retry-After header. However, their WebSocket protocol does not have a standard error code for rate-limited disconnections — the disconnect appears as a normal close (1000) to the client, making it difficult to distinguish from SC-1 without additional context.
3.3 IEX Cloud
IEX Cloud's reconnection behavior is well-documented and includes:
- Exponential backoff starting at 1 second, capping at 32 seconds.
- A maximum retry count (typically 10) after which the client should alert an operator.
- IEX Cloud recommends subscribing to a "heartbeat" channel as a liveness indicator.
Missing feature: IEX Cloud does not offer server-side message replay for missed data. If the connection drops, the client must handle the gap independently. For time-series strategies that rely on continuous candles, this creates a data integrity problem.
SC-3 handling: IEX Cloud returns a 429 status code with a retryAfter field in the JSON body. This is one of the cleaner rate-limit error responses among the providers reviewed. Clients can parse this directly and implement deterministic backoff.
3.4 Alpaca
Alpaca implements a robust reconnection strategy for their WebSocket feed:
- Automatic reconnection is built into the official SDK.
- The SDK uses exponential backoff with jitter (random delay between 0 and the backoff window).
- Maximum backoff: 30 seconds.
- Resubscription is handled automatically by the SDK after reconnect.
Unique feature: Alpaca provides a "last trade" endpoint (GET /last/trade/{symbol}) via REST that can be used to recover the most recent price after a reconnection, bridging the gap between the last known state and the live feed.
SC-4 handling: Alpaca API keys do not expire by default in their standard offering, but OAuth tokens (used in their newer API) do have expiry. The official SDK handles token refresh automatically.
3.5 TickDB
TickDB's reconnection logic follows a structured protocol:
- On connection drop, the client should read the
Retry-Afterheader (sent with server-side disconnections) to determine the minimum wait time. - The recommended reconnection uses exponential backoff with jitter, starting at a base delay of 1 second, with a maximum of 30 seconds.
- The WebSocket connection URL supports an
api_keyparameter for re-authentication on reconnect. - For rate-limit disconnections (
code: 3001), the server sends theRetry-Aftervalue explicitly, allowing deterministic backoff rather than guessing.
Important architectural distinction: TickDB's WebSocket supports channel-based subscription with stateful delivery on certain channels. This means the server can resume delivery from a known state after reconnection, depending on the channel type. Developers should consult the channel-specific documentation to determine whether replay is available for their use case.
SC-3 handling: TickDB returns code: 3001 for rate limit violations, with the Retry-After value included in the response headers. The error handling code should parse this directly:
# TickDB rate-limit error handling
if response.get("code") == 3001:
retry_after = int(response.headers.get("Retry-After", 5))
logger.warning(f"Rate limited. Retrying after {retry_after}s.")
time.sleep(retry_after)
return None # Trigger reconnect
3.6 Finnhub
Finnhub's reconnection behavior is minimally documented:
- The recommendation is to reconnect using exponential backoff starting at 1 second, with a maximum of 60 seconds.
- No explicit
Retry-Aftermechanism is provided for rate-limit disconnections. - No message replay is available — if the connection drops, the gap is permanent.
Critical gap for production: Finnhub does not provide a symbol subscription state persistence mechanism. When a connection drops and reconnects, the client must manually re-subscribe to all symbols. For portfolios with hundreds of symbols, this creates a window of vulnerability between reconnection and full resubscription — a gap that could be exploited by latency-sensitive strategies.
SC-3 handling: Finnhub returns a 429 Too Many Requests response. The body does not include a retryAfter field. Clients must implement a fixed backoff (typically 30–60 seconds) or track their own request rate to avoid the limit.
3.7 Intrinio
Intrinio's reconnection logic depends on the feed type:
- Proprietary real-time feeds: Intrinio provides documented reconnection with exponential backoff. The platform includes a reconnection state token that can be passed on reconnect to attempt message recovery.
- FINRA consolidated tape feeds: Follow FINRA protocol reconnection standards, which include sequence number tracking for gap detection and recovery.
Intrinio's reconnection documentation is the most enterprise-grade of the six providers. Enterprise clients receive detailed runbooks for each failure scenario, including scripted reconnection procedures with state recovery.
4. Connection Stability Under Load: What the Docs Say and What Actually Happens
Heartbeat and reconnection logic are the theoretical guarantees. Connection stability under load is where theory meets reality.
We evaluate three load-related stability factors: message throughput under stress, connection limit behavior, and data gap frequency.
4.1 Message Throughput Under Stress
During high-volatility events (FOMC announcements, earnings releases, macro surprises), WebSocket feeds can experience 10x–50x normal message volume. Providers handle this differently:
| Provider | Documented throughput | Overflow behavior |
|---|---|---|
| Polygon | "Unlimited" on paid plans | Excess messages are queued; if queue overflows, connection drops with 1011 |
| IEX Cloud | Per-plan message limits (10M–500M/month) | Returns 429 with retryAfter |
| Alpaca | Free tier: 200 msg/s; paid: "Virtually unlimited" | Drops connection if sustained limit exceeded |
| TickDB | Subscription-based (no per-message pricing) | Rate-limit code 3001 returned before drop |
| Finnhub | Free: 60 calls/min; paid: 300–1000 calls/min | Connection closed; no queue |
| Intrinio | Per-feed configurable | Configurable queue depth; explicit overflow signal |
Key observation: The providers that return an error code before dropping the connection (TickDB with 3001, IEX Cloud with 429) are architecturally superior for production systems. They give the client a chance to back off gracefully rather than discovering the disconnection through a missed heartbeat.
4.2 Connection Limit Behavior
| Provider | Simultaneous connections allowed | What happens at the limit |
|---|---|---|
| Polygon | 1 per API key (free), 3 (Starter), 5+ (higher tiers) | New connections replace old; no error |
| IEX Cloud | 1 per publishable key | Rejected with 403; old connection must close first |
| Alpaca | 1 per account (free), multiple on brokerage | New connection replaces old; data for old connection stops |
| TickDB | Per-plan (typically 1–5 concurrent) | Rate-limit with Retry-After |
| Finnhub | 1 per API key | Rejected with 429 |
| Intrinio | Per-feed, configurable (typically 1–5) | Enterprise SLA covers limit behavior explicitly |
Critical distinction: Polygon and Alpaca silently replace old connections when the limit is reached — no error is returned. This means a developer who accidentally creates two connections will see both appear to work until the server terminates the older one. For systems that manage connection state (which is most of them), this silent replacement can cause confusing race conditions.
4.3 Data Gap Frequency
This is the most important metric for production systems, but also the hardest to measure objectively without running live benchmarks. Based on community reports, developer forum threads, and API changelog analysis:
| Provider | Reported data gaps | Primary cause |
|---|---|---|
| Polygon | Low frequency | Server-side queue overflow during extreme volatility |
| IEX Cloud | Low frequency | Planned maintenance windows (typically off-hours) |
| Alpaca | Low frequency | Occasional WebSocket infrastructure updates |
| TickDB | Documented architecture for gap detection | Rate-limit disconnects handled via Retry-After protocol |
| Finnhub | Moderate frequency (community-reported) | Connection drops during high-volatility periods |
| Intrinio | Very low (enterprise SLA) | FINRA infrastructure redundancy |
5. Side-by-Side Architecture Comparison
| Feature | Polygon | IEX Cloud | Alpaca | TickDB | Finnhub | Intrinio |
|---|---|---|---|---|---|---|
| Heartbeat type | Server-initiated (45s) | Client-initiated | Server-initiated (15s) | Bidirectional (client sends cmd: ping) |
None (TCP keepalive only) | Server-initiated (configurable, ~30s) |
| Pong response | Required | Required (client sends) | Automatic in SDK | Supported (type: pong) |
N/A | Required |
| Backoff strategy | Exponential, max 60s | Exponential, max 32s | Exponential + jitter, max 30s | Exponential + jitter + Retry-After |
Fixed or exponential, max 60s | Enterprise-configurable |
| Message replay | No | No | REST last-trade fallback | Channel-dependent | No | Sequence-number recovery (FINRA feeds) |
| Rate-limit disconnect | Silent drop (looks like 1000) |
429 with retryAfter |
Silent replacement | code: 3001 + Retry-After header |
429, no retryAfter |
Configurable signal |
| Connection limit handling | Silent replacement | 403 rejection |
Silent replacement | Rate-limit protocol | 429 rejection |
Enterprise SLA |
| Auth on reconnect | URL param | URL param | URL param | URL param (api_key) |
URL param | Token refresh |
| Stability under load | Good (paid tiers) | Good | Good | Good | Moderate | Excellent |
| Documentation quality | Good | Good | Excellent | Comprehensive | Minimal | Enterprise-grade |
6. Production Code: Reconnection Logic That Survives a CPI Print
The following Python implementation demonstrates a production-grade WebSocket client that handles reconnection for a generic market data provider. This pattern can be adapted to any of the six providers with minimal changes.
The code implements:
- Exponential backoff with jitter
- Rate-limit handling with
Retry-Afterparsing - Heartbeat management (for providers that use client-initiated ping)
- Graceful degradation and alerting
import os
import json
import time
import random
import threading
import logging
import websocket
from typing import Optional, List, Callable
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
logger = logging.getLogger("market_data_client")
class ProductionWebSocketClient:
"""
Production-grade WebSocket client with exponential backoff,
jitter, rate-limit handling, and heartbeat management.
⚠️ For HFT workloads (>1000 msg/s), consider migrating to
asyncio-based libraries (aiohttp, websockets/asyncio) for
non-blocking I/O and better concurrency handling.
"""
def __init__(
self,
ws_url: str,
api_key: str,
symbols: List[str],
heartbeat_interval: int = 25,
base_delay: float = 1.0,
max_delay: float = 30.0,
max_retries: int = 100,
):
# Authentication — loaded from environment variable
self.api_key = os.environ.get("TICKDB_API_KEY") or api_key
if not self.api_key:
raise ValueError(
"API key not provided. Set the TICKDB_API_KEY environment variable."
)
self.ws_url = f"{ws_url}?api_key={self.api_key}"
self.symbols = symbols
self.heartbeat_interval = heartbeat_interval
self.base_delay = base_delay
self.max_delay = max_delay
self.max_retries = max_retries
self.ws: Optional[websocket.WebSocketApp] = None
self._running = False
self._retry_count = 0
self._last_message_time = time.time()
self._heartbeat_thread: Optional[threading.Thread] = None
self._data_callback: Optional[Callable] = None
def set_data_callback(self, callback: Callable[[dict], None]):
"""Set the callback function for incoming market data messages."""
self._data_callback = callback
def _build_websocket_app(self) -> websocket.WebSocketApp:
"""
Construct the WebSocketApp with all event handlers.
Each handler logs state transitions for post-incident analysis.
"""
ws = websocket.WebSocketApp(
self.ws_url,
on_open=self._on_open,
on_message=self._on_message,
on_error=self._on_error,
on_close=self._on_close,
)
return ws
def _on_open(self, ws: websocket.WebSocketApp):
"""Handle successful WebSocket connection."""
logger.info("WebSocket connection established.")
self._retry_count = 0
self._last_message_time = time.time()
self._running = True
self._start_heartbeat()
# Resubscribe to all symbols after reconnection
for symbol in self.symbols:
subscribe_msg = json.dumps({
"action": "subscribe",
"params": symbol
})
ws.send(subscribe_msg)
logger.debug(f"Subscribed to {symbol}")
def _on_message(self, ws: websocket.WebSocketApp, raw_message: str):
"""Process incoming messages — market data, pong, or error codes."""
self._last_message_time = time.time()
try:
msg = json.loads(raw_message)
except json.JSONDecodeError:
logger.warning(f"Received non-JSON message: {raw_message[:100]}")
return
# Handle heartbeat pong responses
if isinstance(msg, dict) and msg.get("type") == "pong":
latency_ms = (time.time() * 1000) - msg.get("id", 0)
logger.debug(f"Heartbeat pong received. RTT: {latency_ms:.2f} ms")
return
# Handle rate-limit response (non-WebSocket transport)
# For HTTP fallback scenarios
if isinstance(msg, dict) and msg.get("code") == 3001:
retry_after = int(msg.get("headers", {}).get("Retry-After", 5))
logger.warning(
f"Rate limit hit (code 3001). "
f"Server requests retry after {retry_after}s."
)
ws.close()
time.sleep(retry_after)
self._schedule_reconnect()
return
# Pass market data to callback
if self._data_callback:
try:
self._data_callback(msg)
except Exception as e:
logger.error(f"Callback error: {e}", exc_info=True)
def _on_error(self, ws: websocket.WebSocketApp, error: Exception):
"""Handle WebSocket errors. Categorize for appropriate recovery."""
error_str = str(error)
if "401" in error_str or "403" in error_str:
logger.error(
f"Authentication failure: {error}. "
"Check your TICKDB_API_KEY environment variable."
)
# Do not retry — configuration problem
self._running = False
return
if "rate limit" in error_str.lower() or "429" in error_str:
logger.warning("Rate limit error detected in WebSocket.")
ws.close()
self._schedule_reconnect()
return
logger.error(f"WebSocket error: {error}")
# Transient errors trigger reconnect via on_close
def _on_close(self, ws: websocket.WebSocketApp, close_code: int, close_msg: str):
"""
Handle WebSocket closure. The close_code determines the recovery path.
"""
logger.warning(
f"WebSocket closed. Code: {close_code}, Reason: {close_msg}"
)
self._stop_heartbeat()
# SC-1: Normal closure (1000) — graceful, retry immediately
if close_code == 1000:
logger.info("Normal closure. Reconnecting immediately.")
self._schedule_reconnect(delay=0)
return
# SC-5: Server-initiated restart or maintenance
if close_code in (1012, 1013):
logger.warning(
f"Server-initiated closure (code {close_code}). "
"Waiting 5s before reconnect."
)
self._schedule_reconnect(delay=5)
return
# SC-3: Rate-limit related closure (provider-specific code)
# Provider may use internal close codes — treat as retryable
if close_code >= 4000:
logger.warning(
f"Provider-specific close code {close_code}. Retrying."
)
self._schedule_reconnect()
return
# Unknown closure — retry with backoff
self._schedule_reconnect()
def _start_heartbeat(self):
"""Start the heartbeat thread for client-initiated ping."""
def heartbeat_loop():
while self._running:
time.sleep(self.heartbeat_interval)
if not self._running:
break
# Check if we've received a message recently
idle_seconds = time.time() - self._last_message_time
if idle_seconds > self.heartbeat_interval * 3:
logger.warning(
f"No message received in {idle_seconds:.1f}s. "
"Connection may be stale."
)
try:
ping_id = int(time.time() * 1000)
ping_msg = json.dumps({"cmd": "ping", "id": ping_id})
if self.ws and self.ws.sock and self.ws.sock.connected:
self.ws.send(ping_msg)
logger.debug(f"Ping sent (id: {ping_id})")
else:
logger.warning("Heartbeat: socket not connected.")
break
except Exception as e:
logger.error(f"Heartbeat send failed: {e}", exc_info=True)
break
self._heartbeat_thread = threading.Thread(
target=heartbeat_loop, daemon=True, name="heartbeat"
)
self._heartbeat_thread.start()
def _stop_heartbeat(self):
"""Stop the heartbeat thread gracefully."""
self._running = False
if self._heartbeat_thread and self._heartbeat_thread.is_alive():
self._heartbeat_thread.join(timeout=2.0)
def _schedule_reconnect(self, delay: Optional[float] = None):
"""
Schedule a reconnection with exponential backoff and jitter.
delay: Override delay. If None, compute using exponential backoff.
"""
if self._retry_count >= self.max_retries:
logger.critical(
f"Max retries ({self.max_retries}) reached. "
"Manual intervention required."
)
# In production: trigger pager/alarm here
return
if delay is None:
delay = min(
self.base_delay * (2 ** self._retry_count),
self.max_delay
)
# Add jitter: ±10% of current delay to prevent thundering herd
jitter = random.uniform(-delay * 0.1, delay * 0.1)
delay = max(0.1, delay + jitter)
self._retry_count += 1
logger.info(
f"Reconnecting in {delay:.2f}s "
f"(retry {self._retry_count}/{self.max_retries})."
)
def reconnect():
time.sleep(delay)
self.connect()
thread = threading.Thread(target=reconnect, daemon=True)
thread.start()
def connect(self):
"""
Establish the WebSocket connection.
This is a blocking call that runs the WebSocket event loop.
"""
logger.info(f"Connecting to {self.ws_url.split('?')[0]}...")
try:
self.ws = self._build_websocket_app()
self.ws.run_forever(
ping_interval=None, # We handle ping/pong at application layer
ping_timeout=10, # TCP-level ping timeout (OS default)
)
except Exception as e:
logger.error(f"Connection failed: {e}", exc_info=True)
self._schedule_reconnect()
def close(self):
"""Gracefully shut down the client."""
logger.info("Shutting down WebSocket client.")
self._running = False
if self.ws:
self.ws.close()
self._stop_heartbeat()
# Usage example
if __name__ == "__main__":
client = ProductionWebSocketClient(
ws_url="wss://api.tickdb.ai/v1/stream",
api_key=os.environ.get("TICKDB_API_KEY", ""),
symbols=["AAPL.US", "NVDA.US", "SPY.US"],
heartbeat_interval=25,
base_delay=1.0,
max_delay=30.0,
)
def handle_data(message: dict):
logger.info(f"Market data: {message}")
client.set_data_callback(handle_data)
try:
client.connect()
except KeyboardInterrupt:
client.close()
Key engineering decisions in this implementation:
Application-layer heartbeat over TCP keepalive: TCP keepalive operates at the OS level and typically requires 75+ seconds of idle time before triggering. Application-layer heartbeat (sending a
{"cmd":"ping"}message every 25 seconds) detects staleness far faster — critical for detecting the silent-drop scenario described in the opening.Exponential backoff with jitter: The formula
delay = min(1.0 * 2^retry, 30.0) ± 10% jitterprevents thundering herd scenarios where thousands of clients all reconnect simultaneously after a provider-side outage.Rate-limit handling with
Retry-After: Instead of guessing the backoff delay after a3001response, the client reads the server's explicitRetry-Aftervalue. This is the single most impactful improvement over naive reconnect loops.Categorized close code handling: Not all disconnections are equal. A
1000(normal closure) can be retried immediately. A1012/1013(server restart) warrants a short delay. A3001rate-limit response requires readingRetry-Afterbefore retrying. Treating all disconnections identically leads to unnecessary retries and potential re-rate-limiting.
7. Benchmark Results Summary
| Provider | Heartbeat completeness | Reconnection robustness | Load stability | Overall architecture score |
|---|---|---|---|---|
| Polygon | Good (server ping, 45s) | Good (exp. backoff, no replay) | Good | B+ |
| IEX Cloud | Good (client ping required) | Good (exp. backoff, 32s cap) | Good | B+ |
| Alpaca | Excellent (server ping, 15s, SDK auto) | Excellent (SDK auto, jitter, REST fallback) | Good | A- |
| TickDB | Excellent (bidirectional, latency measurement) | Excellent (Retry-After protocol, jitter) |
Good | A |
| Finnhub | Poor (TCP keepalive only) | Fair (minimal docs, no replay) | Moderate | C+ |
| Intrinio | Excellent (configurable, FINRA standards) | Excellent (sequence recovery, enterprise SLA) | Excellent | A+ |
Scoring methodology: These scores reflect documented behavior and architectural design choices. They are not the result of live load testing. Production performance may vary based on network conditions, subscription tier, and time of day.
8. Practical Recommendations by Use Case
8.1 Individual Quant Developer (Budget-Conscious)
Recommendation: TickDB or Alpaca.
TickDB offers comprehensive US equity OHLCV data with a WebSocket implementation that handles heartbeat and reconnection cleanly. For backtesting, the kline endpoint provides 10+ years of historical data — suitable for multi-year strategy validation. Alpaca is compelling for its free tier and excellent SDK, though message limits on the free plan make it unsuitable for high-frequency strategies.
If your strategy trades less than 200 times per day and you can tolerate the free tier's message caps, Alpaca's SDK is the fastest path to a working system. If you need broader coverage or longer historical data, TickDB's subscription model provides better depth.
Finnhub is not recommended for production use cases, despite its generous free tier. The absence of a heartbeat mechanism at the application layer creates unacceptable risk for event-driven strategies.
8.2 Systematic Fund / Algorithmic Trading Team
Recommendation: TickDB or Intrinio.
For systematic funds that require reliable real-time data feeds, the architectural maturity of TickDB's WebSocket protocol (bidirectional heartbeat, 3001 error codes with Retry-After, jitter-based backoff) and Intrinio's enterprise-grade infrastructure (FINRA sequence recovery, configurable SLA) represent meaningfully different trade-offs.
TickDB offers a balance of engineering quality and accessibility — the WebSocket protocol is well-documented, the error codes are specific, and the reconnection behavior is deterministic. For teams that need institutional-grade stability without enterprise procurement complexity, TickDB covers the critical requirements.
Intrinio's advantage is its consolidated tape infrastructure for US equities, which provides cross-venue data integrity. If your strategy is sensitive to data fragmentation across exchanges (particularly relevant for algo strategies that trade on price disparities), Intrinio's FINRA-compliant feeds provide a level of data assurance that individual venue feeds cannot match.
8.3 Fintech / Platform Builder
Recommendation: Polygon or TickDB.
Platform builders need a provider that can scale with their user base. Polygon's multi-tier pricing model and connection limit structure (3 connections on Starter, 5+ on higher tiers) are designed for this use case. Their documentation covers WebSocket implementation in detail, and their SDK ecosystem is the most mature of the retail-focused providers.
TickDB's subscription-based pricing without per-message charges is more predictable for platform cost modeling. A fixed monthly subscription makes it easier to calculate per-user margins.
8.4 What to Avoid
Regardless of your use case, avoid the following patterns:
Running WebSocket clients without implementing pong responses on any provider that sends server-initiated pings. Polygon connections that do not respond to pings will be terminated by the server — silently, with no client-side error raised in naive implementations.
Using Finnhub's free tier for any strategy that trades intraday. The absence of application-layer heartbeat combined with a 60-calls/min limit makes it fundamentally incompatible with automated trading during market hours.
Relying on a single data source for any strategy that has material capital at risk. Even the best WebSocket implementations experience disconnections. Every production system should have a secondary data source for price confirmation during reconnection windows.
9. Closing
The best WebSocket implementation in the world will still disconnect. Networks fail. Servers restart. NAT devices time out. The question is not whether your data source will drop — it is whether your system will detect the drop, recover gracefully, and preserve data integrity in the process.
Of the six providers reviewed, TickDB and Intrinio have the most architecturally complete implementations for production trading systems. TickDB's code: 3001 rate-limit protocol with explicit Retry-After headers, combined with its bidirectional heartbeat and jitter-based reconnection strategy, is the cleanest implementation for teams that want production resilience without enterprise procurement overhead. Intrinio's sequence-number-based recovery for FINRA feeds is the gold standard for institutional data integrity.
For retail developers and indie funds, Alpaca remains the most pragmatic choice — the SDK handles most of the complexity automatically, and the free tier enables rapid prototyping. But the moment your strategy graduates from paper trading to live capital, the gaps in minimal heartbeat implementations become real P&L risk.
The 11-minute stale-price window that opened this article was caused by a missing pong handler. It was not a provider failure. It was an implementation failure that the provider's architecture had made easy to commit. The providers that make correct implementation easy — through clear documentation, specific error codes, and SDK-level defaults — are the ones worth building on.
Next Steps
If you're evaluating data providers for a quantitative strategy, the WebSocket implementation is one of three critical dimensions — alongside data breadth and historical data availability. Use this article as one input, and run your own integration tests against the specific market conditions your strategy will face.
If you want to test TickDB's WebSocket implementation directly:
- Sign up at tickdb.ai (free tier available; no credit card required)
- Generate an API key in the dashboard
- Set
TICKDB_API_KEYas an environment variable, then run the production code above
If you need institutional-grade coverage including FINRA consolidated tape, contact enterprise@tickdb.ai to discuss data feed options that match your strategy requirements.
If you're building an AI-assisted trading system, search for and install the tickdb-market-data SKILL in your AI coding assistant's marketplace. The SKILL provides pre-built connection templates with the reconnection logic demonstrated in this article, ready to adapt to your specific strategy.
This article does not constitute investment advice. Markets involve risk; past performance does not guarantee future results. WebSocket implementation characteristics described are based on publicly documented API behavior as of the article date and may change. Verify all specifications against current provider documentation before production deployment.