Process Guardians for Quant Systems: supervisor and systemd for Crash Recovery and Auto-Restart | US Stocks

A production-grade trading strategy means nothing if it sits dead on a server at 3 AM.

In 2019, a well-capitalized systematic fund lost an estimated $400,000 in a single overnight session because their futures execution daemon crashed and nobody was watching. The process died. The server kept running. The strategy went dark. By the time the team arrived in the morning, the market had moved, and their positions had drifted into a 4% drawdown with no hedge in place. The strategy itself was sound. The infrastructure around it was not.

This is the gap this article addresses. Whether you run a single Python strategy connected to TickDB's WebSocket API or a full portfolio of microservices across multiple venues, you need two guarantees: if a process crashes, it gets pulled back up automatically, and if the server restarts, your strategy comes online before the market opens. The tools for both are built into every Linux distribution and require no third-party agents. They are supervisord and systemd.

This article walks through both systems from a quant engineer's perspective, with production-ready configurations, health check scripts, and the edge cases that separate a working demo from a system you can trust with real capital.

The Problem Space: What Kills Trading Processes

Before reaching for a solution, it is worth cataloging the failure modes. Trading processes die for reasons that fall into three categories.

Memory leaks and resource exhaustion. Python strategies running in tight loops accumulate garbage in pandas DataFrames or NumPy arrays held in memory. A process that runs fine for six hours and dies at hour seven has a slow leak. supervisor or systemd alone cannot fix a memory leak, but they can ensure the process is restarted before it consumes the swap and destabilizes the host.

Signal-based termination. Unix processes receive signals from the operating system and other processes. SIGTERM is a polite shutdown request. SIGKILL is not. If your process receives SIGTERM from an OOM killer or an automated scaling event, you want it restarted. If it receives SIGKILL, you want an alert sent before the next restart attempt. Both supervisor and systemd give you control over how each signal is handled.

External dependency failure. A strategy that depends on a WebSocket feed from TickDB, a database connection, or a FIX gateway will stall if that dependency goes away. The strategy process itself is healthy; it is waiting on I/O that will never arrive. Health checks that only verify "is the process running" miss this entirely. You need checks that verify "is the process doing what it is supposed to be doing."

Part 1: supervisord — User-Space Process Management

supervisord is the right tool when you need cross-user process management, a web-based monitoring interface, or a configuration that travels with a specific application rather than the operating system. It is installed via a package manager and configured through INI-style files that most developers find immediately readable.

Installing and Initializing

On Debian/Ubuntu:

sudo apt-get update
sudo apt-get install -y supervisor

On RHEL/CentOS:

sudo yum install -y supervisor

After installation, the supervisor configuration lives in /etc/supervisor/. The main configuration file is supervisord.conf. Per-program overrides live in the conf.d/ directory:

# /etc/supervisor/conf.d/trading-strategy.conf
[program:trading-strategy]
command=/opt/strategies/bin/run_strategy.sh
directory=/opt/strategies
user=quant
autostart=true
autorestart=true
startretries=5
exitcodes=0,2          ; 2 = unexpected exit; 0 = clean shutdown
stderr_logfile=/var/log/supervisor/trading-strategy.err.log
stdout_logfile=/var/log/supervisor/trading-strategy.out.log
stdout_logfile_maxbytes=50MB
stdout_logfile_backups=5

Decoding the Critical Settings

autorestart=true tells supervisor to restart the program regardless of how it exited. Combined with startretries=5, supervisor will attempt five restarts before giving up and entering a FATAL state. This is intentional: an infinite restart loop is worse than a dead process, because it generates enormous log volume and can exhaust file descriptors.

exitcodes=0,2 is a subtle but important detail. Exit code 0 means the process exited cleanly — you called sys.exit(0) or the script completed. Exit code 2 means the process exited for an unknown reason. By including 2 in the expected exit codes, you tell supervisor "treat the unknown exit as a crash, restart immediately." If you want to distinguish between a crash and a deliberate shutdown triggered by a signal handler, you would emit a specific non-zero code and list only the codes you consider safe.

Health Checks Beyond "Is It Running"

The configuration above will restart a crashed process. It will not restart a process that is alive but hung. To handle that case, you need a health check script and supervisord's keeplive mechanism via a named pipe or an external supervisor event listener.

A practical approach for a trading strategy is to have the strategy itself expose a lightweight HTTP health endpoint and write a monitor script that supervisor runs as a separate process:

#!/bin/bash
# /opt/strategies/bin/health_check.sh
# Run externally via a cron job or a separate supervisor program
# that sends alerts when the main strategy is unhealthy.

STRATEGY_HOST="localhost"
STRATEGY_PORT="8080"
SLACK_WEBHOOK="${SLACK_ALERT_WEBHOOK:-}"

response=$(curl -s -o /dev/null -w "%{http_code}" \
  --connect-timeout 3 \
  --max-time 5 \
  "http://${STRATEGY_HOST}:${STRATEGY_PORT}/health")

if [ "$response" != "200" ]; then
    timestamp=$(date '+%Y-%m-%d %H:%M:%S %Z')
    message="[ALERT] Trading strategy health check failed at ${timestamp}. HTTP status: ${response}"

    # Log locally
    echo "$message" >> /var/log/strategies/health_alerts.log

    # Send Slack alert if webhook is configured
    if [ -n "$SLACK_WEBHOOK" ]; then
        curl -s -X POST "$SLACK_WEBHOOK" \
          -H 'Content-type: application/json' \
          --data "{\"text\": \"${message}\"}" \
          --max-time 10
    fi

    # Signal supervisor to restart the process
    supervisorctl restart trading-strategy
fi

Run this check every 60 seconds via cron:

# /etc/cron.d/trading-health-check
* * * * * root /opt/strategies/bin/health_check.sh >> /var/log/strategies/health_check.log 2>&1

The health endpoint in your strategy should return meaningful diagnostics, not just a 200 OK:

# Lightweight health endpoint — embed in your main strategy process
from http.server import BaseHTTPRequestHandler
import psutil
import os

class HealthHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/health":
            pid = os.getpid()
            proc = psutil.Process(pid)
            memory_mb = proc.memory_info().rss / (1024 * 1024)

            health = {
                "status": "healthy",
                "pid": pid,
                "memory_mb": round(memory_mb, 2),
                "open_connections": len(proc.connections()),
                "timestamp": proc.create_time()
            }

            self.send_response(200)
            self.send_header("Content-Type", "application/json")
            self.end_headers()
            self.wfile.write(json.dumps(health).encode())
        else:
            self.send_response(404)
            self.end_headers()

    def log_message(self, format, *args):
        pass  # Suppress HTTP log noise

Running Multiple Strategies Under One Supervisor Instance

If you run several strategies on the same host — a mean-reversion strategy, an arbitrage monitor, and a volatility surface tracker — supervisor makes it straightforward to manage them as a group:

# /etc/supervisor/conf.d/strategies-group.conf
[group:quant-strategies]
programs=mean-reversion,arb-monitor,vol-surface

[program:mean-reversion]
command=/opt/strategies/mean_reversion/run.sh
directory=/opt/strategies/mean_reversion
user=quant
autostart=true
autorestart=true
startretries=3
exitcodes=0,2

[program:arb-monitor]
command=/opt/strategies/arb_monitor/run.sh
directory=/opt/strategies/arb_monitor
user=quant
autostart=true
autorestart=true
startretries=3
exitcodes=0,2

[program:vol-surface]
command=/opt/strategies/vol_surface/run.sh
directory=/opt/strategies/vol_surface
user=quant
autostart=true
autorestart=true
startretries=3
exitcodes=0,2

With a group defined, you can control all three strategies with a single command:

supervisorctl restart quant-strategies:
supervisorctl status quant-strategies:

The group restart is not atomic — supervisor restarts each program sequentially with a short delay between them. For a trading system, this is often acceptable because you typically want strategies to come online one at a time to avoid thundering-herd load on your data feeds.

The supervisor Web Interface

supervisord ships with an XML-RPC interface and a built-in web UI. Enable the web UI by adding this to your supervisord.conf:

[inet_http_server]
port=*:9001
username=quantadmin
password=SecurePasswordHere

[supervisorctl]
serverurl=unix:///var/run/supervisor.sock

The web interface gives you a live view of process state, stdout/stderr tailing, and one-click restart and log download. In a production environment, bind this to localhost only or protect it behind a VPN, because it exposes process control to anyone who can reach port 9001.

Part 2: systemd — The Operating System Layer

systemd is not just an init system. It is a full service management framework that the Linux kernel trusts more than any user-space process manager. When your server boots, systemd is what brings the system to a usable state. When systemd decides a service is dead, it is the OS-level authority on what happens next. For any process that must survive a reboot, run as a system service, or participate in boot-order dependencies, systemd is the right tool.

Writing a systemd Unit File

Systemd unit files live in /etc/systemd/system/ for locally defined services and /lib/systemd/system/ for packages installed by the OS. A trading strategy unit file looks like this:

# /etc/systemd/system/quant-trading-strategy.service
[Unit]
Description=Quantitative Trading Strategy — Mean Reversion
Documentation=https://internal-docs/trading-strategy
After=network-online.target postgresql.service
Wants=network-online.target postgresql.service

[Service]
Type=simple
User=quant
Group=quant
WorkingDirectory=/opt/strategies/mean_reversion
Environment="PYTHONPATH=/opt/strategies/lib"
Environment="TICKDB_API_KEY_FILE=/run/secrets/tickdb_api_key"
ExecStartPre=/bin/sleep 5
ExecStart=/opt/strategies/mean_reversion/run.sh
Restart=on-abnormal
RestartSec=10
TimeoutStartSec=30
TimeoutStopSec=60
StandardOutput=journal
StandardError=journal
SyslogIdentifier=mean-reversion-strategy
# Prevent memory from growing unbounded
MemoryMax=2G
# Cap CPU time — catch infinite loops early
CPUQuota=80%

# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/strategies/mean_reversion/logs
PrivateTmp=true
# Allow writing to the strategy log directory only
ReadWritePaths=/opt/strategies/mean_reversion/logs

[Install]
WantedBy=multi-user.target

Understanding the Critical Directives

After=network-online.target postgresql.service is one of the most important directives for any trading service. A strategy that connects to TickDB's WebSocket API needs the network. A strategy that queries a local database needs the database. Without After, systemd will start your strategy before its dependencies are ready, and you will spend your morning debugging connection refused errors that vanish when you simply restart the service after everything else has come up.

Restart=on-abnormal restarts the service on any unexpected termination. This is the systemd equivalent of autorestart=true in supervisor. on-failure is more restrictive — it only restarts when the process exits with a non-zero code. on-abnormal is preferred for trading systems because it also catches abnormal termination by signal.

TimeoutStartSec=30 and TimeoutStopSec=60 deserve attention. The start timeout prevents a hung process from blocking the boot sequence. The stop timeout is critical for trading systems: if your strategy has open positions when systemd sends SIGTERM, you need time for a graceful shutdown that closes positions or at least logs the state. Sixty seconds is a reasonable floor. If your shutdown handler needs longer, increase it — the cost is a slower shutdown, which is acceptable. The cost of a forced SIGKILL while positions are open is not.

Health Checks with systemd

systemd has native support for health checks through ExecStartPost and ExecStartPre, but for a trading strategy, the most reliable pattern is a watchdog timer. Systemd expects your process to send a periodic SD_WATCHDOG_DISABLE=1 or sd_notify heartbeat. If the heartbeat stops, systemd treats the service as unhealthy and restarts it.

The simplest production-ready approach is to add a watchdog-compatible health check script:

#!/usr/bin/env python3
# /opt/strategies/bin/watchdog_heartbeat.py
# Run as a daemon alongside the main strategy.
# Sends a watchdog ping to systemd every 15 seconds.

import os
import time
import socket

WATCHDOG_USEC = os.environ.get("WATCHDOG_USEC", "30000000")  # 30 seconds in microseconds

def send_watchdog():
    """Send WATCHDOG=1 to systemd via the notify socket."""
    notify_socket = os.environ.get("NOTIFY_SOCKET")
    if not notify_socket:
        # systemd not managing this process
        return

    try:
        with socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM) as sock:
            sock.sendto(b"WATCHDOG=1\n", notify_socket)
            sock.sendto(f"MAINPID={os.getpid()}\n".encode(), notify_socket)
    except Exception as e:
        pass  # Non-fatal; logging here would cause a loop

def main():
    interval = int(WATCHDOG_USEC) / 2_000_000  # Convert microseconds to seconds, ping at half the interval
    while True:
        send_watchdog()
        time.sleep(interval)

if __name__ == "__main__":
    main()

Enable watchdog in the unit file:

[Service]
WatchdogSec=30
NotifyAccess=main

Now systemd actively monitors your strategy. If the strategy process hangs and the watchdog heartbeat stops, systemd restarts it within 30 seconds — without requiring an external cron job or separate monitor process.

Reloading and Enabling the Service

After writing or modifying a unit file:

sudo systemctl daemon-reload
sudo systemctl enable quant-trading-strategy.service  # Start on boot
sudo systemctl start quant-trading-strategy.service   # Start now
sudo systemctl status quant-trading-strategy.service # Verify

To view real-time logs from the strategy, use journalctl:

# Tail logs for a specific service
journalctl -u quant-trading-strategy.service -f

# View logs since the last restart
journalctl -u quant-trading-strategy.service -b

# Search for errors
journalctl -u quant-trading-strategy.service -p err

journalctl is one of systemd's strongest advantages over supervisor. It captures logs in a structured, queryable format with timestamps, process IDs, and syslog identifiers — invaluable when debugging a crash that happened at 2 AM three days ago.

Part 3: Choosing Between supervisor and systemd

The decision between supervisor and systemd is not either/or for most quant infrastructure. The right approach is to use systemd for host-level services and process survival, and supervisor for application-level process grouping and the web UI.

Dimension	supervisord	systemd
Scope	Per-application	Per-host
Boot integration	No (user-space only)	Yes (init system integration)
Web UI	Yes (built-in)	No (requires external tools like cockpit)
Health check	DIY (external scripts)	Native watchdog via sd_notify
Log rotation	Manual (stdout_logfile_maxbytes)	Automatic (journald)
Cross-user management	Built-in (one supervisord per user)	Per-user services possible but awkward
Dependency management	Manual (startretries, ordering)	Native (After=, Wants=, Requires=)
Process groups	Built-in (group directive)	Via slice and scope units
Restart policy granularity	Per-program	Per-unit

A practical architecture looks like this: systemd manages the host-level services — your network mount, your database, your cron jobs. supervisor manages the application-level strategies on top of those services. When the server reboots, systemd brings the database online first, then supervisor starts, and supervisor starts each strategy. If a strategy crashes, supervisor handles the restart immediately. If the entire supervisor process dies, systemd restarts it.

Boot Order Verification

Regardless of which tool you use, verify your startup sequence before going live. The test is simple: reboot the server and watch the sequence.

# Log the startup sequence with systemd-analyze
systemd-analyze plot > boot_sequence.svg        # Visual timeline
systemd-analyze critical-chain                 # List services on the critical path

If your strategy starts before the network is ready, you will see connection errors in the first few seconds of the log. Adjust the After= directive in the systemd unit or the startup delay in the supervisor configuration until the sequence is clean.

Part 4: Production Configuration Patterns

Restart Throttling — Preventing the Crash Loop

A process that crashes and restarts immediately in a tight loop can cause cascading problems: log file exhaustion, API rate limit violations from reconnected clients hammering the same endpoint, and position state corruption if the restart happens mid-execution. Both tools handle this with backoff.

In supervisor:

startretries=5
autorestart=true
exitcodes=0,2
; supervisor does not have built-in restart backoff,
; but you can simulate it with a wrapper script

In systemd:

StartLimitIntervalSec=300  ; Reset the restart counter after 5 minutes
StartLimitBurst=3          ; Allow 3 restarts in the interval

With StartLimitBurst=3, systemd allows three restarts in 300 seconds. On the fourth attempt, it gives up and enters the failed state. This prevents the crash loop and ensures someone gets paged.

Resource Limits

For a trading strategy, resource limits are a safety net, not a performance tuning tool. Set them conservatively:

# systemd
MemoryMax=2G          ; Kill if memory exceeds 2 GB — catch leaks
TasksMax=64           ; Limit thread/process count
LimitNOFILE=65536     ; Prevent fd exhaustion

For supervisor, resource limits are host-level and set in /etc/security/limits.conf:

# /etc/security/limits.conf
quant soft nofile 65536
quant hard nofile 65536
quant soft memlock unlimited
quant hard memlock unlimited

Integration with TickDB

When your trading strategy consumes real-time market data from TickDB's WebSocket API, the restart sequence has a specific ordering requirement: the network must be established, then the WebSocket connection must be established, and only then should the strategy's trading logic begin processing orders. If the strategy restarts and immediately starts submitting orders before the data feed is confirmed, it operates blind for the first few seconds — which is precisely when microstructure volatility is highest.

A robust startup script that verifies the TickDB connection before starting the strategy:

#!/bin/bash
# /opt/strategies/bin/run_strategy.sh
# Wraps the strategy startup with a connection verification step.

set -e

TICKDB_API_KEY=$(cat /run/secrets/tickdb_api_key)
STRATEGY_SCRIPT="/opt/strategies/mean_reversion/main.py"

# Wait for network
until ping -c1 -W2 api.tickdb.ai &>/dev/null; do
    echo "[$(date)] Waiting for network..."
    sleep 5
done

# Verify TickDB API key is valid before starting strategy
echo "[$(date)] Verifying TickDB API connection..."
http_code=$(curl -s -o /dev/null -w "%{http_code}" \
    -H "X-API-Key: ${TICKDB_API_KEY}" \
    --connect-timeout 5 \
    --max-time 10 \
    "https://api.tickdb.ai/v1/symbols/available")

if [ "$http_code" != "200" ]; then
    echo "[$(date)] FATAL: TickDB API unreachable. HTTP code: ${http_code}. Strategy not starting."
    exit 1
fi

echo "[$(date)] TickDB connection verified. Starting strategy."
exec python3 "$STRATEGY_SCRIPT"

This script lives in ExecStart= for systemd or command= for supervisor. The API key verification adds approximately 1–2 seconds to the startup delay, which is negligible compared to the cost of a strategy running without a confirmed data connection.

Part 5: Alerting and Monitoring

Restarting a crashed process is necessary but not sufficient. You need to know when a restart happened, why it happened, and whether the underlying cause is resolved. A complete monitoring setup has three layers.

Layer 1 — Process state monitoring. supervisorctl status or systemctl status tells you whether the process is running. This is the baseline. If the process is not running, something is wrong.

Layer 2 — Health monitoring. The health check script or watchdog timer tells you whether the process is healthy. This catches the hung process that process monitoring misses.

Layer 3 — Alert routing. Any health check failure that triggers a restart should also send a notification. The Slack webhook in the health check script above is one approach. A more robust setup uses alertmanager or PagerDuty with deduplication — so that if the process restarts three times in five minutes, you get one page, not three.

A practical alert rule for Prometheus-compatible monitoring:

# alert_rules.yml — for Prometheus + Alertmanager setup
groups:
  - name: quant_trading
    rules:
      - alert: StrategyDown
        expr: up{job="trading-strategy"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Trading strategy {{ $labels.instance }} is down"

      - alert: StrategyRestartSpree
        expr: changes(process_restart_total{job="trading-strategy"}[5m]) > 2
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Strategy restarted {{ $value }} times in 5 minutes"

Closing

The $400,000 loss from that 2019 incident was not caused by a bad alpha signal. It was caused by the absence of a process guardian. A single autorestart=true directive in a supervisor configuration, or a three-line addition to a systemd unit file, would have caught the crash within seconds and brought the strategy back online before the market moved far enough to matter.

Process supervision is infrastructure plumbing, not a research project. It does not improve your Sharpe ratio. It does not generate alpha. What it does is ensure that the Sharpe ratio you have calculated in your backtest is the Sharpe ratio your capital experiences in production. That guarantee is worth the two hours it takes to configure it properly.

Next Steps

If you are running a single strategy on a VPS or cloud instance, start with a systemd unit file. It survives reboots, integrates with the OS log system, and requires no additional packages.

If you run multiple strategies or need a web UI for your team, layer supervisord on top of systemd. The group management and process-level stdout capture are worth the additional complexity.

If you use AI coding assistants, search for and install the tickdb-market-data SKILL in your ClawHub-compatible AI tool to accelerate the integration between your strategy and TickDB's real-time data feeds.

If you need enterprise-grade historical data for backtesting across a multi-year lookback window, reach out to enterprise@tickdb.ai for institutional data plans that include 10+ years of cleaned US equity OHLCV data aligned across venues.

This article does not constitute investment advice. Markets involve risk; past performance does not guarantee future results. Trading strategies can incur losses, and no automated monitoring system eliminates the need for human oversight.