Process Guardians for Quant Systems: Auto-Restart, Health Checks, and Server Reboot Recovery | US Stocks

The 3 AM Wake-Up Call That Destroys Portfolios

At 3:47 AM on a Tuesday, a hardware thermal event triggers a kernel panic on your trading server. Your mean-reversion strategy, which was purring along with a 1.34 Sharpe ratio in live trading, dies silently. No alerts fire. No reconnection happens. By the time you check your dashboard at 7:30 AM, the market has moved, your positions have drifted, and your paper profit has evaporated — or worse, turned into a real loss.

This is not a hypothetical. It is the scenario that separates profitable quant strategies from costly experiments. The code itself may be flawless. The alpha may be real. But infrastructure without process guardianship is a time bomb with a random fuse.

This article is a production-grade guide to building an indestructible foundation for your trading strategies. We will cover three layers of defense:

Supervisor — a lightweight process manager for application-level restarts.
systemd — the Linux init system for boot recovery and system-level supervision.
Custom health check scripts — your custom guardrails for strategy-specific monitoring.

By the end, your trading processes will survive crashes, network hiccups, and server reboots without manual intervention.

The Supervision Stack: A Layered Defense Architecture

Before writing a single configuration line, you need to understand the architecture. The supervision stack is not a single tool — it is a layered system where each layer has a different responsibility and different failure scope.

┌─────────────────────────────────────────────┐
│  Layer 3: Health Check Scripts              │
│  Strategy-specific monitoring               │
│  Custom alerts + recovery logic             │
├─────────────────────────────────────────────┤
│  Layer 2: Application Supervisor (supervisor)│
│  Per-process lifecycle management           │
│  Fast restarts, log rotation, stdout/stderr  │
├─────────────────────────────────────────────┤
│  Layer 1: System Init (systemd)             │
│  Boot recovery, system-level resource mgmt  │
│  Process 1 on modern Linux                  │
├─────────────────────────────────────────────┤
│  Layer 0: Kernel + Hardware                 │
│  The foundation — you cannot supervise this │
└─────────────────────────────────────────────┘

Each layer catches failures at different timescales:

Layer	Restart latency	Scope	Best for
systemd	~5–30 sec (network-dependent)	System-wide, boot-triggered	Recovering from crashes, server reboots
supervisor	<1 sec	Per-process, fast	Crash recovery, config reload, log management
Health scripts	Configurable (10 sec–5 min)	Application logic	Strategy-specific validation, custom alerts

The key insight: do not rely on a single layer. Use systemd as the anchor (it is always running), supervisor as the workhorse (it is fast and flexible), and health scripts as the intelligence layer (it knows what your strategy should look like when healthy).

Layer 1: systemd — The Boot Guardian

systemd is the init system on all major Linux distributions. It is the first process the kernel starts after booting, and it remains running for the entire lifetime of the system. This makes it the ideal anchor for process recovery — if your process dies at the systemd level, it means supervisor also died, which is a much rarer event than a simple application crash.

1.1 Why systemd for Trading Systems?

Three reasons make systemd the correct choice for trading system recovery:

Boot recovery: If the server reboots (planned maintenance or power failure), systemd starts your processes automatically. No manual SSH required.
Resource isolation: You can set CPU affinity, memory limits, and I/O priorities per service. This prevents a runaway backtest process from starving your live strategy.
Dependency management: You can declare that your trading service depends on the network being available, preventing the service from starting before the network is ready.

1.2 Writing a systemd Unit File for Your Strategy

Create a unit file at /etc/systemd/system/quant-strategy@.service:

[Unit]
Description=Quant Strategy: %i
Documentation=https://internal-docs/strategy-%i
After=network-online.target
Wants=network-online.target

# Prevent starting before the system clock is synchronized
# Critical for time-sensitive trading
ConditionNTPSync=yes

[Service]
Type=simple

# The working directory — change per environment
WorkingDirectory=/opt/strategies/%i/

# The actual command. Always use absolute paths.
# Never rely on PATH for systemd services.
ExecStart=/opt/strategies/%i/venv/bin/python /opt/strategies/%i/main.py

# Restart policy: always restart on crash
Restart=on-failure

# Restart every 10 seconds maximum. If it crashes more than
# 5 times in 10 seconds, stop trying (possible bad config)
RestartSec=10
StartLimitIntervalSec=60
StartLimitBurst=5

# Resource limits — prevent memory leaks from killing the system
MemoryMax=4G
MemoryHigh=3G

# CPU affinity: keep strategy on dedicated cores if possible
# CPUAffinity=2,3,4,5

# Environment variables — API keys live here, NOT in code
EnvironmentFile=/etc/strategies/%i/env.conf

# Logging: systemd captures stdout/stderr automatically
StandardOutput=journal
StandardError=journal

# Security: run as non-root
User=quant-user
Group=quant-group

# Allow core dumps for debugging (limit size)
LimitCORE=infinity

[Install]
WantedBy=multi-user.target

This unit file defines a reusable service template. The @%i notation means it is a template — you can run multiple instances with different names:

# Start the mean-reversion strategy
sudo systemctl start quant-strategy@mean-reversion

# Start the momentum strategy
sudo systemctl start quant-strategy@momentum

# Check status
sudo systemctl status quant-strategy@mean-reversion

# View logs
sudo journalctl -u quant-strategy@mean-reversion -f

1.3 The Environment File: Where API Keys Belong

Never hardcode API keys in your code or your unit file. Create an environment file at /etc/strategies/mean-reversion/env.conf:

# /etc/strategies/mean-reversion/env.conf
# API Keys — this file should be readable only by quant-user
TICKDB_API_KEY=tk_live_xxxxxxxxxxxxxxxxxxxxx
ALPACA_API_KEY=PKXXXXXXXXXXXXXXXXXX
ALPACA_SECRET_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Strategy parameters
MAX_POSITION_SIZE=10000
MAX_DAILY_LOSS=500
LOG_LEVEL=INFO

# Network configuration — critical for trading systems
GATEWAY_CHECK_URL=https://api.tickdb.ai/v1/health
GATEWAY_CHECK_INTERVAL=30

Set the correct permissions immediately:

sudo chmod 600 /etc/strategies/mean-reversion/env.conf
sudo chown quant-user:quant-group /etc/strategies/mean-reversion/env.conf

1.4 Boot Recovery: Making Sure Your Strategy Starts After Reboot

The [Install] section with WantedBy=multi-user.target ensures the service starts at boot. To enable it:

# Enable auto-start at boot
sudo systemctl enable quant-strategy@mean-reversion

# Verify it is enabled
sudo systemctl is-enabled quant-strategy@mean-reversion
# Output: enabled

# If you need to reboot the server safely, use this sequence
sudo systemctl reboot
# systemd will stop services gracefully, then reboot

1.5 The Watchdog Mechanism: Detecting Silent Deaths

systemd supports a hardware or software watchdog. If your process fails to send a heartbeat within the timeout, systemd assumes it is hung and restarts it.

# Add to [Service] section
WatchdogSec=30
Restart=on-failure

In your strategy code, you need to send watchdog notifications:

import os
import subprocess

def send_watchdog_heartbeat():
    """Send a watchdog notification to systemd.
    
    systemd expects a newline to the watchdog fd.
    The sd_notify() call is the canonical way; this is the subprocess version.
    """
    watchdog_fd = os.environ.get("WATCHDOG_FD")
    if watchdog_fd:
        with open(watchdog_fd, "w") as f:
            f.write("WATCHDOG=1\n")

Or use the proper Python interface:

import sdnotify

def main():
    n = sdnotify.SystemdNotifier()
    n.notify("READY=1")
    
    while True:
        # Your trading logic here
        process_signals()
        time.sleep(10)
        # Send heartbeat every 10 seconds
        n.notify("WATCHDOG=1")

Install the dependency:

pip install python-daemon
# Or use the sdnotify package
pip install sdnotify

Layer 2: supervisor — The Application Workhorse

While systemd is the anchor, supervisor is the workhorse for application-level process management. Supervisor excels at:

Fast restarts (< 1 second) compared to systemd's 10+ seconds.
Real-time log management with automatic rotation.
Program group management — start, stop, and restart multiple related processes as a unit.
XML-RPC interface for programmatic control, useful for deployment scripts.
Process status web UI for quick diagnostics.

2.1 Installing and Configuring supervisor

# Debian/Ubuntu
sudo apt-get install supervisor

# RHEL/CentOS
sudo yum install supervisor

# Verify installation
supervisord --version

Create the main configuration file at /etc/supervisor/supervisord.conf:

[supervisord]
logfile=/var/log/supervisor/supervisord.log
logfile_maxbytes=50MB
logfile_backups=10
loglevel=info
pidfile=/var/run/supervisord.pid
nodaemon=false
minfds=1024
minprocs=200

# HTTP server for remote control — disable in production if not needed
[inet_http_server]
port=*:9001
username=supervisor
password=CHANGE_THIS_PASSWORD_IN_PRODUCTION

[supervisorctl]
serverurl=unix:///var/run/supervisor.sock

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

2.2 Writing a Supervisor Program Configuration

For each trading strategy, create a dedicated config in /etc/supervisor/conf.d/. Create /etc/supervisor/conf.d/mean-reversion.conf:

[program:mean-reversion]
; Command — always use absolute paths
command=/opt/strategies/mean-reversion/venv/bin/python /opt/strategies/mean-reversion/main.py

; Directory — supervisor will cd here before running
directory=/opt/strategies/mean-reversion/

; User to run as — must match the systemd service user
user=quant-user

; Auto-restart settings
autorestart=true
startsecs=10
startretries=3

; Exit codes that mean "crash" (0 and SIGTERM are normal stops)
exitcodes=0
stopsignal=TERM

; stdout redirection — critical for debugging
stdout_logfile=/var/log/supervisor/mean-reversion-stdout.log
stdout_logfile_maxbytes=100MB
stdout_logfile_backups=10
stderr_logfile=/var/log/supervisor/mean-reversion-stderr.log
stderr_logfile_maxbytes=100MB
stderr_logfile_backups=10

; Environment variables
environment=PYTHONUNBUFFERED="1",TICKDB_API_KEY="%(ENV_TICKDB_API_KEY)s"

; Priority — lower numbers start first (systemd uses similar logic)
priority=999

; Process naming for the status command
process_name=%(program_name)s
numprocs=1

Critical configuration notes:

PYTHONUNBUFFERED=1: Without this, Python output is buffered and never reaches the log file. You will have no logs during crashes.
%(ENV_VAR_NAME)s: Supervisor supports environment variable substitution, but you must define the variable in the [supervisord] section's environment= directive or load it from a file.
autorestart=true: This is the core feature. Set it to unexpected if you only want restarts on crash, not on manual stops.

2.3 The supervisor reload Trick for Hot Configuration Updates

One of supervisor's greatest strengths is zero-downtime configuration reloads:

# After editing a .conf file
sudo supervisorctl reread
# Output: mean-reversion: reloaded

sudo supervisorctl update mean-reversion
# Output: mean-reversion: stopped
# Output: mean-reversion: started

# If you want to reload without restarting (graceful):
sudo supervisorctl signal SIGTERM mean-reversion
# supervisor will detect the stop and restart with new config

2.4 Program Groups: Managing Multiple Related Processes

A trading system rarely consists of just one process. You typically have:

The main strategy engine.
A data feed connector.
A risk management module.
A notification sender.

Supervisor lets you manage these as a group:

# /etc/supervisor/conf.d/trading-system.conf

[group:trading-system]
programs=mean-reversion-engine,data-feed,risk-manager,notification-sender
priority=999

Now you can control all four processes with one command:

# Start all processes in the group
sudo supervisorctl start trading-system:*

# Stop all processes in the group
sudo supervisorctl stop trading-system:*

# Restart all processes in the group
sudo supervisorctl restart trading-system:*

# Check status of all group members
sudo supervisorctl status trading-system:

2.5 Integrating supervisor with systemd

The proper layering is:

systemd → starts and monitors supervisord (the supervisor daemon).
supervisor → manages individual trading processes.
Health check scripts → run inside the processes or as separate monitors.

systemd unit for supervisor itself:

# /etc/systemd/system/supervisor.service
[Unit]
Description=Supervisor Process Manager
Documentation=http://supervisord.org/
After=network.target

[Service]
Type=forking
ExecStart=/usr/bin/supervisord -c /etc/supervisor/supervisord.conf
ExecStop=/usr/bin/supervisorctl $SUPERVISOR_OPTIONS shutdown
ExecReload=/usr/bin/supervisorctl $SUPERVISOR_OPTIONS reload
KillMode=process
Restart=on-failure
RestartSec=10
User=root

[Install]
WantedBy=multi-user.target

This creates a supervisor service that auto-starts at boot, managed by systemd, which in turn manages your trading processes.

Layer 3: Health Check Scripts — The Custom Intelligence

Both systemd and supervisor handle generic crash recovery. But your trading strategy has specific health criteria that generic tools cannot know:

Is the order book feed receiving data?
Is the strategy actively trading or sitting idle after hours?
Are positions aligned with the expected portfolio state?
Is memory usage normal or is there a leak?
Are there any positions in error state?

This is where custom health check scripts become essential.

3.1 The Health Check Architecture

┌──────────────────────────────────────────────────┐
│              Health Check Scheduler               │
│         (runs every N seconds via cron)          │
└────────────────────────┬─────────────────────────┘
                         │
              ┌──────────┴──────────┐
              │                     │
    ┌─────────▼─────────┐  ┌────────▼────────┐
    │  External Checks   │  │  Internal Checks │
    │  (network, APIs)   │  │  (logs, state)   │
    └─────────┬─────────┘  └────────┬────────┘
              │                     │
              └──────────┬──────────┘
                         │
              ┌──────────▼──────────┐
              │   Decision Engine   │
              │  (threshold logic)  │
              └──────────┬──────────┘
                         │
              ┌──────────▼──────────┐
              │   Action Executor   │
              │ (alert, restart,    │
              │  stop, escalate)    │
              └─────────────────────┘

3.2 The Core Health Check Script

Create /opt/strategies/mean-reversion/health_check.sh:

#!/bin/bash
# Health check script for mean-reversion strategy
# Run via cron: * * * * * /opt/strategies/mean-reversion/health_check.sh >> /var/log/health-check.log 2>&1

set -euo pipefail

# Configuration
STRATEGY_NAME="mean-reversion"
LOG_DIR="/var/log/strategies/${STRATEGY_NAME}"
STATE_FILE="/var/run/strategies/${STRATEGY_NAME}/state.json"
GATEWAY_CHECK_URL="https://api.tickdb.ai/v1/health"
CRITICAL_ERROR_COUNT=3
ALERT_THRESHOLD=2

# Timestamps for log correlation
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
LOG_FILE="${LOG_DIR}/health-check.log"

# --- Helper Functions ---

log() {
    echo "[${TIMESTAMP}] $1" | tee -a "${LOG_FILE}"
}

check_process() {
    # Is the process running?
    if ! pgrep -f "mean-reversion" > /dev/null 2>&1; then
        log "ERROR: No process found for ${STRATEGY_NAME}"
        return 1
    fi
    log "INFO: Process is running"
    return 0
}

check_data_freshness() {
    # Is the order book data fresh?
    local last_update
    last_update=$(stat -c %Y "${STATE_FILE}" 2>/dev/null || echo "0")
    local current_time
    current_time=$(date +%s)
    local age=$((current_time - last_update))
    
    # Data older than 60 seconds is stale during market hours
    if [ ${age} -gt 60 ]; then
        log "ERROR: Data stale by ${age} seconds (threshold: 60s)"
        return 1
    fi
    log "INFO: Data fresh (${age}s old)"
    return 0
}

check_api_availability() {
    # Can we reach TickDB?
    local http_code
    http_code=$(curl -s -o /dev/null -w "%{http_code}" \
        --max-time 5 \
        "${GATEWAY_CHECK_URL}" 2>/dev/null || echo "000")
    
    if [ "${http_code}" != "200" ]; then
        log "ERROR: TickDB API returned ${http_code} (expected 200)"
        return 1
    fi
    log "INFO: TickDB API reachable (HTTP 200)"
    return 0
}

check_memory_usage() {
    # Is memory consumption reasonable?
    local pid
    pid=$(pgrep -f "mean-reversion" | head -1)
    
    if [ -z "${pid}" ]; then
        log "WARN: Cannot find PID for memory check"
        return 0  # Don't fail the overall check on this
    fi
    
    local rss_kb
    rss_kb=$(ps -p "${pid}" -o rss= 2>/dev/null | tr -d ' ')
    local rss_mb=$((rss_kb / 1024))
    
    # Alert if memory exceeds 3 GB
    if [ ${rss_mb} -gt 3072 ]; then
        log "ERROR: Memory usage ${rss_mb}MB exceeds threshold (3072MB)"
        return 1
    fi
    log "INFO: Memory usage ${rss_mb}MB (threshold: 3072MB)"
    return 0
}

check_error_rate() {
    # Are there excessive errors in the recent logs?
    local error_count
    error_count=$(tail -n 1000 "${LOG_DIR}/strategy.log" 2>/dev/null | \
        grep -c "ERROR\|CRITICAL" || echo "0")
    
    if [ ${error_count} -gt ${ALERT_THRESHOLD} ]; then
        log "WARN: ${error_count} errors in recent logs (threshold: ${ALERT_THRESHOLD})"
        # Return 0 here — we alert but don't hard-stop on log errors alone
        return 0
    fi
    log "INFO: Error rate acceptable (${error_count} errors)"
    return 0
}

escalate() {
    local failure_type="$1"
    log "ALERT: Escalating ${failure_type}"
    
    # Send to webhook (Slack, PagerDuty, etc.)
    curl -s -X POST "${WEBHOOK_URL}" \
        -H 'Content-Type: application/json' \
        -d "{
            \"text\": \"[${STRATEGY_NAME}] Health check failed: ${failure_type} at ${TIMESTAMP}\",
            \"attachments\": [{
                \"color\": \"danger\",
                \"fields\": [
                    {\"title\": \"Failure Type\", \"value\": \"${failure_type}\", \"short\": true},
                    {\"title\": \"Timestamp\", \"value\": \"${TIMESTAMP}\", \"short\": true},
                    {\"title\": \"Action\", \"value\": \"Auto-restarting via supervisor\", \"short\": true}
                ]
            }]
        }" 2>/dev/null || true
    
    # Attempt graceful restart via supervisor
    log "INFO: Sending SIGTERM to supervisor for graceful restart"
    supervisorctl signal SIGTERM "${STRATEGY_NAME}" 2>/dev/null || true
    
    return 0
}

# --- Main Execution ---

main() {
    log "INFO: === Health check started ==="
    
    local exit_code=0
    local failures=()
    
    check_process || { failures+=("process_check"); exit_code=1; }
    check_api_availability || { failures+=("api_check"); exit_code=1; }
    check_data_freshness || { failures+=("data_freshness"); exit_code=1; }
    check_memory_usage || { failures+=("memory_check"); exit_code=1; }
    check_error_rate
    
    if [ ${exit_code} -ne 0 ]; then
        log "ERROR: Health check failed: ${failures[*]}"
        
        # Count consecutive failures for escalation threshold
        local failure_count_file="/var/run/strategies/${STRATEGY_NAME}/consecutive_failures"
        local consecutive
        consecutive=$(cat "${failure_count_file}" 2>/dev/null || echo "0")
        consecutive=$((consecutive + 1))
        echo "${consecutive}" > "${failure_count_file}"
        
        if [ ${consecutive} -ge ${CRITICAL_ERROR_COUNT} ]; then
            escalate "CONSECUTIVE_FAILURES:${failures[*]}"
            echo "0" > "${failure_count_file}"  # Reset after escalation
        fi
    else
        # Reset failure counter on success
        echo "0" > "/var/run/strategies/${STRATEGY_NAME}/consecutive_failures" 2>/dev/null || true
    fi
    
    log "INFO: === Health check completed (exit: ${exit_code}) ==="
    return ${exit_code}
}

main "$@"

3.3 Setting Up the Cron Scheduler

Make the script executable and add it to cron:

sudo chmod +x /opt/strategies/mean-reversion/health_check.sh

# Add to crontab — run every minute
sudo crontab -e
# Add this line:
# * * * * * /opt/strategies/mean-reversion/health_check.sh >> /var/log/health-check.log 2>&1

For more granular timing (every 30 seconds), use systemd timers instead of cron — they are more reliable:

# /etc/systemd/system/mean-reversion-health.timer
[Unit]
Description=Mean-Reversion Health Check Timer

[Timer]
# Run every 30 seconds
OnBootSec=30
OnUnitActiveSec=30
AccuracySec=1ms

[Install]
WantedBy=timers.target

# /etc/systemd/system/mean-reversion-health.service
[Unit]
Description=Mean-Reversion Health Check
After=network-online.target

[Service]
Type=oneshot
ExecStart=/opt/strategies/mean-reversion/health_check.sh
User=quant-user

# Enable and start
sudo systemctl enable mean-reversion-health.timer
sudo systemctl start mean-reversion-health.timer

# Check next run time
systemctl list-timers mean-reversion-health.timer

4. The Complete Architecture: Connecting All Layers

With all three layers configured, here is the complete recovery flow:

Recovery Scenario 1: Application Crash

Tick: Your strategy raises an unhandled exception and dies.
supervisor detects the exit within 1 second.
supervisor immediately restarts the process (configured with autorestart=true).
Health check runs on the next cron tick (within 60 seconds), confirms the process is healthy.
Result: Downtime measured in seconds, not minutes.

Recovery Scenario 2: Supervisor Itself Crashes

Tick: A kernel bug or OOM killer terminates supervisord.
systemd detects that the supervised process has exited (via PID file monitoring).
systemd waits RestartSec=10, then restarts supervisord.
supervisord starts and reads its configuration.
supervisor restarts all programs in its configuration (your strategies).
Result: Downtime measured in minutes, not hours.

Recovery Scenario 3: Server Reboot

Event: Power failure or planned maintenance reboot.
Kernel boots, starts systemd.
systemd starts supervisor.service (configured with WantedBy=multi-user.target).
supervisord starts, reads its configuration.
supervisor starts all configured programs.
Health check timer triggers on schedule, confirms all systems healthy.
Result: Fully automatic recovery with zero manual intervention.

5. Monitoring Dashboard: Status at a Glance

Write a simple status script that aggregates information from all layers:

#!/bin/bash
# /opt/strategies/bin/status.sh — run this to see everything at once

echo "======================================"
echo "  QUANT SYSTEM STATUS — $(date)"
echo "======================================"
echo ""

echo "=== SYSTEMD SERVICES ==="
systemctl list-units 'quant-strategy@*' --state=running --no-pager
echo ""

echo "=== SUPERVISOR PROCESSES ==="
supervisorctl status
echo ""

echo "=== HEALTH CHECK TIMERS ==="
systemctl list-timers --no-pager | grep -E "health|strategy"
echo ""

echo "=== RECENT HEALTH CHECK LOGS ==="
tail -n 20 /var/log/health-check.log | tail -n 10
echo ""

echo "=== RESOURCE USAGE (TOP STRATEGIES) ==="
ps aux --sort=-%mem | grep -E "python|mean-reversion|momentum" | head -5
echo ""

echo "=== DISK SPACE ==="
df -h /opt/strategies

Run this script to get a complete picture in under 2 seconds.

6. Testing Your Supervision Stack

Never deploy a supervision configuration without testing it. Here is the testing sequence:

Test 1: Kill the Process (Simulate Crash)

# Find the process
pid=$(pgrep -f "mean-reversion" | head -1)

# Kill it violently (simulates crash)
sudo kill -9 ${pid}

# Watch the restart
watch -n 1 "supervisorctl status mean-reversion"

# It should come back within 10 seconds

Test 2: Stop and Verify Auto-Recovery

# Manual stop
sudo supervisorctl stop mean-reversion

# Wait 30 seconds
sleep 30

# Check if it restarted on its own
# (This depends on autorestart setting — if set to true, it will restart)
sudo supervisorctl status mean-reversion

Test 3: Simulate Server Reboot

# Enable all services
sudo systemctl enable quant-strategy@mean-reversion
sudo systemctl enable supervisor

# Check boot target
sudo systemctl get-default
# Ensure it is: multi-user.target

# Verify the unit will start at boot
sudo systemctl is-enabled quant-strategy@mean-reversion

Test 4: Health Check Alert Pipeline

# Trigger a health check failure manually
# Temporarily break the API check by editing the script or blocking the URL

# Run the health check manually
sudo -u quant-user /opt/strategies/mean-reversion/health_check.sh

# Verify the alert was sent to your webhook
# Check Slack/PagerDuty for the alert

7. Common Pitfalls and How to Avoid Them

Pitfall 1: Running as Root

Problem: If your supervisor or systemd service runs as root, a compromised strategy has full system access.

Fix:

[Service]
User=quant-user
Group=quant-group

Also add to your supervisor config:

user=quant-user

Pitfall 2: Circular Restarts (Crash Loop)

Problem: A misconfigured strategy crashes immediately on start, supervisor restarts it, and it crashes again — a crash loop that burns CPU and logs.

Fix: Use supervisor's startretries and systemd's StartLimitBurst:

# supervisor
startsecs=5
startretries=3

# systemd
RestartSec=30
StartLimitIntervalSec=120
StartLimitBurst=3

After 3 failed restarts in 2 minutes, the service stops, preventing the loop.

Pitfall 3: Environment Variables Not Loaded

Problem: Your strategy expects TICKDB_API_KEY but it is not set, causing silent failures.

Fix: Always use an environment file and verify the variable is set:

# In your strategy startup script
if [ -z "${TICKDB_API_KEY}" ]; then
    echo "ERROR: TICKDB_API_KEY is not set"
    exit 1
fi

Pitfall 4: Log Files Filling the Disk

Problem: Unbounded log rotation fills up /var/log, crashing the system.

Fix: Configure logfile_maxbytes and logfile_backups in both supervisor and systemd:

# supervisor
stdout_logfile_maxbytes=100MB
stdout_logfile_backups=5

# systemd
StandardOutput=journal
# systemd's journal has automatic rotation (journalctl --vacuum-time)

Run this periodically to clean old logs:

sudo journalctl --vacuum-time=7d

8. Production Checklist

Before deploying to production, verify each item:

systemd unit file created and placed in /etc/systemd/system/
Unit enabled with systemctl enable
Environment file created with correct permissions (chmod 600)
API keys loaded from environment, never hardcoded
supervisor configuration created in /etc/supervisor/conf.d/
Health check script executable (chmod +x) and added to cron or systemd timer
Watchdog notifications implemented in strategy code
Alert webhook configured and tested
Crash loop protection configured (startretries, StartLimitBurst)
Log rotation configured (logfile_maxbytes, logfile_backups)
Resource limits set (memory, CPU affinity)
Process runs as non-root user (quant-user)
Health check tested by manually killing the process
Reboot recovery tested in staging environment
Status dashboard script created and tested

Closing: The Indestructible Foundation

A trading strategy is only as reliable as the infrastructure beneath it. The alpha may be exceptional, the code may be perfect, but a single crash without auto-recovery is all it takes to lose a position, miss a trade, or wake up to a catastrophe.

The supervision stack described in this article — systemd as the anchor, supervisor as the workhorse, and health check scripts as the intelligence — creates a three-layer defense that handles crashes at every timescale, recovers from server reboots automatically, and monitors your strategy's specific health criteria that no generic tool can provide.

The best part: all of this is built on tools that ship with every major Linux distribution, require no commercial licenses, and have been battle-tested in production environments for over a decade.

Start with one strategy. Implement the full stack. Test it. Then roll it out to the rest of your portfolio. The 3 AM wake-up calls will stop. The recovery time will shrink from hours to seconds. And your strategies will run — reliably, automatically, and without needing you to be watching.

Next Steps

If you're running a single strategy on a VPS, start with the systemd unit file in Section 1.2 — it is the single highest-value configuration you can add.

If you're running multiple strategies on a dedicated server, implement the supervisor stack from Section 2 — program groups make multi-strategy management dramatically simpler.

If you need custom health monitoring for your specific strategy logic, build the health check script from Section 3 — the framework is there; customize the checks to your strategy's specific requirements.

If you want to learn more about TickDB's infrastructure and how it integrates with production quant systems, visit the TickDB documentation at tickdb.ai/docs — the API provides the data layer that your supervised strategies consume.

If you use AI coding assistants, search for and install the tickdb-market-data SKILL in your AI tool's marketplace to access TickDB's API reference and code templates directly in your development workflow.

This article does not constitute investment advice. Trading strategies involve substantial risk of loss. The supervision techniques described here are infrastructure tools — they do not guarantee profitable trading outcomes. Always test in a staging environment before deploying to production.