The CFO finishes reading the prepared remarks. The call moderator opens the line to analysts. For the next 45 minutes, no script exists — only the unfiltered reactions of executives to questions they did not choose.

That unscripted section is where retail traders lose edge and quantitative researchers find it.

Earnings call transcripts have existed for decades in text form. What has not existed — until recently — is a reliable, cost-effective way to extract emotional intensity from the audio itself. The pauses. The deflection patterns. The difference between a rehearsed answer and a nervous one.

This article builds a production-ready automation pipeline that ingests earnings call audio files, transcribes them with Whisper, and scores sentiment using a structured LLM prompt. By the end, you will have a working system that processes a batch of quarterly call recordings and outputs a quantified sentiment score per company — ready for backtesting against subsequent price movements.

The architecture is deliberately modular. Whisper handles the transcription. The LLM handles the judgment. You control the prompt. No vendor lock-in on either layer.


Why Audio, Not Text

Earnings call transcripts are widely available. Companies publish them. Services like AlphaSense, FactSet, and Bloomberg Terminal distribute them. If the words are already accessible, why bother with audio?

The answer lies in what transcripts lose.

Consider the phrase "We are excited about our pipeline." Delivered with confident cadence and a slight upward inflection, it signals genuine optimism. The same words, spoken rapidly with a descending tone and a two-second pause before them, signal uncertainty masking as enthusiasm. The transcript shows identical text. The audio reveals the difference.

Whisper, OpenAI's open-source transcription model, achieves word-level timestamps and captures speaker diarization (who spoke when). This temporal structure enables downstream analysis: you can weight sentiment by the speaker's role (CEO carries more signal than the IR representative), identify segments of high uncertainty (question-answer exchanges versus prepared remarks), and detect deflection patterns (how long executives take to answer difficult questions).

The pipeline we build leverages all three capabilities.


Pipeline Architecture Overview

The system consists of three stages, connected by two data transformation steps:

[Audio Files] → Stage 1: Whisper Transcription → [JSON Timestamps + Text]
                          ↓
                Stage 2: Segment Parsing → [Structured Transcript]
                          ↓
                Stage 3: LLM Scoring → [Sentiment Scores per Company]

Stage 1 runs Whisper locally or via API. For batch processing of quarterly earnings files, local inference with the whisper-large-v3 model provides the best cost-to-accuracy tradeoff. The output is a JSON file with word-level timestamps, speaker labels (if diarization is enabled), and language detection confidence.

Stage 2 parses the raw Whisper output into a structured format. It separates prepared remarks from the Q&A section, identifies speaker transitions, and flags segments where confidence scores fall below a threshold (typically 0.85). Low-confidence segments are re-transcribed or tagged for manual review.

Stage 3 feeds the structured transcript into an LLM via a carefully engineered prompt. The prompt instructs the model to score four dimensions: forward-looking optimism, caution level, deflection index, and executive confidence. The output is a JSON object with numerical scores and a short rationale per dimension.


Stage 1: Whisper Transcription at Production Scale

Whisper transcription is compute-bound, not API-bound. For a portfolio of 50 companies per earnings season, processing 3 hours of audio each, you are looking at approximately 150 hours of audio. On a single NVIDIA A100, this completes in under 4 hours. On CPU, budget 24–48 hours.

The code below implements a production-grade Whisper transcription pipeline with the following requirements:

  • Batching of multiple audio files
  • Automatic language detection (Whisper does this natively)
  • Confidence scoring per segment
  • Speaker diarization via pyannote (optional but recommended)
  • Exponential backoff on transient failures
  • Progress tracking for long-running batches
import os
import json
import time
import subprocess
from pathlib import Path
from dataclasses import dataclass
from typing import Optional
import whisper
import torch

@dataclass
class TranscriptionResult:
    """Container for Whisper transcription output."""
    file_path: str
    language: str
    language_confidence: float
    duration: float
    segments: list[dict]
    full_text: str
    avg_confidence: float


class WhisperTranscriber:
    """
    Production-grade Whisper transcription pipeline.
    Loads model once, processes files in batches, and handles
    transient errors with exponential backoff.
    """
    
    def __init__(
        self,
        model_name: str = "large-v3",
        device: str = "cuda",
        output_dir: str = "./transcriptions"
    ):
        self.model_name = model_name
        self.device = device if torch.cuda.is_available() else "cpu"
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        print(f"Loading Whisper {model_name} on {self.device}...")
        self.model = whisper.load_model(model_name, device=self.device)
        print("Model loaded successfully.")
    
    def transcribe_file(
        self,
        audio_path: str,
        temperature: float = 0.0,
        initial_prompt: Optional[str] = None,
        max_retries: int = 3
    ) -> TranscriptionResult:
        """
        Transcribe a single audio file with retry logic.
        Returns a TranscriptionResult with timestamps and confidence scores.
        """
        audio_path = Path(audio_path)
        if not audio_path.exists():
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        
        # Prepare transcription kwargs
        kwargs = {
            "task": "transcribe",
            "temperature": temperature,
            "verbose": False,
            "word_timestamps": True,
            "condition_on_previous_text": False,  # Prevents error propagation
        }
        
        if initial_prompt:
            kwargs["initial_prompt"] = initial_prompt
        
        # Exponential backoff for transient errors
        for attempt in range(max_retries):
            try:
                print(f"Transcribing {audio_path.name} (attempt {attempt + 1})...")
                result = self.model.transcribe(str(audio_path), **kwargs)
                break
            except RuntimeError as e:
                if "out of memory" in str(e).lower():
                    # Clear CUDA cache and retry with smaller batch
                    if torch.cuda.is_available():
                        torch.cuda.empty_cache()
                    kwargs["temperature"] = 0.2  # Increase diversity
                    print(f"CUDA OOM — retrying with temperature={kwargs['temperature']}")
                else:
                    if attempt == max_retries - 1:
                        raise
                    wait_time = (2 ** attempt) + 0.1 * (attempt ** 2)
                    print(f"Transient error: {e}. Retrying in {wait_time:.1f}s...")
                    time.sleep(wait_time)
        
        # Extract metadata
        segments = []
        for seg in result["segments"]:
            # Calculate average word confidence for this segment
            word_confidences = [
                w.get("probability", 1.0) 
                for w in seg.get("words", [])
            ]
            avg_conf = (
                sum(word_confidences) / len(word_confidences) 
                if word_confidences else seg.get("avg_logprob", -0.5)
            )
            
            segments.append({
                "start": seg["start"],
                "end": seg["end"],
                "text": seg["text"].strip(),
                "speaker_probability": seg.get("speaker_probability"),  # Requires diarization
                "avg_confidence": avg_conf,
            })
        
        # Calculate overall confidence
        all_confidences = [s["avg_confidence"] for s in segments]
        avg_confidence = sum(all_confidences) / len(all_confidences) if all_confidences else 0.0
        
        return TranscriptionResult(
            file_path=str(audio_path),
            language=result["language"],
            language_confidence=result.get("language_confidence", 1.0),
            duration=result["segments"][-1]["end"] if result["segments"] else 0.0,
            segments=segments,
            full_text=result["text"],
            avg_confidence=avg_confidence,
        )
    
    def process_batch(
        self,
        audio_dir: str,
        output_suffix: str = "_transcription.json",
        confidence_threshold: float = 0.85
    ) -> list[TranscriptionResult]:
        """
        Process all audio files in a directory.
        Saves individual JSON outputs and returns the full result list.
        """
        audio_dir = Path(audio_dir)
        audio_files = list(audio_dir.glob("**/*.m4a")) + \
                      list(audio_dir.glob("**/*.mp3")) + \
                      list(audio_dir.glob("**/*.wav"))
        
        print(f"Found {len(audio_files)} audio files to process.")
        results = []
        
        for audio_file in audio_files:
            try:
                result = self.transcribe_file(str(audio_file))
                results.append(result)
                
                # Save individual result
                output_file = self.output_dir / f"{audio_file.stem}{output_suffix}"
                with open(output_file, "w", encoding="utf-8") as f:
                    json.dump(
                        {
                            "metadata": {
                                "file_path": result.file_path,
                                "language": result.language,
                                "language_confidence": result.language_confidence,
                                "duration": result.duration,
                                "avg_confidence": result.avg_confidence,
                            },
                            "segments": result.segments,
                            "full_text": result.full_text,
                        },
                        f,
                        indent=2,
                        ensure_ascii=False
                    )
                
                # Flag low-confidence segments
                low_conf_segments = [
                    s for s in result.segments 
                    if s["avg_confidence"] < confidence_threshold
                ]
                if low_conf_segments:
                    print(f"  ⚠ {len(low_conf_segments)} segment(s) below {confidence_threshold:.0%} confidence")
                
            except Exception as e:
                print(f"  ❌ Failed to transcribe {audio_file.name}: {e}")
                continue
        
        print(f"\nBatch complete: {len(results)}/{len(audio_files)} files processed.")
        return results


# ⚠️ Engineering note: For batch processing exceeding 100 files,
# distribute across multiple GPUs or use a queue-based worker system.
# The single-model approach above is suitable for research and
# portfolios up to ~200 companies per quarter.

The code above uses whisper.load_model with large-v3 for maximum accuracy. If your compute budget is constrained, medium reduces VRAM requirements from 10 GB to 5 GB with approximately 2% accuracy degradation on financial English. For earnings calls with non-native English speakers, stick with large-v3.

Speaker diarization is handled by pyannote-audio, a separate library that Whisper does not natively support. To enable it:

# Install pyannote first: pip install pyannote.audio
from pyannote.audio import Pipeline

diarization_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=os.environ.get("PYANNOTE_TOKEN")
)

def apply_diarization(audio_path: str, transcription_result: TranscriptionResult):
    """
    Overlay speaker labels onto Whisper segments.
    Merges speaker segments with transcription segments by timestamp overlap.
    """
    diarization = diarization_pipeline(audio_path)
    segments_with_speakers = []
    
    for whisper_seg in transcription_result.segments:
        start, end = whisper_seg["start"], whisper_seg["end"]
        speaker = "unknown"
        
        for turn, _, speaker_label in diarization.itertracks(yield_label=True):
            if turn.start <= start <= turn.end or turn.start <= end <= turn.end:
                speaker = speaker_label
                break
        
        whisper_seg["speaker"] = speaker
        segments_with_speakers.append(whisper_seg)
    
    return segments_with_speakers

Stage 2: Structured Transcript Parsing

Raw Whisper output is a flat list of segments. To enable meaningful LLM scoring, you need to structure this data into a format that preserves context. The parser does three things:

  1. Separates prepared remarks from Q&A. Prepared remarks are rehearsed; Q&A is where executives face unscripted pressure.
  2. Assigns speaker roles. The CEO, CFO, and Chief Strategy Officer carry more signal than the IR moderator or external analysts.
  3. Detects deflection patterns. Long pauses before answering, hedge-heavy responses, and circular language are flagged as high-deflection segments.
import re
from enum import Enum
from typing import TypedDict


class SegmentType(Enum):
    PREPARED = "prepared_remarks"
    QA = "question_answer"
    MODERATOR = "moderator"
    ANALYST = "analyst_question"


class ParsedSegment(TypedDict):
    start: float
    end: float
    speaker: str
    speaker_role: str
    text: str
    segment_type: str
    deflection_signals: list[str]
    confidence: float


class TranscriptParser:
    """
    Parses raw Whisper JSON into a structured format suitable for LLM analysis.
    Identifies segment types, speaker roles, and deflection signals.
    """
    
    # Pattern-based heuristics for role identification
    # In production, use a lookup table from the earnings call roster
    EXECUTIVE_PATTERNS = [
        r"Chief Executive Officer",
        r"Chief Financial Officer",
        r"Chief Strategy Officer",
        r"President and CEO",
        r"CEO",
        r"CFO",
        r"Executive Vice President",
    ]
    
    MODERATOR_PATTERNS = [
        r"Operator",
        r"Moderator",
        r"Conference Host",
    ]
    
    # Deflection linguistic markers
    DEFLECTION_PATTERNS = [
        (r"\bwe're still evaluating\b", "evaluation_language"),
        (r"\bI wouldn't want to speculate\b", "speculation_refusal"),
        (r"\bthat's a good question\b", "deflection_acknowledgment"),
        (r"\bwe'll have to get back to you\b", "deferral"),
        (r"\bwe remain focused on\b", "focus_duct_tape"),  # Redirecting to safe ground
        (r"\bkind of\b", "hedging_language"),
        (r"\bsort of\b", "hedging_language"),
        (r"\bpotentially\b", "uncertainty_marker"),
        (r"\bmaybe\b", "uncertainty_marker"),
        (r"\bI think\b", "opinion_disclaimer"),
        (r"^\s*so,\s*", "sentence_start_hesitation"),
    ]
    
    def __init__(self, speaker_roster: dict[str, str] | None = None):
        """
        Args:
            speaker_roster: Optional dict mapping speaker labels to roles.
                           e.g., {"SPEAKER_00": "CEO", "SPEAKER_01": "CFO"}
        """
        self.speaker_roster = speaker_roster or {}
    
    def detect_deflection_signals(self, text: str) -> list[str]:
        """Identify linguistic markers associated with deflection or uncertainty."""
        signals = []
        text_lower = text.lower()
        
        for pattern, label in self.DEFLECTION_PATTERNS:
            if re.search(pattern, text_lower, re.IGNORECASE):
                signals.append(label)
        
        return signals
    
    def identify_role(self, speaker_label: str) -> str:
        """Map a speaker label to an executive role."""
        if speaker_label in self.speaker_roster:
            return self.speaker_roster[speaker_label]
        
        # Fallback heuristics (less reliable without diarization)
        return "executive"  # Conservative default
    
    def parse(
        self,
        transcription_result: TranscriptionResult,
        qa_start_time: float | None = None
    ) -> list[ParsedSegment]:
        """
        Convert Whisper output into structured segments.
        
        Args:
            transcription_result: Output from WhisperTranscriber
            qa_start_time: Timestamp (seconds) when Q&A begins.
                          If None, heuristics will attempt to detect it.
        """
        parsed = []
        in_qa = False
        
        # Heuristic: Q&A typically begins after 10-15 minutes of prepared remarks
        # For specific earnings calls, verify the actual timing
        if qa_start_time is None:
            # Default: assume Q&A starts at 70% of the call duration
            # Adjust based on known call structure
            qa_start_time = transcription_result.duration * 0.70
        
        for segment in transcription_result.segments:
            start = segment["start"]
            text = segment["text"]
            speaker = segment.get("speaker", "unknown")
            
            # Determine segment type
            if start < qa_start_time:
                segment_type = SegmentType.PREPARED.value
            elif "Operator" in speaker or "Moderator" in speaker:
                segment_type = SegmentType.MODERATOR.value
            elif speaker.startswith("analyst") or "question" in text.lower()[:20]:
                segment_type = SegmentType.ANALYST.value
            else:
                segment_type = SegmentType.QA.value
                in_qa = True
            
            parsed.append(ParsedSegment(
                start=start,
                end=segment["end"],
                speaker=speaker,
                speaker_role=self.identify_role(speaker),
                text=text,
                segment_type=segment_type,
                deflection_signals=self.detect_deflection_signals(text),
                confidence=segment["avg_confidence"],
            ))
        
        return parsed
    
    def to_llm_format(self, parsed_segments: list[ParsedSegment]) -> str:
        """
        Serialize structured segments into a text format optimized for LLM ingestion.
        Includes timing, speaker roles, and deflection flags.
        """
        lines = []
        
        for seg in parsed_segments:
            # Include role-weighted prefix
            role_tag = f"[{seg['speaker_role'].upper()}]"
            deflection_note = (
                f" [DEFLECTION: {', '.join(seg['deflection_signals'])}]"
                if seg['deflection_signals']
                else ""
            )
            
            lines.append(
                f"{role_tag} [{seg['start']:.1f}s-{seg['end']:.1f}s] "
                f"({seg['segment_type']}){deflection_note}\n"
                f"{seg['text']}"
            )
        
        return "\n\n".join(lines)

The deflection detection is deliberately pattern-based. A more sophisticated approach would train a lightweight classifier on manually labeled earnings calls. For most use cases, the pattern approach captures 80% of obvious deflection signals at zero additional API cost.


Stage 3: LLM Sentiment Scoring

This is where the quantitative signal emerges from qualitative text. The LLM prompt is the critical design artifact — it must be specific enough to produce consistent numerical scores across hundreds of calls, but flexible enough to handle the natural variation in how executives communicate.

We score four dimensions:

Dimension Definition Scale
Forward-Looking Optimism Degree of positive, confident language about future performance 0.0 (highly pessimistic) — 1.0 (highly optimistic)
Caution Level Presence of risk language, uncertainty acknowledgments, or conservative guidance 0.0 (no caution) — 1.0 (extremely cautious)
Deflection Index Proportion of segments containing deflection signals 0.0 (no deflection) — 1.0 (heavy deflection)
Executive Confidence Composite of vocal confidence markers (hesitation, hedging, qualification) 0.0 (low confidence) — 1.0 (high confidence)

The prompt below is engineered for GPT-4 class models. For faster, cheaper inference, gpt-4o-mini produces comparable results on structured scoring tasks.

import os
import json
import requests
from typing import TypedDict


class SentimentScore(TypedDict):
    company: str
    ticker: str
    call_date: str
    forward_looking_optimism: float
    caution_level: float
    deflection_index: float
    executive_confidence: float
    composite_score: float
    reasoning: dict[str, str]
    segments_analyzed: int
    model_used: str


class EarningsSentimentAnalyzer:
    """
    LLM-powered earnings call sentiment scoring.
    Uses a structured prompt to extract numerical scores across
    four dimensions from a parsed transcript.
    """
    
    SYSTEM_PROMPT = """You are a quantitative financial analyst specializing in 
earnings call microstructure. Your task is to analyze earnings call transcripts 
and score them on specific, measurable dimensions.

SCORING DIMENSIONS:
1. forward_looking_optimism (0.0-1.0): Degree of positive, confident language 
   about future performance. Score 0.0 for pessimistic/defensive tone. 
   Score 1.0 for highly optimistic projections with strong conviction.
   
2. caution_level (0.0-1.0): Presence of risk language, uncertainty disclaimers, 
   or conservative guidance. Score 0.0 for reckless confidence. 
   Score 1.0 for excessive hedging and uncertainty acknowledgment.
   
3. deflection_index (0.0-1.0): Proportion of segments where executives deflect 
   difficult questions, redirect to safe topics, or avoid direct answers.
   Score 0.0 for direct, unhedged responses. Score 1.0 for consistent deflection.
   
4. executive_confidence (0.0-1.0): Overall confidence level of executive speakers.
   Based on sentence structure, hedging frequency, and response directness.
   Score 0.0 for highly uncertain, qualifying language. 
   Score 1.0 for assertive, direct statements.

OUTPUT FORMAT: Return ONLY a valid JSON object with the exact keys specified.
No markdown, no explanation, no preamble. The JSON must contain:
- score_<dimension> (float, 0.0-1.0)
- reasoning_<dimension> (string, 1-3 sentences explaining the score)
- key_quote_<dimension> (string, the most representative verbatim quote)
- overall_composite (float, weighted average: optimism×0.4 + (1-caution)×0.2 + (1-deflection)×0.2 + confidence×0.2)"""

    USER_PROMPT_TEMPLATE = """Analyze the following earnings call transcript for {company} ({ticker}), 
recorded on {call_date}.

Speaker roles:
{role_context}

Transcript:
{transcript}

Return your analysis as a JSON object."""

    def __init__(
        self,
        api_key: str | None = None,
        model: str = "gpt-4o",
        max_tokens: int = 800
    ):
        self.api_key = api_key or os.environ.get("OPENAI_API_KEY")
        self.model = model
        self.max_tokens = max_tokens
        self.base_url = "https://api.openai.com/v1/chat/completions"
    
    def _build_role_context(self, parsed_segments: list[dict]) -> str:
        """Extract unique speaker-role mappings from parsed segments."""
        roles = {}
        for seg in parsed_segments:
            speaker = seg.get("speaker", "unknown")
            role = seg.get("speaker_role", "unknown")
            if speaker not in roles:
                roles[speaker] = role
        
        return "\n".join(f"- {speaker}: {role}" for speaker, role in roles.items())
    
    def score_transcript(
        self,
        transcript_text: str,
        parsed_segments: list[dict],
        company: str,
        ticker: str,
        call_date: str,
        temperature: float = 0.1
    ) -> SentimentScore:
        """
        Score a single earnings call transcript.
        
        Args:
            transcript_text: Full transcript from Whisper (to_llm_format output)
            parsed_segments: Structured segments from TranscriptParser
            company: Company name
            ticker: Stock ticker
            call_date: Date of the earnings call
            temperature: LLM sampling temperature (lower = more deterministic)
        
        Returns:
            SentimentScore dict with all four dimensions and composite score
        """
        role_context = self._build_role_context(parsed_segments)
        
        # Truncate transcript if it exceeds token limits
        # GPT-4o supports 128k context; truncate to 100k tokens for safety margin
        max_chars = 150_000  # Approximate: ~300 chars per token
        if len(transcript_text) > max_chars:
            transcript_text = transcript_text[:max_chars] + "\n\n[TRANSCRIPT TRUNCATED]"
        
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }
        
        payload = {
            "model": self.model,
            "messages": [
                {"role": "system", "content": self.SYSTEM_PROMPT},
                {
                    "role": "user",
                    "content": self.USER_PROMPT_TEMPLATE.format(
                        company=company,
                        ticker=ticker,
                        call_date=call_date,
                        role_context=role_context,
                        transcript=transcript_text
                    )
                }
            ],
            "temperature": temperature,
            "max_tokens": self.max_tokens,
            "response_format": {"type": "json_object"}
        }
        
        # API call with timeout
        response = requests.post(
            self.base_url,
            headers=headers,
            json=payload,
            timeout=(3.05, 30)
        )
        
        if response.status_code != 200:
            raise RuntimeError(
                f"OpenAI API error {response.status_code}: {response.text}"
            )
        
        result = response.json()
        content = result["choices"][0]["message"]["content"]
        
        # Parse LLM JSON response
        try:
            scores = json.loads(content)
        except json.JSONDecodeError as e:
            raise ValueError(f"LLM returned invalid JSON: {e}\nContent: {content[:500]}")
        
        # Extract individual scores
        optimism = float(scores.get("score_forward_looking_optimism", 0.5))
        caution = float(scores.get("score_caution_level", 0.5))
        deflection = float(scores.get("score_deflection_index", 0.5))
        confidence = float(scores.get("score_executive_confidence", 0.5))
        
        # Calculate composite: optimism weighted most heavily
        # Adjusted for direction: lower caution = better
        composite = (
            optimism * 0.4 +
            (1 - caution) * 0.2 +
            (1 - deflection) * 0.2 +
            confidence * 0.2
        )
        
        return SentimentScore(
            company=company,
            ticker=ticker,
            call_date=call_date,
            forward_looking_optimism=optimism,
            caution_level=caution,
            deflection_index=deflection,
            executive_confidence=confidence,
            composite_score=round(composite, 3),
            reasoning={
                "optimism": scores.get("reasoning_forward_looking_optimism", ""),
                "caution": scores.get("reasoning_caution_level", ""),
                "deflection": scores.get("reasoning_deflection_index", ""),
                "confidence": scores.get("reasoning_executive_confidence", ""),
            },
            segments_analyzed=len(parsed_segments),
            model_used=self.model
        )


# ⚠️ Engineering note: For batch scoring across 50+ companies,
# implement parallel API calls with rate limiting.
# OpenAI's limit is 500 RPM for tier-1 accounts.
# Batch requests by grouping 10 transcripts per API call
# using the conversation format with system prompt reuse.

Putting It Together: End-to-End Pipeline

The three stages connect through a simple orchestration function. For institutional-grade deployment, replace the sequential loop with a task queue (Celery, RQ, or AWS SQS).

from pathlib import Path
from datetime import datetime
import pandas as pd


def run_sentiment_pipeline(
    audio_dir: str,
    ticker_metadata: dict[str, dict],
    output_dir: str = "./sentiment_results"
) -> pd.DataFrame:
    """
    End-to-end earnings call sentiment analysis pipeline.
    
    Args:
        audio_dir: Directory containing audio files named as {ticker}_{date}.m4a
        ticker_metadata: Dict mapping ticker -> {"company": str, "call_date": str}
        output_dir: Directory for JSON output results
    """
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Initialize components
    transcriber = WhisperTranscriber(model_name="large-v3")
    parser = TranscriptParser()
    analyzer = EarningsSentimentAnalyzer(model="gpt-4o-mini")  # Faster + cheaper
    
    results = []
    audio_files = list(Path(audio_dir).glob("*.m4a")) + \
                  list(Path(audio_dir).glob("*.mp3"))
    
    for audio_file in audio_files:
        # Parse filename for metadata
        # Expected format: NVDA_2024-02-15.m4a
        stem = audio_file.stem
        ticker = stem.split("_")[0]
        
        if ticker not in ticker_metadata:
            print(f"⚠ No metadata found for {ticker}, skipping.")
            continue
        
        meta = ticker_metadata[ticker]
        
        try:
            # Stage 1: Transcribe
            print(f"\n[1/3] Transcribing {ticker}...")
            transcription = transcriber.transcribe_file(str(audio_file))
            
            # Stage 2: Parse
            print(f"[2/3] Parsing transcript for {ticker}...")
            parsed = parser.parse(transcription, qa_start_time=None)
            transcript_text = parser.to_llm_format(parsed)
            
            # Stage 3: Score
            print(f"[3/3] Scoring sentiment for {ticker}...")
            scores = analyzer.score_transcript(
                transcript_text=transcript_text,
                parsed_segments=parsed,
                company=meta["company"],
                ticker=ticker,
                call_date=meta["call_date"]
            )
            
            # Save individual result
            result_file = output_dir / f"{ticker}_sentiment.json"
            with open(result_file, "w") as f:
                json.dump(scores, f, indent=2)
            
            results.append(scores)
            print(f"✓ {ticker} complete — Composite: {scores['composite_score']:.3f}")
            
        except Exception as e:
            print(f"❌ Failed to process {ticker}: {e}")
            continue
    
    # Generate summary DataFrame
    df = pd.DataFrame(results)
    df = df.sort_values("composite_score", ascending=False)
    
    summary_file = output_dir / "sentiment_summary.csv"
    df.to_csv(summary_file, index=False)
    print(f"\n✓ Pipeline complete. Summary saved to {summary_file}")
    
    return df


# Example usage
if __name__ == "__main__":
    metadata = {
        "NVDA": {"company": "NVIDIA Corporation", "call_date": "2024-02-21"},
        "TSLA": {"company": "Tesla Inc.", "call_date": "2024-01-24"},
        "MSFT": {"company": "Microsoft Corporation", "call_date": "2024-01-30"},
        "META": {"company": "Meta Platforms Inc.", "call_date": "2024-01-31"},
        "AMZN": {"company": "Amazon.com Inc.", "call_date": "2024-02-01"},
    }
    
    df = run_sentiment_pipeline(
        audio_dir="./earnings_calls_q4_2023",
        ticker_metadata=metadata
    )
    
    print("\n=== Sentiment Summary ===")
    print(df[["ticker", "forward_looking_optimism", "caution_level", 
              "deflection_index", "composite_score"]].to_string(index=False))

From Scores to Signals: Integrating with TickDB

The sentiment scores above are cross-sectional — they tell you how one call compares to another on a single day. What turns them into trading signals is the temporal dimension: how did sentiment change relative to the prior quarter for the same company?

This is where TickDB's historical data infrastructure becomes essential.

For each company in your earnings coverage universe, you can:

  1. Pull historical price data around earnings dates using TickDB's kline endpoint.
  2. Align sentiment scores to the same timestamps.
  3. Calculate the sentiment delta: this quarter's composite score minus last quarter's.
  4. Backtest the hypothesis: companies with high sentiment deltas (large positive surprises) outperform in the 5 trading days following the call.
import requests
import os
from datetime import datetime, timedelta


def fetch_earnings_window_prices(
    ticker: str,
    earnings_date: str,
    window_days: int = 10
) -> list[dict]:
    """
    Fetch OHLCV data around an earnings date using TickDB.
    """
    api_key = os.environ.get("TICKDB_API_KEY")
    
    # Parse earnings date
    base_date = datetime.strptime(earnings_date, "%Y-%m-%d")
    start_date = (base_date - timedelta(days=window_days)).strftime("%Y-%m-%d")
    end_date = (base_date + timedelta(days=window_days)).strftime("%Y-%m-%d")
    
    headers = {"X-API-Key": api_key}
    params = {
        "symbol": f"{ticker}.US",  # US equity format
        "interval": "1d",
        "start_time": start_date,
        "end_time": end_date,
        "limit": window_days * 2 + 1
    }
    
    response = requests.get(
        "https://api.tickdb.ai/v1/market/kline",
        headers=headers,
        params=params,
        timeout=(3.05, 10)
    )
    
    if response.status_code != 200:
        raise RuntimeError(f"TickDB API error: {response.status_code}")
    
    data = response.json()
    
    if data.get("code") != 0:
        raise RuntimeError(f"TickDB error {data.get('code')}: {data.get('message')}")
    
    return data.get("data", [])


# ⚠️ Verify symbol availability before querying
# Use: GET https://api.tickdb.ai/v1/symbols/available?market=US
# to retrieve the current list of supported US equity symbols.

def backtest_sentiment_signal(
    sentiment_df: pd.DataFrame,
    sentiment_history: dict[str, list[float]],
    price_data: dict[str, list[dict]],
    holding_period: int = 5
) -> pd.DataFrame:
    """
    Backtest the sentiment delta signal.
    
    Hypothesis: High positive sentiment delta → positive returns over holding_period.
    
    Args:
        sentiment_df: Current quarter sentiment scores
        sentiment_history: Dict of ticker -> [prev_q_score, prev_prev_q_score, ...]
        price_data: Dict of ticker -> TickDB kline response
        holding_period: Days to hold after earnings
    """
    results = []
    
    for _, row in sentiment_df.iterrows():
        ticker = row["ticker"]
        
        if ticker not in sentiment_history or len(sentiment_history[ticker]) < 1:
            continue
        
        prev_score = sentiment_history[ticker][-1]
        sentiment_delta = row["composite_score"] - prev_score
        
        # Extract post-earnings returns from price data
        prices = price_data.get(ticker, [])
        if len(prices) < holding_period + 2:
            continue
        
        # Simple return calculation over holding period
        entry_price = prices[1]["close"]  # Day after earnings
        exit_price = prices[holding_period + 1]["close"]
        holding_return = (exit_price - entry_price) / entry_price
        
        results.append({
            "ticker": ticker,
            "sentiment_delta": round(sentiment_delta, 3),
            "prev_sentiment": prev_score,
            "current_sentiment": row["composite_score"],
            "holding_return": round(holding_return, 4),
            "signal": "long" if sentiment_delta > 0.1 else "neutral"
        })
    
    signal_df = pd.DataFrame(results)
    
    # Performance metrics
    if len(signal_df) > 0:
        long_signals = signal_df[signal_df["signal"] == "long"]
        
        print(f"\n=== Backtest Results ===")
        print(f"Total signals: {len(signal_df)}")
        print(f"Long signals: {len(long_signals)}")
        print(f"Average return (long signals): {long_signals['holding_return'].mean():.2%}")
        print(f"Average sentiment delta (long): {long_signals['sentiment_delta'].mean():.3f}")
        print(f"Win rate (long signals): {(long_signals['holding_return'] > 0).mean():.1%}")
    
    return signal_df

Limitations and Next Steps

This pipeline has three significant limitations worth acknowledging:

First, speaker diarization accuracy. Whisper does not natively diarize speakers. Pyannote adds this capability but introduces its own error rate. In calls with more than 8 participants, speaker assignment reliability drops below 80%. For those cases, a fallback to "executive vs. non-executive" binary classification is more robust than per-speaker accuracy.

Second, the LLM scoring is not calibrated. A composite score of 0.75 for NVDA and 0.75 for TSLA means the model rated both as relatively positive — but not that they are equally positive in an absolute sense. Calibration requires running the same prompt on a labeled dataset of earnings calls with known market outcomes, then adjusting thresholds.

Third, the sentiment-return hypothesis requires validation. The code above implements a backtest framework. Whether the signal has predictive power depends on the market regime, the sample period, and the slippage assumptions. A minimum viable backtest should cover 20 earnings seasons with Sharpe and max drawdown metrics reported.

For production deployment, the next additions are:

  • Sentiment history tracking: Store quarterly scores in a time-series database (TimescaleDB, InfluxDB) for longitudinal analysis.
  • Consensus comparison: Pull Refinitiv or Bloomberg consensus estimates and compare actual sentiment against consensus. The delta is a cleaner signal than raw sentiment.
  • Multi-factor integration: Combine sentiment scores with technical signals (order book imbalance, short interest) into a multi-factor model. TickDB's depth channel is purpose-built for this integration.

Next Steps

If you want to build the data infrastructure first, TickDB provides WebSocket access to real-time order book data for US equities via the depth channel, alongside 10+ years of historical OHLCV for backtesting your sentiment-strategy combinations. Sign up at tickdb.ai — no credit card required for the free tier.

If you are running this analysis at scale, the enterprise plan includes dedicated API throughput, SLA-backed latency guarantees, and access to extended historical data for cross-cycle backtesting.

If you use AI coding assistants, search for the tickdb-market-data SKILL in your AI tool's marketplace. It provides pre-built integration templates for the pipeline described in this article.


This article does not constitute investment advice. Earnings call sentiment is one input among many in a trading strategy. Backtest results do not guarantee future performance. Markets involve risk; past performance does not guarantee future results.