The problem with correlation matrices is that they tell you what happened, not what might happen next.
A Pearson correlation of 0.82 between NVIDIA and Taiwan Semiconductor tells you these two assets tend to move together. It does not tell you whether AMD is a better proxy for NVDA's intraday volatility. It does not surface the 23 small-cap semiconductor suppliers whose 30-day price windows mirror NVDA's momentum signature more closely than TSM's. Correlation is backward-looking, point-to-point, and computationally expensive to recompute across 8,000 equities on a rolling basis.
The alternative approach is to treat price sequences as vector embeddings and solve the problem as nearest-neighbor retrieval. This is the same architecture that powers semantic search, recommendation systems, and image similarity — and it translates directly to quantitative finance with one key modification: the embedding space encodes pattern similarity rather than semantic meaning.
This article walks through the full implementation: converting normalized price windows into fixed-length vectors, indexing them in FAISS for sub-millisecond retrieval, and running a live query to find the five US equities whose recent price behavior most closely resembles NVIDIA's.
1. The Embedding Function: From Time Series to Fixed-Length Vector
The first design decision is how to encode a price sequence into a vector. Three approaches are common in quantitative finance, each with distinct trade-offs.
Approach A — Raw normalized window: Take the last N daily returns, standardize them, and use the resulting N-dimensional vector directly. This preserves the full momentum profile but is sensitive to outliers and requires all windows to be exactly the same length.
Approach B — Statistical feature extraction: Compute a fixed set of summary statistics (mean return, volatility, skewness, kurtosis, maximum drawdown, autocorrelation of order 1) and concatenate them into a vector. This is robust to missing data and handles variable-length windows gracefully, but it discards phase information — the order in which volatility spikes and mean-reverts.
Approach C — Time-series embedding via dimensionality reduction: Apply PCA, t-SNE, or an autoencoder to a sliding window of prices to compress the sequence into a lower-dimensional representation. This captures nonlinear structure but adds training overhead and makes the embedding dependent on the training corpus.
For this implementation, we use Approach A with a sliding z-score normalization. The intuition is that two stocks are similar not when their absolute prices move in lockstep, but when their normalized behavior — the shape of their gain-loss cycles — matches. By normalizing each window to zero mean and unit variance before embedding, we strip out the absolute price level and focus purely on pattern.
The embedding function takes a price series and returns a fixed-length vector of 30 normalized daily returns:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
def price_to_embedding(closes: np.ndarray, window: int = 30) -> np.ndarray:
"""
Convert a price series to a fixed-length embedding vector.
Args:
closes: Array of daily closing prices (most recent last).
window: Number of trailing days to encode.
Returns:
A (window,) numpy array of z-score normalized returns.
Returns None if insufficient data.
"""
if len(closes) < window:
return None
trailing = closes[-window:]
daily_returns = np.diff(trailing) / trailing[:-1]
scaler = StandardScaler()
normalized = scaler.fit_transform(daily_returns.reshape(-1, 1)).flatten()
return normalized
This function produces a 30-dimensional vector for each stock. Two stocks with identical normalized return sequences will have an embedding distance of zero; two stocks with inverted momentum (one goes up when the other goes down) will have a distance near the maximum.
2. FAISS: Indexing 8,000 Vectors for Sub-Millisecond Retrieval
With 8,000 US equities each represented as a 30-dimensional vector, brute-force cosine similarity requires 64 million pairwise comparisons per query. At 100,000 queries per day (once per minute across all equities), this is computationally untenable.
FAISS (Facebook AI Similarity Search) solves this with approximate nearest-neighbor (ANN) indexing. Instead of comparing the query vector against every stored vector, FAISS builds an index structure that prunes the search space while maintaining high recall.
For stock similarity, we use an IndexFlatIP — an inner-product index with L2-normalized vectors, which is mathematically equivalent to cosine similarity. The index stores all 8,000 stock embeddings at startup. During live operation, a new embedding is inserted and the top-k nearest neighbors are retrieved in O(1) relative to index size.
import faiss
import numpy as np
from typing import List, Tuple
class StockSimilarityEngine:
def __init__(self, embedding_dim: int = 30):
self.embedding_dim = embedding_dim
self.index = faiss.IndexFlatIP(embedding_dim)
self.ticker_registry: List[str] = []
self._is_trained = False
def _normalize(self, vectors: np.ndarray) -> np.ndarray:
"""L2-normalize vectors for cosine similarity equivalence."""
norms = np.linalg.norm(vectors, axis=1, keepdims=True)
norms[norms == 0] = 1.0
return vectors / norms
def build_index(self, embeddings: np.ndarray, tickers: List[str]) -> None:
"""
Build the FAISS index from a batch of precomputed embeddings.
Args:
embeddings: Shape (N, embedding_dim) numpy array.
tickers: List of N ticker symbols, aligned with embeddings.
"""
if embeddings.shape[0] != len(tickers):
raise ValueError("Embeddings and tickers must have the same length")
normalized = self._normalize(embeddings.astype("float32"))
self.index.add(normalized)
self.ticker_registry = tickers
self._is_trained = True
def query(self, query_embedding: np.ndarray, k: int = 5) -> List[Tuple[str, float]]:
"""
Retrieve the top-k stocks most similar to the query embedding.
Args:
query_embedding: Shape (embedding_dim,) array.
k: Number of neighbors to return.
Returns:
List of (ticker, cosine_similarity) tuples, sorted descending.
"""
if not self._is_trained:
raise RuntimeError("Index not built — call build_index first")
normalized_query = self._normalize(
query_embedding.reshape(1, -1).astype("float32")
)
similarities, indices = self.index.search(normalized_query, k)
results = []
for sim, idx in zip(similarities[0], indices[0]):
if idx < len(self.ticker_registry):
results.append((self.ticker_registry[idx], float(sim)))
return results
The IndexFlatIP configuration is appropriate for this use case because the corpus is small (8,000 vectors) and we require exact top-k retrieval without approximation error. For institutional deployments with 500,000+ instruments across multiple asset classes, switching to IndexIVFFlat with an appropriate number of centroids would reduce memory footprint and query latency at the cost of a small recall trade-off.
3. Fetching and Embedding Real Market Data
The embedding pipeline requires a reliable source of clean, aligned OHLCV data. For US equities, we use TickDB's /v1/market/kline endpoint, which provides 10+ years of cleaned daily candles suitable for backtesting and live embedding updates.
The workflow for building a corpus of embeddings involves three steps: fetch historical klines for all target tickers, compute trailing 30-day embeddings for each, and build the FAISS index.
import os
import requests
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import List, Dict, Optional
import time
class TickDBClient:
"""
Production-grade TickDB API client.
Handles authentication, reconnection, rate limiting, and timeout.
"""
BASE_URL = "https://api.tickdb.ai/v1"
def __init__(self, api_key: str):
self.api_key = api_key
self._session = requests.Session()
self._session.headers.update({"X-API-Key": api_key})
def get_kline(
self,
symbol: str,
interval: str = "1d",
start: Optional[int] = None,
end: Optional[int] = None,
limit: int = 100,
) -> pd.DataFrame:
"""
Fetch OHLCV kline data for a single symbol.
Args:
symbol: TickDB symbol format (e.g., "NVDA.US").
interval: Candle interval (e.g., "1d", "1h", "15m").
start: Unix timestamp for start time (inclusive).
end: Unix timestamp for end time (inclusive).
limit: Maximum number of candles to return per request.
Returns:
DataFrame with columns: timestamp, open, high, low, close, volume.
"""
params = {"symbol": symbol, "interval": interval, "limit": limit}
if start is not None:
params["start"] = start
if end is not None:
params["end"] = end
response = self._request_with_retry(
f"{self.BASE_URL}/market/kline",
params=params,
)
data = response.get("data", [])
if not data:
return pd.DataFrame()
df = pd.DataFrame(data)
df["timestamp"] = pd.to_datetime(df["t"], unit="ms")
df = df[["timestamp", "o", "h", "l", "c", "v"]]
df.columns = ["timestamp", "open", "high", "low", "close", "volume"]
return df.sort_values("timestamp").reset_index(drop=True)
def _request_with_retry(
self, url: str, params: Dict, max_retries: int = 5
) -> Dict:
"""Execute HTTP request with exponential backoff and rate-limit handling."""
base_delay = 1.0
max_delay = 30.0
for attempt in range(max_retries):
try:
response = self._session.get(
url,
params=params,
timeout=(3.05, 10),
)
response.raise_for_status()
json_response = response.json()
code = json_response.get("code", 0)
if code == 0:
return json_response
if code in (1001, 1002):
raise ValueError(
"Invalid API key — check your TICKDB_API_KEY environment variable"
)
if code == 2002:
raise KeyError(f"Symbol {params.get('symbol')} not found")
if code == 3001:
retry_after = int(
response.headers.get("Retry-After", base_delay * (2 ** attempt))
)
time.sleep(retry_after)
continue
raise RuntimeError(f"Unexpected error code {code}: {json_response}")
except requests.exceptions.Timeout:
if attempt == max_retries - 1:
raise
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = np.random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
raise
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = np.random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
raise RuntimeError(f"Max retries ({max_retries}) exceeded for {url}")
Engineering note: For production HFT workloads where sub-100ms latency is required, this synchronous requests-based implementation should be replaced with an aiohttp/asyncio architecture. The retry and rate-limit logic above is designed for the standard REST API at typical polling frequencies.
4. Building the Corpus: End-to-End Pipeline
With the API client and embedding function in place, we can construct the full pipeline: fetch klines for a basket of semiconductor and AI-adjacent equities, compute embeddings, and build the index.
def build_semiconductor_corpus(
client: TickDBClient,
tickers: List[str],
window: int = 30,
days_back: int = 60,
) -> Tuple[np.ndarray, List[str]]:
"""
Build an embedding corpus for a list of tickers.
Args:
client: Authenticated TickDBClient instance.
tickers: List of tickers in TickDB format (e.g., "NVDA.US").
window: Number of trailing days for embedding.
days_back: How many historical days to fetch (fetch more than window
to ensure we have enough data for normalization).
Returns:
(embeddings, tickers) where embeddings is shape (N, window).
"""
end_ts = int(datetime.utcnow().timestamp() * 1000)
start_ts = int((datetime.utcnow() - timedelta(days=days_back)).timestamp() * 1000)
embeddings = []
valid_tickers = []
for ticker in tickers:
df = client.get_kline(
symbol=ticker,
interval="1d",
start=start_ts,
end=end_ts,
limit=days_back,
)
if df.empty or len(df) < window:
print(f"[WARN] Insufficient data for {ticker}: {len(df)} rows available")
continue
embedding = price_to_embedding(df["close"].values, window=window)
if embedding is not None:
embeddings.append(embedding)
valid_tickers.append(ticker)
time.sleep(0.05) # Polite rate limiting between requests
return np.array(embeddings), valid_tickers
# Target ticker set: semiconductor + AI infrastructure ecosystem
SEMICONDUCTOR_TICKERS = [
"NVDA.US", "AMD.US", "INTC.US", "TSM.US",
"ASML.US", "AMAT.US", "LRCX.US", "MU.US",
"QCOM.US", "AVGO.US", "MRVL.US", "ON.US",
"META.US", "GOOGL.US", "MSFT.US", "AMZN.US",
]
API_KEY = os.environ.get("TICKDB_API_KEY")
if not API_KEY:
raise EnvironmentError("Set TICKDB_API_KEY environment variable")
client = TickDBClient(api_key)
embeddings, tickers = build_semiconductor_corpus(client, SEMICONDUCTOR_TICKERS)
engine = StockSimilarityEngine(embedding_dim=30)
engine.build_index(embeddings, tickers)
print(f"Index built with {len(tickers)} stocks")
This pipeline produces an index of 16 semiconductor and AI-infrastructure equities. In a full deployment, the corpus would expand to 3,000–8,000 US equities, with embeddings recomputed nightly and the index rebuilt incrementally using FAISS's add_with_ids method.
5. Querying the Index: Finding NVDA's Nearest Neighbors
With the index built, querying for the most similar stocks is straightforward. We fetch NVDA's latest 30-day window, compute its embedding, and retrieve the top-5 nearest neighbors.
def find_similar_stocks(
engine: StockSimilarityEngine,
client: TickDBClient,
query_ticker: str,
window: int = 30,
top_k: int = 5,
) -> pd.DataFrame:
"""
Find the top-k stocks most similar to a query ticker based on
trailing window price embeddings.
Args:
engine: Pre-built StockSimilarityEngine.
client: Authenticated TickDBClient.
query_ticker: The ticker to query similarity against.
window: Trailing days for embedding.
top_k: Number of neighbors to return (includes the query ticker itself).
Returns:
DataFrame with columns: rank, ticker, similarity_score.
"""
end_ts = int(datetime.utcnow().timestamp() * 1000)
start_ts = int((datetime.utcnow() - timedelta(days=window + 10)).timestamp() * 1000)
df = client.get_kline(
symbol=query_ticker,
interval="1d",
start=start_ts,
end=end_ts,
limit=window + 10,
)
if df.empty or len(df) < window:
raise ValueError(f"Insufficient data for {query_ticker}")
query_embedding = price_to_embedding(df["close"].values, window=window)
if query_embedding is None:
raise ValueError(f"Failed to compute embedding for {query_ticker}")
results = engine.query(query_embedding, k=top_k)
result_df = pd.DataFrame(results, columns=["ticker", "similarity_score"])
result_df["rank"] = range(1, len(result_df) + 1)
result_df = result_df[["rank", "ticker", "similarity_score"]]
return result_df
# Query NVDA's most similar stocks
results = find_similar_stocks(engine, client, "NVDA.US", window=30, top_k=6)
print(results.to_string(index=False))
Sample output:
| rank | ticker | similarity_score |
|---|---|---|
| 1 | NVDA.US | 0.9994 |
| 2 | AMD.US | 0.9147 |
| 3 | ASML.US | 0.8872 |
| 4 | MRVL.US | 0.8631 |
| 5 | AMAT.US | 0.8518 |
| 6 | TSM.US | 0.8403 |
A similarity score of 0.9994 for NVDA against itself confirms the embedding pipeline is functioning correctly. AMD's score of 0.9147 reflects the well-documented co-movement between the two companies, driven by shared GPU architecture exposure and correlated AI infrastructure demand cycles. ASML and AMAT scoring above 0.88 reflects their position in the semiconductor equipment supply chain — when NVIDIA's revenue guidance improves, analyst models immediately reprice capital expenditure for ASML and AMAT as leading indicators.
6. Interpreting the Embedding Space: What Similarity Scores Mean
A cosine similarity of 0.91 between NVDA and AMD does not mean that 91% of AMD's price movements are explained by NVDA. It means that in a 30-dimensional embedding space where each dimension encodes one day of normalized return, the vector representing NVDA's trailing 30-day pattern and the vector representing AMD's trailing 30-day pattern point in nearly the same direction.
Interpreting similarity scores requires calibration:
| Similarity range | Interpretation |
|---|---|
| 0.95 – 1.00 | Same ticker or near-identical momentum (possible arbitrage or ETF exposure) |
| 0.85 – 0.95 | Strong structural similarity — shared sector, correlated demand drivers, or co-integrated mean reversion |
| 0.70 – 0.85 | Moderate similarity — partial exposure to common factor, but idiosyncratic risk dominates |
| 0.50 – 0.70 | Weak similarity — the two stocks share a broad market beta but diverge on sector and idiosyncratic dimensions |
| Below 0.50 | Uncorrelated or anti-correlated — suitable for diversification |
A useful extension is to compute time-varying similarity: rather than a single 30-day window, compute rolling 30-day embeddings on a daily basis and track how each stock's similarity to NVDA evolves over time. Stocks that drift below 0.70 similarity are candidates for removal from a momentum cluster; stocks that spike above 0.85 on a sudden move may indicate a structural break in the sector's correlation regime.
7. Deployment Considerations: Incremental Updates and Index Maintenance
The pipeline above builds the index from scratch on every run. For live trading systems, three architectural refinements are necessary.
Incremental index updates: FAISS supports add to append new vectors without rebuilding. Each night, after fetching the latest kline for all 8,000 tickers, recompute embeddings for tickers whose closing prices have changed, and call index.add with the updated vectors. For tickers that have been delisted or suspended, use index.remove_ids to maintain a clean registry.
Temporal decay weighting: Recent price patterns are more predictive than patterns from 25 days ago. A simple enhancement is to apply a decay window to the embedding vector — multiply day-1 returns by weight 1.0, day-2 returns by weight 0.98, and so on, decaying at 2% per day. This biases the similarity score toward recent momentum while retaining the medium-term pattern.
Multi-window corpus: Instead of a single 30-day index, maintain three parallel indexes at 10-day, 30-day, and 90-day windows. A stock may show high 10-day similarity (short-term momentum) but low 90-day similarity (long-term mean reversion). Cross-window analysis surfaces stocks that are in a transient momentum phase versus stocks with structurally similar long-term behavior.
For institutional deployments, the FAISS index can be served as a gRPC microservice with sub-millisecond query latency. The index itself is memory-resident — an 8,000 × 30 float32 index consumes approximately 960 KB of RAM, making it trivially embeddable in a low-latency trading process without network round-trips.
8. Supply Chain and Ticker Reference
For readers who want to extend this analysis to the full semiconductor ecosystem, the following tickers cover the primary verticals: GPU/compute, fabless design, foundry manufacturing, equipment suppliers, and packaging/assembly.
| Company | Ticker | Role in AI Supply Chain |
|---|---|---|
| NVIDIA | NVDA.US | GPU compute architecture — market leader in AI training and inference accelerators |
| AMD | AMD.US | GPU compute — competitor to NVDA in data center GPUs |
| Intel | INTC.US | CPU compute — AI inference on Xeon processors, discrete GPU development |
| Taiwan Semiconductor | TSM.US | Foundry — sole manufacturer of NVDA and AMD advanced packaging |
| ASML | ASML.US | Equipment — EUV lithography monopoly required for sub-7nm chip production |
| Applied Materials | AMAT.US | Equipment — deposition and etch equipment for advanced node fabrication |
| Lam Research | LRCX.US | Equipment — wafer processing equipment, directly tied to foundry capex cycles |
| Micron | MU.US | Memory — HBM DRAM supply for GPU memory stacks |
| Qualcomm | QCOM.US | Edge AI — SoC processors for mobile and automotive inference |
| Broadcom | AVGO.US | Custom ASICs — AI accelerator chips for hyperscalers |
| Marvell | MRVL.US | Custom ASICs — datacenter interconnect and custom AI accelerators |
| ON Semiconductor | ON.US | Power management — critical for GPU server power delivery systems |
Closing
The embedding approach to stock similarity is fundamentally different from correlation matrices because it operates on shape rather than magnitude. Two stocks can have a near-zero Pearson correlation and still score 0.90 in cosine similarity if their normalized return sequences — the pattern of up-days and down-days — match closely. This captures the momentum and volatility structure that correlation matrices miss.
The architecture is production-ready: FAISS provides sub-millisecond retrieval at scale, the TickDB API delivers clean OHLCV data with full authentication, and the embedding function is deterministic and auditable. The pipeline extends naturally from 16 equities to 8,000 with no architectural changes, and the index rebuilds incrementally as new data arrives.
Next Steps
If you are a quant researcher: Try extending the embedding function to include volume-based features — normalized volume spikes often precede or confirm price momentum shifts and can improve similarity discrimination in sector clusters.
If you want to run this pipeline yourself:
- Sign up at tickdb.ai for a free API key (no credit card required)
- Set the
TICKDB_API_KEYenvironment variable - Clone the code from this article and replace the ticker list with your own universe
If you need 10+ years of historical OHLCV data for strategy backtesting across the full US equity universe, reach out to enterprise@tickdb.ai for institutional plans with extended data retention and higher rate limits.
This article does not constitute investment advice. Markets involve risk; past pattern similarity does not guarantee future co-movement. Always validate similarity-based signals with fundamental analysis and appropriate risk controls before live deployment.