Two complementary scrapers feed Reddit sentiment data into reddit.duckdb. The poller runs every 5 minutes via Reddit's JSON API for near-realtime signal. The scraper runs 4x daily using Playwright + a ProtonVPN IP fleet for deep content extraction. Both write to the same DuckDB file consumed by the VIX Signal Worker.
Related pages:
reddit.duckdbReddit JSON API ──► Poller (every 5min, httpx) ──► LLM Cull ──► reddit.duckdb
│
Reddit HTML ──────► Scraper (4x/day, Playwright) ──► LLM Cull ────────┘
│
ProtonVPN (10 IPs, rotating)
│
#reddit-pulse (8pm Slack digest)
| Component | Method | Frequency | Auth |
|---|---|---|---|
| Poller | Reddit JSON API (/r/sub/.json), httpx |
Every 5min (market hours) | Reddit app credentials |
| Scraper | Playwright browser automation | 4x/day (weekdays) | ProtonVPN 10-IP pool |
43 subreddits across 7 categories:
wallstreetbets, stocks, investing, StockMarket, Superstonk, options, thetagang, SecurityAnalysis, ValueInvesting, Daytrading, pennystocks, RobinHood, personalfinance, financialindependence, dividends
options, thetagang, Optionstrading, Futures_Trading
technology, energy, realestate, healthcare
MachineLearning, artificial, LocalLLaMA, OpenAI, Anthropic, nvidia, AICompanions, ChatGPT, StableDiffusion, singularity
Economics, economy, worldnews, geopolitics
Bitcoin, ethereum, CryptoCurrency
news, politics, business, Finance, Economics
Posts below the threshold are discarded before LLM processing to reduce noise.
| Subreddit | Min Upvotes |
|---|---|
wallstreetbets |
50 |
stocks |
20 |
investing |
20 |
options |
15 |
thetagang |
10 |
SecurityAnalysis |
5 |
quant |
3 |
| All others | 10 (default) |
After collection, both the poller and scraper pass posts through a cull step using the qwen3 alias (llama4:scout, 41 tok/s):
# Cull prompt (simplified)
"""
Given this Reddit post, is it relevant to equity/volatility/macro trading signals?
Return JSON: {"relevant": true/false, "tickers": ["AAPL", ...], "sentiment": "bullish|bearish|neutral"}
"""
Irrelevant posts are dropped. Relevant posts are stored with ticker tags and sentiment label.
All output lands in reddit.duckdb:
CREATE TABLE posts (
id TEXT PRIMARY KEY,
subreddit TEXT,
title TEXT,
body TEXT,
upvotes INTEGER,
scraped_at TIMESTAMP,
source TEXT, -- 'poller' or 'scraper'
relevant BOOL,
tickers TEXT[], -- extracted ticker symbols
vader_score FLOAT, -- VADER compound sentiment [-1, 1]
llm_sentiment TEXT, -- 'bullish' | 'bearish' | 'neutral'
);
The scraper's 4th daily run (8pm) triggers the Reddit Pulse report:
#reddit-pulse threadThe Playwright scraper rotates through 10 ProtonVPN IPs to avoid Reddit rate limits on deep scraping:
The reddit_flow module from stock_automation v0.2.0 is the interface between this pipeline and the VIX Signal Worker:
from stock_automation.reddit_flow import get_sentiment_context
# Called at the start of each VIX signal run
context = get_sentiment_context(
db_path="/mnt/nfs/reddit.duckdb",
lookback_hours=2,
tickers=["VIX", "UVXY", "SPY", "SPX"],
)
# Returns: ticker mentions, VADER scores, top posts, sentiment summary
The Sentiment Agent receives this context as part of its input payload. See VIX Signal Pipeline for the full agent architecture.