Training NLP Model for Reddit Analysis
Reddit is unique: long discussions, high-quality analytical posts, DD (Due Diligence) project reviews. Unlike Twitter's instant reactions, Reddit reflects more thoughtful and long-term community opinions.
Key Subreddits
- r/CryptoCurrency (6M+): general discussions, news, sentiment
- r/Bitcoin (5M+): BTC-oriented community
- r/ethfinance: high-quality ETH discussions
- r/defi: DeFi-oriented content
- r/CryptoMoonShots: speculative altcoin posts (high noise)
- r/Buttcoin: skeptics/critics (reverse sentiment indicator)
Reddit API (PRAW)
import praw
from datetime import datetime
import asyncpraw # async version
class RedditCryptoCollector:
def __init__(self, client_id, client_secret, user_agent):
self.reddit = asyncpraw.Reddit(
client_id=client_id,
client_secret=client_secret,
user_agent=user_agent
)
async def collect_subreddit_posts(self, subreddit_name, limit=100,
sort='new', time_filter='day'):
subreddit = await self.reddit.subreddit(subreddit_name)
posts = []
async for post in subreddit.top(time_filter=time_filter, limit=limit):
posts.append({
'id': post.id,
'title': post.title,
'text': post.selftext,
'score': post.score,
'upvote_ratio': post.upvote_ratio,
'num_comments': post.num_comments,
'created_utc': datetime.fromtimestamp(post.created_utc),
'author': str(post.author),
'subreddit': subreddit_name,
'flair': post.link_flair_text
})
return posts
async def collect_comments(self, post_id, limit=50):
"""Collect top comments to post"""
submission = await self.reddit.submission(id=post_id)
await submission.comments.replace_more(limit=3)
comments = []
for comment in submission.comments.list()[:limit]:
if hasattr(comment, 'body') and len(comment.body) > 20:
comments.append({
'body': comment.body,
'score': comment.score,
'created_utc': datetime.fromtimestamp(comment.created_utc)
})
return comments
Reddit Content Specifics
Reddit posts significantly longer than tweets. DD posts can contain 2000+ words. Need:
- Chunk-based processing: split long text into chunks, classify each, aggregate.
def analyze_long_post(text, analyzer, chunk_size=512, overlap=50):
tokens = text.split()
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk = ' '.join(tokens[i:i+chunk_size])
chunks.append(chunk)
chunk_scores = [analyzer.analyze(chunk)['score'] for chunk in chunks]
# Weight: beginning and end more important
weights = np.ones(len(chunk_scores))
if len(weights) > 2:
weights[0] = 1.5 # header/beginning
weights[-1] = 1.3 # conclusion
return np.average(chunk_scores, weights=weights)
- Title vs body weighting: post title often more informative. Weight = 2:1.
Reddit-specific Signals
Upvote ratio: > 0.85 = consensus positive. < 0.50 = controversial.
Comment velocity: sharp rise in comments signals viral post.
Hot algorithm: Reddit's hot score = (upvotes - downvotes) / (time_since_post)^gravity. High score = trending content.
Awards: posts with Gold/Platinum received significant engagement.
def calculate_reddit_engagement_score(post):
score = post['score']
ratio = post['upvote_ratio']
comments = post['num_comments']
# Standardized engagement
engagement = (
np.log1p(score) * ratio +
np.log1p(comments) * 0.5
)
return engagement
Due Diligence (DD) Analysis
DD posts on Reddit — most valuable source. Deep project analysis, often ahead of mainstream media.
DD detection: posts with "DD" flair or keywords ("tokenomics", "roadmap", "team analysis", "red flags"):
def is_dd_post(post):
dd_indicators = [
post.get('flair', '').lower() in ['dd', 'analysis', 'research'],
any(kw in post['text'].lower() for kw in
['tokenomics', 'whitepaper', 'team analysis', 'red flag',
'due diligence', 'fundamentals', 'on-chain data']),
len(post['text'].split()) > 500 # long post
]
return sum(dd_indicators) >= 2
Long-term Sentiment Model
Reddit sentiment slower to react (half-life ~24-72 hours vs ~1-4 hours for Twitter). For long-term signals better use 7-day rolling average.
Monthly Roundup analysis: r/CryptoCurrency publishes monthly roundup posts. Top comments contain most discussed topics — quality signal for macro positioning.
Token Mention Monitoring
async def monitor_token_mentions(token_symbol, subreddits, lookback_hours=24):
search_terms = [token_symbol, get_token_name(token_symbol)]
mentions = []
for subreddit in subreddits:
posts = await search_subreddit(subreddit, ' OR '.join(search_terms),
lookback_hours)
for post in posts:
sentiment = analyzer.analyze(post['title'] + ' ' + post['text'][:200])
mentions.append({
'token': token_symbol,
'sentiment': sentiment['score'],
'engagement': calculate_reddit_engagement_score(post),
'subreddit': subreddit
})
if mentions:
avg_sentiment = np.average(
[m['sentiment'] for m in mentions],
weights=[m['engagement'] for m in mentions]
)
return avg_sentiment, len(mentions)
return 0, 0
Developing Reddit analysis system with PRAW-based collection, chunk-based NLP for long posts, DD detection, token mention monitoring and long-term sentiment aggregation.







