Hallucination Detection in Language Model Responses
An LLM confidently writes "Drug X was approved by FDA in 2021"—yet the drug doesn't exist. Or a RAG system cites a document with an exact page number that doesn't exist in the source. Hallucinations are not a rare bug, but a systemic property of autoregressive models: the next token is predicted by probability distribution, not fact base. In business-critical applications, this is unacceptable.
Where Hallucinations Come From and Why They're Hard to Catch
The problem is not that the model "doesn't know"—it's that GPT-4, Claude, Llama and similar models lack an internal verification mechanism. The model doesn't know that it doesn't know. Confidence in answers (confidence score from logprobs) weakly correlates with actual accuracy: you can get logprob close to 0 for a hallucinated fact.
Three main sources of hallucinations in production:
Mismatch between retrieval and generation. In RAG pipeline, the retriever returns top-5 chunks by cosine similarity, but they don't contain the answer. The model generates anyway—fills the gap with pretraining patterns. Typical situation: chunk_size=512 without overlap, FAISS with L2 metric instead of cosine, weak embedding model (all-MiniLM-L6-v2 instead of text-embedding-3-large or E5-mistral-7b).
Temporal drift. Model trained on data up to a certain date. Queries about current events, changed regulations, or new products inevitably produce hallucinations without current context.
Instruction-following vs factuality trade-off. During RLHF fine-tuning, models learn to be "helpful"—give an answer even if data is insufficient. This directly stimulates hallucinations under uncertainty.
How to Build a Detection System
Hallucination detection cannot be solved by one method. In practice, we apply a multi-level architecture:
Level 1 — Self-Consistency Check
Generate N answers to one question with temperature > 0 (usually N=5–10, temperature=0.7). Compare answers semantically using sentence-transformers (paraphrase-multilingual-mpnet-base-v2). High variability signals unreliable facts. Metric: average pairwise cosine similarity < 0.75 indicates instability.
Level 2 — Grounding Score
For RAG systems: verify whether each statement in the answer is supported by retrieved chunks. Use NLI model (cross-encoder/nli-deberta-v3-base) to evaluate entailment between answer and context. Statement with entailment score < 0.6 marked as unverified.
Level 3 — Retrieval Faithfulness
RAGAS metrics: faithfulness, answer_relevancy, context_precision. Faithfulness < 0.7 with context_precision > 0.8 means context was present but model ignored it—classic generative hallucination.
Level 4 — External Fact-Checking
For critical domains (medicine, law, finance): verify through search (Tavily, Bing Search API) or specialized knowledge bases (Wikidata SPARQL, PubMed API). Extract named entities (spaCy + custom model) and verify each separately.
Practical Case
Client—law firm, internal precedent law assistant. Model: GPT-4-turbo with RAG on 50k documents (pgvector + LangChain). Problem: 18% of answers contained references to non-existent cases or incorrect decision dates (identified by manual audit of 200 queries).
Solution: added two-level check. At retrieval level—reranker cross-encoder/ms-marco-MiniLM-L-6-v2 raised context_precision from 0.61 to 0.84. At generation level—NLI verification of each legal claim + regex extraction of case numbers with subsequent verification through arbitration decisions API. Hallucination rate dropped to 3.2% in 2 weeks of iterations.
Metrics for Assessing Detection Quality
| Metric | Tool | Target Value |
|---|---|---|
| Hallucination rate | Manual audit + NLI | < 5% for production |
| Faithfulness (RAGAS) | ragas library | > 0.80 |
| Grounding score | NLI deberta | > 0.65 per claim |
| Self-consistency | sentence-transformers | cosine sim > 0.75 |
| Latency overhead | — | < 500ms for detection |
Implementation Process
Audit current state—analyze existing pipeline: retriever quality, chunk strategy, embedding model, prompts. Collect dataset from 100–200 real queries with ground truth.
Baseline measurement—get numbers: hallucination rate, faithfulness, latency. Without baseline, it's unclear what to improve.
Multi-level detection—choose methods for domain specificity. Medicine requires external verification; internal company knowledge—grounding score is enough.
Pipeline integration—detector embeds as middleware. Low-grounding answers marked with warning or sent for human review.
Production monitoring—log all scores, build Grafana dashboard. Metric drift—signal to reindex or change prompt strategy.
Timeline: from 2 weeks to add detection to existing RAG pipeline to 2 months for full verification system with external sources in complex domain.







