Hallucination Detection Implementation for AI System

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Hallucination Detection Implementation for AI System
Complex
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1214
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823

Hallucination Detection in Language Model Responses

An LLM confidently writes "Drug X was approved by FDA in 2021"—yet the drug doesn't exist. Or a RAG system cites a document with an exact page number that doesn't exist in the source. Hallucinations are not a rare bug, but a systemic property of autoregressive models: the next token is predicted by probability distribution, not fact base. In business-critical applications, this is unacceptable.

Where Hallucinations Come From and Why They're Hard to Catch

The problem is not that the model "doesn't know"—it's that GPT-4, Claude, Llama and similar models lack an internal verification mechanism. The model doesn't know that it doesn't know. Confidence in answers (confidence score from logprobs) weakly correlates with actual accuracy: you can get logprob close to 0 for a hallucinated fact.

Three main sources of hallucinations in production:

Mismatch between retrieval and generation. In RAG pipeline, the retriever returns top-5 chunks by cosine similarity, but they don't contain the answer. The model generates anyway—fills the gap with pretraining patterns. Typical situation: chunk_size=512 without overlap, FAISS with L2 metric instead of cosine, weak embedding model (all-MiniLM-L6-v2 instead of text-embedding-3-large or E5-mistral-7b).

Temporal drift. Model trained on data up to a certain date. Queries about current events, changed regulations, or new products inevitably produce hallucinations without current context.

Instruction-following vs factuality trade-off. During RLHF fine-tuning, models learn to be "helpful"—give an answer even if data is insufficient. This directly stimulates hallucinations under uncertainty.

How to Build a Detection System

Hallucination detection cannot be solved by one method. In practice, we apply a multi-level architecture:

Level 1 — Self-Consistency Check

Generate N answers to one question with temperature > 0 (usually N=5–10, temperature=0.7). Compare answers semantically using sentence-transformers (paraphrase-multilingual-mpnet-base-v2). High variability signals unreliable facts. Metric: average pairwise cosine similarity < 0.75 indicates instability.

Level 2 — Grounding Score

For RAG systems: verify whether each statement in the answer is supported by retrieved chunks. Use NLI model (cross-encoder/nli-deberta-v3-base) to evaluate entailment between answer and context. Statement with entailment score < 0.6 marked as unverified.

Level 3 — Retrieval Faithfulness

RAGAS metrics: faithfulness, answer_relevancy, context_precision. Faithfulness < 0.7 with context_precision > 0.8 means context was present but model ignored it—classic generative hallucination.

Level 4 — External Fact-Checking

For critical domains (medicine, law, finance): verify through search (Tavily, Bing Search API) or specialized knowledge bases (Wikidata SPARQL, PubMed API). Extract named entities (spaCy + custom model) and verify each separately.

Practical Case

Client—law firm, internal precedent law assistant. Model: GPT-4-turbo with RAG on 50k documents (pgvector + LangChain). Problem: 18% of answers contained references to non-existent cases or incorrect decision dates (identified by manual audit of 200 queries).

Solution: added two-level check. At retrieval level—reranker cross-encoder/ms-marco-MiniLM-L-6-v2 raised context_precision from 0.61 to 0.84. At generation level—NLI verification of each legal claim + regex extraction of case numbers with subsequent verification through arbitration decisions API. Hallucination rate dropped to 3.2% in 2 weeks of iterations.

Metrics for Assessing Detection Quality

Metric Tool Target Value
Hallucination rate Manual audit + NLI < 5% for production
Faithfulness (RAGAS) ragas library > 0.80
Grounding score NLI deberta > 0.65 per claim
Self-consistency sentence-transformers cosine sim > 0.75
Latency overhead < 500ms for detection

Implementation Process

Audit current state—analyze existing pipeline: retriever quality, chunk strategy, embedding model, prompts. Collect dataset from 100–200 real queries with ground truth.

Baseline measurement—get numbers: hallucination rate, faithfulness, latency. Without baseline, it's unclear what to improve.

Multi-level detection—choose methods for domain specificity. Medicine requires external verification; internal company knowledge—grounding score is enough.

Pipeline integration—detector embeds as middleware. Low-grounding answers marked with warning or sent for human review.

Production monitoring—log all scores, build Grafana dashboard. Metric drift—signal to reindex or change prompt strategy.

Timeline: from 2 weeks to add detection to existing RAG pipeline to 2 months for full verification system with external sources in complex domain.