We’ve been working on the hallucination problem from a systems perspective rather than a model perspective.
Instead of trying to improve generation quality, we focused on constraining when a model is allowed to produce an answer.
We ran a controlled benchmark:
• 200 questions (100 answerable, 100 unanswerable)
• Same base model across all conditions
• Compared plain LLM, standard RAG, and our architecture
• 3 independent AI judges from different model families
Results:
• Plain LLM: \~28% accuracy, \~16% hallucination
• RAG: \~31% accuracy, \~29% hallucination
• Our system: \~95% accuracy, \~1.5% hallucination
One surprising result was that RAG increased hallucination in our test setup.
The key difference is a gating layer that validates whether an answer is sufficiently supported before allowing it to be returned. If not, the system refuses.
Full paper - https://www.apothyai.com/benchmark
Would appreciate:
• feedback on methodology
• ideas for stronger evaluation
• replication attempts
Happy to answer questions.
[link] [comments]