Brought to you by
A Ranking & Evaluation Framework For LLM Hallucinations
Get The Full ReportThe LLM landscape has changed a lot since launching our first Hallucination Index in November 2023, with larger, more powerful open and closed-sourced models being announced monthly. Since then, two things happened: the term "hallucinate" became Dictionary.com’s Word of the Year, and Retrieval-Augmented-Generation (RAG) has become one of the leading methods for building AI solutions. And while the parameters and context lengths of these models continue to grow, the risk of hallucinations remains.
Our new Index evaluates how well 22 of the leading models adhere to given context, helping developers make informed decisions about balancing price and performance. We conducted rigorous testing of top LLMs with input ranging from 1,000 to 100,000 tokens to answer the question of how well they perform across short, medium, and long context lengths. So let's dive deep into the insights. Welcome to the new Hallucination Index - RAG Special!
Adding additional context has emerged as a new way to improve RAG performance and reduce reliability on vector databases. So, we tested each LLM across three scenarios each with varying context length.
Short Context
Less than 5k tokens
equivalent to RAG on few pages
Medium Context
5k to 25k tokens
equivalent to RAG on a book chapter
Long Context
40k to 100k tokens
equivalent to RAG on a book
We followed the following steps when testing each LLM:
1. We gathered diverse datasets reflecting real-world scenarios across three different context lengths.
2. We employed a high-performance evaluation model, called Context Adherence, to measure factual accuracy and closed-domain hallucinations - cases where the model said things that were not provided in the context data.
Learn more about Context Adherence evaluation metric and the ChainPoll evaluation method.
closed-source
models
open-source
models
RAG
tasks
While closed-source models still offer the best performance thanks to proprietary training data, open-source models like Gemma, Llama, and Qwen continue to improve in hallucination performance without the cost barriers of their close-sourced counterparts.
We were surprised to find models perform particularly well with extended context lengths without losing quality or accuracy, reflecting how far model training and architecture has come.
In certain cases, smaller models outperformed larger models. For example Gemini-1.5-flash-001 outperformed larger models, which suggests that efficiency in model design can sometimes outweigh scale.
During testing, Anthropic's latest Claude 3.5 Sonnet scored close to perfect, beating out o1 and GPT-4o in shorter context scenarios while being cost effective.
Best performing model
Claude 3.5 Sonnet due to great performance on all tasks with context support up to 200k.
Best performance for the cost
GPT-4o-mini due to near flawless performance for all tasks at affordable price.
Best performing open-source model
Qwen2-72B-Instruct due to great performance in SCR and MCR with context support up to 128k.