LLM Hallucination Index

RAG SPECIAL

Brought to you byGalileo Logo

A Ranking & Evaluation Framework For LLM Hallucinations

Get The Full Report

Welcome to the
Hallucination Index!

The LLM landscape has changed a lot since launching our first Hallucination Index in November 2023, with larger, more powerful open and closed-sourced models being announced monthly. Since then, two things happened: the term "hallucinate" became Dictionary.com’s Word of the Year, and Retrieval-Augmented-Generation (RAG) has become one of the leading methods for building AI solutions. And while the parameters and context lengths of these models continue to grow, the risk of hallucinations remains.

Our new Index evaluates how well 22 of the leading models adhere to given context, helping developers make informed decisions about balancing price and performance. We conducted rigorous testing of top LLMs with input ranging from 1,000 to 100,000 tokens to answer the question of how well they perform across short, medium, and long context lengths. So let's dive deep into the insights. Welcome to the new Hallucination Index - RAG Special!

About the Index

What?

Adding additional context has emerged as a new way to improve RAG performance and reduce reliability on vector databases. So, we tested each LLM across three scenarios each with varying context length.

Short Context

Less than 5k tokens
equivalent to RAG on few pages

Medium Context

5k to 25k tokens
equivalent to RAG on a book chapter

Long Context

40k to 100k tokens
equivalent to RAG on a book

Learn more about task type selection

How?

We followed the following steps when testing each LLM:

1. We gathered diverse datasets reflecting real-world scenarios across three different context lengths.

2. We employed a high-performance evaluation model, called Context Adherence, to measure factual accuracy and closed-domain hallucinations - cases where the model said things that were not provided in the context data.

Learn more about Context Adherence evaluation metric and the ChainPoll evaluation method.

10

closed-source
models

12

open-source
models

3

RAG
tasks

Trends

01

Open source is closing the gap

While closed-source models still offer the best performance thanks to proprietary training data, open-source models like Gemma, Llama, and Qwen continue to improve in hallucination performance without the cost barriers of their close-sourced counterparts.

02

What context length?

We were surprised to find models perform particularly well with extended context lengths without losing quality or accuracy, reflecting how far model training and architecture has come.

03

Larger is not always better

In certain cases, smaller models outperformed larger models. For example Gemini-1.5-flash-001 outperformed larger models, which suggests that efficiency in model design can sometimes outweigh scale.

04

Anthropic outperforms OpenAl

During testing, Anthropic's latest Claude 3.5 Sonnet scored close to perfect, beating out o1 and GPT-4o in shorter context scenarios while being cost effective.