The Hallucination Index is an ongoing initiative to evaluate and rank the largest and most popular LLMs on their propensity to hallucinate across common task types. The models were evaluated using a diverse set of datasets, chosen for their popularity and ability to challenge the models' abilities to stay on task. Below is the methodology used to create the Hallucination Index. Our RAG methodology is designed to rigorously evaluate RAG models across a variety of dimensions, ensuring both factual accuracy and contextual adherence.
The Hallucination Index evaluated the largest and most popular LLMs available today. These LLMs were chosen by surveying popular LLM repos, leaderboards, and industry surveys. The LLMs selected represent a combination of open-source and closed-source models of varying sizes. This domain is evolving, with new models being released weekly.
The Hallucination Index will be updated every two quarters. To see an LLM added to the Index, contact us here.
Next, LLMs were tested across three common task types to observe their performance. We selected tasks relevant to developers and end-users and tested each LLM’s ability to operate with context of different lengths.
Why short, medium & long context RAG tasks?
Context length affects the design of a RAG system by influencing retrieval strategies, computational resource needs, and the balance between precision and breadth. We conducted 3 experiments to gauge the state of LLMs’ performance in different contexts lengths.
For short context lengths (less than 5,000 tokens), the pros are faster responses, better precision, and simplicity. However, they can miss out on broader context and might overfit to narrow scenarios. There is also a higher reliance on vector database precision to ensure relevant information retrieval.
Medium context lengths (5,000 to 25,000 tokens) offer a balance between detail and scope, providing more nuanced answers. They rely less on the pinpoint accuracy of vector databases, as they have more room to include context. However, they come with increased complexity and higher resource usage.
Long context lengths (40,000 to 100,000 tokens) handle detailed queries well, offering rich information and comprehensive understanding. Since extensive context can be included, the reliance on vector database precision decreases even further. The downside is slower response times, high computational costs, and potential inclusion of irrelevant information.
Short Context RAG
The SCR evaluation utilizes a variety of demanding datasets to test the robustness of models in handling short contexts:
We employ Chainpoll with GPT-4o, which leverages the strong reasoning power of GPT series models. By using a chain of thought technique to poll the model multiple times, we can better judge the correctness of the responses. This not only provides a metric to quantify potential hallucinations but also offers explanations based on the provided context, a crucial feature for RAG systems.
Medium and Long Context RAG
Our methodology focuses on models' ability to comprehensively understand extensive texts in medium and long contexts.
We extract text from very recent 10k documents of a company, divide it into chunks, and designate one of these chunks as the needle chunk. Using these chunks, we construct the necessary dataset by varying the location of the needle. We create a retrieval question that can be answered using the needle. The LLM has to answer the question using the context containing the needle.
Medium context lengths - 5k, 10k, 15k, 20k, 25k
Long context lengths - 40k, 60k, 80k, 100k
We designed the task with these considerations:
Effect of prompting technique on performance
Additionally we experimented with a prompting technique known as Chain-of-Note to improve performance as it has worked for short context.
Evaluation
Adherence to context is evaluated using a custom LLM-based assessment, checking for the relevant answer within the response.
Short Context Rag
The Hallucination Index assesses LLM performance by leveraging 4 popular and 2 proprietary datasets. The datasets effectively challenge each LLM's capabilities relevant to the task at hand. For this task we convert the query and document to form the input prompt with context.
DROP: Reading comprehension benchmark which requires Discrete Reasoning Over the content of Paragraphs. Answering requires resolving references in a question, perhaps to multiple input positions, and performing discrete operations over them (such as addition, counting, or sorting).
Microsoft MS Macro: A dataset containing queries and paragraphs with relevance labels.
HotpotQA: A dataset with Wikipedia-based question-answer pairs that require finding and reasoning over multiple supporting documents to answer.
ConvFinQA: A dataset to study the chain of numerical reasoning in conversational question answering. It poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
Medium Context RAG
We extract text from very recent 10k documents of a company, divide it into chunks, and designate one of these chunks as the needle chunk. Using these chunks, we construct the necessary dataset by varying the location of the needle. We keep the needle at 20 varying locations per each context length to test the performance.
For the dataset with a context length of 10k, we will create 20 samples, keeping the “info” at different positions in the context—0, 500, 1000, 1500, .., 9000, 9500.
Similarly, for the dataset with context length of 100k, we will create 20 samples where we keep the “info” at different positions in the context - 0, 5000, 10000, 15000, .., 90000, 95000.
Once LLMs, Task Types, and Datasets selected, experimentation begins.
The experimentation process is outlined below.
We follow the model's prompt format, adding context in a simple bullet point format. For long-form text generation, we use the question as the prompt and apply the necessary formatting required by the model.
Generation: The generations are done using private APIs, Together, and hosting model on HuggingFace.
Scoring
After preparing the prompts and generation for each model and dataset, they were evaluated using ChainPoll to obtain the task score. ChainPoll utilizes the strong reasoning abilities of GPTs and employs a technique of polling the model multiple times to assess the accuracy of the response. This approach not only quantifies the extent of potential errors but also provides an explanation based on the given context, particularly in the case of RAG-based systems.
Chainpoll: A High Efficacy Method for LLM Hallucination Detection
A high accuracy methodology for hallucination detection that provides an 85% correlation with human feedback - your first line of defense when evaluating model outputs.
ChainPoll: a novel approach to hallucination detection that is substantially more accurate than any metric we’ve encountered in the academic literature. Across a diverse range of benchmark tasks, the ChainPoll outperforms all other methods – in most cases, by a huge margin.
ChainPoll dramatically out-performs a range of published alternatives – including SelfCheckGPT, GPTScore, G-Eval, and TRUE – in a head-to-head comparison on RealHall.
ChainPoll is also faster and more cost-effective than most of the metrics listed above.
Unlike all other methods considered here, ChainPoll also provides human-readable verbal justifications for the judgments
it makes, via the chain-of-thought text produced during inference.
Though much of the research literature concentrates on the the easier case of closed-domain hallucination detection, we show that ChainPoll is equally strong when detecting either open-domain or closed domain hallucinations. We develop versions of ChainPoll specialized to each of these cases: ChainPoll-Correctness for open-domain and ChainPoll-Adherence for closed-domain.
Metric | Aggregate AUROC |
---|---|
ChainPoll-GPT-4o | 0.86 |
SelfCheck-Bertscore | 0.74 |
SelfCheck-NGram | 0.70 |
G-Eval | 0.70 |
Max pseudo-entropy | 0.77 |
GPTScore | 0.65 |
Random Guessing | 0.60 |
How does this work?
Chainpoll piggybacks on the strong reasoning power of your LLMs, but further leverages a chain of thought technique to poll the model multiple times to judge the correctness of the response. This technique not only provides a metric to quantify the degree of potential hallucinations, but also provides an explanation based on the context provided, in the case of RAG based systems.
Evaluation
We selected an LLM-based evaluation to keep the approach scalable. ChainPoll powers the metrics used to evaluate output propensity for hallucination.
Task score
The final score shown is calculated as the mean of the score for each task dataset. The dataset score is the mean of the ChainPoll score for each sample.
Learn more
We have developed a comprehensive set of RAG metrics to cover various evaluation aspects of these models. Our documentation provides a detailed breakdown of each RAG metric and our methodologies.
About Context Adherence
Context Adherence evaluates the degree to which a model's response aligns strictly with the given context, serving as a metric to gauge closed-domain hallucinations, wherein the model generates content that deviates from the provided context.
The higher the Context Adherence score (i.e., it has a value of 1 or close to 1), the more likely the response is to contain only information from the context provided to the model.
The lower the Context Adherence score (ie., it has a value of 0 or close to 0), the response is more likely to contain information not included in the context provided to the model.
These metrics are powered by ChainPoll, a hallucination detection methodology developed by Galileo Labs. You can read more about ChainPoll here: https://arxiv.org/abs/2310.18344
Task Alignment
Begin by identifying which of our benchmarking task types aligns most closely with your specific application.
Top 3 Model Selection
Based on your criteria, carefully select the three top-performing models for your identified task. Consider factors such as performance, cost, and privacy with your objectives.
Exploration of New Models
Extend your model pool by adding any additional models you believe could deliver strong performance in your application context. This proactive approach allows for a more comprehensive evaluation.
Data Preparation
Prepare a high-quality evaluation dataset using real-world data specific to your task. This dataset should be representative of the challenges and nuances to be faced in production.
Performance Evaluation
Execute a thorough evaluation of the selected models using your prepared dataset. Assess their performance based on relevant metrics, ensuring a comprehensive understanding of each model's strengths and weaknesses.