The Hallucination Index is an ongoing initiative to evaluate and rank the largest and most popular LLMs propensity to hallucinate across common task types. The models were evaluated using a diverse set of datasets, chosen for their popularity and ability to challenge the models' abilities to stay on task. Below is the methodology used to create the Hallucination Index.
The Hallucination Index evaluated the largest and most popular LLMs available today. LLMs were chosen by surveying popular LLM repos, leaderboards, and industry surveys. The LLMs selected represent a combination of open-source and closed-source models of varying sizes. This domain is evolving, with new models being released on a weekly basis.
Hallucination Index will be updated quarterly. To see an LLM added to the Hallucination Index reach out here.
Next, LLMs were tested across three common task types to observe their performance. In selecting task types, we selected tasks relevant to developers and end-users and tested each LLM’s ability to operate with and without context.
Task Types Selected:
Question & Answer without RAG
A model that, when presented with a question, relies on the internal knowledge and understanding that the AI model has already acquired during its training. It generates answers based on patterns, facts, and relationships it has learned without referencing external sources of information.
Question & Answer with RAG
A model that, when presented with a question, uses retrieved information from a given dataset, database, or set of documents to provide an accurate answer. This approach is akin to looking up information in a reference book or searching a database before responding.
Long-Form Text Generation
Using generative AI to create extensive and coherent pieces of text such as reports, articles, essays, or stories. For this use-case, AI models are trained on large datasets to understand context, maintain subject relevance, and mimic a natural writing style over longer passages.
The Hallucination Index assesses LLM performance by leveraging 7 popular datasets. The datasets effectively challenge each LLM's capabilities relevant to the task at hand. For the Q&A with RAG task we convert the query and document to form the input prompt with context. While for Q&A without RAG & Long form text generation we use the question as the prompt with formatting required for the respective model.
Tasks Type | QA without RAG | QA with RAG | Long Form Text Generation |
---|---|---|---|
Definition | Generates short answer for a question without context | Generates an answer for a question with context | Generates long answer based on a prompt |
Datasets |
Q&A without RAG
Q&A with RAG
Long-form text Generation
Once LLMs, Task Types, and Datasets were selected, experimentation began. The experimentation process is outlined below.
Scoring
After the prompts and generation were ready for each model and dataset, they were scored with ChainPoll to get the task score.
Evaluation
We selected an LLM based evaluation to keep the approach scalable. The metrics used to evaluate output propensity for hallucination are powered by ChainPoll.
Existing benchmarks stick to traditional statistical metrics, but detecting hallucinations reliably has more to do with detecting qualitative nuances in the model’s output that are specific to the task types.
While asking GPT-4 to detect hallucinations is a popular (albeit expensive) method to rely on, ChainPoll has emerged as a superior method to detect hallucinations from your model’s outputs, with a high correlation against human benchmarking.
Annotation
We leveraged human annotation to confirm the reliability of metric for each task type in our ChainPoll experiments as well as Index experiments. The ChainPoll paper makes use of RealHall dataset which consists of open and closed domain prompts.
Task score
The final score shown in the bar chart is calculated as the mean of score for each dataset of the task. The dataset score is the mean of ChainPoll score for each sample.
Metric | Aggregate AUROC |
---|---|
ChainPoll-GPT-4o | 0.86 |
SelfCheck-Bertscore | 0.74 |
SelfCheck-NGram | 0.70 |
G-Eval | 0.70 |
Max pseudo-entropy | 0.77 |
GPTScore | 0.65 |
Random Guessing | 0.60 |
Hallucination detection performance on RealHall, averaged across datatasets
Evaluation Metrics
Measures whether a given model response is factual or not. Correctness uncovers open-domain hallucinations - factual errors that do not relate to any specific documents or context.
Context Adherence evaluates the degree to which a model's response aligns strictly with the given context, serving as a metric to gauge closed-domain hallucinations, wherein the model generates content that deviates from the provided context.
While our model ranking provides valuable insights for various tasks, we acknowledge that it does not cover all applications and domains comprehensively. To address this, we have plans to incorporate additional models and datasets in the future. To request a specific model, get in touch below. In the meantime, here's a suggested approach to refine your model selection process:
Task Alignment
Begin by identifying which of our benchmarking task types aligns most closely with your specific application..
Top 3 Model Selection
Based on your criteria, carefully select the three top-performing models for your identified task. Consider factors such as performance, cost, and privacy with your objectives.
Exploration of New Models
Extend your model pool by adding any additional models you believe could deliver strong performance in your application context. This proactive approach allows for a more comprehensive evaluation.
Data Preparation
Prepare a high-quality evaluation dataset using real-world data specific to your task. This dataset should be representative of the challenges and nuances to be faced in production.
Performance Evaluation
Execute a thorough evaluation of the selected models using your prepared dataset. Assess their performance based on relevant metrics, ensuring a comprehensive understanding of each model's strengths and weaknesses.
By following these steps, you'll gain a nuanced perspective on model suitability for your application, enabling you to make informed decisions in selecting the most appropriate model. Stay tuned for updates as we expand our model offerings to further cater to diverse applications and domains.
You're on your way to learning: