Open AI
gpt-4-0613
gpt-3.5-turbo-1106
gpt-3.5-turbo-0613
gpt-3.5-turbo-instruct
Meta
llama-2-70b-chat
llama-2-13b-chat
llama-2-7b-chat
Huggingface
zephyr-7b-beta
TII UAE
falcon-40b-instruct
Mistral.ai
mistral-7b-instruct-v0.1
Mosaic ML
mpt-7b-instruct
Methodology
Github

Methodology

The Hallucination Index is an ongoing initiative to evaluate and rank the largest and most popular LLMs propensity to hallucinate across common task types. The models were evaluated using a diverse set of datasets, chosen for their popularity and ability to challenge the models' abilities to stay on task. Below is the methodology used to create the Hallucination Index.

Model Selection

The Hallucination Index evaluated the largest and most popular LLMs available today. LLMs were chosen by surveying popular LLM repos, leaderboards, and industry surveys. The LLMs selected represent a combination of open-source and closed-source models of varying sizes. This domain is evolving, with new models being released on a weekly basis.

Hallucination Index will be updated quarterly. To see an LLM added to the Hallucination Index reach out here.

Open AI

Task Type Selection

Next, LLMs were tested across three common task types to observe their performance. In selecting task types, we selected tasks relevant to developers and end-users and tested each LLM’s ability to operate with and without context.

Task Types Selected:

Question & Answer without RAG
A model that, when presented with a question, relies on the internal knowledge and understanding that the AI model has already acquired during its training. It generates answers based on patterns, facts, and relationships it has learned without referencing external sources of information.
Question & Answer with RAG
A model that, when presented with a question, uses retrieved information from a given dataset, database, or set of documents to provide an accurate answer. This approach is akin to looking up information in a reference book or searching a database before responding.
Long-Form Text Generation
Using generative AI to create extensive and coherent pieces of text such as reports, articles, essays, or stories. For this use-case, AI models are trained on large datasets to understand context, maintain subject relevance, and mimic a natural writing style over longer passages.

Dataset Selection

The Hallucination Index assesses LLM performance by leveraging 7 popular datasets. The datasets effectively challenge each LLM's capabilities relevant to the task at hand. For the Q&A with RAG task we convert the query and document to form the input prompt with context. While for Q&A without RAG & Long form text generation we use the question as the prompt with formatting required for the respective model.

Tasks Type	QA without RAG	QA with RAG	Long Form Text Generation
Definition	Generates short answer for a question without context	Generates an answer for a question with context	Generates long answer based on a prompt
Datasets	TruthfulQA, TriviaQA	NarrativeQA, DROP, MS Marco, HotpotQA distractor test	Open Assistant

Show less

Q&A without RAG

TruthfulQA: A benchmark to measure biases of large language models using question answering task.
TriviaQA: A reading comprehension dataset containing question-answer-evidence triples.

Q&A with RAG

NarrativeQA: A dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents.
DROP: Reading comprehension benchmark which requires Discrete Reasoning Over the content of Paragraphs. Answering requires resolving references in a question, perhaps to multiple input positions, and performing discrete operations over them (such as addition, counting, or sorting).
Microsoft MS Macro: A dataset containing queries and paragraphs with relevance labels.
HotpotQA: A dataset with Wikipedia-based question-answer pairs that require finding and reasoning over multiple supporting documents to answer.

Long-form text Generation

OpenAssistant: A human-generated, human-annotated assistant-style conversation corpus. It covers factual questions on varied domains.

Experimentation

Once LLMs, Task Types, and Datasets were selected, experimentation began. The experimentation process is outlined below.

Prompt formatting: The prompts were formatted differently as per the need of the task.
Prompt format for Q&A without RAG: We construct zero-shot prompts without any chain of thought (CoT) instructions. We use the prompt format as per the model.
Prompt format for Q&A with RAG: We construct zero-shot prompts without any chain of thought (CoT) instructions. We use the prompt format as per the model. We add the context in the prompt with a simple bullet point format.
Prompt format for long-form text generation: We use the question as the prompt along with prompt formatting required for the model.
Generation: The generations are done using the same text generation configuration using TGI server with bitsandbytes NF4 quantisation.

Evaluation

Scoring
After the prompts and generation were ready for each model and dataset, they were scored with ChainPoll to get the task score.

Evaluation
We selected an LLM based evaluation to keep the approach scalable. The metrics used to evaluate output propensity for hallucination are powered by ChainPoll.

Existing benchmarks stick to traditional statistical metrics, but detecting hallucinations reliably has more to do with detecting qualitative nuances in the model’s output that are specific to the task types.

While asking GPT-4 to detect hallucinations is a popular (albeit expensive) method to rely on, ChainPoll has emerged as a superior method to detect hallucinations from your model’s outputs, with a high correlation against human benchmarking.

Annotation
We leveraged human annotation to confirm the reliability of metric for each task type in our ChainPoll experiments as well as Index experiments. The ChainPoll paper makes use of RealHall dataset which consists of open and closed domain prompts.

Task score
The final score shown in the bar chart is calculated as the mean of score for each dataset of the task. The dataset score is the mean of ChainPoll score for each sample.

Metric	Aggregate AUROC
ChainPoll	0.78
SelfCheck-Bertscore	0.67
SelfCheck-NGram	0.64
G-Eval	0.58
Max pseudo-entropy	0.55
GPTScore	0.52
Random Guessing	0.50

Hallucination detection performance on RealHall, averaged across datatasets

Learn more about ChainPoll

Evaluation Metrics

Correctness:
Measures whether a given model response is factual or not. Correctness uncovers open-domain hallucinations - factual errors that do not relate to any specific documents or context.
- The higher the Correctness score (i.e., it has a value of 1 or close to 1), the higher the probability the response is accurate.
- The lower the Correctness score (i.e., it has a value of 0 or close to 0), the higher the probability of hallucination and factual errors
Context Adherence:
Context Adherence evaluates the degree to which a model's response aligns strictly with the given context, serving as a metric to gauge closed-domain hallucinations, wherein the model generates content that deviates from the provided context.
- The higher the Context Adherence score (i.e., it has a value of 1 or close to 1), the response is more likely to only contain information from the context provided to the model.
- The lower the Context Adherence score (ie., it has a value of 0 or close to 0), the response is more likely to contain information not included in the context provided to the model.

These metrics are powered by ChainPoll, a hallucination detection methodology developed by Galileo Labs. You can read more about ChainPoll here: https://arxiv.org/abs/2310.18344

How to use the Hallucination Index for LLM selection?

While our model ranking provides valuable insights for various tasks, we acknowledge that it does not cover all applications and domains comprehensively. To address this, we have plans to incorporate additional models and datasets in the future. To request a specific model, get in touch below. In the meantime, here's a suggested approach to refine your model selection process:

Task Alignment
Begin by identifying which of our benchmarking task types aligns most closely with your specific application..
Top 3 Model Selection
Based on your criteria, carefully select the three top-performing models for your identified task. Consider factors such as performance, cost, and privacy with your objectives.
Exploration of New Models
Extend your model pool by adding any additional models you believe could deliver strong performance in your application context. This proactive approach allows for a more comprehensive evaluation.
Data Preparation
Prepare a high-quality evaluation dataset using real-world data specific to your task. This dataset should be representative of the challenges and nuances to be faced in production.
Performance Evaluation
Execute a thorough evaluation of the selected models using your prepared dataset. Assess their performance based on relevant metrics, ensuring a comprehensive understanding of each model's strengths and weaknesses.

By following these steps, you'll gain a nuanced perspective on model suitability for your application, enabling you to make informed decisions in selecting the most appropriate model. Stay tuned for updates as we expand our model offerings to further cater to diverse applications and domains.

🔮 Read the full report

You're on your way to learning:

Hallucination rankings by task type
Correctness and Context Adherence for each model
Evaluation methodology for hallucinations

LLMHALLUCINATIONINDEXLLM