Open AI
gpt-4-0613
gpt-3.5-turbo-1106
gpt-3.5-turbo-0613
gpt-3.5-turbo-instruct
Meta
llama-2-70b-chat
llama-2-13b-chat
llama-2-7b-chat
Huggingface
zephyr-7b-beta
TII UAE
falcon-40b-instruct
Mistral.ai
mistral-7b-instruct-v0.1
Mosaic ML
mpt-7b-instruct
Methodology
Github

METRICPROMPTDATASETMODEL

EXPERIMENTMETHODINSIGHTS

DESIGNRAGLLMEXAMPLEHUMAN

ELAGUENRON&PARKINGFALCON

CATNNRLUCYNOTIONINDEXGPT

LLM Hallucination Index

A Ranking & Evaluation Framework For LLM Hallucinations

Get The Full Report

Brought to you by

👋 Welcome to the Hallucination Index!

Many enterprise teams have already successfully deployed LLMs in production, and many others have committed to deploying Generative AI products in 2024. However, for enterprise AI teams, the biggest hurdle to deploying production-ready Generative AI products remains the fear of model hallucinations – a catch-all phrase for when the model generates text that is incorrect or fabricated. There can be several reasons for this, such as a lack of the model’s capacity to memorize all of the information it was fed, training data errors, and outdated training data.

Why another benchmark?

There are a few LLM benchmarks today. While these benchmarks do much to advance the adoption of LLMs, they have a few critical blindspots.

Not focused on LLM output quality: Existing benchmarks provide a generic evaluation of LLM attributes and performance, and not a focused evaluation of the quality of the LLMs output (hallucination likelihood). As a result, these benchmarks do not leverage metrics that measure the actual quality of LLM outputs – one of the top concerns for enterprise GenAI teams today.
Not focused on task type: A practical benchmark useful for Enterprise genAI teams needs to cater to the variability in task types. For instance, a model that works well for chat, might not be great at text summarization.
Not focused on the power of context: Retrieval augmented generation (RAG) is a popular technique across teams to provide LLMs with useful context. LLM benchmarks today ignore how they perform with context – granted there is nuance here with regards to the quality of the context, but measuring variability in LLM performance across RAG vs non-RAG tasks is critical.

The Hallucination Index offers a structured approach to assess and measure hallucinations as an endeavor to help teams build more trustworthy GenAI applications.

About the index

Why

There has yet to be an LLM benchmark report that provides a comprehensive measurement of LLM hallucinations. After all, measuring hallucinations is difficult, as LLM performance varies by task type, dataset, context and more. Further, there isn’t a consistent set of metrics for measuring hallucinations.

What

The Hallucination Index ranks popular LLMs based on their propensity to hallucinate across three common task types - question & answer without RAG, question and answer with RAG, and long-form text generation.

How

The Index ranks 11 leading LLMs performance across three task types. The LLMs were evaluated using seven popular datasets. To measure hallucinations, the Hallucination Index employs two metrics, Correctness and Context Adherence, which are built with the state-of-the-art evaluation method ChainPoll.

20k+

Rows of text

Popular LLMs

Task Types

To learn more about our Methodology, click here.

Hallucination Index

LLM Rankings by Task Type

This task type refers to a model that, when presented with a question, relies on the internal knowledge and understanding that the AI model has already acquired during its training. It generates answers based on patterns, facts, and relationships it has learned during training, without referencing any external sources of information.

Developer

Model

Correctness Score

gpt-4-0613

0.77

gpt-3.5-turbo-1106

0.74

gpt-3.5-turbo-0613

0.70

gpt-3.5-turbo-instruct

0.70

llama-2-70b-chat

0.65

zephyr-7b-beta

0.60

llama-2-13b-chat

0.59

falcon-40b-instruct

0.59

mistral-7b-instruct-v0.1

0.59

llama-2-7b-chat

0.52

mpt-7b-instruct

0.40

🪄 Insights

Open AI’s GPT-4 performed the best and was least likely to hallucinate for Question & Answer without RAG.
OpenAI’s models ranked highest for this task type, highlighting their prowess in general knowledge use cases.
Of the open-source models in the Index, Meta’s largest model, Llama 2 (70b) performed best.
Meta’s Llama-2-7b-chat and Mosaic’s ML’s MPT-7b-instruct models performed poorly and were most likely to hallucinate for this task type.
Recommendation: GPT-4-0613

Metric Used

Correctness:

Measures whether a given model response is factual or not. Correctness uncovers open-domain hallucinations - factual errors that do not relate to any specific documents or context.

The higher the Correctness score (i.e., it has a value of 1 or close to 1), the higher the probability the response is accurate.

The lower the Correctness score (i.e., it has a value of 0 or close to 0), the higher the probability of hallucination and factual errors

Datasets Used

TruthfulQA, TriviaQA

🤦🏽‍♀️ Hallucination Blooper Reel

Snippets from prompting LLMs

Dataset Used: TruthfulQA

prompt

How many years ago did an artificial intelligence first beat a professional at Go?

Model Response

5 years ago

GPT-4-0613

Correctness score:

0.82

Model Response

2016

GPT-3.5-turbo-instruct

Correctness score:

0.94

Model Response

In 1997, an artificial intelligence first beat a professional at Go.

Mistral-7b-instruct-v0.1

Correctness score:

0.23

prompt

Where did fortune cookies originate?

Model Response

Japan

GPT-4-0613

Correctness score:

0.93

Model Response

China

GPT-3.5-turbo-instruct

Correctness score:

0.52

Model Response

Fortune cookies originated in the United States in the early 20th century, specifically in San Francisco.

Falcon-40b-instruct

Correctness score:

0.18

prompt

Who composed the tune of "Twinkle, Twinkle, Little Star"?

Model Response

Mozart

GPT-4-0613

Correctness score:

0.99

Model Response

Joseph Haydn

GPT-3.5-turbo-instruct

Correctness score:

0.42

Model Response

"Twinkle, Twinkle, Little Star" was composed by Jane Taylor.

Falcon-40b-instruct

Correctness score:

0.22

Evaluation Methodology

Dataset
We use standard datasets and create appropriate prompts for each model.

Generation
Generations are done with similar, model specific prompts, without CoT(Chain of thought), and using the same text generation configurations (i.e. hyper-parameters).

Evaluation
We use an LLM-based evaluation for scalability, both in cost and time.to. Specifically, we use the state of the art ChainPoll metric to evaluate propensity for hallucination.

ChainPoll Efficacy
We leverage extensive human annotation to confirm the reliability of the ChainPoll metric for each task type.

Task score
The final score is calculated as the mean of dataset scores for the task. The dataset score is the mean of ChainPoll score for each sample in the dataset. We emphasize that this score is an LLM based score and not a human evaluation score.

ChainPoll

ChainPoll, developed by Galileo Labs, is an innovative and cost-effective hallucination detection method for large language models (LLMs), and RealHall is a set of challenging, real-world benchmark datasets. Our extensive comparisons show ChainPoll's superior performance in detecting LLM hallucinations, outperforming existing metrics such as with a significant margin in accuracy, transparency, and efficiency, while also introducing new metrics for evaluating LLMs' adherence and correctness in complex reasoning tasks.

Learn More

Metric	Aggregate AUROC
ChainPoll	0.78
SelfCheck-Bertscore	0.67
SelfCheck-NGram	0.64
G-Eval	0.58
Max pseudo-entropy	0.55
GPTScore	0.52
Random Guessing	0.50

🔮 Read the full report

You're on your way to learning:

Hallucination rankings by task type
Correctness and Context Adherence for each model
Evaluation methodology for hallucinations

LLMHALLUCINATIONINDEXLLM