A Ranking & Evaluation Framework For LLM Hallucinations
Brought to you by
Many enterprise teams have already successfully deployed LLMs in production, and many others have committed to deploying Generative AI products in 2024. However, for enterprise AI teams, the biggest hurdle to deploying production-ready Generative AI products remains the fear of model hallucinations – a catch-all phrase for when the model generates text that is incorrect or fabricated. There can be several reasons for this, such as a lack of the model’s capacity to memorize all of the information it was fed, training data errors, and outdated training data.
There are a few LLM benchmarks today. While these benchmarks do much to advance the adoption of LLMs, they have a few critical blindspots.
The Hallucination Index offers a structured approach to assess and measure hallucinations as an endeavor to help teams build more trustworthy GenAI applications.
There has yet to be an LLM benchmark report that provides a comprehensive measurement of LLM hallucinations. After all, measuring hallucinations is difficult, as LLM performance varies by task type, dataset, context and more. Further, there isn’t a consistent set of metrics for measuring hallucinations.
The Hallucination Index ranks popular LLMs based on their propensity to hallucinate across three common task types - question & answer without RAG, question and answer with RAG, and long-form text generation.
The Index ranks 11 leading LLMs performance across three task types. The LLMs were evaluated using seven popular datasets. To measure hallucinations, the Hallucination Index employs two metrics, Correctness and Context Adherence, which are built with the state-of-the-art evaluation method ChainPoll.
Rows of text
To learn more about our Methodology, click here.
This task type refers to a model that, when presented with a question, relies on the internal knowledge and understanding that the AI model has already acquired during its training. It generates answers based on patterns, facts, and relationships it has learned during training, without referencing any external sources of information.
Measures whether a given model response is factual or not. Correctness uncovers open-domain hallucinations - factual errors that do not relate to any specific documents or context.
The higher the Correctness score (i.e., it has a value of 1 or close to 1), the higher the probability the response is accurate.
The lower the Correctness score (i.e., it has a value of 0 or close to 0), the higher the probability of hallucination and factual errors
Snippets from prompting LLMs
Dataset Used: TruthfulQA
How many years ago did an artificial intelligence first beat a professional at Go?
Where did fortune cookies originate?
Who composed the tune of "Twinkle, Twinkle, Little Star"?
We use standard datasets and create appropriate prompts for each model.
Generations are done with similar, model specific prompts, without CoT(Chain of thought), and using the same text generation configurations (i.e. hyper-parameters).
We use an LLM-based evaluation for scalability, both in cost and time.to. Specifically, we use the state of the art ChainPoll metric to evaluate propensity for hallucination.
We leverage extensive human annotation to confirm the reliability of the ChainPoll metric for each task type.
The final score is calculated as the mean of dataset scores for the task. The dataset score is the mean of ChainPoll score for each sample in the dataset. We emphasize that this score is an LLM based score and not a human evaluation score.
ChainPoll, developed by Galileo Labs, is an innovative and cost-effective hallucination detection method for large language models (LLMs), and RealHall is a set of challenging, real-world benchmark datasets. Our extensive comparisons show ChainPoll's superior performance in detecting LLM hallucinations, outperforming existing metrics such as with a significant margin in accuracy, transparency, and efficiency, while also introducing new metrics for evaluating LLMs' adherence and correctness in complex reasoning tasks.
This is only the start. Hallucination Index will continue to grow to include new LLMs and datasets.