Check out our latest product, LLM Studio!

A Metrics-First Approach to LLM Evaluation

Learn about different types of LLM evaluation metrics
Pratik Bhavsar
Pratik BhavsarGalileo Labs
7 min readSeptember 19 2023

There has been tremendous progress in the world of Large Language Models (LLMs). We have seen a series of blockbuster models like GPT3, GPT3.5, GPT4, Falcon, MPT and Llama pushing the state of the art. The industry has started adopting them for various applications but there's a big problem. It's hard to figure out how well these models are performing. Companies are struggling to compare different LLMs for generative applications. The tendency of LLMs to hallucinate requires us to measure them carefully. In this blog, we discuss helpful metrics for evaluating LLMs and understanding their suitability for your use cases.

Need for new LLM evaluation metrics

It is easy to start building with Large Language Models (LLMs) such as OpenAI's ChatGPT, but they can be very difficult to evaluate. The main concerns around evaluations are:

  1. Human evaluation is costly and prone to errors: LLMs are used for varied generative tasks which cannot be judged by these metrics. In desperation, companies often rely on human vibe check. But human annotations are costly and full of biases which make evaluation slow and unreliable.
  2. Poor correlations with human judgment: Traditional metrics like BLEU / ROUGE have shown poor correlation with how humans evaluate model output. Without reliable metrics it becomes hard to know if the LLM app is release worthy.
  3. Absence of reliable benchmarks: Researchers have come up with benchmarks like Open LLM Leaderboard, but they do not evaluate the generative capability since the tasks contain multiple choice questions. The datasets used are also limited and might not have domain coverage for the target use case.

Given these challenges, companies should prioritize investments in the development of evaluation metrics. These metrics will enable them to make data-driven decisions without depending solely on human judgment. Let's explore some key metrics that can assist companies in designing an evaluation system to enhance their generative applications.

Types of LLM Evaluation Metrics

Over time, many metrics have been proposed to measure the quality of LLM outputs. How to best evaluate LLMs is still an active research work, but we have found some which are more useful than others. We categorize them below to make them easier to understand.

Types of LLM Evaluation Metrics
Types of LLM Evaluation Metrics

Category 1: Top Level Metrics for LLM Evaluation

These are well known metrics which can be used for any application. They work for any input/output of the LLMs.

1. Factuality

Factuality measures factual correctness of the output. It is developed by Galileo which leverages GPT-3.5 with chain of thought(CoT) prompting and self-consistency. It surfaces errors of precision and not recall. It is very useful for detecting hallucinations in different scenarios like summarisation or open domain QA.

Metric signal: Higher factuality is correlated with higher output quality.

Prompt: When did Aliens invade earth?

High factuality:

Response: Aliens have never invaded earth.

Low factuality:

Response: Aliens invaded earth on July 4th 2020.

2. Prompt Perplexity

Prompt perplexity is simply the perplexity of the prompt given as input to the LLM. A recent study showed that the lower the perplexity of the prompt, the better suited the prompt is for the given task. High perplexity indicates lower understanding of the text which means the model has not understood the prompt. If the model is unable to understand its input (the prompt) it is more likely to generate poor outputs.

Metric signal: Lower prompt perplexity is correlated with higher output quality.

Low perplexity:

Translate the following English sentence into French: 'The quick brown fox jumps over the lazy dog.'

In this case, the prompt is clear and directly instructs the model on what task to perform. The model is likely to have a low perplexity because it can confidently understand and execute the translation task.

High perplexity:

"Can you, like, if you don't mind, convert to French for me? The quick brown fox jumps over the lazy dog.

In this example, the instruction is not very clear that it’s a translation task and does not highlight what is to be translated.

3. LLM Uncertainty

A recent study has shown that log probability can help us find low quality generations. Uncertainty leverages the same philosophy as prompt perplexity but on the generated text. It is calculated by leveraging log probability given by the LLM for each generated token. For models like GPT-3.5 and GPT4 which do not give logprob, we use other models as proxy.

Metric signal: Lower LLM uncertainty is correlated with higher output quality.

Low uncertainty:

Prompt: “Where did the inventors of GPT3 architecture work?”

Response: “OpenAI”

The response here is correct and contains low uncertainty.

High uncertainty:

Prompt: “Where did the inventors of GPT5 architecture work?”

Response: “Deepmind”

The response here is incorrect and contains high uncertainty.

Category 2: Metrics for Evaluating RAG Effectiveness

RAG refers to retrieval augmented generation where we add domain specific knowledge(DSK) in the prompt with the help of a search engine. This is required to make LLMs work with DSK which can be totally missing during its training on open web data. Let’s discuss metrics which help with improving the RAG system.

1. Groundedness

Groundedness measures whether the model’s response is supported by the documents given to the model in the context window. This is to prevent the model from stating facts which are out of context - hallucinations. We calculate this using GPT-3.5 with chain of thought(CoT) prompting and self-consistency. A score of 1 means it’s grounded and there is a lower chance of hallucination.

Metric signal: Higher groundedness is correlated with higher output quality.


Query: "What is the population of Paris, France?"

Retrieved docs:

Doc1: Census 2023 reported population of Paris to be 2.2 million.

Doc2: Census 2022 reported population of Paris to be 2.1 million.

Doc3: The population of Paris is more than 2 million.

High groundedness:

Response: "The population of Paris, France, according to the most recent census report, is approximately 2.2 million people."

In this example, the model's response is directly supported by the information present in the retrieved documents. It provides a specific population figure based on a reliable source, demonstrating groundedness.

Low groundedness:

Response: "Paris, France has a population of 10 million people."

In this example, the model's response is not grounded in the provided documents, and it seems to have fabricated a population figure that is not supported by the context.

2. Context Similarity

Context Similarity measures how relevant the context fetched was to the user query. Low score could be a sign of a bad doc chunking/retrieval strategy or of missing info. To address this, one would add more documents to context DB or change the retrieval indexing/retrieval strategy. This can help debug reasons for bad generations.

Metric signal: Higher context similarity is correlated with higher output quality.

Query: Please provide information about the impact of climate change on polar bears.

High context similarity:

Retrieved Documents:

Doc 1 title: "The Effects of Climate Change on Polar Bear Populations"

Doc 2 title: "Arctic Ecosystems and Climate Change"

Doc 3 title: "Polar Bear Conservation Efforts in a Warming World"

In this case, the context similarity is good because all three retrieved documents are directly related to the query about the impact of climate change on polar bears. T

Low context similarity:

Retrieved Documents:

Doc 1 title: "Polar bears are fascinating creatures living in the Arctic."

Doc 2 title: "Climate change is a global issue affecting various ecosystems."

Doc 3 title: "The significance of bears in cultural folklore."

In this example, there is an overlap of words like "polar bears" and "climate change" between the query and documents. However, the context similarity is low because none of the retrieved documents directly address the impact of climate change on polar bears.

3. Context Sparsity

Context sparsity is the spread between similarity of best and worst matching doc selected for the context using the search engine. It leverages the distances provided by the vector DB between retrieved docs and query to calculate the score.

Metric signal: Lower context sparsity is correlated with higher output quality.

4. Answer Relevance

Answer Relevance measures how relevant the answer generated was to the user query. Low QA Relevance can be due to any of the ingredients used for the response. We calculate this using a biencoder based embedding similarity. To address this one might add more instructions to prompt, augment context or change the model or its settings.

Metric signal: Higher answer relevance is correlated with higher output quality.


User Query: "Can you tell me about the history of the Eiffel Tower?"

Retrieved docs: Doc 1, Doc 2, Doc 3

High answer relevance:

Generated Answer: "The Eiffel Tower, constructed in 1887-1889, is an iconic landmark in Paris, France. It was designed by Gustave Eiffel and served as the entrance arch for the 1889 World's Fair. The tower stands 324 meters tall and has played a significant role in the history of architecture and tourism."

In this example, the generated answer is directly relevant to the user's query about the history of the Eiffel Tower. It provides a concise and informative response, demonstrating high answer relevance.

Low answer relevance:

Generated Answer: "The Eiffel Tower is a famous Parisian landmark that offers stunning views of the city from its observation decks. It has a restaurant, and many tourists visit it every year."

In this case, the generated answer, while containing some information about the Eiffel Tower, doesn't address the user's query about its history. It lacks historical context and details, resulting in low answer relevance.

Category 3: Metrics for Data Quality when Fine-tuning

Recent works like Textbooks Are All You Need II: phi-1.5, LIMA: Less Is More for Alignment, Deduplicating Training Data Makes Language Models Better & The Curse of Recursion: Training on Generated Data Makes Models Forget indicate that we can get a better LLM by fine-tuning on high quality smaller data instead of larger noisy data.

Data Error Potential (DEP)

We have developed a unique metric called DEP which helps you find samples which are either hard or contain errors. We show this score for each sample in the training data ranging from 0-1. A higher score means more chances of error in the data.

Metric signal: Lower DEP is correlated with higher output quality.

Custom metrics

We understand that metrics mentioned above won't be all-encompassing. For example, LLMs can be used for multi-tasking in which the output can contain more than one prediction. In such cases users need custom metrics to evaluate different parts of outputs. Some use cases require LLM based chatbot to follow a tone. Hence they want to monitor it as any divergence from expected tone can lead to user complaints. Developers might want to detect the presence of certain words in the outputs which are related with errors. We deem such long tail use cases in the category of custom metrics.

Metric comparison

Here is a table which can help you evaluate these metrics from a practical perspective.

LLM Metric Comparison
LLM Metric Comparison

At Galileo, we've dedicated three years to studying language model metrics. Over the last few months we have been working on our new product - LLM Studio. It has 3 features Galileo Prompt, Galileo Finetune and Galileo Monitor to help you with prompt engineering, finetuning and monitoring of LLMs.

Galileo Prompt makes it a breeze for the users to get these metrics as they are available out of the box. With the help of these metrics, you can easily run experiments to improve your prompts, models, and settings. This makes experimenting simple because you don't have to create these metrics from scratch. You also get a single page view of all the runs to find your best performing configuration. If you have a novel application, you can even create your own custom metrics for your specific needs.

Galileo Prompt
Galileo Prompt

Galileo’s LLM studio is a single platform to help teams with LLM evaluation, experimentation, and observability. Join 1000s of developers building apps powered by LLMs and get early access!

Working with Natural Language Processing or Computer Vision?

Read about Galileo’s NLP Ops and CV Ops solutions

Natural Language Processing

Natural Language Processing

Learn more
Computer Vision

Computer Vision

Learn more