Check out our latest product, LLM Studio!
There has been tremendous progress in the world of Large Language Models (LLMs). We have seen a series of blockbuster models like GPT3, GPT3.5, GPT4, Falcon, MPT and Llama pushing the state of the art. The industry has started adopting them for various applications but there's a big problem. It's hard to figure out how well these models are performing. Companies are struggling to compare different LLMs for generative applications. The tendency of LLMs to hallucinate requires us to measure them carefully. In this blog, we discuss helpful metrics for evaluating LLMs and understanding their suitability for your use cases.
It is easy to start building with Large Language Models (LLMs) such as OpenAI's ChatGPT, but they can be very difficult to evaluate. The main concerns around evaluations are:
Given these challenges, companies should prioritize investments in the development of evaluation metrics. These metrics will enable them to make data-driven decisions without depending solely on human judgment. Let's explore some key metrics that can assist companies in designing an evaluation system to enhance their generative applications.
Over time, many metrics have been proposed to measure the quality of LLM outputs. How to best evaluate LLMs is still an active research work, but we have found some which are more useful than others. We categorize them below to make them easier to understand.
These are well known metrics which can be used for any application. They work for any input/output of the LLMs.
Factuality measures factual correctness of the output. It is developed by Galileo which leverages GPT-3.5 with chain of thought(CoT) prompting and self-consistency. It surfaces errors of precision and not recall. It is very useful for detecting hallucinations in different scenarios like summarisation or open domain QA.
Metric signal: Higher factuality is correlated with higher output quality.
Prompt: When did Aliens invade earth?
Response: Aliens have never invaded earth.
Response: Aliens invaded earth on July 4th 2020.
Prompt perplexity is simply the perplexity of the prompt given as input to the LLM. A recent study showed that the lower the perplexity of the prompt, the better suited the prompt is for the given task. High perplexity indicates lower understanding of the text which means the model has not understood the prompt. If the model is unable to understand its input (the prompt) it is more likely to generate poor outputs.
Metric signal: Lower prompt perplexity is correlated with higher output quality.
Translate the following English sentence into French: 'The quick brown fox jumps over the lazy dog.'
In this case, the prompt is clear and directly instructs the model on what task to perform. The model is likely to have a low perplexity because it can confidently understand and execute the translation task.
"Can you, like, if you don't mind, convert to French for me? The quick brown fox jumps over the lazy dog.
In this example, the instruction is not very clear that it’s a translation task and does not highlight what is to be translated.
A recent study has shown that log probability can help us find low quality generations. Uncertainty leverages the same philosophy as prompt perplexity but on the generated text. It is calculated by leveraging log probability given by the LLM for each generated token. For models like GPT-3.5 and GPT4 which do not give logprob, we use other models as proxy.
Metric signal: Lower LLM uncertainty is correlated with higher output quality.
Prompt: “Where did the inventors of GPT3 architecture work?”
The response here is correct and contains low uncertainty.
Prompt: “Where did the inventors of GPT5 architecture work?”
The response here is incorrect and contains high uncertainty.
RAG refers to retrieval augmented generation where we add domain specific knowledge(DSK) in the prompt with the help of a search engine. This is required to make LLMs work with DSK which can be totally missing during its training on open web data. Let’s discuss metrics which help with improving the RAG system.
Groundedness measures whether the model’s response is supported by the documents given to the model in the context window. This is to prevent the model from stating facts which are out of context - hallucinations. We calculate this using GPT-3.5 with chain of thought(CoT) prompting and self-consistency. A score of 1 means it’s grounded and there is a lower chance of hallucination.
Metric signal: Higher groundedness is correlated with higher output quality.
Query: "What is the population of Paris, France?"
Doc1: Census 2023 reported population of Paris to be 2.2 million.
Doc2: Census 2022 reported population of Paris to be 2.1 million.
Doc3: The population of Paris is more than 2 million.
Response: "The population of Paris, France, according to the most recent census report, is approximately 2.2 million people."
In this example, the model's response is directly supported by the information present in the retrieved documents. It provides a specific population figure based on a reliable source, demonstrating groundedness.
Response: "Paris, France has a population of 10 million people."
In this example, the model's response is not grounded in the provided documents, and it seems to have fabricated a population figure that is not supported by the context.
Context Similarity measures how relevant the context fetched was to the user query. Low score could be a sign of a bad doc chunking/retrieval strategy or of missing info. To address this, one would add more documents to context DB or change the retrieval indexing/retrieval strategy. This can help debug reasons for bad generations.
Metric signal: Higher context similarity is correlated with higher output quality.
Query: Please provide information about the impact of climate change on polar bears.
High context similarity:
Doc 1 title: "The Effects of Climate Change on Polar Bear Populations"
Doc 2 title: "Arctic Ecosystems and Climate Change"
Doc 3 title: "Polar Bear Conservation Efforts in a Warming World"
In this case, the context similarity is good because all three retrieved documents are directly related to the query about the impact of climate change on polar bears. T
Low context similarity:
Doc 1 title: "Polar bears are fascinating creatures living in the Arctic."
Doc 2 title: "Climate change is a global issue affecting various ecosystems."
Doc 3 title: "The significance of bears in cultural folklore."
In this example, there is an overlap of words like "polar bears" and "climate change" between the query and documents. However, the context similarity is low because none of the retrieved documents directly address the impact of climate change on polar bears.
Context sparsity is the spread between similarity of best and worst matching doc selected for the context using the search engine. It leverages the distances provided by the vector DB between retrieved docs and query to calculate the score.
Metric signal: Lower context sparsity is correlated with higher output quality.
Answer Relevance measures how relevant the answer generated was to the user query. Low QA Relevance can be due to any of the ingredients used for the response. We calculate this using a biencoder based embedding similarity. To address this one might add more instructions to prompt, augment context or change the model or its settings.
Metric signal: Higher answer relevance is correlated with higher output quality.
User Query: "Can you tell me about the history of the Eiffel Tower?"
Retrieved docs: Doc 1, Doc 2, Doc 3
High answer relevance:
Generated Answer: "The Eiffel Tower, constructed in 1887-1889, is an iconic landmark in Paris, France. It was designed by Gustave Eiffel and served as the entrance arch for the 1889 World's Fair. The tower stands 324 meters tall and has played a significant role in the history of architecture and tourism."
In this example, the generated answer is directly relevant to the user's query about the history of the Eiffel Tower. It provides a concise and informative response, demonstrating high answer relevance.
Low answer relevance:
Generated Answer: "The Eiffel Tower is a famous Parisian landmark that offers stunning views of the city from its observation decks. It has a restaurant, and many tourists visit it every year."
In this case, the generated answer, while containing some information about the Eiffel Tower, doesn't address the user's query about its history. It lacks historical context and details, resulting in low answer relevance.
Recent works like Textbooks Are All You Need II: phi-1.5, LIMA: Less Is More for Alignment, Deduplicating Training Data Makes Language Models Better & The Curse of Recursion: Training on Generated Data Makes Models Forget indicate that we can get a better LLM by fine-tuning on high quality smaller data instead of larger noisy data.
We have developed a unique metric called DEP which helps you find samples which are either hard or contain errors. We show this score for each sample in the training data ranging from 0-1. A higher score means more chances of error in the data.
Metric signal: Lower DEP is correlated with higher output quality.
We understand that metrics mentioned above won't be all-encompassing. For example, LLMs can be used for multi-tasking in which the output can contain more than one prediction. In such cases users need custom metrics to evaluate different parts of outputs. Some use cases require LLM based chatbot to follow a tone. Hence they want to monitor it as any divergence from expected tone can lead to user complaints. Developers might want to detect the presence of certain words in the outputs which are related with errors. We deem such long tail use cases in the category of custom metrics.
Here is a table which can help you evaluate these metrics from a practical perspective.
At Galileo, we've dedicated three years to studying language model metrics. Over the last few months we have been working on our new product - LLM Studio. It has 3 features Galileo Prompt, Galileo Finetune and Galileo Monitor to help you with prompt engineering, finetuning and monitoring of LLMs.
Galileo Prompt makes it a breeze for the users to get these metrics as they are available out of the box. With the help of these metrics, you can easily run experiments to improve your prompts, models, and settings. This makes experimenting simple because you don't have to create these metrics from scratch. You also get a single page view of all the runs to find your best performing configuration. If you have a novel application, you can even create your own custom metrics for your specific needs.