Check out our latest product, LLM Studio!
LLM Hallucinations are a critical deterrent to enterprise adoption. Mitigating hallucinations comes down to ensuring we work with the right prompts, data as well as providing the right context through a vector database. The Galileo LLM studio is the LLM explainability platform to identify and mitigate hallucinations across prompt engineering, fine-tuning and production monitoring for your LLMs. Join 1000s of developers in the waitlist!
For now, let’s deep dive into LLM hallucinations and understand them a bit better.
The creation of human-like text with Natural Language Generation (NLG) has improved recently because of advancements in Transformer-based language models. This has made the text produced by NLG helpful for creating summaries, generating dialogue, or transforming data into text. However, there is a problem: these deep learning systems sometimes make up or "hallucinate" text that was not intended, which can lead to worse performance and disappoint users in real-world situations.
Hallucination in AI means the AI makes up things that sound real, but are either wrong or not related to the context. This often happens because the AI has built-in biases, doesn't fully understand the real world, or its training data isn't complete. In these instances, the AI comes up with information it wasn't specifically taught, leading to responses that can be incorrect or misleading.
There are also a few terms that often cause confusion.
Hallucinations can affect important choices, such as court cases or when dealing with a company's reputation, like what happened with Google’s Bard. They can also expose private information the model saw during training, cause mistakes in medical diagnoses due to incorrect patient summaries, or simply lead to a frustrating chatbot experience that can't answer a basic question.
This is why understanding and mitigation of hallucination is important.
Let's have a look at hallucinations in different generative tasks:
Abstractive summarization is a method used to pick out important details from documents and create short, clear, and easy-to-read summaries. These have seen impressive results when done with neural networks. However, studies have found issues, such as these models often producing misleading content that doesn't match the original document. It's shown that 25% of the summaries from the best current models have this problem [Falke et al.]. Summaries with a lot of misleading content can still get a higher ROUGE score [Lin et al.].
There are two types of hallucination in summarization -- intrinsic, where the summary says something opposite to the original text, and extrinsic, where the summary includes something not mentioned in the original at all. For example, if an article says the FDA approved the first Ebola vaccine in 2019, an intrinsic hallucination would be to say the FDA rejected it. An extrinsic hallucination example might be to claim that China has started testing a COVID-19 vaccine when the original article doesn't mention that at all.
In addition to this, Pagnoni and his colleagues describe more detailed types of factual errors in summaries.
In open domain dialogue generation, a chatbot either gives the user necessary details or keeps them interested with new replies without rehashing previous conversation. A little bit of hallucination might be acceptable in this context.
Intrinsic hallucination is when the chatbot's response contradicts the previous conversation or external knowledge. As shown in table 1, the bot might incorrectly interpret a moderate price range as high, or mix up names like 'Roger Federer' and 'Rafael Nadal.'
Extrinsic hallucination happens when we can't cross-check the chatbot's response with previous dialogue or outside knowledge. For instance, as shown in table 1, the bot might claim something about the Pickwick hotel being in San Diego or Djokovic being in the top ten singles players, but without enough information to confirm or deny this.
Now, let's talk about open-domain dialogue creation. Inconsistency within the bot's responses is a type of intrinsic hallucination, while inconsistency with outside sources can be either intrinsic or extrinsic hallucination.
In open-domain conversation, a little hallucination may be fine as long as it doesn't include serious factual errors. But it's usually hard to verify the facts because the system doesn't usually have access to outside resources. In these systems, inconsistencies in the bot's replies are often seen as the main problem. You can see this when a bot gives different answers to similar questions, like “What's your name?” and “May I ask your name?” A focus here is the bot's persona consistency - its identity and interaction style - and making sure it doesn't contradict itself. Aside from this, a chatbot in the open domain should provide consistent and informative responses that align with the user's speech to keep the conversation engaging. External resources with specific persona information or general knowledge can help the chatbot create responses.
Generative question answering (GQA) is a system that creates in-depth responses to questions instead of just pulling out answers from given texts. This is useful because many questions people ask on search engines need detailed explanations. These answers are typically long and can't be directly taken from specific phrases.
Usually, a GQA system looks for relevant information from different sources to answer a question. Then, it uses this information to come up with an answer. In many instances, no single document has the complete answer, so multiple documents are used. These documents might have repeating, supporting, or conflicting information. Because of this, the generated answers often contain hallucinations.
Table 1 shows two examples of hallucination for GQA. The information for both questions comes from Wikipedia. The first question asks about the "Dow Jones Industrial Average." The response given, "index of 30 major U.S. stock indexes," doesn't match with the Wikipedia explanation, "of 30 notable companies listed on U.S. stock exchanges." So we call this an "intrinsic hallucination." For the second question, the description of a Sadducee as someone who behaves dishonestly, particularly a politician seeking power through dishonesty, couldn't be confirmed from the original Wikipedia documents. As a result, we label it as an "extrinsic hallucination.
There are also other ways to classify these errors. One group suggests splitting them into errors seen with altered test sets and natural errors. The first type can be seen when changing the test set drastically alters the translation. The second type, natural errors, is connected to errors in the original dataset. These can be split again into detached and oscillatory errors [Raunak et al.]. Detached errors happen when the translation doesn't match the meaning of the original text. Oscillatory errors happen when the translation repeats phrases that weren't in the original text.
Other errors include suddenly skipping to the end of the text or when the translation is mostly in the original language. These are all considered types of hallucination errors in machine translation.
Data-to-Text Generation is the process of creating written descriptions based on data like tables, database records, or knowledge graphs.
In this case, the hallucinations are defined in a similar way but in terms of objects.
In the process of visual question answering, the model can give believable responses based on just the text, as seen in below figure (left and right). But, these responses might be incorrect when an image is provided. Additionally, the middle figure shows that if we ask a question that can't be answered, the model may make up an answer that seems to fit the image, even though it's not supported by the image.
The advancements in Transformer-based language models have greatly enhanced the capabilities of Natural Language Generation. However, the occurrence of 'hallucinations,' where the AI fabricates incorrect or out-of-context details, pose significant challenges. These inaccuracies can occur in various tasks such as abstractive summarization, dialogue generation, generative question answering, machine translation, data-to-text generation, and vision-language model generation.
Addressing the issue of hallucinations in AI is crucial to increase the reliability and usability of such systems in real-world applications. By continuing research and development in these areas, we can hope to mitigate these issues and leverage the full potential of AI in language understanding and generation tasks. We will look into ways to mitigate it in our next posts.
LLM Hallucinations are a critical deterrent to enterprise adoption. Mitigating hallucinations comes down to ensuring we work with the right prompts, data as well as providing the right context through a vector database. The Galileo LLM studio is a single platform to help teams with LLM evaluation, experimentation, and observability. Join 1000s of developers in the waitlist!
Survey of Hallucination in Natural Language Generation [Ziwei et al.]
The Curious Case of Hallucinations in Neural Machine Translation [Raunak et al.]
Detecting Hallucinated Content in Conditional Neural Sequence Generation [Zhou et al.]
Working with Natural Language Processing or Computer Vision?
Read about Galileo’s NLP Ops and CV Ops solutions