GenAI Productionize 2.0: The premier conference for GenAI application development

Practical Tips for GenAI System Evaluation

Osman Javed
Osman JavedVP of Marketing
Practical Tips for GenAI System Evaluation
3 min readApril 25 2024

It’s time to put the science back in data science! Craig Wiley, Senior Director of Product for AI at Databricks, has an immense amount of hands-on experience developing, deploying, and evaluating generative AI models and solutions. He joined us at GenAI Productionize 2024 to share practical tips and frameworks for evaluating and improving generative AI, referencing what worked for Databricks. Read some key takeaways from his session, and, for a real treat, watch his entire session below.

Safety, Accuracy, and Governance

By focusing on safety, accuracy, and governance, AI teams can ensure their GenAI solutions are reliable, ethical, and ready for production. Metrics like F1 scores are traditionally used in NLP to measure accuracy, but these metrics are ineffective for complex generative tasks. Instead, teams are adapting evaluations to specific questions or scenarios to allow for more targeted assessments, including new research-backed metrics. Model-in-the-loop approaches are becoming increasingly popular to solve this gap, using models themselves to provide preliminary evaluations of AI outputs, proving faster and cheaper than human-only evaluations over large data sets. But when use cases are complex or high risk, nothing beats having a human-in-the-loop. When using a RAG system, continually update your data inputs in your vector database to improve the accuracy of model outputs.

Because of the non-deterministic nature of GenAI models, teams need to implement robust governance to provide clear responses to queries about AI decisions, especially in regulated industries like education, healthcare, and finance. Databricks uses a four-pronged approach to governance:

  • Centralization: Centralizing critical components such as prompts, models, and evaluation metrics in one location improves transparency and control to help understand and explain model behaviors.
  • Auditability: Implementing regular audits using specific metrics helps verify and validate the AI systems' operations and outcomes continuously to proactively identify any issues before they become a problem.
  • Automated Evaluation: Automated LLM evaluation ensures consistent and efficient oversight. This becomes especially important as AI initiatives scale, making manual evaluation impractical.
  • Alerting: Automated alerts notify relevant stakeholders about anomalies or issues in AI operations, helping provide timely interventions to maintain the integrity of systems.

Depending on your company size and culture, your organization may choose a decentralized approach to governance, with AI specialists embedded within each business unit, or a centralized top-down approach that defines and enforces a uniform governance model. Regardless, in the end, governance is about trust and needs to be structured, dynamic and ongoing, with a constant cycle of monitoring, evaluation, and adjustment. It requires careful planning, resources, and commitment from across the organization to be safe, accurate, and well-governed.

Science in Data Science

It’s time to put ‘science’ back into data science. Just like scientists measure the outcomes of their experiments, AI teams must examine system outputs or responses, and ask whether they’re correct, fulfill the expected outcome, and are optimal for the intended use. But evaluation is not a straightforward engineering challenge; simple linear adjustments may not be possible or effective. Evaluations require detailed investigations into specific cases where the system did not perform as expected, digging into which component of the system negatively affected the output and isolating the problem. Was the right data available? Was it processed properly? Can we tweak the system to re-evaluate the outputs? Continuous iteration is the backbone of GenAI development.

Databricks found their model initially lacked accuracy and iterated to incrementally improve performance, creating a robust dataset, fine-tuning prompts, and generating synthetic data. By enriching the model with detailed descriptions of data entities, such as tables in a data lake, the model's utility and accuracy in real-world applications were significantly improved. A rigorous, data-driven, and scientifically sound approach is essential for effective GenAI solutions.

Moving from POC to Production

Since the advent of BERT, there’s been an explosion of new language models and generative AI applications. Enterprises are now beginning to move these solutions from proof-of-concept into production. But teams must realize initial deployment is just the beginning of the journey, and GenAI is more than just merely models. It is an integrated system spanning foundation models, context data, training data, embedding models, vector databases, observability, and much more, each working together in sophisticated multi-step processes requiring thoughtful system design. From the right chunking strategy to proper search methodology to the right infrastructure, many factors influence the performance and utility of GenAI systems in production. And once in production, teams must implement on-going monitoring and reinforcement learning mechanisms to incorporate user feedback into training and evaluation. Ensure the model performs well with live data and not just under test conditions, identify any discrepancies, and make the necessary adjustments. Oftentimes, fixing a specific, frequent problem leads to unexpected benefits in other areas of the system. AI teams must be able to trace, log, and debug each component of the system.

But teams shouldn’t think a single model or system will solve all use cases. Develop models tailored to specific tasks, and individually evaluate performance for these tasks. Have clear, well-defined objectives for what the model is expected to achieve and provide a performance benchmark to measure against. Depending on the use case, teams must decide whether to prioritize performance improvement, cost reduction, or high accuracy. Defining these priorities early on will help guide resource allocation and efforts during the development process. Using an open-source model combined with high-quality, proprietary training data can significantly improve performance while keeping costs manageable.

Craig offers a treasure trove of information and experience, well beyond what’s covered above. Do yourself a favor and watch the entire session now!

Working with Natural Language Processing?

Read about Galileo’s NLP Studio

Natural Language Processing

Natural Language Processing

Learn more