How prioritizing data quality helped involve.ai improve customer intelligence ML model performance by 10%

Comapany: involve.ai

Industry: Customer Intelligence

Galileo helps us manage our training runs, it helps us keep everything organized, but more importantly, it helps us prioritize our data itself and organize it so that we can understand what's working and what's not.

Ella Lucas

NLP Data Scientist @ Involve AI

→ Project Objective

Type of data
Unstructured sequence tagging problem (NLP textual data about long customer emails)

Source of data
Real-world dataset from clients- any text that captures interactions with customers

Task Type
NER (entity recognition on customer emails)

→ Stage of model development:

Production
The model is in production, the goal is to find high ROI errors that will eventually improve the F1 score on the validation set for the models.

→ ML Stack:

Frameworks used
SpaCy, fine-tuning the spacy NER model in en-US
Working with HuggingFace recently

Services used
Amazon s3 for storing data
LabelStudio for labeling services
Amazon SageMaker for training
Airflow for automation pipelines (Amazon service)

Overview

→

involve.ai is an early warning system for customer revenue - our entire mission statement is to help companies be truly data-driven and customer-centric. We do this using data science and machine le=arning solutions to unify and analyze data that lives in disparate sources such as Zendesk, Salesforce, and all of the various places companies maintain customer data these days- the list is ever-growing.

We provide a two-pronged service. First, we work with customers to understand where all of their data is living and bring it together into one comprehensive dashboard using a no-code data aggregation tool. Then, we provide a health score based on machine learning models trained to predict which of those customers are likely to churn versus which have a high likelihood for upsell or expansion.

The Machine Learning team’s primary responsibility is maintaining the models. There's me, the team lead, and two engineers. We also have Data Science Consultants that work adjacent to the Machine Learning team that are more customer-focused. Together, these teams unify the data, build the dashboard and meet with the customer to make sure that the ML models we provide work properly for their business.

We use Galileo for our NER Model which predicts customers' behavior. The labels are actually customer success, customer intelligence-oriented labels. Since the NER model is completely in-house and we curate our own training data, the challenge is making sure that this training data is accurate and scalable across our entire customer base.
We don't require that our customers come from one particular domain, so we need to ensure that the data for the model is domain agnostic.

Galileo has helped us gain visibility into the data we were using to inform our production models. In turn, we were able to collect better data and improve our accuracy by 10%.
Galileo helps us manage our training runs- it helps us keep everything organized, but more importantly, it helps us prioritize our data itself and organize it so that we can understand what's working and what's not.

We use this data to make decisions around changes to our modeling workflow, like new pre-processing methods or whether models are good enough to go into production. Essentially, it empowers our ML workload by providing visibility into the data that we use for modeling.

Challenge

Gain visibility into the data we were using to inform our production model and improve our model accuracy at involve.ai.

Our original challenge in a nutshell was a lack of visibility into the data that we were using to inform production models. Previously our process was labor intensive and we simply did not have the visibility we now have with Galileo.

Our processes were really slow in terms of finding out whether the information that we were passing to the labeler was the best subset of data we could pass forward. Then, once the labeler finished their job and we moved ahead with training, we wanted to know if the model was performing as well as we would expect. That kind of manual process requiring engineering review and prioritization using Python scripts wasn’t scalable and was fairly custom for each domain.

This process required us to collaborate heavily with other teams, taking their time and ours just to make sure that the model outputs matched the inputs that we were giving it based on training data labels.

Solution

Remove the Manual process with Galileo to turn around more accurate models faster.

Remove the Manual process with Galileo to turn around more accurate models faster.

We have an in-house data labeler who's constantly working on projects. So when the data labeler finishes a project, we will typically take the data they've labeled and augment an existing model's training data with that new information to see if it improves accuracy or strengthens the model.

We test that around every two weeks, depending on capacity and the speed at which the labeler can add new information. But we only come up with a new model every one to two months because it takes more than one iteration over that two-week labeling cycle to augment the model itself with enough information to improve accuracy.

With Galileo, we removed manual processes in between meeting our product team or other teams to review that data before making an update. We are now able to turn models around really fast.

Result

In the end, Galileo saved us a lot of manual labor and improved our model accuracy by 10%.

Galileo’s insights gave us specific feedback that improves the labeler’s workflow. On the flipside, these insights help us understand what information is the most pertinent to pass to our models, versus which to discard.

In the end, Galileo improved our accuracy by 10%. This is pretty big for us. We realized that there were specific preprocessing techniques that we could do to clean the information prior to training. What we were previously doing had added a lot of noise, which, as Galileo showed us, led to high Data Error Potential.

Once we began that preprocessing, we made large gains in accuracy, and speeding up workloads allowed us to spend more time on models. It took around two weeks in the past to get information to our product team and gather feedback. Now we don't really need that middleman. Instead, we can go through Galileo and look at how a specific model is performing over the data and then make a recommendation to the product team about whether this model looks promising and scalable.

Galileo helped us improve our models which in turn helped us improve our customers’ experience, empowering them with clearer visibility, earlier detection of key customer behaviors, higher retention, increased upsells, and sustainable revenue.

Sign up Today to start using
Galileo for Free

🔭 Sign Up

Working with Natural Language Processing?

Read about Galileo’s NLP Studio

Natural Language Processing