Gusto developed a method to improve the reliability of their LLM-based customer support system by using token log-probabilities as a confidence metric. The approach monitors sequence log-probability scores to identify and filter out potentially hallucinated or low-quality LLM responses. In their case study, they found a 69% relative difference in accuracy between high and low confidence responses, with the highest confidence responses achieving 76% accuracy compared to 45% for the lowest confidence responses.
This case study from Gusto, a company specializing in payroll, benefits, and HR solutions, explores an innovative approach to making LLM applications more reliable in production environments, specifically focusing on customer support applications. The study presents a practical solution to one of the most significant challenges in deploying LLMs in production: controlling and preventing hallucinations and low-quality outputs.
The core problem Gusto aimed to solve was how to reliably use LLMs for handling customer support questions while maintaining high confidence in the accuracy of responses. Their use case involved processing thousands of daily support questions, many of which could be answered using internal documentation (e.g., questions about adding contractors or changing employee status).
The technical approach centers around using token log-probabilities as a confidence metric for LLM outputs, drawing inspiration from machine translation literature. The key insight is that when LLMs hallucinate or generate poor-quality responses, they tend to be less confident in their token predictions. This confidence can be measured using the Seq-Logprob metric, which is calculated as the average of log-probabilities from sequence generation.
The implementation details include several key components:
* Collection of Seq-LogProb scores through the OpenAI API for all LLM outputs to understand the expected confidence distribution
* Monitoring of outputs at the lower end of the confidence distribution
* Implementation of decision boundaries for automated response handling
The team conducted a comprehensive experiment with 1,000 support questions to validate their approach. The methodology involved:
* Running support questions through their LLM service while recording confidence scores
* Having customer support experts label the LLM-generated outputs as either "good quality" or "bad quality"
* Analyzing the correlation between confidence scores and output quality
The results were significant:
* A 69% relative difference was observed between the most confident and least confident LLM responses
* The highest confidence responses achieved 76% accuracy
* The lowest confidence responses showed 45% accuracy
* The confidence scores followed a normal distribution, making it practical for setting thresholds
What makes this approach particularly valuable for production environments is its practicality and efficiency:
* It requires no additional computation cost as the log-probabilities are already available as a by-product of generation
* It enables automated quality control without needing reference-based evaluation
* It allows for flexible implementation of different handling strategies for different confidence levels
The study also revealed important patterns in LLM behavior:
* Low-confidence responses tend to be vague, overly broad, and more likely to deviate from prompt guidelines
* High-confidence responses typically demonstrate precise understanding of both the problem and solution
* The confidence distribution is sensitive to prompt changes, requiring calibration when prompts are modified
For production deployment, Gusto implemented several practical patterns:
* Automatic rejection of poor-quality responses based on confidence thresholds
* Integration of expert-in-the-loop verification for low-confidence responses
* Additional information gathering for cases where confidence is low
The case study demonstrates how to transform LLM systems into more traditional ML systems with controllable precision-recall tradeoffs. This allows organizations to set specific quality thresholds based on their needs and risk tolerance.
The approach's limitations and considerations are also worth noting:
* Confidence thresholds need recalibration when prompts change
* The method works best with APIs that expose token probabilities (like OpenAI's)
* The approach doesn't eliminate hallucinations entirely but provides a practical way to manage their frequency
From an LLMOps perspective, this case study offers valuable insights into building reliable production systems with LLMs. It demonstrates how traditional ML concepts like precision-recall curves can be adapted for LLM applications, and how simple yet effective metrics can be used to control output quality. The approach is particularly valuable because it can be implemented without significant additional infrastructure or computational overhead, making it practical for organizations of various sizes and technical capabilities.
The study also highlights the importance of systematic evaluation and monitoring in LLM applications, showing how quantitative metrics can be used to make decisions about when to trust LLM outputs and when to fall back to human oversight. This balance between automation and human intervention is crucial for successful LLM deployment in production environments.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.