Gusto: Using Token Log-Probabilities to Detect and Filter LLM Hallucinations in Customer Support

LLMOps Database

Gusto

Company

Gusto

Title

Using Token Log-Probabilities to Detect and Filter LLM Hallucinations in Customer Support

Industry

Link

https://engineering.gusto.com/tackling-ai-hallucinations-in-llm-apps-6d46692f8cac

Year

2024

Summary (short)

Gusto developed a method to improve the reliability of their LLM-based customer support system by using token log-probabilities as a confidence metric. The approach monitors sequence log-probability scores to identify and filter out potentially hallucinated or low-quality LLM responses. In their case study, they found a 69% relative difference in accuracy between high and low confidence responses, with the highest confidence responses achieving 76% accuracy compared to 45% for the lowest confidence responses.

meta

This case study from Gusto, a company specializing in payroll, benefits, and HR solutions, explores an innovative approach to making LLM applications more reliable in production environments, specifically focusing on customer support applications. The study presents a practical solution to one of the most significant challenges in deploying LLMs in production: controlling and preventing hallucinations and low-quality outputs. The core problem Gusto aimed to solve was how to reliably use LLMs for handling customer support questions while maintaining high confidence in the accuracy of responses. Their use case involved processing thousands of daily support questions, many of which could be answered using internal documentation (e.g., questions about adding contractors or changing employee status). The technical approach centers around using token log-probabilities as a confidence metric for LLM outputs, drawing inspiration from machine translation literature. The key insight is that when LLMs hallucinate or generate poor-quality responses, they tend to be less confident in their token predictions. This confidence can be measured using the Seq-Logprob metric, which is calculated as the average of log-probabilities from sequence generation. The implementation details include several key components: * Collection of Seq-LogProb scores through the OpenAI API for all LLM outputs to understand the expected confidence distribution * Monitoring of outputs at the lower end of the confidence distribution * Implementation of decision boundaries for automated response handling The team conducted a comprehensive experiment with 1,000 support questions to validate their approach. The methodology involved: * Running support questions through their LLM service while recording confidence scores * Having customer support experts label the LLM-generated outputs as either "good quality" or "bad quality" * Analyzing the correlation between confidence scores and output quality The results were significant: * A 69% relative difference was observed between the most confident and least confident LLM responses * The highest confidence responses achieved 76% accuracy * The lowest confidence responses showed 45% accuracy * The confidence scores followed a normal distribution, making it practical for setting thresholds What makes this approach particularly valuable for production environments is its practicality and efficiency: * It requires no additional computation cost as the log-probabilities are already available as a by-product of generation * It enables automated quality control without needing reference-based evaluation * It allows for flexible implementation of different handling strategies for different confidence levels The study also revealed important patterns in LLM behavior: * Low-confidence responses tend to be vague, overly broad, and more likely to deviate from prompt guidelines * High-confidence responses typically demonstrate precise understanding of both the problem and solution * The confidence distribution is sensitive to prompt changes, requiring calibration when prompts are modified For production deployment, Gusto implemented several practical patterns: * Automatic rejection of poor-quality responses based on confidence thresholds * Integration of expert-in-the-loop verification for low-confidence responses * Additional information gathering for cases where confidence is low The case study demonstrates how to transform LLM systems into more traditional ML systems with controllable precision-recall tradeoffs. This allows organizations to set specific quality thresholds based on their needs and risk tolerance. The approach's limitations and considerations are also worth noting: * Confidence thresholds need recalibration when prompts change * The method works best with APIs that expose token probabilities (like OpenAI's) * The approach doesn't eliminate hallucinations entirely but provides a practical way to manage their frequency From an LLMOps perspective, this case study offers valuable insights into building reliable production systems with LLMs. It demonstrates how traditional ML concepts like precision-recall curves can be adapted for LLM applications, and how simple yet effective metrics can be used to control output quality. The approach is particularly valuable because it can be implemented without significant additional infrastructure or computational overhead, making it practical for organizations of various sizes and technical capabilities. The study also highlights the importance of systematic evaluation and monitoring in LLM applications, showing how quantitative metrics can be used to make decisions about when to trust LLM outputs and when to fall back to human oversight. This balance between automation and human intervention is crucial for successful LLM deployment in production environments.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source