Stripe: Production LLM Implementation for Customer Support Response Generation

LLMOps Database

Finance

Stripe

Company

Stripe

Title

Production LLM Implementation for Customer Support Response Generation

Industry

Finance

Link

https://www.youtube.com/watch?v=FQl6K160DKU

Year

2024

Summary (short)

Stripe implemented a large language model system to help support agents answer customer questions more efficiently. They developed a sequential framework that combined fine-tuned models for question filtering, topic classification, and response generation. While the system achieved good accuracy in offline testing, they discovered challenges with agent adoption and the importance of monitoring online metrics. Key learnings included breaking down complex problems into manageable ML steps, prioritizing online feedback mechanisms, and maintaining high-quality training data.

Tags

# Stripe's Journey in Deploying LLMs for Customer Support ## Background and Use Case Stripe, a global payments company serving millions of businesses across 50 countries, handles tens of thousands of support cases weekly. With predominantly text-based support interactions, they identified LLMs as an opportunity to improve their support operations. While there were multiple potential applications (such as ticket summarization, translation, and classification), they chose to focus on their most complex challenge: helping agents answer customer questions more efficiently using GPT. ## Technical Approach and Architecture ### Initial Attempts and Limitations - Started with direct GPT-3.5 implementation but found limitations with factual accuracy - Upgraded to GPT-4 which improved results but still exhibited hallucinations - Discovered that general-purpose models struggled with domain-specific knowledge ### Final Solution Architecture - Implemented a sequential framework with multiple specialized components: ### Data and Training Strategy - Used few hundred labels per class for fine-tuning - Relied on expert agent annotations for golden quality answers - Team performed hand labeling for trigger and topic classifiers - Implemented thresholds to mitigate hallucinations ## Deployment and Monitoring ### Shadow Mode Deployment - Shipped each stage of the framework separately in shadow mode - Implemented daily sampling and review of shadow mode predictions - Used tags to track model predictions without affecting production ### Experimental Setup - Designed controlled experiment comparing cases with and without ML responses - Randomized by support case due to volume constraints - Included thousands of agents in the experiment ### Monitoring Challenges - Initially lacked online accuracy metrics - Developed heuristic-based match rate to measure response similarity - Created dashboards for tracking performance metrics - Prioritized monitoring as highly as model development ## Key Learnings and Best Practices ### Model Development - Break down complex business problems into manageable ML steps - Consider human behavior and UX implications early - Implement comprehensive monitoring from the start - Use shadow deployments for validation ### Data Strategy - Data quality and management proved more important than model sophistication - 80-20 rule applied: data work took months while coding took weeks - Moved away from generative fine-tuning to classification for scalability - Implemented weak supervision techniques for data labeling - Developed strategy for maintaining fresh, up-to-date training data ### Production Considerations - Compute costs were manageable due to use of smaller engines - Developed strategy to transition from GPT to in-house models where appropriate - Implemented comprehensive monitoring dashboards - Created proxy metrics for tracking online performance ## Results and Impact ### Challenges - Lower than expected agent adoption despite positive offline testing - Gap between offline and online performance metrics - Need for extensive UX and agent training ### Solutions - Developed proxy metrics for online monitoring - Implemented comprehensive agent training programs - Created data maintenance strategy for long-term accuracy ## Infrastructure and Team Structure - Started with lightweight team: - Identified need for expanded team including: ## Future Directions - Moving towards classification-based approaches for better scalability - Investing in subject matter expertise programs - Developing living oracle of training data - Focus on maintaining fresh and accurate responses as product suite grows

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free