Stripe developed an LLM-based system to help support agents handle customer inquiries more efficiently by providing relevant response prompts. The solution evolved from a simple GPT implementation to a sophisticated multi-stage framework incorporating fine-tuned models for question validation, topic classification, and response generation. Despite strong offline performance, the team faced challenges with agent adoption and online monitoring, leading to valuable lessons about the importance of UX consideration, online feedback mechanisms, and proper data management in LLM production systems.
Stripe, a global payments company serving millions of customers with a wide suite of payment and data products, developed an LLM-powered system to assist their support operations. The presentation, given by Sophie Daley, a data scientist at Stripe, covers the lessons learned from building their first production application of large language models in the support space. The core goal was to help support agents solve customer cases more efficiently by prompting them with relevant, AI-generated responses to user questions. Importantly, customers would always interact directly with human agents—the LLM system was designed purely as an agent assistance tool, not a customer-facing chatbot.
The support operations team handles tens of thousands of text-based support cases weekly, making it a prime candidate for LLM applications. The complexity and breadth of Stripe’s product offerings mean that agents often spend significant time researching answers, which the team aimed to reduce through intelligent response suggestions.
One of the first and most significant lessons the team learned was that LLMs are not oracles. When testing out-of-the-box GPT (specifically DaVinci) with basic support questions like “How can I pause payouts?”, the model would produce plausible-sounding but factually incorrect answers. This was true for the majority of questions Stripe customers ask because the pre-training materials were either outdated, incomplete, or confused with generic instructions that might relate to other payments companies.
While prompt engineering could potentially fix specific answers, the scope and complexity of Stripe’s support space made this approach unviable at scale. This is an important lesson for organizations considering LLM deployment: domain-specific accuracy often cannot be achieved through prompting alone when dealing with proprietary, rapidly-changing, or highly specialized knowledge.
To address these limitations, the team developed a multi-stage pipeline that broke down the problem into more manageable ML steps:
This approach provided several benefits. First, it gave the team much more control over the solution framework. Second, fine-tuning completely mitigated hallucinations in their case, which is a notable claim worth examining critically. The team found that fine-tuning on GPT required approximately 500 labels per class, allowing them to move quickly using expert agent annotations. The framework leveraged fine-tuned GPT models for both classification and generation tasks.
The team relied on standard backtest evaluations for classification models using labeled datasets. For generative models, expert agents manually reviewed and labeled responses for quantitative assessment. User testing and training data collection also involved agents who dictated what ML response prompts should look like for different input question types.
After many ML iterations, offline feedback trended positively and the team felt confident in their model accuracy, leading them to ship to production. They designed a controlled experiment comparing cases where agents received ML-generated response prompts versus those that didn’t.
However, a significant gap emerged: online case labeling was not feasible at scale, leaving them without visibility into online accuracy trends. Once shipped, they discovered that agent adoption rates were much lower than expected—very few cases were actually using the ML-generated answers. Without online accuracy metrics, the team was essentially operating in the dark trying to understand whether there was a discrepancy between online and offline performance.
To address this, they developed a heuristic-based “match rate” metric representing how often ML-generated responses matched what agents actually sent to users. This provided a crude lower-bound measure of expected accuracy and helped them understand model trends in production. Even though offline testing and online accuracy trends looked good, agents were too accustomed to their existing workflows and were ignoring the prompts. This lack of engagement became a major bottleneck for realizing efficiency gains, requiring a much larger UX effort to increase adoption.
Several practical lessons emerged from this experience:
Perhaps the most significant lesson was that data remains the most important factor when solving business problems with LLMs. The speaker pushed back against the notion that newer or more advanced LLM architectures will solve everything if you just find the right prompt. LLMs are not a silver bullet—production deployment still requires data collection, testing, experimentation infrastructure, and iteration just like any other ML model.
The classic 80/20 rule for data science held true: writing the code for the LLM framework took days or weeks, while iterating on the training dataset took months. Iterating on label data quality yielded higher performance gains compared to using more advanced GPT engines. The ML errors they encountered related to “gotchas” specific to Stripe’s support domain rather than general gaps in model understanding, meaning adding or improving data samples typically addressed performance gaps.
Interestingly, scaling proved to be more of a data management challenge than a model advancement challenge. Collecting labels for generative fine-tuning models added significant complexity. For their second iteration (noted as currently in development), the team made a notable architectural decision: they swapped out the generative ML component for more straightforward classification approaches. This allowed them to leverage weak supervision techniques like Snorkel machine learning and embedding-based classification to label data at scale without requiring explicit human labelers.
The team also heavily invested in a subject matter expertise strategy program to collect and maintain their dataset. Because Stripe’s support space changes over time as products evolve, labels need to stay fresh for the model to remain accurate. Their goal is for this dataset to become a “living oracle” guaranteeing ML responses stay fresh and accurate into the future.
This case study offers valuable honest insights into the challenges of productionalizing LLMs, though some claims warrant scrutiny. The assertion that fine-tuning “completely mitigated hallucinations” is a strong claim that would benefit from more rigorous verification—hallucination mitigation typically involves tradeoffs and isn’t usually absolute. Additionally, the low agent adoption rates despite positive offline metrics highlight a common but often underappreciated gap between ML performance and real-world utility.
The pivot from generative to classification-based approaches in their second iteration is particularly noteworthy, suggesting that simpler, more controllable ML approaches may sometimes outperform generative models in production settings where reliability and maintainability are paramount. This pragmatic evolution reflects mature ML engineering judgment rather than chasing the newest techniques.
Overall, this case study provides a candid look at the operational realities of deploying LLMs in enterprise support settings, with lessons applicable across industries deploying similar agent-assistance systems.
Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.