Intercom: Scaling Customer Support AI Chatbot to Production with Multiple LLM Providers

LLMOps Database

Tech

Intercom

Company

Intercom

Title

Scaling Customer Support AI Chatbot to Production with Multiple LLM Providers

Industry

Tech

Link

https://www.youtube.com/watch?v=g29qCdQ5Jqg

Year

2023

Summary (short)

Intercom developed Fin, an AI customer support chatbot that resolves up to 86% of conversations instantly. They faced challenges scaling from proof-of-concept to production, particularly around reliability and cost management. The team successfully improved their system from 99% to 99.9%+ reliability by implementing cross-region inference, strategic use of streaming, and multiple model fallbacks while using Amazon Bedrock and other LLM providers. The solution has processed over 13 million conversations for 4,000+ customers with most achieving over 50% automated resolution rates.

This case study examines how Intercom built and scaled their AI customer support chatbot called Fin, focusing on the critical transition from proof-of-concept to production-grade LLM implementation. The study provides valuable insights into real-world challenges and solutions in deploying LLMs at scale. ## Background and Achievement Intercom's Fin represents a significant success story in production LLM applications. The chatbot can handle a wide range of customer support tasks, from answering informational queries using their help center content to performing actual actions like pausing subscriptions. The system has demonstrated impressive results: * Successfully resolved over 13 million conversations * Deployed across 4,000+ customers * Achieves 50%+ resolution rates for most customers * Can resolve up to 86% of conversations instantly * Became Intercom's most successful product ## Technical Architecture and Evolution ### Initial Development (V1) The team started with a unique approach given their lack of usable historical data: * Built initially on product intuition rather than existing data * Developed when LLM capacity was scarce, requiring selective use of models * Implemented comprehensive LLM monitoring and transaction logging from the start * Used Python despite Intercom being traditionally a Ruby on Rails shop ### Infrastructure and Monitoring The team implemented sophisticated monitoring and logging systems: * Tracked crucial metrics like time-to-first-token and time-to-sample-token * Implemented full HTTP request/response logging * Used distributed tracing for debugging complex retry scenarios * Developed template-based prompt logging instead of materialized prompts * Utilized Kinesis agent and Snowflake for log aggregation ### Iterative Development Process They developed a two-stage process for safely evolving the system: * Offline testing using saved prompt inputs with new templates * Automated evaluation using LLMs to assess changes * Production A/B testing for final validation * Careful attention to business metrics since they only charged for successful resolutions ### Reliability Engineering The team improved system reliability through several key strategies: * Strategic use of streaming vs. non-streaming responses * Implementation of cross-region inference * Deployment of multiple model fallbacks * Careful quota management * Use of both on-demand and provisioned throughput depending on use case ## Key Technical Decisions and Implementations ### Amazon Bedrock Integration The decision to use Amazon Bedrock was driven by several factors: * Data privacy - keeping data within AWS infrastructure * Access to multiple model providers including Anthropic * Simplified contractual requirements due to AWS relationship * Cross-region inference capabilities * Built-in monitoring and observability ### RAG Implementation The system uses RAG (Retrieval Augmented Generation) with several production-focused optimizations: * Multiple index strategies for different data types * Various chunking strategies based on content type * Selective content retrieval rather than bulk context inclusion * Evaluation tools for testing different chunking and retrieval approaches ### Reliability Improvements The team achieved significant reliability improvements through: * Strategic use of retries at multiple levels * Cross-region inference to handle regional capacity issues * Careful management of streaming vs. non-streaming responses * Implementation of model fallbacks * Use of provisioned throughput for critical paths ## Cost and Performance Optimization The team implemented several strategies to optimize costs while maintaining performance: * Selective use of more expensive models only where needed * Careful monitoring of token usage and costs * Implementation of budgeting and tracking per use case * Use of inference profiles for departmental cost allocation * Strategic balance between on-demand and provisioned throughput ## Lessons Learned The case study highlights several important lessons for production LLM deployments: * The importance of comprehensive monitoring and observability * Need for careful capacity planning and quota management * Value of basic engineering practices like timeouts and retries * Importance of having multiple fallback options * Need to treat different use cases differently based on requirements The team's experience demonstrates that while LLM serving is still a maturing industry, careful attention to engineering fundamentals, combined with sophisticated monitoring and fallback strategies, can create highly reliable production systems. Their success in improving from 99% to 99.9%+ reliability while maintaining cost effectiveness provides a valuable blueprint for other organizations looking to deploy LLMs in production.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free