Intercom developed Fin, an AI customer support chatbot that resolves up to 86% of conversations instantly. They faced challenges scaling from proof-of-concept to production, particularly around reliability and cost management. The team successfully improved their system from 99% to 99.9%+ reliability by implementing cross-region inference, strategic use of streaming, and multiple model fallbacks while using Amazon Bedrock and other LLM providers. The solution has processed over 13 million conversations for 4,000+ customers with most achieving over 50% automated resolution rates.
This case study examines how Intercom built and scaled their AI customer support chatbot called Fin, focusing on the critical transition from proof-of-concept to production-grade LLM implementation. The study provides valuable insights into real-world challenges and solutions in deploying LLMs at scale.
## Background and Achievement
Intercom's Fin represents a significant success story in production LLM applications. The chatbot can handle a wide range of customer support tasks, from answering informational queries using their help center content to performing actual actions like pausing subscriptions. The system has demonstrated impressive results:
* Successfully resolved over 13 million conversations
* Deployed across 4,000+ customers
* Achieves 50%+ resolution rates for most customers
* Can resolve up to 86% of conversations instantly
* Became Intercom's most successful product
## Technical Architecture and Evolution
### Initial Development (V1)
The team started with a unique approach given their lack of usable historical data:
* Built initially on product intuition rather than existing data
* Developed when LLM capacity was scarce, requiring selective use of models
* Implemented comprehensive LLM monitoring and transaction logging from the start
* Used Python despite Intercom being traditionally a Ruby on Rails shop
### Infrastructure and Monitoring
The team implemented sophisticated monitoring and logging systems:
* Tracked crucial metrics like time-to-first-token and time-to-sample-token
* Implemented full HTTP request/response logging
* Used distributed tracing for debugging complex retry scenarios
* Developed template-based prompt logging instead of materialized prompts
* Utilized Kinesis agent and Snowflake for log aggregation
### Iterative Development Process
They developed a two-stage process for safely evolving the system:
* Offline testing using saved prompt inputs with new templates
* Automated evaluation using LLMs to assess changes
* Production A/B testing for final validation
* Careful attention to business metrics since they only charged for successful resolutions
### Reliability Engineering
The team improved system reliability through several key strategies:
* Strategic use of streaming vs. non-streaming responses
* Implementation of cross-region inference
* Deployment of multiple model fallbacks
* Careful quota management
* Use of both on-demand and provisioned throughput depending on use case
## Key Technical Decisions and Implementations
### Amazon Bedrock Integration
The decision to use Amazon Bedrock was driven by several factors:
* Data privacy - keeping data within AWS infrastructure
* Access to multiple model providers including Anthropic
* Simplified contractual requirements due to AWS relationship
* Cross-region inference capabilities
* Built-in monitoring and observability
### RAG Implementation
The system uses RAG (Retrieval Augmented Generation) with several production-focused optimizations:
* Multiple index strategies for different data types
* Various chunking strategies based on content type
* Selective content retrieval rather than bulk context inclusion
* Evaluation tools for testing different chunking and retrieval approaches
### Reliability Improvements
The team achieved significant reliability improvements through:
* Strategic use of retries at multiple levels
* Cross-region inference to handle regional capacity issues
* Careful management of streaming vs. non-streaming responses
* Implementation of model fallbacks
* Use of provisioned throughput for critical paths
## Cost and Performance Optimization
The team implemented several strategies to optimize costs while maintaining performance:
* Selective use of more expensive models only where needed
* Careful monitoring of token usage and costs
* Implementation of budgeting and tracking per use case
* Use of inference profiles for departmental cost allocation
* Strategic balance between on-demand and provisioned throughput
## Lessons Learned
The case study highlights several important lessons for production LLM deployments:
* The importance of comprehensive monitoring and observability
* Need for careful capacity planning and quota management
* Value of basic engineering practices like timeouts and retries
* Importance of having multiple fallback options
* Need to treat different use cases differently based on requirements
The team's experience demonstrates that while LLM serving is still a maturing industry, careful attention to engineering fundamentals, combined with sophisticated monitoring and fallback strategies, can create highly reliable production systems. Their success in improving from 99% to 99.9%+ reliability while maintaining cost effectiveness provides a valuable blueprint for other organizations looking to deploy LLMs in production.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.