Character.ai scaled their open-domain conversational AI platform from 300 to over 30,000 generations per second within 18 months, becoming the third most-used generative AI application globally. They tackled unique engineering challenges around data volume, cost optimization, and connection management while maintaining performance. Their solution involved custom model architectures, efficient GPU caching strategies, and innovative prompt management tools, all while balancing performance, latency, and cost considerations at scale.
Character.ai represents one of the most significant case studies in scaling LLMs for production use, demonstrating both the challenges and solutions in operating large language models at massive scale. The company, founded by former Google Brain researchers including one of the primary authors of the attention paper, has built a platform that allows users to create and interact with AI characters through natural language conversations.
## Platform Overview and Growth
Character.ai launched in October 2022, just before ChatGPT, and experienced exponential growth, scaling from 300 generations per second to over 30,000 messages per second by summer 2023. The platform hosts over 200 million community-created characters, each essentially functioning as a specialized prompt. Unlike many competitors, Character.ai built their own foundation models from the ground up, giving them unique control over model architecture and optimization.
## Technical Architecture and Scaling Challenges
The company faced three major unique challenges in scaling their LLM application:
### 1. Data Volume Management
The system handles an unprecedented volume of data per second compared to traditional social applications. Each message generation requires pulling the entire chat history, resulting in 7-8 GB per second of network traffic just on the main generation path. This required careful optimization of networking layers and caching strategies.
### 2. Cost and Latency Optimization
Serving 30,000 messages per second requires significant GPU resources. The team made several innovative optimizations:
* Implementation of multi-query attention (MQA) which provided a 5x reduction in GPU cache usage
* Development of a sophisticated GPU caching system achieving 95% cache hit ratio
* Architectural decisions made during pre-training to optimize for production deployment
* Careful balancing of model parameter count against performance
### 3. Connection Management
With average response times of 12.5-25 seconds, the system needs to maintain hundreds of thousands of open connections simultaneously. This created unique challenges:
* Scaling the number of pods efficiently
* Optimizing connection lightweight-ness
* Managing pressure on Redis cache and networking layers
* Careful resource allocation across services
## Production Infrastructure Innovations
Character.ai developed several key tools and approaches for managing their production environment:
### Prompt Management System
They created and later open-sourced "prompt-poet" (originally called "homies"), a sophisticated prompt management system that:
* Uses YAML and Jinja templating for structured prompt management
* Implements smart truncation strategies for context window management
* Enables systematic prompt experimentation and testing
* Facilitates prompt version control and quality assurance
### Database Optimization
The team encountered and solved several critical database challenges:
* Dealt with a critical PostgreSQL transaction ID (XID) issue that nearly caused a week-long outage
* Implemented database sharding strategies
* Optimized for high-volume transaction processing
* Developed specialized backup and recovery procedures
### Quality and Performance Monitoring
The platform implements comprehensive monitoring and testing:
* Systematic A/B testing for prompt and model changes
* Engagement metrics tracking
* Quality regression testing
* Performance impact analysis for even minor changes
## Lessons Learned and Best Practices
Several key insights emerged from their scaling journey:
### Architecture Decisions
* Moving from monolith to services architecture
* Importance of efficient caching strategies at multiple levels
* Need for specialized tools for prompt management and testing
* Critical role of database architecture in scaling
### Resource Management
* Careful balancing of research vs. production resource allocation
* Trading off between latency, cost, and performance
* Importance of efficient GPU utilization and caching
* Need for sophisticated connection management strategies
### Testing and Quality Assurance
* Importance of comprehensive testing beyond just functional correctness
* Need for both quantitative metrics and qualitative "vibe checks"
* Value of systematic A/B testing for all changes
* Importance of monitoring user engagement metrics
The Character.ai case study demonstrates the complexity of running LLMs at scale and the need for innovative solutions across the entire stack. Their experience shows that successful LLMOps requires not just model optimization, but also sophisticated infrastructure, tooling, and monitoring systems. The team's willingness to build custom solutions when needed, while also leveraging existing tools where appropriate, provides valuable lessons for others building large-scale LLM applications.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.