Voiceflow, a chatbot and voice assistant platform, integrated large language models into their existing infrastructure while maintaining custom language models for specific tasks. They used OpenAI's API for generative features but kept their custom NLU model for intent/entity detection due to superior performance and cost-effectiveness. The company implemented extensive testing frameworks, prompt engineering, and error handling while dealing with challenges like latency variations and JSON formatting issues.
# Voiceflow's Journey with LLMs in Production
## Company Overview
Voiceflow is a platform focused on helping companies build chat and voice assistants. As a self-serve platform with a history of almost five years, they recently added generative AI features using large language models. The platform serves multiple verticals including automotive, retail, and banking sectors.
## LLM Integration Strategy
### Platform Features
- AI playground for experimenting with different language models
- Prompt chaining capabilities
- Knowledge base feature for FAQ creation
- Mixed approach combining traditional chatbot features with LLM capabilities
### Infrastructure Decisions
- Opted to use OpenAI API instead of self-hosting LLMs
- Built ML Gateway service to handle both LLM and traditional model connections
- Implemented prompt validation, rate limiting, and usage tracking
- Added support for multiple providers (OpenAI, Anthropic's Claude)
## Technical Challenges and Solutions
### JSON Generation and Error Handling
- Encountered issues with LLMs generating clean JSON
- Implemented prompt engineering solutions
- Added regex and handwritten rules for formatting
- Created comprehensive error tracking system
- Developed test suite for prompt verification
### Fine-tuning vs Few-shot Learning
- Experimented with fine-tuning on smaller OpenAI models
- Found performance degradation compared to larger models
- Generated training data for fine-tuning experiments
- Discovered cost implications of few-shot learning in production
- Carefully balanced token usage vs performance
### Model Selection and Performance
- Tested multiple model variants (GPT-3.5, GPT-4, Claude)
- Found GPT-4 too slow for their use cases
- Implemented model selection guidelines for users
- Developed internal LLM testing framework
- Created documentation to help users choose appropriate models
### Latency and Reliability
- Observed significant latency fluctuations with OpenAI API
- Benchmarked Azure OpenAI vs standard OpenAI
- Found Azure version 3x faster with lower standard deviation
- Dealt with downstream dependency challenges
- Implemented multi-model success criteria
## Custom Models vs LLMs
### NLU Model Implementation
- Maintained custom NLU model for intent/entity detection
- Achieved superior performance compared to GPT-4
- Custom model showed 1000x cost advantage over GPT-4
- Maintained consistent 16-18ms latency
- Implemented Redis queue for improved performance
### Architecture Evolution
- Initially used Pub/Sub architecture
- Discovered limitations with high P99 latencies
- Re-architected to meet performance requirements
- Achieved target P50 and P99 metrics
- Outperformed industry standards
### Platform Design Considerations
- Built flexible architecture supporting multiple model types
- Implemented schema validation
- Supported multi-region deployment
- Balanced managed services vs self-hosted solutions
- Created testing frameworks for both technical and non-technical users
## Cost and Performance Optimization
### API Integration Costs
- Carefully monitored token usage
- Considered cost implications of different models
- Balanced performance vs cost for different use cases
- Implemented usage tracking
- Provided cost guidance to platform users
### Deprecated vs New Features
- Deprecated in-house utterance recommendation in favor of LLM solution
- Maintained custom NLU model due to superior performance
- Balanced build vs buy decisions
- Considered maintenance costs of different approaches
## Lessons Learned
- LLMs aren't always the best solution for every task
- Infrastructure decisions need regular review as technology evolves
- Important to maintain flexibility in architecture
- Cost considerations crucial for production deployment
- Testing and monitoring essential for production reliability
- Balance between custom solutions and managed services needs regular evaluation
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.