Perplexity has built a conversational search engine that combines LLMs with various tools and knowledge sources. They tackled key challenges in LLM orchestration including latency optimization, hallucination prevention, and reliable tool integration. Through careful engineering and prompt management, they reduced query latency from 6-7 seconds to near-instant responses while maintaining high quality results. The system uses multiple specialized LLMs working together with search indices, tools like Wolfram Alpha, and custom embeddings to deliver personalized, accurate responses at scale.
# Building a Production LLM System at Perplexity
## Company Overview
Perplexity has developed a conversational search engine that aims to be the fastest way to get answers to any question on the internet. The company was founded by former engineers from Meta, Databricks, and other tech companies, with a mission to become the world's most knowledge-centered company.
## Technical Architecture
### Core Components
- Multiple specialized LLMs working in concert:
- Search infrastructure:
- Infrastructure stack:
### LLM Selection and Usage
- GPT-4 and GPT-3.5 Turbo for main reasoning tasks
- Claude Instant as an alternative for certain workloads
- Custom fine-tuned smaller models like MPT for specific tasks
- Model selection based on rigorous evaluation benchmarks
## Engineering Challenges and Solutions
### Latency Optimization
- Initial system had 5-6 second query latency
- Achieved dramatic speedup through:
### Quality Control and Testing
- Comprehensive evaluation system
- Prompt engineering discipline:
### Tool Integration and Orchestration
- RAG implementation combining:
- Challenges addressed:
### Personalization
- AI Profile system for user customization
- Prompt engineering for personalization
- Privacy-focused design
- Reduced need for repetitive user instructions
## Production System Features
### Search and Response Generation
- Breaking complex queries into sub-queries
- Processing up to 16 search results per query
- Context-aware follow-up suggestions
- Citation and source tracking
- Multi-language support
### Scaling and Reliability
- Handles tens/hundreds of thousands of users per minute
- Built for extensibility and maintainability
- Small, efficient engineering team
- Focus on reliability at scale
## Development Practices
### Team Structure and Philosophy
- Small, highly skilled engineering team
- Emphasis on doing more with less
- Direct ownership of entire stack
- Regular dogfooding of the product
### Quality Assurance
- Production testing approach
- User feedback incorporation
- Continuous evaluation
- Regular benchmarking
### Monitoring and Improvement
- Tracking of user complaints and issues
- Quick iteration on problems
- Regular updates to prompts and models
- Performance monitoring across the stack
## Future Directions
- Expanding tool integration capabilities
- Improving mobile experiences
- Adding new languages and features
- Continuing latency optimization
- Exploring new model integrations (e.g., Falcon)
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.