Doordash: Building an Enterprise LLMOps Stack: Lessons from Doordash

LLMOps Database

E-commerce

Doordash

Company

Doordash

Title

Building an Enterprise LLMOps Stack: Lessons from Doordash

Industry

E-commerce

Link

https://www.youtube.com/watch?v=OiyP8uUI1OU

Year

2023

Summary (short)

The ML Platform team at Doordash shares their exploration and strategy for building an enterprise LLMOps stack, discussing the unique challenges of deploying LLM applications at scale. The presentation covers key components needed for production LLM systems, including gateway services, prompt management, RAG implementations, and fine-tuning capabilities, while drawing insights from industry leaders like LinkedIn and Uber's approaches to LLMOps architecture.

# Building an Enterprise LLMOps Stack at Doordash ## Overview This case study presents insights from Doordash's ML Platform team's journey in developing their LLMOps stack. The presentation, led by the ML Platform team lead, explores the challenges and strategic considerations in implementing LLMs at enterprise scale, while also examining approaches taken by other major tech companies. ## Key Challenges in Enterprise LLM Implementation ### Unique Technical Challenges - Inference and serving optimization - Cost management for high-QPS use cases - Latency considerations - Balancing proprietary vs open-source LLMs - Infrastructure requirements varying by model type - Backup and failover strategies between different model types ### Production Considerations - Need for effective prompt management systems - Version control for prompts - Testing and evaluation frameworks - Release management processes - Cost efficiency in deployment ## Core Components of the LLMOps Stack ### Foundation Layer - Gateway services for unified model access - Support for both proprietary and open-source LLMs - Caching mechanisms for high-QPS scenarios - Cost optimization systems ### RAG Implementation Components - Vector database management - Embedding model infrastructure - Automated pipelines for embedding updates - Data synchronization systems ### Fine-tuning Infrastructure - Automated training pipelines - Template management systems - Evaluation frameworks - Cost-efficient training processes ### Operational Tools - Playground environments for experimentation - Monitoring and observability systems - Performance analytics - Cost tracking and optimization tools ## Industry Insights and Learnings ### LinkedIn's Approach - Emphasis on gateway services - Focus on playground environments for innovation - Strong emphasis on trust and responsible AI - Internal model hosting for privacy concerns - Comprehensive trust and safety frameworks ### Uber's Implementation - Integration with existing ML platform - Comprehensive gen AI platform including: ## Implementation Strategy ### Prioritization Framework - Use case-driven component selection - Phased implementation approach - Focus on high-impact components first - Scalability considerations from the start ### Key Success Metrics - Developer velocity - Reduction in friction points - Cost management effectiveness - System reliability and performance ## Best Practices and Recommendations ### Architecture Considerations - Modular design for flexibility - Support for multiple model types - Robust caching strategies - Comprehensive monitoring systems ### Development Approach - Start with core components - Build based on actual use cases - Focus on automation - Implement strong evaluation frameworks ### Operational Guidelines - Maintain clear documentation - Implement robust testing procedures - Monitor costs actively - Ensure scalability of solutions ## Technical Infrastructure Requirements ### Computing Resources - GPU infrastructure management - Efficient resource allocation - Scaling mechanisms - Cost optimization strategies ### Data Management - Vector storage solutions - Embedding management - Data synchronization - Version control systems ### Security and Privacy - Access control mechanisms - Data privacy safeguards - Model security considerations - Compliance frameworks ## Future Considerations ### Scaling Challenges - Managing growing model sizes - Handling increasing request volumes - Cost optimization at scale - Performance optimization ### Emerging Needs - Support for new model types - Integration with emerging tools

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free