Company
Doordash
Title
Evolving ML Infrastructure for Production Systems: From Traditional ML to LLMs
Industry
Tech
Year
2025
Summary (short)
A comprehensive overview of ML infrastructure evolution and LLMOps practices at major tech companies, focusing on Doordash's approach to integrating LLMs alongside traditional ML systems. The discussion covers how ML infrastructure needs to adapt for LLMs, the importance of maintaining guard rails, and strategies for managing errors and hallucinations in production systems, while balancing the trade-offs between traditional ML models and LLMs in production environments.
This case study provides insights into the evolution of ML infrastructure and LLMOps practices at major tech companies, with a particular focus on Doordash's approach to integrating LLMs into production systems. The discussion features Faras Hammad, a machine learning leader at Doordash with previous experience at Netflix, Meta, Uber, and Yahoo. The case study begins by examining how different companies approach ML infrastructure. At Doordash, ML infrastructure is critical due to the breadth and variety of problems requiring ML solutions, from store recommendations to merchant tools and search functionality. The company emphasizes the importance of infrastructure in driving innovation, particularly in early-stage areas where quick experimentation and iteration are essential. A significant portion of the discussion focuses on the integration of LLMs into production systems alongside traditional ML models. Key insights include: * Error Management and Guard Rails The case study emphasizes that LLMs have inherently high error rates and are non-deterministic. Doordash's approach involves carefully considering where LLMs can be used based on the cost of errors. For instance, they might use LLMs for tagging restaurants in long-tail cases where errors have minimal impact, but implement guard rails to prevent inappropriate content. The company often combines LLMs with traditional models, using LLMs as input features or having traditional models act as guardrails for LLM outputs. * Infrastructure Adaptations The infrastructure stack has needed significant adaptation to support LLMs. Traditional ML systems typically follow a standard pattern where models receive features from a feature store and return predictions. However, LLMs introduce new patterns with RAG architectures and context windows, requiring different infrastructure approaches. The company has had to rethink how they handle model serving, data storage, and computation resources. * Integration Strategies Doordash's approach to LLM integration is pragmatic, recognizing that LLMs aren't suitable for every use case. They maintain traditional ML models (like tree-based methods) for cases requiring high explainability, low latency, or where simple models perform adequately. The company emphasizes the importance of cost considerations, noting that LLMs are significantly more expensive to run and deploy. * Infrastructure Evolution The case study discusses how ML infrastructure needs to support both traditional models and LLMs while remaining flexible enough to adapt to rapid changes in the field. This includes: * The ability to easily incorporate open-source models and tools * Support for different deployment patterns and serving architectures * Flexibility in handling both traditional feature-based approaches and newer patterns like RAG * Infrastructure for managing prompts and context windows * Tools for evaluation and monitoring of non-deterministic systems The discussion also covers organizational aspects of LLMOps, emphasizing the importance of cross-functional collaboration between infrastructure engineers, data scientists, and ML engineers. Doordash's approach involves: * Early involvement of platform teams in ML projects * Infrastructure teams using their own products to better understand user needs * Balancing centralized and embedded team structures * Supporting both advanced users and non-specialists through appropriate tooling Looking forward, the case study suggests that the limitations of current LLMs will become more apparent, leading to a more balanced approach where LLMs and traditional ML models coexist in production systems. The key recommendation is to design infrastructure systems with change in mind, allowing for rapid adaptation as the field evolves. The case study also highlights the importance of open source tools in modern ML infrastructure, with Metaflow being specifically mentioned as a tool that provides the right level of abstraction for ML workflows. This allows teams to focus on model development while abstracting away infrastructure complexities. A particular emphasis is placed on testing and evaluation approaches for LLM-based systems, noting that traditional software engineering testing paradigms (expecting 100% pass rates) don't apply. Instead, teams need to develop new approaches to monitoring and evaluating probabilistic systems in production. Overall, the case study provides valuable insights into how a major tech company is pragmatically approaching the integration of LLMs into production systems, maintaining existing ML infrastructure while adapting to new requirements and challenges posed by generative AI technologies.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.