This case study examines how Lyft evolved their machine learning platform to support generative AI infrastructure, providing valuable insights into the challenges and solutions of implementing LLMs in a production environment at scale.
Lyft has maintained a substantial ML presence with over 50 models across more than 100 GitHub repositories, serving over 1000 unique models company-wide. Their approach differs from other companies of similar size by favoring breadth over depth, with numerous teams each managing their own models. These models support various functions including location suggestions, ETAs, pricing, routing, and fraud detection, with some handling over 10,000 requests per second.
The company's ML platform was initially designed around a comprehensive lifecycle approach, supporting models from ideation through deployment and ongoing maintenance. Their journey to supporting GenAI began with their existing ML infrastructure, which had already evolved to handle various model types and frameworks.
Lyft built a sophisticated proxy system that routes all LLM traffic through their existing ML serving infrastructure. They wrapped vendor client libraries (like OpenAI's) to maintain the same interface while controlling the transport layer. This approach provided several benefits:
To address the growing usage of LLMs, Lyft developed a comprehensive evaluation framework with three main categories:
The framework allows for both automated and LLM-based evaluations of responses, including checks for harmful content and response completeness. This modular approach enables teams to implement specific guardrails and quality metrics for their use cases.
Lyft is developing a higher-level interface for AI applications that wraps core LLM functionality and includes:
Lyft has implemented several production use cases leveraging their GenAI infrastructure:
Their flagship implementation uses a RAG-based approach for initial customer support responses, combining LLMs with knowledge bases to:
The team faced several challenges in adapting their ML platform for GenAI:
LLM requests are significantly more complex than traditional ML model inputs. Lyft addressed this by:
They implemented several security measures:
The platform maintains compatibility with existing ML infrastructure while adding GenAI-specific components:
The platform has enabled rapid adoption of GenAI across Lyft, with hundreds of internal users and multiple production applications. The standardized infrastructure has reduced implementation time for new use cases while maintaining security and quality standards.
Lyft continues to evolve their platform with plans to:
The case study demonstrates how a mature ML platform can be effectively adapted for GenAI while maintaining operational excellence and enabling rapid innovation across the organization.