Company
Vespper
Title
Implementing LLM Fallback Mechanisms for Production Incident Response System
Industry
Tech
Year
2024
Summary (short)
When Vespper's incident response system faced an unexpected OpenAI account deactivation, they needed to quickly implement a fallback mechanism to maintain service continuity. Using LiteLLM's fallback feature, they implemented a solution that could automatically switch between different LLM providers. During implementation, they discovered and fixed a bug in LiteLLM's fallback handling, ultimately contributing the fix back to the open-source project while ensuring their production system remained operational.
This case study provides an insightful look into the challenges and solutions of maintaining reliable LLM-powered systems in production, specifically focusing on Vespper's experience with implementing fallback mechanisms for their incident response platform. Vespper operates an LLM-based incident response system that automatically investigates alerts from various monitoring tools (Datadog, New Relic, SigNoz, PagerDuty). Their system employs LLMs to analyze incidents by examining internal tools, observability systems, codebases, and Slack communications, providing findings to developers through Slack to expedite incident resolution. The core challenge arose when their OpenAI account was unexpectedly deactivated due to a false positive, effectively bringing down their core product. This incident highlighted a critical vulnerability in their architecture - a single point of failure dependency on OpenAI's services. The case study details their rapid response and implementation of a more resilient system. Technical Implementation Details: The company's initial architecture utilized LiteLLM as an abstraction layer for LLM interactions, allowing them to handle both customer-provided LLM keys and their own keys through a unified API. This abstraction proved valuable during the crisis, as it provided the foundation for implementing their fallback solution. The technical solution involved leveraging LiteLLM's fallback feature, which allows specifying multiple models in an array. The system attempts to use each model in sequence until a successful response is received. However, during implementation, they encountered a bug in LiteLLM's source code where the fallback mechanism was failing due to parameter handling issues. The bug manifested as a TypeError related to multiple values being passed for the 'model' argument. Upon investigation, they found that the issue stemmed from how LiteLLM's async_completion_with_fallbacks function handled model parameters during the fallback process. The root cause was that the fallback dictionary was being merged with completion_kwargs while also passing the model separately to litellm.acompletion. Their solution involved modifying LiteLLM's source code to use the pop() method instead of get() when handling the model parameter, ensuring the model key was removed from the fallback dictionary before merging. This prevented the parameter conflict while maintaining the intended functionality. Deployment Considerations: The deployment process revealed additional complexity in their production environment. Their system uses Docker with Poetry for dependency management, with LiteLLM being installed from PyPi during the build process. To quickly deploy their fix without waiting for an official patch: * They forked the LiteLLM repository * Implemented their fix in the fork * Modified their Poetry configuration to install their forked version as a git dependency * Tested the solution in production * Contributed the fix back to the main project through a pull request Lessons Learned and Best Practices: This incident highlighted several important LLMOps considerations: * The importance of avoiding single points of failure in LLM-powered systems * The value of abstraction layers that facilitate provider switching * The need for robust fallback mechanisms in production LLM systems * The importance of contributing fixes back to the open-source community The case study also demonstrates effective incident response practices in LLMOps: * Quick identification of the root cause through logging * Rapid implementation of a temporary solution * Thorough testing before deployment * Long-term planning for system resilience Future Improvements: Following this incident, Vespper initiated an audit of their systems to identify and address other potential single points of failure. While the fallback mechanism provided an immediate solution, they recognized the need for building even greater resilience into their system. This case study offers valuable insights for organizations building production LLM systems, emphasizing the importance of: * Architectural flexibility through abstraction layers * Robust fallback mechanisms * Comprehensive testing of third-party dependencies * Contributing to the open-source ecosystem * Regular system audits for potential vulnerabilities The experience demonstrates that while LLMs can provide powerful capabilities for automation and incident response, careful attention must be paid to operational reliability and resilience in production environments.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.