Company
Github
Title
Building a Low-Latency Global Code Completion Service
Industry
Tech
Year
2024
Summary (short)
Github built Copilot, a global code completion service handling hundreds of millions of daily requests with sub-200ms latency. The system uses a proxy architecture to manage authentication, handle request cancellation, and route traffic to the nearest available LLM model. Key innovations include using HTTP/2 for efficient connection management, implementing a novel request cancellation system, and deploying models across multiple global regions for improved latency and reliability.
Github's Copilot represents one of the largest deployments of LLMs in production, serving over 400 million completion requests daily with peak traffic of 8,000 requests per second. This case study provides deep insights into the challenges and solutions of running LLMs at scale in a global production environment. The core challenge was building a code completion service that could compete with local IDE completions while operating over the internet. This meant dealing with network latency, shared resources, and potential cloud outages while maintaining response times under 200 milliseconds. The solution architecture centered around a proxy service (copilot-proxy) that manages authentication, request routing, and efficient connection handling. The evolution of the system is particularly interesting from an LLMOps perspective. Initially starting with direct OpenAI API access for alpha users, the team quickly realized the need for a more scalable solution. The introduction of the proxy layer solved multiple operational challenges: Authentication and Access Control: Rather than embedding API keys in clients, the system uses short-lived tokens generated through Github's OAuth system. The proxy validates these tokens and handles the secure communication with the underlying LLM services, providing a clean separation of concerns and better security. Connection Management: A key innovation was the use of HTTP/2 for maintaining long-lived connections. This avoided the overhead of repeatedly establishing TCP and TLS connections, which could add 5-6 round trips worth of latency. The proxy maintains persistent connections both with clients and upstream LLM providers, creating efficient "warmed-up" TCP pipes. Global Distribution: To minimize latency, the system deploys proxy instances alongside LLM models in multiple Azure regions worldwide. Traffic is routed to the nearest healthy instance using octoDNS, which provides sophisticated geographic routing capabilities. When a region experiences issues, it automatically removes itself from DNS routing, ensuring service continuity through other regions. Request Optimization: The system implements sophisticated request handling, including a novel cancellation mechanism. Given that about 45% of requests are cancelled (due to users continuing to type), the ability to quickly cancel requests and free up model resources was crucial for efficiency. The proxy layer provides this capability without requiring client updates. Monitoring and Observability: The proxy layer provides crucial monitoring capabilities, allowing the team to track real user latency, request patterns, and system health. This proved invaluable for identifying issues like the "instant cancellation" pattern and model behavior anomalies. The system also demonstrates sophisticated operational practices: * Multi-region deployment with automatic failover * Graceful handling of client versioning issues * Ability to split traffic across multiple models * Support for A/B testing and experimentation * Integration with enterprise compliance requirements Several key lessons emerged from this implementation: * The importance of bringing computation closer to users for latency-sensitive applications * The value of having control points in the request path for operational flexibility * The benefits of using modern protocols like HTTP/2 for performance optimization * The importance of building observability into the system from the ground up The project also showed the value of custom infrastructure where it matters most. While using standard cloud services would have been easier, building a custom proxy layer with specific optimizations for their use case provided significant competitive advantages. From an LLMOps perspective, the case study demonstrates how to successfully operationalize LLMs in a global, high-scale environment. The architecture balances multiple competing concerns: latency, reliability, security, and operational efficiency. The proxy-based approach provides the flexibility needed to handle production issues, from model behavior anomalies to client versioning challenges, without requiring client updates. The success of this architecture is evident in the numbers: handling hundreds of millions of daily requests with sub-200ms latency, while maintaining the ability to rapidly respond to issues and evolve the system. It's a testament to thoughtful system design that considers not just the technical requirements but also the operational realities of running LLMs in production.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.