The case study details GitHub's journey in developing GitHub Copilot by working with OpenAI's large language models. Starting with GPT-3 experimentation in 2020, the team evolved from basic code generation testing to creating an interactive IDE integration. Through multiple iterations of model improvements, prompt engineering, and fine-tuning techniques, they enhanced the tool's capabilities, ultimately leading to features like multi-language support, context-aware suggestions, and the development of GitHub Copilot X.
This case study provides a rare insider look at how GitHub built and evolved GitHub Copilot, one of the most widely-adopted AI coding assistants in production. The article, originally published in May 2023 and updated in February 2024, features interviews with key GitHub engineers and researchers who worked on the project from its inception. GitHub Copilot represents a significant LLMOps case study because it demonstrates the full lifecycle of taking LLMs from experimental API access to a production-grade developer tool used by millions.
The journey began in June 2020 when OpenAI released GPT-3, which represented a capability threshold that finally made code generation viable. Prior to this, GitHub engineers had periodically evaluated whether general-purpose code generation was feasible, but previous models were simply not capable enough. This underscores an important LLMOps consideration: timing model adoption to capability thresholds rather than simply adopting the newest technology.
When GitHub first received API access to GPT-3 from OpenAI, they took a structured approach to evaluation. The GitHub Next research and development team assessed the model by giving it coding-like tasks and evaluated outputs in two different forms. For the first evaluation approach, they crowdsourced self-contained coding problems to test the model’s capabilities systematically. Interestingly, the article notes that this evaluation methodology was eventually abandoned because “the models just got too good” — initially solving about 50% of problems but eventually reaching 90%+ accuracy. This highlights the challenge of evaluation in rapidly-evolving LLM capabilities: test suites that were discriminating become obsolete as models improve.
The initial prototype was an AI-powered chatbot where developers could ask coding questions and receive runnable code snippets. However, the team quickly pivoted when they discovered that IDE integration provided a superior modality. As Albert Ziegler noted, placing the model directly in the IDE created an interactive experience that was “useful in almost every situation.” This architectural decision — embedding AI assistance directly into existing workflows rather than requiring developers to context-switch to a separate tool — proved foundational to Copilot’s success.
GitHub’s LLMOps journey involved working with progressively improving models from OpenAI. The first model was Python-only, followed by a JavaScript model and then a multilingual model. An interesting finding was that the JavaScript-specific model had problems that the multilingual model did not exhibit. The team was surprised that the multilingual model performed so well despite not being specialized — a counterintuitive result that suggests generalization can sometimes outperform specialization in LLM applications.
In 2021, OpenAI released the Codex model, built in partnership with GitHub. This was an offshoot of GPT-3 trained on billions of lines of public code, enabling it to produce code suggestions in addition to natural language. The model contained upwards of 170 billion parameters, making traditional training approaches challenging. This partnership model — where a company contributes domain expertise and data while a foundation model provider contributes base model capabilities — represents one successful pattern for enterprises building on LLMs.
As GitHub Copilot prepared for launch as a technical preview, the team created a dedicated Model Improvements team responsible for monitoring and improving quality through communication with the underlying LLM. Their primary metric was “completion” — when users accept and keep GitHub Copilot suggestions in their code. This represents a crucial production ML concept: defining a clear success metric that aligns with user value.
The article provides excellent detail on prompt engineering in production. John Berryman explains that since LLMs are fundamentally document completion models trained on partial documents, the key insight is that prompt crafting is “really all about creating a ‘pseudo-document’ that will lead the model to a completion that benefits the customer.”
Critically, the team discovered they didn’t need to limit context to just the current file. They could pull additional context from the IDE to improve completions. One major breakthrough was incorporating content from neighboring editor tabs. Berryman describes this as “one of my favorite tricks” that resulted in a “huge lift in our acceptance rate and characters retained.” This approach mirrors how developers actually work — referencing related files while coding — and embedding that pattern directly into the prompt.
The philosophy articulated here is worth noting: “we can make the user more productive by incorporating the way they think about code into the algorithm itself.” Rather than requiring users to manually provide context, the system proactively gathers relevant context in the same way a developer would, but automatically.
Fine-tuning was employed to adapt pre-trained models for specific tasks or domains. Alireza Goudarzi explained that fine-tuning involves “training the underlying Codex model on a user’s specific codebase to provide more focused, customized completions.” This acknowledges that general models, while powerful, can produce outputs that aren’t necessarily helpful for specific codebases with unique conventions.
A key challenge mentioned is understanding why users reject or accept suggestions. Goudarzi notes there’s “no way for us to really troubleshoot in the typical engineering way” — you can’t step through an LLM like traditional code. Instead, the approach is to “figure out how to ask the right questions to get the output we desire.” This represents a fundamental shift in debugging methodology for LLM-powered systems.
The article documents several concrete improvements that enhanced production quality:
Language Disambiguation: Early versions of Copilot would sometimes suggest code in the wrong programming language, such as suggesting Python code at the top of a C# file. The initial fix was adding a headline to the prompt specifying the language. However, a more elegant solution emerged: putting the file path at the top of the prompt. The file extension naturally indicates the language, and the filename itself often provides semantic hints (e.g., “connectiondatabase.py” suggests database operations in Python). This solved the language problem and improved suggestion quality by enabling better boilerplate code suggestions.
Cross-File Context Retrieval: The team eventually built a component that could lift code from other open files in the IDE. This feature scanned open files for text similar to the current cursor position. As described, this was discussed since GitHub Copilot’s genesis but took months of iteration to implement successfully. The result was a “huge boost in code acceptance because suddenly, GitHub Copilot knew about other files.”
The article provides insight into how model updates from OpenAI were incorporated. Johan Rosenkilde recounts that previous model improvements were good but often not perceptible to end users. However, when the third iteration of Codex dropped, users could genuinely “feel” the difference, especially for less common programming languages like F#. This highlights the challenge of managing user expectations around model updates and the importance of testing improvements across the full spectrum of use cases, not just popular languages.
Several themes emerge from this case study that are broadly applicable:
Modality matters: The pivot from a chatbot interface to IDE-embedded suggestions dramatically improved utility. The context of where and how users interact with AI assistance is as important as the model quality itself.
Evaluation evolves: Test suites that worked early on became obsolete as models improved. Teams need flexible evaluation frameworks that can scale with model capabilities.
Context engineering is crucial: Much of the product improvement came not from model changes but from better prompt construction — gathering context from file paths, neighboring tabs, and related files.
User signals are ambiguous: Understanding why users accept or reject suggestions remains challenging. This requires a different debugging mindset than traditional software.
Model improvements aren’t always visible: Not all model updates produce user-perceptible improvements, making it important to have robust internal metrics while managing external expectations.
It’s worth noting that this article is published by GitHub itself and features interviews with their own engineers, which naturally presents a positive narrative. The specific metrics around improvement (acceptance rates, etc.) are described qualitatively rather than quantitatively in most cases. The article also doesn’t discuss challenges like handling proprietary code, latency considerations in production, cost management for API calls, or how they handle edge cases and failures. These would be valuable additions for a complete LLMOps picture.
Additionally, while the evolution to GitHub Copilot X with chat functionality and expanded platform integration is mentioned, the technical details of how these multi-modal systems operate in production are not covered. The article focuses primarily on the core code completion feature rather than the full system architecture.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Cursor, an AI-powered IDE built by Anysphere, faced the challenge of scaling from zero to serving billions of code completions daily while handling 1M+ queries per second and 100x growth in load within 12 months. The solution involved building a sophisticated architecture using TypeScript and Rust, implementing a low-latency sync engine for autocomplete suggestions, utilizing Merkle trees and embeddings for semantic code search without storing source code on servers, and developing Anyrun, a Rust-based orchestrator service. The results include reaching $500M+ in annual revenue, serving more than half of the Fortune 500's largest tech companies, and processing hundreds of millions of lines of enterprise code written daily, all while maintaining privacy through encryption and secure indexing practices.
The case study examines two companies' approaches to deploying LLMs for code generation at scale: Stackblitz's Bolt.new achieving over $8M ARR in 2 months with their browser-based development environment, and Qodo's enterprise-focused solution handling complex deployment scenarios across 96 different configurations. Both companies demonstrate different approaches to productionizing LLMs, with Bolt.new focusing on simplified web app development for non-developers and Qodo targeting enterprise testing and code review workflows.