Gitlab's ModelOps team developed a sophisticated code completion system using multiple LLMs, implementing a continuous evaluation and improvement pipeline. The system combines both open-source and third-party LLMs, featuring a comprehensive architecture that includes continuous prompt engineering, evaluation benchmarks, and reinforcement learning to consistently improve code completion accuracy and usefulness for developers.
This case study comes from a presentation by an engineering manager at GitLab who leads the Model Ops group. The team embarked on a journey to build code completion tools called “GitLab Suggestions” over approximately 6-7 months. The presentation provides insights into their LLMOps architecture, evaluation frameworks, and continuous improvement processes for deploying LLMs in production for code generation use cases.
Code completion tools are fundamentally important to developers, using AI as assistance in decision-making to help developers write code faster and more effectively. The presenter frames the core challenge around three key metrics that any code completion LLM output should satisfy:
The presentation focuses primarily on the “helpful” dimension—ensuring that code completions actually aid developers in achieving their goals. A key insight shared is that for many LLMs (other than ChatGPT), whether they are third-party or open-source, organizations don’t necessarily have access to the training data to judge quality. This creates a fundamental challenge: how do you take these third-party or open-source LLMs and make them write code that is better than what an average coder could produce?
When choosing raw LLMs for code completion, the presenter outlines several factors to consider:
The presenter makes an important observation that assessing quality is extremely difficult without evaluating at scale, which leads into the discussion of their evaluation architecture.
A key architectural decision is that in this world of LLMs, you’re not just choosing one model—you’re choosing many based on the factors discussed. The GitLab architecture includes both open-source pre-trained LLMs that can be further tuned with additional data, and third-party LLMs.
The full architecture flows from left to right, starting with data ingestion. Additional data is used to enhance pre-trained LLMs, downloaded from sources like Hugging Face. This involves downloading raw datasets, pre-processing, and tokenizing data, which then moves into an environment for training and tuning with checkpoint layers.
Two engines run in parallel within this architecture:
The Prompt Engine processes every piece of code a developer writes through a “prompt library” or “prompt DB.” This engine deconstructs code into tokens, understands what was finally committed, and develops a scoring mechanism for ranking outputs. This ranking considers whether the output represents good, average, or less experienced developer understanding.
A Gateway layer includes prompt engine post-processing and a validator that calls different models based on the user input—routing to either third-party models or their own pre-trained models as appropriate.
The presenter emphasizes moving beyond manual spreadsheet-based prompt evaluation to automated, scaled continuous evaluation. The evaluation loop includes:
A concrete example provided involves analyzing a historic database of code to understand token completion patterns. The team observed that different developers (developer one, developer two, developer three) write the same code in different ways with different parameters and evaluation styles. They run algorithms based on code commits and similarity to agree on what constitutes an “actual agreeable developer output.”
This agreed-upon output becomes the benchmark against which LLM outputs (whether open-source, third-party, or tuned) are evaluated. The evaluation uses techniques based on objective review, scoring similarities (the presenter mentions Pearson correlation as one option), examining both the prompt and the code output.
The beauty of this approach, as emphasized by the presenter, is having this evaluation process run as microservice engines that can operate in an infinite loop on an instantaneous basis. Rather than manually reviewing prompts in spreadsheets, this architecture enables continuous background evaluation that keeps dialing up code completion usefulness across all the key metrics.
The serving layer implements what the presenter calls reinforcement learning. From the evaluation engine, the system knows what the actual shortfalls are and what the LLMs are producing. The next step is identifying prompt templates or prompt tunings that can be applied instantaneously.
Key components of the serving layer include:
When the prompt validator cannot achieve the desired output quality, requests can be routed to fine-tuning of the models in a continuous loop. This entire system is version-controlled, enabling the team to track changes and improvements over time.
The presenter draws an analogy to Amazon’s recommendation engine—constantly dialing up accuracy of usability through continuous feedback loops. The journey might start with raw LLMs providing a 10% acceptance rate for coders, then through continuous evaluation benchmarking and prompt refincement via reinforcement learning, the accuracy steadily improves.
The full loop integrates training data and completion data continuously. With code completion use cases, there will always be coders writing code, which means there’s always new data available. This data can be continuously added for:
This creates a virtuous cycle where more usage leads to more data, which leads to better evaluation, which leads to better prompts and fine-tuning, which leads to better code completion accuracy.
The presenter concludes with a philosophical note: “In this day and age of LLMs, data is still the oil. It is very difficult to imagine that with sufficient data there will remain things that only humans can do.”
It’s worth noting that while this presentation provides valuable architectural insights, it is somewhat light on specific quantitative results or metrics. The presenter mentions starting with a 10% acceptance rate as an example but doesn’t share concrete numbers about where GitLab’s code completion tools ended up in terms of accuracy or developer adoption.
The presentation also represents a relatively early-stage journey (6-7 months at the time of the talk), so some of the continuous improvement mechanisms described may have been aspirational or in development rather than fully proven at scale. However, the architectural patterns and evaluation frameworks described represent sound engineering practices for LLMOps in production code completion scenarios.
The emphasis on moving beyond manual evaluation to automated, scaled evaluation pipelines is particularly valuable advice for teams building similar systems. The multi-model approach with routing and the concept of continuous prompt engineering through microservices represents a mature architecture for production LLM deployments.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.