ZenML

Building Production-Ready LLMs for Automated Code Repair: A Scalable IDE Integration Case Study

Replit 2024
View original source

Replit tackled the challenge of automating code repair in their IDE by developing a specialized 7B parameter LLM that integrates directly with their Language Server Protocol (LSP) diagnostics. They created a production-ready system that can automatically fix Python code errors by processing real-time IDE events, operational transformations, and project snapshots. Using DeepSeek-Coder-Instruct-v1.5 as their base model, they implemented a comprehensive data pipeline with serverless verification, structured input/output formats, and GPU-accelerated inference. The system achieved competitive results against much larger models like GPT-4 and Claude-3, with their finetuned 7B model matching or exceeding the performance of these larger models on both academic benchmarks and real-world error fixes. The production system features low-latency inference, load balancing, and real-time code application, demonstrating successful deployment of an LLM system in a high-stakes development environment where speed and accuracy are crucial.

Industry

Tech

Technologies

Summary

Replit, an online IDE and development platform, built a specialized LLM specifically for automated code repair—their first “Replit-native” AI model. The motivation stems from their vision of AI as a first-class citizen in the development environment, where models are trained to interact directly with IDE events rather than just general code understanding. The specific use case chosen was code repair using Language Server Protocol (LSP) diagnostics, which generate hundreds of millions of events per day on their platform but only provide automated fixes for about 10% of Python diagnostic messages. This case study provides an excellent example of an end-to-end LLMOps workflow: from data sourcing and pipeline construction, through synthetic data generation and model training, to evaluation against both academic and production-realistic benchmarks.

Data Pipeline and Engineering

The data engineering aspect of this project is particularly sophisticated and represents a significant portion of the LLMOps work. Replit’s sessions are represented as streams of Operational Transformations (OTs), which provide edit-by-edit history of all code changes. This allows them to “replay” a project’s state at any point in time. They merge OT data with session events (LSP diagnostics, CodeMirror actions, package installations, code execution, shell commands) into a unified timeline.

The data pipeline was designed to produce (code, diagnostic) pairs with the goal of creating 100K examples while being ready to scale by at least an order of magnitude. They implemented the pipeline using PySpark on Databricks to handle the scale. The process involves recreating the filesystem of a project at the time of each diagnostic, which requires replaying OTs to the correct timestamp. A sanity check verifies that the most recent Repl filesystem can be reconstructed to match a copy stored in GCS. They also run their pyright-extended meta-LSP (Ruff and Pyright) to verify that expected diagnostics are reproduced.

Data filtering was important: they excluded diagnostics that already have associated CodeActions (deterministic LSP solutions), stylistic rules like line-length and import-sorting warnings, and private/non-Python projects. A notable infrastructure challenge was that LSP executables need to be pointed to a filesystem directory, and in a Spark environment dynamically persisting strings is challenging—they solved this using serverless lambdas that scale up in bursts.

Synthetic Data Generation and Distillation

A key insight from the Replit team was that fixed errors taken directly from user data are noisier than synthesized diffs. They found that a well-defined synthetic pipeline resulted in more accurate diffs with less variance in the output space. Their approach was to use large pre-trained code LLMs with a few-shot prompt pipeline implemented in DSPy to synthesize diffs from real error states.

They chose numbered Line Diffs as their target format based on research from OctoPack showing that Line Diff formatting leads to higher zero-shot fix performance, and their latency requirement that generated sequences should be as short as possible. They compared this against Unified Diff format and found that line numbers were hallucinated in Unified Diffs both with and without line numbers in the input, and that Unified Diffs would have higher decoding cost.

An important observation was that starting from real error states and synthesizing only the diff (rather than synthesizing both error state and diff end-to-end) is less prone to mode collapse, since input feature and diff distributions are drawn from the real world. They verified this through audits of generated data.

Post-synthesis verification was rigorous: they use regular expressions to extract line diffs and filter out malformed/incomplete diffs, apply generated numbered line diffs to verify they can be correctly and unambiguously applied, and use an LLM to filter out incorrect diffs to increase the proportion of correct to incorrect samples.

Model Architecture and Training

The team chose a 7B parameter model to balance capabilities with inference latency and cost constraints for production deployment. They experimented with base and instruction-tuned models from the Starcoder2 and DeepSeek-Coder families, ultimately settling on DeepSeek-Coder-Instruct-v1.5 based on performance. The weights were downloaded from HuggingFace and patched to use Flash Attention v2 Triton kernel.

Training infrastructure used a fork of MosaicML’s LLM Foundry (v0.5.0 tag) with Composer, running on the MosaicML platform with a single node of 8 H100 GPUs per experiment. They used FSDP with Full Shard strategy and activation checkpointing.

Hyperparameters were carefully tuned: Decoupled AdamW optimizer, Cosine Annealing with Warmup scheduler (initial LR of 1e-5, decaying to 0.01x with 100 batch warmup), beta_1=0.9, beta_2=0.99, epsilon=1e-8, no weight decay, batch size of 16. Training for 4 epochs gave the best performance, consistent with prior work on pretraining optimal epochs for smaller high-quality datasets. They used norm-based gradient clipping with threshold 1.0, mixed precision with BF16, and a packing ratio of 6.0 for Bin Packing of sequences.

Input/Output Schema Design

Rather than using natural language instructions (common in instruction finetuning), the team designed a structured schema with angle-bracketed sentinel tokens, inspired by function calling and tool usage approaches. This decision yielded more consistently generated and formatted responses that are easier to parse. The format is also designed to be extensible for future work modeling Replit sessions as sequences of events and outputs (e.g., adding tokens like <run_command> and <exec_output>).

Key design decisions included: adding line numbers to input code, LSP error line, and output line diffs (guaranteeing non-ambiguous diff application and empirically boosting response quality); following the base LLM’s data format to stay close to training distribution; and not modifying the vocabulary/architecture for dedicated special tokens since performance was good with each sentinel token mapped to 3-5 tokens from the base tokenizer.

Evaluation Strategy

The evaluation approach was comprehensive and addresses a critical LLMOps concern: existing automated program repair benchmarks have been shown to be leaked in pre-training corpora of large code LLMs, and are often curated from professional repositories that poorly represent the skill diversity of real users.

They created a two-part evaluation. The LeetCode repair eval uses DebugBench (selected for recency, error subtyping, and open-source pipeline) with a subset of syntactic and reference errors that can be assisted by LSP diagnostics. They also used the LiveCodeBench approach of selecting recent LeetCode problems after the base model’s data cutoff date and applying the DebugBench synthetic bug injection pipeline, resulting in 360 samples.

The Replit repair eval is a completely new benchmark designed to test the model in the actual inference setting—fixing LSP diagnostics for users writing code on Replit. They sampled held-out (code, diagnostic) pairs from each diagnostic type, removed low-quality code, deduplicated following StarCoder recommendations to ensure no train-test leakage, and had human annotators verify or correct SOTA LLM-generated fixes. This resulted in 389 samples.

Metrics included functional correctness (for LeetCode eval where solutions can be submitted for evaluation), AST exact match, and AST match with string fallback (for cases where source code cannot be parsed into valid AST but the fix is still valid). They acknowledge that exact match is a lower bound to functional correctness but is necessary when test generation isn’t feasible.

Baselines included GPT-4-Turbo, GPT-3.5-Turbo, Claude-3-Opus, Claude-3-Haiku, and the base DeepSeek-Coder-Instruct-v1.5 model.

Results and Production Considerations

The Replit Code Repair 7B model achieved competitive performance against much larger models on both benchmarks. Notably, there is a significant performance gap between the Replit model and other models (except GPT-4 Turbo) on the real-world Replit eval, demonstrating the value of specialized training on platform-native data.

A key finding was that overall performance on the real-world eval remains lower than on the LeetCode eval, highlighting the importance of evaluating on both academic and production-realistic benchmarks. This is a valuable lesson for LLMOps practitioners: academic benchmarks may overestimate production performance.

Scaling experiments showed that performance improves with both training dataset size (testing 10K, 25K, 50K, 75K samples) and model parameters, providing guidance for future scaling decisions.

Future Work and Production Deployment

The team plans several extensions relevant to production deployment: handling more complex cases like cross-file edits, improving multi-line edit performance, supporting the long tail of errors seen on Replit, and extending to more programming languages (with interest in cross-language transfer learning). They are also investing in improved evaluations to capture wider distributions of LSP errors across languages.

Once the model is in production, they plan to experiment with post-training methods like DPO using user data collected by the platform (which fixes are accepted vs. rejected), representing a valuable feedback loop for continuous improvement. This highlights the advantage of building platform-native models: direct access to user acceptance signals for reinforcement learning from human feedback.

The overall approach represents a mature LLMOps workflow: domain-specific data engineering, careful synthetic data generation with verification, infrastructure choices balancing capability and latency, rigorous evaluation on both academic and production-realistic benchmarks, and planning for post-deployment optimization using production signals.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building an AI-Powered IDE at Scale: Architectural Deep Dive

Cursor 2025

Cursor, an AI-powered IDE built by Anysphere, faced the challenge of scaling from zero to serving billions of code completions daily while handling 1M+ queries per second and 100x growth in load within 12 months. The solution involved building a sophisticated architecture using TypeScript and Rust, implementing a low-latency sync engine for autocomplete suggestions, utilizing Merkle trees and embeddings for semantic code search without storing source code on servers, and developing Anyrun, a Rust-based orchestrator service. The results include reaching $500M+ in annual revenue, serving more than half of the Fortune 500's largest tech companies, and processing hundreds of millions of lines of enterprise code written daily, all while maintaining privacy through encryption and secure indexing practices.

code_generation code_interpretation chatbot +34

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53