## Summary
Replit, an online IDE and development platform, built a specialized LLM specifically for automated code repair—their first "Replit-native" AI model. The motivation stems from their vision of AI as a first-class citizen in the development environment, where models are trained to interact directly with IDE events rather than just general code understanding. The specific use case chosen was code repair using Language Server Protocol (LSP) diagnostics, which generate hundreds of millions of events per day on their platform but only provide automated fixes for about 10% of Python diagnostic messages. This case study provides an excellent example of an end-to-end LLMOps workflow: from data sourcing and pipeline construction, through synthetic data generation and model training, to evaluation against both academic and production-realistic benchmarks.
## Data Pipeline and Engineering
The data engineering aspect of this project is particularly sophisticated and represents a significant portion of the LLMOps work. Replit's sessions are represented as streams of Operational Transformations (OTs), which provide edit-by-edit history of all code changes. This allows them to "replay" a project's state at any point in time. They merge OT data with session events (LSP diagnostics, CodeMirror actions, package installations, code execution, shell commands) into a unified timeline.
The data pipeline was designed to produce (code, diagnostic) pairs with the goal of creating 100K examples while being ready to scale by at least an order of magnitude. They implemented the pipeline using PySpark on Databricks to handle the scale. The process involves recreating the filesystem of a project at the time of each diagnostic, which requires replaying OTs to the correct timestamp. A sanity check verifies that the most recent Repl filesystem can be reconstructed to match a copy stored in GCS. They also run their pyright-extended meta-LSP (Ruff and Pyright) to verify that expected diagnostics are reproduced.
Data filtering was important: they excluded diagnostics that already have associated CodeActions (deterministic LSP solutions), stylistic rules like line-length and import-sorting warnings, and private/non-Python projects. A notable infrastructure challenge was that LSP executables need to be pointed to a filesystem directory, and in a Spark environment dynamically persisting strings is challenging—they solved this using serverless lambdas that scale up in bursts.
## Synthetic Data Generation and Distillation
A key insight from the Replit team was that fixed errors taken directly from user data are noisier than synthesized diffs. They found that a well-defined synthetic pipeline resulted in more accurate diffs with less variance in the output space. Their approach was to use large pre-trained code LLMs with a few-shot prompt pipeline implemented in DSPy to synthesize diffs from real error states.
They chose numbered Line Diffs as their target format based on research from OctoPack showing that Line Diff formatting leads to higher zero-shot fix performance, and their latency requirement that generated sequences should be as short as possible. They compared this against Unified Diff format and found that line numbers were hallucinated in Unified Diffs both with and without line numbers in the input, and that Unified Diffs would have higher decoding cost.
An important observation was that starting from real error states and synthesizing only the diff (rather than synthesizing both error state and diff end-to-end) is less prone to mode collapse, since input feature and diff distributions are drawn from the real world. They verified this through audits of generated data.
Post-synthesis verification was rigorous: they use regular expressions to extract line diffs and filter out malformed/incomplete diffs, apply generated numbered line diffs to verify they can be correctly and unambiguously applied, and use an LLM to filter out incorrect diffs to increase the proportion of correct to incorrect samples.
## Model Architecture and Training
The team chose a 7B parameter model to balance capabilities with inference latency and cost constraints for production deployment. They experimented with base and instruction-tuned models from the Starcoder2 and DeepSeek-Coder families, ultimately settling on DeepSeek-Coder-Instruct-v1.5 based on performance. The weights were downloaded from HuggingFace and patched to use Flash Attention v2 Triton kernel.
Training infrastructure used a fork of MosaicML's LLM Foundry (v0.5.0 tag) with Composer, running on the MosaicML platform with a single node of 8 H100 GPUs per experiment. They used FSDP with Full Shard strategy and activation checkpointing.
Hyperparameters were carefully tuned: Decoupled AdamW optimizer, Cosine Annealing with Warmup scheduler (initial LR of 1e-5, decaying to 0.01x with 100 batch warmup), beta_1=0.9, beta_2=0.99, epsilon=1e-8, no weight decay, batch size of 16. Training for 4 epochs gave the best performance, consistent with prior work on pretraining optimal epochs for smaller high-quality datasets. They used norm-based gradient clipping with threshold 1.0, mixed precision with BF16, and a packing ratio of 6.0 for Bin Packing of sequences.
## Input/Output Schema Design
Rather than using natural language instructions (common in instruction finetuning), the team designed a structured schema with angle-bracketed sentinel tokens, inspired by function calling and tool usage approaches. This decision yielded more consistently generated and formatted responses that are easier to parse. The format is also designed to be extensible for future work modeling Replit sessions as sequences of events and outputs (e.g., adding tokens like `` and ``).
Key design decisions included: adding line numbers to input code, LSP error line, and output line diffs (guaranteeing non-ambiguous diff application and empirically boosting response quality); following the base LLM's data format to stay close to training distribution; and not modifying the vocabulary/architecture for dedicated special tokens since performance was good with each sentinel token mapped to 3-5 tokens from the base tokenizer.
## Evaluation Strategy
The evaluation approach was comprehensive and addresses a critical LLMOps concern: existing automated program repair benchmarks have been shown to be leaked in pre-training corpora of large code LLMs, and are often curated from professional repositories that poorly represent the skill diversity of real users.
They created a two-part evaluation. The LeetCode repair eval uses DebugBench (selected for recency, error subtyping, and open-source pipeline) with a subset of syntactic and reference errors that can be assisted by LSP diagnostics. They also used the LiveCodeBench approach of selecting recent LeetCode problems after the base model's data cutoff date and applying the DebugBench synthetic bug injection pipeline, resulting in 360 samples.
The Replit repair eval is a completely new benchmark designed to test the model in the actual inference setting—fixing LSP diagnostics for users writing code on Replit. They sampled held-out (code, diagnostic) pairs from each diagnostic type, removed low-quality code, deduplicated following StarCoder recommendations to ensure no train-test leakage, and had human annotators verify or correct SOTA LLM-generated fixes. This resulted in 389 samples.
Metrics included functional correctness (for LeetCode eval where solutions can be submitted for evaluation), AST exact match, and AST match with string fallback (for cases where source code cannot be parsed into valid AST but the fix is still valid). They acknowledge that exact match is a lower bound to functional correctness but is necessary when test generation isn't feasible.
Baselines included GPT-4-Turbo, GPT-3.5-Turbo, Claude-3-Opus, Claude-3-Haiku, and the base DeepSeek-Coder-Instruct-v1.5 model.
## Results and Production Considerations
The Replit Code Repair 7B model achieved competitive performance against much larger models on both benchmarks. Notably, there is a significant performance gap between the Replit model and other models (except GPT-4 Turbo) on the real-world Replit eval, demonstrating the value of specialized training on platform-native data.
A key finding was that overall performance on the real-world eval remains lower than on the LeetCode eval, highlighting the importance of evaluating on both academic and production-realistic benchmarks. This is a valuable lesson for LLMOps practitioners: academic benchmarks may overestimate production performance.
Scaling experiments showed that performance improves with both training dataset size (testing 10K, 25K, 50K, 75K samples) and model parameters, providing guidance for future scaling decisions.
## Future Work and Production Deployment
The team plans several extensions relevant to production deployment: handling more complex cases like cross-file edits, improving multi-line edit performance, supporting the long tail of errors seen on Replit, and extending to more programming languages (with interest in cross-language transfer learning). They are also investing in improved evaluations to capture wider distributions of LSP errors across languages.
Once the model is in production, they plan to experiment with post-training methods like DPO using user data collected by the platform (which fixes are accepted vs. rejected), representing a valuable feedback loop for continuous improvement. This highlights the advantage of building platform-native models: direct access to user acceptance signals for reinforcement learning from human feedback.
The overall approach represents a mature LLMOps workflow: domain-specific data engineering, careful synthetic data generation with verification, infrastructure choices balancing capability and latency, rigorous evaluation on both academic and production-realistic benchmarks, and planning for post-deployment optimization using production signals.