Replit: Building Production-Ready LLMs for Automated Code Repair: A Scalable IDE Integration Case Study - ZenML LLMOps Database

LLMOps Database

Replit

Company

Replit

Title

Building Production-Ready LLMs for Automated Code Repair: A Scalable IDE Integration Case Study

Industry

Tech

Link

https://blog.replit.com/code-repair

Year

2024

Summary (short)

Replit tackled the challenge of automating code repair in their IDE by developing a specialized 7B parameter LLM that integrates directly with their Language Server Protocol (LSP) diagnostics. They created a production-ready system that can automatically fix Python code errors by processing real-time IDE events, operational transformations, and project snapshots. Using DeepSeek-Coder-Instruct-v1.5 as their base model, they implemented a comprehensive data pipeline with serverless verification, structured input/output formats, and GPU-accelerated inference. The system achieved competitive results against much larger models like GPT-4 and Claude-3, with their finetuned 7B model matching or exceeding the performance of these larger models on both academic benchmarks and real-world error fixes. The production system features low-latency inference, load balancing, and real-time code application, demonstrating successful deployment of an LLM system in a high-stakes development environment where speed and accuracy are crucial.

Tags

code_generation

code_interpretation

latency_optimization

microsoft_azure

model_optimization

# LLMOps Case Study Notes: Replit Code Repair System ## Overview - Replit is developing AI-native development tools by integrating LLMs directly into their IDE - Primary use case: Automated code repair using LSP (Language Server Protocol) diagnostics - Goal: Create an LLM that can understand IDE events and provide contextually appropriate fixes - Target: Fix Python code errors identified by LSP that don't have deterministic solutions ## System Architecture & Data Pipeline ### Data Sources - User events from IDE sessions - Operational Transformations (OTs) representing code changes - LSP diagnostic messages - Project snapshots for verification ### Pipeline Components - Serverless verification of reconstructed states - Training data store for processed examples - GPU cluster for model training - Deployment infrastructure with load balancing ### Data Processing - Reconstruction of filesystem state at time of errors - Filtering of deterministic cases (where LSP provides fixes) - Removal of stylistic rules - Exclusion of private/non-Python projects - Verification against GCS stored copies - Integration with Ruff and Pyright for diagnostic validation ## Model Development ### Base Model Selection - Chose DeepSeek-Coder-Instruct-v1.5 (7B parameters) - Selected based on balance of: - Used Flash Attention v2 Triton kernel optimization ### Training Infrastructure - Platform: MosaicML - Hardware: Single node with 8 H100s - Framework: LLM Foundry (v0.5.0) with Composer - Distribution: FSDP with Full Shard strategy - Activation checkpointing enabled ### Training Configuration - Optimizer: Decoupled AdamW - Learning rate: 1e-5 with Cosine Annealing - Warmup: 100 batches - Training duration: 4 epochs - Batch size: 16 - Mixed precision: BF16 - Gradient clipping threshold: 1.0 - Packing ratio: 6.0 for sequence binning ## Production Considerations ### Input/Output Format - Schema using angle-bracketed sentinel tokens - Structured format for IDE integration - Consistent template for parsing/generation - Support for future IDE event types ### Performance Optimizations - Line numbering for unambiguous fixes - Maintained code formatting close to training distribution - Efficient tokenizer mapping (3-5 tokens per sentinel) - Flexible output space for various edit types ### Production Pipeline - Integration with Replit workspace - Load balancer for GPU inference - Real-time code application - Model serving infrastructure ## Evaluation Framework ### Metrics - Functional correctness (for Leetcode problems) - AST matching - String representation matching - Pass@1 performance ### Evaluation Sets - Leetcode repair benchmark (360 samples) - Replit repair benchmark (389 samples) - Zero-shot and few-shot testing - Cross-language transfer evaluation ## Scaling & Future Work ### Current Scaling Results - Data scaling shows consistent improvement - Parameter scaling demonstrates benefits up to 33B - Competitive with larger models (GPT-4, Claude-3) ### Future Development Plans - Expand to cross-file edits - Improve multi-line edit performance - Support additional programming languages - Implement DPO based on user feedback - Scale training dataset - Enhance evaluation coverage ## Production Integration Notes ### Deployment Strategy - Client-side integration with Replit IDE - Workspace LSP diagnostic handling - Real-time diff application - Model versioning and updates ### Monitoring & Feedback - User acceptance tracking - Error rate monitoring - Performance metrics collection - Feedback loop for model improvements ### System Requirements - Low latency requirements - High throughput capability - Reliable error handling - Scalable infrastructure ## Lessons Learned

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Use Open Source