Ellipsis developed an AI-powered code review system that uses multiple specialized LLM agents to analyze pull requests and provide feedback. The system employs parallel comment generators, sophisticated filtering pipelines, and advanced code search capabilities backed by vector stores. Their approach emphasizes accuracy over latency, uses extensive evaluation frameworks including LLM-as-judge, and implements robust error handling. The system successfully processes GitHub webhooks and provides automated code reviews with high accuracy and low false positive rates.
Ellipsis has developed a sophisticated LLM-powered code review system that demonstrates several key principles and best practices in deploying LLMs in production. This case study provides valuable insights into building reliable, scalable LLM systems that can handle real-world software engineering tasks.
The core system revolves around automated code review, but the architecture and approaches described have broader implications for production LLM systems. Rather than using a monolithic approach, Ellipsis opted for a modular system with multiple specialized agents, each handling specific aspects of code review. This architectural decision showcases how breaking down complex LLM tasks into smaller, more manageable components can lead to better performance and maintainability.
## System Architecture and Components
The system begins with a GitHub App installation that processes webhook events through Hookdeck for reliability. These events are then routed to a FastAPI web application and placed into a workflow queue managed by Hatchet. This setup demonstrates a production-grade approach to handling real-time events with LLMs, where asynchronous processing allows for prioritizing accuracy over latency.
The review system employs multiple parallel Comment Generators, each specialized for different types of issues such as custom rule violations or duplicate code detection. This parallel architecture allows for:
* Mixing different LLM models (e.g., GPT-4 and Claude) for optimal performance
* Independent benchmarking and optimization of each generator
* Attachment of evidence (code snippets) to support findings
A notable aspect of their production system is the sophisticated filtering pipeline that reduces false positives - a common issue with LLM-based code review tools. The filtering system includes:
* Confidence thresholds customizable per customer
* Deduplication of similar comments
* Logical correctness checks using evidence
* Incorporation of user feedback through embedding-based similarity search
## Code Search and RAG Implementation
The system implements an advanced Code Search capability using both keyword and vector search approaches. Their RAG implementation includes several sophisticated features:
* Multiple chunking methods using tree-sitter for AST parsing
* Dual-purpose embeddings for both specific code functionality and high-level understanding
* Binary classification for result relevance using LLMs
* Efficient HEAD indexing to avoid full repository re-embedding
* Integration with language servers for IDE-like functionality
## Development Workflow and Quality Assurance
Ellipsis has invested heavily in evaluation and testing frameworks, moving beyond simple snapshot testing to more sophisticated approaches:
* Automated benchmarking with ~30 examples per feature
* LLM-as-judge for evaluating agent outputs
* Custom UIs for data annotation
* Extensive error handling at multiple levels
* Automated generation of test cases
Their development workflow emphasizes rapid iteration and continuous improvement:
* Initial prompt engineering with sanity checks
* Systematic measurement of accuracy
* Diagnosis of failure classes using LLM auditors
* Data generation and augmentation
* Regular sampling of production data for edge cases
## Error Handling and Reliability
The system implements comprehensive error handling at multiple levels:
* Basic retries and timeouts for LLM calls
* Model fallbacks (e.g., Claude to GPT-4)
* Tool validation and error feedback loops
* Graceful degradation when components fail
* Descriptive user messaging
## Context Management
To handle the challenges of long context windows, they employ several strategies:
* Token budget management with hard and soft cutoffs
* Priority-based message handling
* Self-summarization capabilities
* Tool-specific summary generation
## Model Selection and Integration
The system primarily uses Claude (Sonnet-3.6) and GPT-4, with Claude showing slightly better performance. They've observed a convergence in how these models respond to prompts, making it easier to swap between them. The integration of GPT-4 Turbo (1106) required different prompting strategies but showed improvements for complex tasks.
## Future Developments
Ellipsis is working on expanding their system with:
* Graph-based code search capabilities for deeper context understanding
* Configurable sandboxes for code generation with build/lint/test capabilities
* Enhanced PR review capabilities using graph traversal
The case study demonstrates a mature approach to deploying LLMs in production, with careful attention to system architecture, reliability, and performance. Their experience shows that successful LLM systems require not just good prompts, but comprehensive infrastructure for testing, monitoring, and continuous improvement.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.