Greptile faced a challenge with their AI code review bot generating too many low-value "nit" comments, leading to user frustration and ignored feedback. After unsuccessful attempts with prompt engineering and LLM-based severity rating, they implemented a successful solution using vector embeddings to cluster and filter comments based on user feedback. This approach improved the percentage of addressed comments from 19% to 55+% within two weeks of deployment.
This case study explores how Greptile, a company providing AI-powered code review solutions, tackled a critical challenge in deploying LLMs in production: managing the signal-to-noise ratio in AI-generated feedback. The study provides valuable insights into the practical challenges of implementing LLMs in production systems and the importance of continuous improvement based on user feedback.
The core problem Greptile faced was common in LLM applications - the tendency of models to be overly verbose and generate too many low-value outputs. Their AI code review bot was generating an excessive number of comments on pull requests, with up to 10 comments on PRs containing just 20 changes. This led to a significant user experience issue where developers began ignoring the bot's feedback entirely, defeating its purpose.
Initial analysis of their production system revealed a stark reality: only about 19% of generated comments were considered valuable, 2% were incorrect, and a whopping 79% were "nits" - technically accurate but ultimately unimportant comments that developers didn't want to address. This kind of detailed analysis of production metrics is crucial in LLMOps, as it provides concrete data for improvement efforts.
The case study details three distinct approaches they attempted to solve this problem, each providing valuable lessons for LLMOps practitioners:
First Approach - Prompt Engineering:
Their initial attempt focused on prompt engineering, a common first resort in LLM applications. However, they discovered a fundamental limitation: even sophisticated prompting couldn't effectively separate high-value from low-value comments without also reducing the number of critical comments. This highlights an important lesson in LLMOps: prompt engineering, while powerful, isn't always the answer to complex quality control problems.
Second Approach - LLM as a Judge:
They then tried implementing a two-pass system where a second LLM would rate the severity of comments on a 1-10 scale, filtering out low-severity items. This approach failed for two reasons:
* The LLM's judgment of its own output was essentially random
* The additional inference step significantly impacted performance
This attempt highlighted important considerations about system architecture and the limitations of using LLMs for self-evaluation.
Final Successful Approach - Vector Embeddings and Clustering:
The successful solution came from combining several modern LLMOps techniques:
* Generation of embedding vectors for comments
* Storage in a vector database
* Implementation of a feedback-based filtering system
* Use of cosine similarity measurements
* Team-specific learning through feedback collection
The system worked by:
* Generating embeddings for new comments
* Comparing them against previously stored feedback
* Filtering out comments that were similar to previously downvoted comments
* Using a threshold-based system (requiring at least 3 similar downvoted comments)
* Maintaining team-specific databases to account for different standards
This approach proved remarkably effective, increasing the "address rate" (percentage of comments that developers actually addressed) from 19% to over 55% within two weeks of deployment.
Key LLMOps Lessons:
* The importance of concrete metrics for measuring LLM system performance
* The value of collecting and utilizing user feedback in production systems
* The limitations of prompt engineering and simple filtering approaches
* The effectiveness of combining traditional ML techniques (clustering, embeddings) with LLMs
* The need for team-specific customization in production LLM systems
* The importance of performance considerations in production systems
The case study also touches on important architectural decisions, such as choosing not to pursue fine-tuning due to costs, speed implications, and the desire to maintain model agnosticism. This demonstrates the practical tradeoffs that must be considered when deploying LLMs in production.
The solution's success also highlights the value of hybrid approaches in LLMOps - combining the generative capabilities of LLMs with traditional machine learning techniques like embedding-based clustering. This kind of hybrid approach can often provide more robust and maintainable solutions than pure LLM-based approaches.
From a deployment perspective, the case study shows the importance of continuous monitoring and improvement. The team's ability to measure the impact of their changes (through the address rate metric) allowed them to quantify the success of their solution and identify areas for further improvement.
The company's transparent acknowledgment that the solution isn't perfect (55% is "far from perfect") but represents significant progress is also noteworthy. This realistic approach to LLMOps - acknowledging that perfect solutions are rare and that continuous improvement is necessary - is valuable for practitioners in the field.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.