Company
Github
Title
Building and Scaling AI-Powered Password Detection in Production
Industry
Tech
Year
2025
Summary (short)
Github developed and deployed Copilot secret scanning to detect generic passwords in codebases using AI/LLMs, addressing the limitations of traditional regex-based approaches. The team iteratively improved the system through extensive testing, prompt engineering, and novel resource management techniques, ultimately achieving a 94% reduction in false positives while maintaining high detection accuracy. The solution successfully scaled to handle enterprise workloads through sophisticated capacity management and workload-aware request handling.
This case study from Github details their journey in implementing and scaling an AI-powered password detection system called Copilot secret scanning. The project represents a significant real-world application of LLMs in production, specifically addressing the challenge of detecting generic passwords in code repositories - a task that traditional regex-based approaches struggle with due to the varied and nuanced nature of password patterns. The development journey provides valuable insights into the practical challenges and solutions in deploying LLMs at scale. The team started with a baseline implementation using GPT-3.5-Turbo and few-shot prompting, but quickly discovered limitations when dealing with unconventional file types and structures in customer repositories. This early experience highlighted the importance of comprehensive testing and evaluation frameworks when deploying LLMs in production. The technical evolution of the system involved several key LLMOps practices and innovations: * Prompt Engineering and Model Selection: * The team experimented with various prompting strategies including Fill-in-the-Middle, Zero-Shot, and Chain-of-Thought approaches * They ultimately adopted a hybrid approach using Microsoft's MetaReflection technique, combining Chain of Thought with few-shot prompting * Multiple models were tested, including both GPT-3.5-Turbo and GPT-4, with the team eventually implementing a two-model approach where GPT-4 validates candidates found by GPT-3.5-Turbo * Testing and Evaluation: * Developed a sophisticated offline evaluation framework incorporating diverse test cases * Implemented visual analysis tools to track the impact of model and prompt changes * Created a data collection pipeline leveraging the Github Code Security team's processes * Used GPT-4 to generate additional test cases based on patterns from existing secret scanning alerts * Employed mirror testing against real repositories to validate improvements without impacting users * Scaling and Resource Management: * Implemented content filtering to reduce unnecessary processing * Experimented with different context windows and tokenization strategies * Created a workload-aware request management system for efficient resource allocation * Developed a sophisticated algorithm allowing dynamic resource sharing between different workloads * The resource management solution was so successful it was adopted by other Github services like Copilot Autofix * Production Monitoring and Optimization: * Implemented comprehensive monitoring of detection quality metrics * Tracked both precision (false positive rate) and recall (false negative rate) * Used voting mechanisms to handle LLM response non-determinism * Maintained ongoing monitoring and refinement based on production data The results demonstrate the effectiveness of their approach: * Achieved a 94% reduction in false positives across organizations * Successfully scaled to handle scanning of nearly 35% of all Github Secret Protection repositories * Maintained high detection accuracy while significantly reducing noise * Created a reusable framework for resource management that benefited other AI services The case study also highlights important lessons for LLMOps: * The critical importance of balancing precision with recall in security applications * The value of diverse test cases and comprehensive evaluation frameworks * The need for sophisticated resource management when scaling LLM applications * The benefits of collaborative innovation across teams The system continues to evolve, with ongoing monitoring and refinement based on customer feedback and detection trends. The team maintains a strong focus on precision while ensuring scalability and performance, demonstrating how LLMs can be effectively deployed and managed in production environments for critical security applications. This implementation showcases the full lifecycle of an LLM application in production, from initial development through testing, scaling, and ongoing maintenance. It particularly highlights the importance of careful evaluation, resource management, and continuous monitoring when deploying LLMs at scale in enterprise environments.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.