Github: BM25 vs Vector Search for Large-Scale Code Repository Search

LLMOps Database

Tech

Github

Company

Github

Title

BM25 vs Vector Search for Large-Scale Code Repository Search

Industry

Tech

Link

https://howaiisbuilt.transistor.fm/episodes/bm25-is-the-workhorse-of-search-vectors-are-its-visionary-cousin-s2-e14/transcript

Year

2024

Summary (short)

Github faces the challenge of providing efficient search across 100+ billion documents while maintaining low latency and supporting diverse search use cases. They chose BM25 over vector search due to its computational efficiency, zero-shot capabilities, and ability to handle diverse query types. The solution involves careful optimization of search infrastructure, including strategic data routing and field-specific indexing approaches, resulting in a system that effectively serves Github's massive scale while keeping costs manageable.

Tags

This case study explores Github's approach to implementing and scaling search functionality across their massive code repository platform, offering valuable insights into the practical considerations of deploying search systems at scale. Github's search infrastructure faces unique challenges due to its scale and diverse use cases. The platform maintains approximately 100 billion documents, all of which are considered "hot" data - meaning they need to be readily accessible as even old commits might be just as relevant as recent ones. This creates significant challenges for search implementation and optimization. The core technical decision that stands out is Github's choice of BM25 over vector search for their primary search functionality. The reasoning behind this choice reveals important considerations for production search systems: * **Computational Efficiency**: BM25 proves significantly more cost-effective than vector search at Github's scale. The challenge of maintaining vector search indexes in memory for 100 billion documents would be prohibitively expensive and complex. * **Storage and Infrastructure Optimization**: Github's implementation requires careful optimization of storage infrastructure. They utilize nodes with 10x more storage than technically needed to ensure sufficient IOPS (Input/Output Operations Per Second) performance, which is crucial for search latency. This showcases how infrastructure decisions must account for both storage capacity and performance characteristics. * **Multi-Field Search Implementation**: Github implements multiple indexing strategies for different fields. For example, title fields might be indexed three different ways with varying priorities: * Exact matches (highest priority) * N-gram fields (medium priority) * Trigram fields (lowest priority) This layered approach helps handle special characters and improve search accuracy across different use cases. The system must handle diverse search scenarios including: * Security vulnerability scanning * Dependencies search within codebases * Issue tracking * Documentation search * Code search * Repository management One of the key advantages of their BM25 implementation is its "zero-shot" capabilities - it can effectively handle queries it hasn't been specifically trained for, which is crucial given the diverse and unpredictable nature of search queries on the platform. Infrastructure and Implementation Details: * Documents are processed through a REST API * Terms are separated and temporarily stored in a translog (write-ahead log) * The system maintains a reverse index implemented as a hash map * Pre-computation of various factors helps optimize search performance * The system implements careful document routing to reduce search scope and improve performance Current Challenges and Future Developments: * The team is working toward unifying different types of search (code search vs text search) into a single coherent experience * There's ongoing work to improve result relevancy through user behavior insights * The system currently segments results by type (repos, code, issues, etc.) which they're looking to improve The infrastructure implementation showcases several key lessons for production search systems: * The importance of choosing search technology based on actual use cases and scale requirements * How infrastructure decisions must balance multiple factors including storage, compute, and latency requirements * The value of pre-computation and careful optimization in large-scale search systems * The need to consider both technical capabilities and business requirements when designing search solutions Github's case demonstrates how traditional search approaches like BM25 can outperform newer technologies like vector search in certain production scenarios, particularly when dealing with massive scale and diverse use cases. Their implementation shows the importance of understanding the full context of your search requirements - including scale, latency requirements, and use case diversity - when making technical decisions. The case also highlights how search system design isn't just about algorithmic choices but must consider the full stack - from storage infrastructure to query routing to result presentation. Their experience shows that successful production search systems require careful attention to practical considerations like infrastructure costs, maintenance complexity, and operational requirements.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source