Glean: Fine-tuning Custom Embedding Models for Enterprise Search

LLMOps Database

Tech

Glean

Company

Glean

Title

Fine-tuning Custom Embedding Models for Enterprise Search

Industry

Tech

Link

https://www.youtube.com/watch?v=jTBsWJ2TKy8

Year

2023

Summary (short)

Glean implements enterprise search and RAG systems by developing custom embedding models for each customer. They tackle the challenge of heterogeneous enterprise data by using a unified data model and fine-tuning embedding models through continued pre-training and synthetic data generation. Their approach combines traditional search techniques with semantic search, achieving a 20% improvement in search quality over 6 months through continuous learning from user feedback and company-specific language adaptation.

This case study from Glean provides a comprehensive look at how they implement and operate large language models in production for enterprise search and RAG applications. The company specializes in creating unified enterprise search solutions that integrate data from various sources like Google Drive, GitHub, Jira, and Confluence, making it accessible through their Glean Assistant and search platform. The core technical approach centers around building custom embedding models for each customer, recognizing that every enterprise has its unique language and terminology. Here's how they implement this in production: **Data Architecture and Preprocessing** They emphasize the importance of a unified data model that can handle heterogeneous enterprise data. This isn't just about document content - it includes handling different types of content like Slack messages, meetings, pull requests, and various document formats. The unified model helps standardize how they process and index content across different applications while maintaining security and access controls. **Embedding Model Development Process** Their approach to building custom embedding models follows several stages: * Base Model Selection: They start with BERT-based architectures as their foundation * Continued Pre-training: They use masked language modeling (MLM) to adapt the base model to company-specific language * Custom Training Data Generation: They create training pairs through multiple methods: - Title-body pairs from documents - Anchor data from document references - Co-access patterns from user behavior - Synthetic data generation using LLMs - Public datasets like MS MARCO For production deployment, they handle access control through their search engine based on OpenSearch, which maintains ACL information at the user level. This allows them to filter search results based on permissions while still training models on broader datasets. **Evaluation and Monitoring** Glean implements multiple evaluation approaches: * Business metrics like session satisfaction and click-through rates * Automated quality evaluations for different query patterns * Unit tests for specific model behaviors (like paraphrase understanding) * LLM-based judges for quality assessment (while noting their limitations) They run model updates on a monthly basis, finding this frequency balances freshness with stability. The system shows approximately 20% improvement in search quality over 6 months through continuous learning. **Production Architecture Considerations** The system combines traditional search techniques with semantic search, recognizing that about 60-70% of enterprise queries can be handled effectively with lexical search and basic signals. Their ranking system incorporates multiple signals: * Semantic relevance through embeddings * Recency and freshness signals * Document authority scores based on usage patterns * Traditional learn-to-rank features **Scalability and Performance** For processing large volumes of documents, they utilize distributed data processing frameworks like Apache Beam and Spark. They've found that smaller, fine-tuned embedding models can often outperform larger models for specific enterprise use cases, leading to better performance and efficiency. **Security and Privacy** Security is handled through multiple layers: * Customer-specific models ensure data isolation * ACL-aware search indexing * Privacy-conscious training processes that avoid exposure to sensitive documents * Unified data model that maintains security constraints **Continuous Improvement Process** The system learns and improves through: * Monthly model updates * User feedback incorporation * Click data analysis * Query pattern analysis * Authority scoring based on document usage **Challenges and Solutions** Some key challenges they've addressed include: * Handling heterogeneous data types (solved through unified data model) * Managing company-specific language (addressed through continued pre-training) * Dealing with sparse user feedback (combined multiple weak signals) * Balancing freshness with stability in model updates * Scaling to hundreds of customers (solved through standardized but customizable processes) The case study demonstrates a sophisticated approach to implementing LLMs in production, showing how combining traditional IR techniques with modern LLM capabilities can create effective enterprise search solutions. Their success comes from careful attention to the specific challenges of enterprise data, strong evaluation practices, and a pragmatic approach to combining different search technologies.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free