Microsoft developed a real-time question-answering system for their MSX Sales Copilot to help sellers quickly find and share relevant sales content from their Seismic repository. The solution uses a two-stage architecture combining bi-encoder retrieval with cross-encoder re-ranking, operating on document metadata since direct content access wasn't available. The system was successfully deployed in production with strict latency requirements (few seconds response time) and received positive feedback from sellers with relevancy ratings of 3.7/5.
This case study describes Microsoft’s internal deployment of a content recommendation system integrated into their MSX Sales Copilot, which is a customized version of Dynamics 365 CRM used daily by Microsoft sellers. The system was launched in July 2023 as one of the first machine learning-based “skills” in the MSX Copilot platform. The fundamental problem being solved is enabling sellers to quickly find relevant technical documentation from the Seismic content repository during live customer calls or meeting preparation, replacing a previously sub-optimal filter-based search system hosted externally.
The case study is notable for its pragmatic approach to a common real-world constraint: the team did not have programmatic access to the actual content of documents, only their metadata. This forced an innovative solution using metadata prompt engineering to create embeddings that could be matched against seller queries.
The system employs a two-stage architecture that balances retrieval speed with ranking accuracy:
The first stage uses a DistilBERT model pre-trained on the MS MARCO dataset to generate embeddings. This architecture processes the query and document prompts independently through the same encoder, then computes cosine similarity between them. The key advantage is that document embeddings can be pre-computed offline and cached, making real-time inference fast. The bi-encoder retrieves the top-100 candidate documents based on cosine similarity with the query embedding.
The limitation of this approach is that it neglects cross-interaction between query and document, which can miss nuanced relevance signals. However, the speed benefit is significant for the initial filtering stage.
The second stage uses a MS MARCO MiniLM cross-encoder model that processes query-document pairs together, enabling attention mechanisms between them. This produces more accurate relevance scores at the cost of higher latency. The cross-encoder cannot operate offline since the query is not known in advance.
The team experimented with different values for K (the number of candidates passed to the re-ranker) and found K=100 to be optimal, balancing latency against recall. Too few candidates (K≲20) provided insufficient diversity, while too many (K≳200) created unacceptable latency.
Since document content was unavailable, the team developed an elaborate prompt engineering approach to transform document metadata into natural language descriptions suitable for embedding. Each document is characterized by approximately 20 metadata features (both categorical and numerical).
For categorical features, the feature name and value are concatenated directly. For numerical features, the team introduced a mapping function that converts raw numbers into categorical labels (high, medium, low, zero) based on percentile thresholds (85th and 65th percentiles). This transformation allows numerical engagement metrics like “number of views” to be incorporated into the textual prompts.
The final prompt for each document is created by concatenating all feature-value pairs into a single English-formed sentence. This approach effectively lifts the problem back to asymmetric semantic search territory, where short queries are matched against longer document descriptions.
The model is deployed as an Azure Machine Learning (AML) endpoint for real-time inference. Integration with the MSX Copilot uses Microsoft’s Semantic Kernel SDK, which provides a planner to route queries to appropriate skills based on context. When the planner determines that a content recommendation is needed, it invokes the AML endpoint, which processes the query through both stages and returns the top-5 most relevant documents.
The system operates on a weekly refresh cycle to accommodate changes in document metadata from the Seismic catalog.
The team conducted extensive latency testing across three Azure VM configurations:
The cross-encoder batch size parameter proved critical for latency optimization. Testing showed that batch sizes of 2 or 4 provided the best latency characteristics. With b=2, the GPU-enabled VM achieved median latency of 0.54 seconds, while the DS4 v2 CPU machine achieved 1.49 seconds. As expected, GPU inference was significantly faster than CPU-only machines.
The entire end-to-end process (including Semantic Kernel planner overhead) takes “a few seconds,” with the majority of time spent by the planner rather than the model inference itself.
The team faced a common challenge in recommendation systems: lack of labeled ground truth data for metrics like NDCG. They addressed this through:
Four human domain experts rated the top-5 results for 31 evaluation queries on a 0-5 scale (where 5 means all results are relevant). Results showed:
Annotators noted that “not relevant” queries tended to be too verbose or generic. Individual annotator averages ranged from 2.74 to 3.26, indicating consistent evaluation without significant bias.
The team validated the two-stage architecture against a single-stage retriever-only baseline. Human experts confirmed that 90% of queries received more relevant recommendations using the full architecture. A specific example showed the value of re-ranking: when a query requested PDF-format documents, the retriever returned only 2 PDFs in its top-5, while the cross-encoder successfully re-ranked candidates to return all 5 documents in the correct format.
The ablation study also validated the inclusion of numerical features through the categorical mapping function, with annotators confirming that queries asking for documents with “high” engagement metrics were better served by the full system.
Initial production feedback after two months was encouraging:
Sellers reported the new system as a “tremendous improvement” over the previous filter-based external search, particularly valuing the integration directly into the MSX interface rather than requiring navigation to the Seismic website.
The case study acknowledges several limitations:
Future plans include:
This case study demonstrates several practical LLMOps patterns:
The use of pre-trained models (DistilBERT, MiniLM) from the Sentence Transformers ecosystem shows how publicly available models can be deployed effectively without fine-tuning. The team leveraged models specifically trained on MS MARCO for asymmetric question-answering scenarios.
The offline pre-computation of document embeddings is a critical optimization pattern for production systems, reducing inference-time compute to only query embedding and similarity computation for the first stage.
The batch size tuning for the cross-encoder highlights the importance of inference-time hyperparameter optimization, which differs from traditional ML training-focused tuning.
The weekly refresh cycle for document metadata shows a pragmatic approach to keeping the system current without requiring real-time document updates.
Finally, the integration via Semantic Kernel demonstrates how ML services can be composed into larger AI applications through skill-based architectures, where a planner routes queries to appropriate specialized systems.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
eBay developed Mercury, an internal agentic framework designed to scale LLM-powered recommendation experiences across its massive marketplace of over two billion active listings. The platform addresses the challenge of transforming vast amounts of unstructured data into personalized product recommendations by integrating Retrieval-Augmented Generation (RAG) with a custom Listing Matching Engine that bridges the gap between LLM-generated text outputs and eBay's dynamic inventory. Mercury enables rapid development through reusable, plug-and-play components following object-oriented design principles, while its near-real-time distributed queue-based execution platform handles cost and latency requirements at industrial scale. The system combines multiple retrieval mechanisms, semantic search using embedding models, anomaly detection, and personalized ranking to deliver contextually relevant shopping experiences to hundreds of millions of users.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.