Company
Microsoft
Title
Real-time Question-Answering System with Two-Stage LLM Architecture for Sales Content Recommendations
Industry
Tech
Year
2024
Summary (short)
Microsoft developed a real-time question-answering system for their MSX Sales Copilot to help sellers quickly find and share relevant sales content from their Seismic repository. The solution uses a two-stage architecture combining bi-encoder retrieval with cross-encoder re-ranking, operating on document metadata since direct content access wasn't available. The system was successfully deployed in production with strict latency requirements (few seconds response time) and received positive feedback from sellers with relevancy ratings of 3.7/5.
# Microsoft Sales Content Recommendation System Case Study ## Overview Microsoft developed and deployed a real-time question-answering system for their MSX Sales Copilot platform to help sellers efficiently find relevant sales content. The system was integrated as the first machine learning-based skill in the production MSX Copilot, focusing on content recommendations from their Seismic repository. ## Technical Architecture ### Data and Preprocessing - System operates on document metadata rather than actual content due to access limitations - Approximately 20 features used per document including both categorical and numerical metadata - Innovative prompt engineering approach to convert metadata into text form - Custom function developed to map numerical features into categorical buckets based on percentile distributions ### Model Architecture - Two-stage architecture combining retrieval and re-ranking - First stage: Bi-encoder retrieval - Second stage: Cross-encoder re-ranking - Optimal K value of 100 documents found through experimentation ### Production Infrastructure - Deployed on Azure Machine Learning endpoints - Integrated with MSX Copilot using Semantic Kernel framework - Weekly model refreshes to accommodate document metadata updates - Planner component in Semantic Kernel handles skill routing ### Performance Optimization - Extensive latency testing across different VM configurations: - Batch size optimization for cross-encoder inference - Found optimal batch sizes of 2-4 for balancing latency and performance - System achieves few-second response times in production ## Evaluation and Testing ### Automated Evaluation - 31 evaluation queries created with domain experts - Comprehensive latency testing across different machine configurations - Ablation studies on architecture components and feature engineering ### Human Evaluation - Four independent human annotators - Relevance scoring on 0-5 scale for recommendations - Results: - 90% of queries showed improvement with two-stage architecture vs retriever-only ### Production Metrics - User satisfaction surveys conducted after launch - Scored 4/5 for relevance to daily tasks - 3.7/5 rating for recommendation relevancy - Positive feedback on integration with MSX interface ## LLMOps Best Practices ### Monitoring and Maintenance - Weekly model updates to handle content changes - Regular feedback collection from sellers - Continuous improvement based on user feedback ### Quality Controls - Business rules integration in query processing - Strict latency requirements enforcement - Regular human evaluation of model outputs ### Infrastructure Design - Modular architecture allowing independent optimization - Efficient caching of document embeddings - Scalable deployment using Azure ML endpoints ## Challenges and Solutions ### Technical Challenges - Limited access to actual document content - Real-time latency requirements - Multi-format document handling ### Production Challenges - Integration with existing systems - Latency optimization - Regular content updates ## Future Improvements ### Planned Enhancements - Integration of actual document content when available - Customization based on sales opportunities - Incorporation of seller-specific features - Collection of user feedback for supervised learning - Implementation of nDCG metrics ### Scalability Considerations - Context length limitations with current models - Feature quantity vs quality trade-offs - Processing time vs accuracy balance ## Deployment Strategy ### Integration Process - Semantic Kernel integration - Azure ML endpoint deployment - Weekly refresh cycles - Planner-based skill routing ### Monitoring and Feedback - Regular user satisfaction surveys - Performance metric tracking - Continuous feedback collection from sellers - Iterative improvement based on usage patterns

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.