Company
Microsoft
Title
Multimodal RAG Architecture Optimization for Production
Industry
Tech
Year
2024
Summary (short)
Microsoft explored optimizing a production Retrieval-Augmented Generation (RAG) system that incorporates both text and image content to answer domain-specific queries. The team conducted extensive experiments on various aspects of the system including prompt engineering, metadata inclusion, chunk structure, image enrichment strategies, and model selection. Key improvements came from using separate image chunks, implementing a classifier for image relevance, and utilizing GPT-4V for enrichment while using GPT-4o for inference. The resulting system achieved better search precision and more relevant LLM-generated responses while maintaining cost efficiency.
This case study from Microsoft details their work on optimizing a production Retrieval-Augmented Generation (RAG) system that can handle both text and image content to answer domain-specific queries. The study provides deep technical insights into the experimentation and implementation process for multimodal RAG systems in production environments. The team focused on improving the handling of multimodal content through a pattern that uses multimodal LLMs (like GPT-4V) to transform image content into detailed textual descriptions during ingestion. This allows both text content and image descriptions to be stored in the same vector database space, enabling standard RAG pipeline retrieval and inference. The architecture consists of three main components: * An ingestion flow that processes source documents and extracts both text and image content * An enrichment flow that generates detailed descriptions of relevant images using multimodal LLMs * An inference flow that retrieves relevant content and generates responses to user queries A key aspect of the implementation was the systematic experimentation approach used to optimize various components. The team developed a comprehensive evaluation framework including both retrieval metrics (like source recall and image precision) and generative metrics (like correctness scores and citation accuracy). Several important technical findings emerged from their experiments: For the ingestion pipeline: * Including document-level metadata alongside unstructured content improved source recall performance * Storing image descriptions as separate chunks rather than inline with text improved both source document and image retrieval metrics without impacting search latency * Implementing a classifier layer using Azure AI's Vision tag endpoint helped filter out irrelevant images (like logos), reducing processing costs while maintaining retrieval accuracy * Including surrounding text context during image description generation showed some improvements in description quality but had limited impact on retrieval metrics For model selection and inference: * GPT-4V performed best for image description generation during enrichment * GPT-4o proved optimal for inference, offering improvements in quality, speed, and cost compared to GPT-4-32k * The team found that using different models for different stages (GPT-4V for enrichment, GPT-4o for inference) provided the best balance of performance and cost The implementation made extensive use of Azure services including: * Azure AI Search for vector storage and retrieval * Azure AI Services for image analysis and classification * Azure OpenAI Service for both enrichment and inference stages Important LLMOps considerations addressed in the study include: * Careful prompt engineering for both enrichment and inference stages * Implementation of caching mechanisms to manage costs and latency * Development of comprehensive evaluation metrics and testing frameworks * Systematic experimentation methodology to validate improvements * Consideration of production requirements like latency and cost constraints The team emphasized the importance of thorough experimentation and validation, developing a robust ground truth dataset and evaluation framework before beginning optimization work. They also highlighted the need for ongoing monitoring and adjustment based on production feedback and evolving technology. This case study provides valuable insights for organizations implementing multimodal RAG systems in production, particularly around: * The importance of systematic experimentation and evaluation * The benefits of separating enrichment and inference processes * The value of using different models optimized for different tasks * The need to balance performance improvements against cost and latency impacts * The benefits of thorough prompt engineering and careful system configuration The study concludes by noting that while they found an optimal configuration for their specific use case, continued monitoring and adjustment would be necessary as user needs evolve and new technologies emerge. This highlights the dynamic nature of LLMOps and the importance of building systems that can adapt to changing requirements and capabilities.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.