ZenML

Building Production-Grade Heterogeneous RAG Systems

AWS GenAIIC 2024
View original source

AWS GenAIIC shares practical insights from implementing RAG systems with heterogeneous data formats in production. The case study explores using routers for managing diverse data sources, leveraging LLMs' code generation capabilities for structured data analysis, and implementing multimodal RAG solutions that combine text and image data. The solutions include modular components for intent detection, data processing, and retrieval across different data types with examples from multiple industries.

Industry

Tech

Technologies

Overview

This case study from AWS GenAIIC (Generative AI Innovation Center) provides an in-depth technical guide on building Retrieval Augmented Generation (RAG) systems that can handle heterogeneous data formats in production environments. As a Part 2 of a series, this article moves beyond text-only RAG to address the operational challenges of working with mixed data sources including structured tables, unstructured text, and images. The content is primarily educational and prescriptive, drawing from hands-on experience with customer implementations across multiple industries.

The article describes several real-world use cases that the GenAIIC team has worked on, including technical assistance systems for field engineers, oil and gas data analysis platforms, financial data analysis systems, industrial maintenance solutions, and ecommerce product search capabilities. While these are presented as customer implementations, specific company names and quantitative results are not disclosed, which limits the ability to verify claims independently.

Production Architecture Patterns

Query Routing for Heterogeneous Data Sources

One of the key LLMOps patterns introduced is the use of a router component to direct incoming queries to appropriate processing pipelines based on the query’s nature and required data type. This is crucial in production systems where different data types require distinct retrieval and processing strategies.

The router accomplishes intent detection through an initial LLM call. The article recommends using a smaller, faster model like Anthropic’s Claude Haiku on Amazon Bedrock to minimize latency in the routing step. This is a practical consideration for production deployments where response time matters.

The router prompt template provided demonstrates several best practices for production LLM systems:

The article provides actual code for parsing the router’s response using regex patterns to extract the selected data source from XML tags. This approach of using structured output formats is essential for reliable production systems where LLM responses need to be programmatically processed.

An important production consideration mentioned is handling ambiguous queries. The article suggests adding a “Clarifications” category as a pseudo-data source that allows the system to ask users for more information when needed, improving user experience and reducing incorrect routing decisions.

The article also mentions that an alternative to prompt-based routing is using the native tool use capability (function calling) available in the Bedrock Converse API, where each data source is defined as a tool. This provides another production-ready pattern for implementing routing logic.

LLM Code Generation for Structured Data Analysis

A significant portion of the article addresses the challenge of analyzing tabular data with LLMs. The key insight is that LLMs do not perform well at analyzing raw tabular data passed directly in prompts, but they excel at code generation. The recommended approach is to leverage LLM code generation capabilities to write Python or SQL code that performs the required analysis.

The article notes that LLMs like Anthropic’s Claude Sonnet 3.5 have 92% accuracy on the HumanEval code benchmark, making code generation a reliable approach for production systems. The workflow involves:

The article mentions that while libraries like LlamaIndex and LangChain offer out-of-the-box text-to-SQL and text-to-Pandas pipelines for quick prototyping, writing custom pipelines provides better control over prompts, code execution, and outputs in production systems.

Several production considerations are highlighted:

The complete code example demonstrates the full pipeline using boto3 to call Amazon Bedrock, including parsing code from XML tags, executing with Python’s exec function, and optionally making a follow-up LLM call for natural language responses.

Multimodal RAG Implementation

The article provides detailed guidance on implementing multimodal RAG systems that handle both text and image data. Three categories of multimodal queries are identified:

Two main approaches are presented for building multimodal retrieval systems:

Approach 1: Multimodal Embedding Models

Using models like Amazon Titan Multimodal Embeddings, both images and text can be embedded into a shared vector space for direct comparison. This approach is simpler and works well for finding images that match high-level descriptions or finding visually similar items. The article provides complete code examples for both ingestion (embedding images and storing in OpenSearch with k-NN vector fields) and retrieval (embedding queries and performing k-NN search).

Approach 2: Multimodal LLM for Captioning

This approach uses multimodal foundation models like Anthropic’s Claude to generate detailed captions for images, which are then embedded using text embedding models like Amazon Titan Text Embeddings. The captions and embeddings are stored alongside the original images. This approach provides more detailed and customizable results, as captions can be guided to focus on specific aspects like color, fabric, pattern, or shape.

The article provides a comparison table evaluating both approaches across key production factors:

Technology Stack

The article describes a production stack built on AWS services:

Critical Assessment

While this article provides valuable technical guidance and code examples, it is important to note several limitations from an LLMOps perspective:

Despite these limitations, the article provides practical, production-ready code examples and architectural patterns that would be valuable for teams building RAG systems with heterogeneous data sources. The modular approach of breaking down the problem into routing, retrieval, and generation components reflects good software engineering practices for production LLM systems.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Scaling AI Product Development with Rigorous Evaluation and Observability

Notion 2025

Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.

document_processing content_moderation question_answering +52

Domain-Specific AI Platform for Manufacturing and Supply Chain Optimization

Articul8 2025

Articul8 developed a generative AI platform to address enterprise challenges in manufacturing and supply chain management, particularly for a European automotive manufacturer. The platform combines public AI models with domain-specific intelligence and proprietary data to create a comprehensive knowledge graph from vast amounts of unstructured data. The solution reduced incident response time from 90 seconds to 30 seconds (3x improvement) and enabled automated root cause analysis for manufacturing defects, helping experts disseminate daily incidents and optimize production processes that previously required manual analysis by experienced engineers.

customer_support data_analysis classification +49