ZenML

Practical Challenges in Building Production RAG Systems

Prolego
View original source

A detailed technical discussion between Prolego engineers about the practical challenges of implementing Retrieval Augmented Generation (RAG) systems in production. The conversation covers key challenges including document processing, chunking strategies, embedding techniques, and evaluation methods. The team shares real-world experiences about how RAG implementations differ from tutorial examples, particularly in handling complex document structures and different data formats.

Industry

Tech

Technologies

Overview

This case study is derived from a panel discussion at Prolego, a consulting firm specializing in generative AI solutions, featuring engineers Kevin DeWalt, Justin, Ben, and Cam who share their practical experiences building RAG (Retrieval Augmented Generation) applications for enterprise clients. The discussion serves as a valuable reality check against the proliferation of simplified RAG tutorials, highlighting the significant gap between demo projects and production-ready implementations.

The conversation opens with a practical demonstration of RAG using ChatGPT, where the host shows how asking a simple question like “Can I bring my small dog Charlie to work?” yields generic responses without context, but becomes useful when augmented with actual company policy documents. This illustrates the core value proposition of RAG: customizing general-purpose LLMs like GPT-4 with organization-specific data without requiring model fine-tuning.

The Gap Between Tutorials and Production Reality

One of the central themes of the discussion is how dramatically real-world RAG implementations differ from the simplified examples found in online tutorials. The engineers emphasize that most tutorials work with clean, well-structured text documents and carefully crafted queries that conveniently demonstrate the happy path. In contrast, production environments present messy, heterogeneous data sources including PDFs, Word documents, SharePoint sites, PowerPoint presentations, wikis, and emails.

The team notes that this is “yet another example of why we won’t be replacing data scientists and programmers with AI anytime” soon, acknowledging that the fundamental data engineering challenges remain core to successful RAG deployments.

Document Parsing and Chunking Challenges

Perhaps the most surprising insight from the discussion is that the most challenging aspect of RAG is often the step that appears simplest in architecture diagrams: converting source documents into meaningful chunks. Justin explains that documents like policies, procedures, and regulations have inherent hierarchical structure with sections and subsections, and naive chunking approaches can lose critical context.

The team provides a concrete example: a policy might state “this rule always applies except in the following conditions” followed by a list of exceptions. If you chunk sentence by sentence or even paragraph by paragraph, you might capture only the exception without the context that it’s an exception to a particular rule. Another example from life insurance applications involves statements like “patient does not have cancer, diabetes, heart disease” where missing the “does not have” qualifier would be catastrophic.

In terms of practical workflow, Ben describes spending “a couple of days” getting document parsing libraries operational, writing heuristics to capture different document elements, sorting elements by position on the page, and using font or color information to identify headings and structure. The goal is to create a flat representation suitable for machine learning operations while preserving hierarchical relationships through keys or metadata.

The engineers recommend using a flat dictionary or list structure where each element’s key encodes the document hierarchy (e.g., filename + chapter + subsection + paragraph), allowing both efficient batch processing and the ability to map back to original document structure when needed.

Embedding and Retrieval Complexity

The discussion reveals that the “most similar” retrieval step involves considerable nuance beyond simply computing cosine similarity between query and document embeddings. Context size presents challenges at multiple levels: both the embedding model’s context window for generating meaningful embeddings, and the generative LLM’s context window which limits how much retrieved information can be passed through.

Justin explains the chunking tradeoff clearly: too small (individual sentences) loses context and produces embeddings that don’t capture meaning well; too large (entire documents) tries to squeeze too much information into a single embedding representation, degrading retrieval quality. Finding the “sweet spot” requires experimentation with chunk sizes, overlap windows, and techniques that combine subsection text with parent section context.

The team discusses various retrieval enhancement techniques:

Ben mentions a Salesforce research paper on iterative summarization that progressively condenses text while preserving named entities and key information, improving the signal-to-noise ratio of retrieved content.

Evaluation Challenges

The team emphasizes that RAG evaluation is significantly more complex than traditional ML evaluation metrics. Justin uses an analogy of grading a student essay: you can check grammar (the easy part), but evaluating thoroughness, appropriate length, and factual correctness is much harder.

A key concept discussed is faithfulness as distinct from correctness. A system might produce a correct answer but not based on the retrieved context—particularly when using powerful models like GPT-4 that have extensive world knowledge. Faithfulness measures whether the generated response is actually derived from the provided context rather than the model’s parametric knowledge. This distinction is crucial for enterprise deployments where answers must be traceable to authoritative source documents.

The engineers acknowledge that many LLM evaluations rely on multiple-choice formats that are easy to score but don’t reflect real-world usage patterns. Ben explains that faithfulness evaluation requires LLMs to evaluate other LLMs, generating and assessing statements about the relationship between questions, answers, and context—there’s no simple numeric metric that captures everything.

Another evaluation challenge is the distribution of query difficulty. An evaluation set with “easy” questions that reliably retrieve correct documents will show good metrics, but users in production may ask questions in unexpected ways, use different vocabulary, or pose genuinely harder questions that the system wasn’t designed for.

Practical Recommendations

The team offers several pragmatic insights for production RAG deployments:

Technical Architecture Insights

The canonical RAG architecture discussed involves splitting source documents into chunks, generating embeddings via an embedding model, storing embeddings in a searchable format, embedding user queries through the same model, finding similar document embeddings, converting those back to text, and including that text in the prompt sent to a generative LLM.

The team notes that while this architecture looks simple, each box in the diagram conceals significant complexity. The embedding models may be pre-trained and commoditized, but everything around them—parsing, chunking, retrieval logic, context assembly, and evaluation—requires substantial engineering judgment and domain expertise.

This discussion provides a valuable counterweight to the “RAG is easy” narrative, demonstrating that production RAG applications are fundamentally data engineering and systems design challenges that happen to involve LLMs, rather than LLM projects that require a bit of data handling.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Context Engineering and Agent Development at Scale: Building Open Deep Research

LangChain 2025

Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.

code_generation summarization chatbot +39

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64