Amazon Finance: Scaling RAG Accuracy from 49% to 86% in Finance Q&A Assistant

LLMOps Database

Finance

Amazon Finance

Company

Amazon Finance

Title

Scaling RAG Accuracy from 49% to 86% in Finance Q&A Assistant

Industry

Finance

Link

https://aws.amazon.com/blogs/machine-learning/how-amazon-finance-automation-built-a-generative-ai-qa-chat-assistant-using-amazon-bedrock?tag=soumet-20

Year

2024

Summary (short)

Amazon Finance Automation developed a RAG-based Q&A chat assistant using Amazon Bedrock to help analysts quickly retrieve answers to customer queries. Through systematic improvements in document chunking, prompt engineering, and embedding model selection, they increased the accuracy of responses from 49% to 86%, significantly reducing query response times from days to minutes.

Tags

question_answering

chatbot

regulatory_compliance

This case study presents a detailed examination of how Amazon Finance Automation developed and iteratively improved a production LLM system to handle customer queries in their Accounts Payable (AP) and Accounts Receivable (AR) operations. The journey demonstrates several key aspects of implementing LLMs in production, with particular emphasis on systematic evaluation and improvement of RAG pipelines. The initial problem faced by Amazon Finance was that analysts were spending excessive time searching through policy documents and consulting subject matter experts to answer customer queries, with response times stretching from hours to days. To address this, they implemented a RAG-based solution using Amazon Bedrock, incorporating several sophisticated components: * A vector store using Amazon OpenSearch Service for the knowledge base * Amazon Titan Multimodal Embeddings G1 model for document embedding * Foundation models from Amazon Bedrock for response generation * Custom components for diversity ranking and "lost in the middle" ranking * Guardrails implementation through Amazon Bedrock for PII detection and prompt injection protection * A validation engine to prevent hallucinations * A Streamlit-based user interface What makes this case study particularly valuable from an LLMOps perspective is the systematic approach to evaluation and improvement. The team started with a baseline system that only achieved 49% accuracy and methodically improved it to 86% through several iterations. The evaluation methodology they developed is especially noteworthy: * They created a test dataset of 100 questions with manually labeled answers from SMEs * They compared traditional NLP metrics (ROUGE and METEOR) with LLM-based evaluation * They found that traditional metrics had a 30% variance from human evaluation, while LLM-based evaluation achieved much better alignment with only 5% variance * They implemented specialized LLM prompts for evaluating accuracy, acceptability, and factualness The improvement process followed three main phases: First, they addressed document chunking issues. The initial fixed-size chunking (512 tokens or 384 words) was causing 14% of inaccuracies due to incomplete context. They developed a semantic chunking approach using: * QUILL Editor for converting unstructured text to HTML * Logical structure identification based on HTML tags * Semantic vector representation of chunks * Tag assignment based on keywords * This improvement alone increased accuracy from 49% to 64% Second, they focused on prompt engineering, addressing several specific issues: * Hallucination prevention when no relevant context was found * Comprehensive response generation * Support for both concise and detailed answers * Citation generation * Chain-of-thought reasoning implementation These changes further improved accuracy from 64% to 76% Finally, they optimized their embedding model selection: * Experimented with multiple first-party and third-party models * Found that contextual embedding models like bge-base-en-v1.5 performed better than alternatives * Ultimately selected Amazon Titan Embeddings G1 model * This improved context retrieval relevance from 55-65% to 75-80% * Overall accuracy increased from 76% to 86% The project demonstrates several LLMOps best practices: * Systematic evaluation with both automated and human-verified metrics * Iterative improvement with clear measurement of impact * Multiple layers of safety (guardrails, validation engine, citation generation) * Careful attention to document preprocessing and chunking * Thoughtful prompt engineering with specific goals * Empirical comparison of different embedding models The case study also highlights the importance of having a clear evaluation framework before starting improvements. The team's ability to measure the impact of each change allowed them to make data-driven decisions about which improvements to pursue. From an architectural perspective, the solution shows how to combine multiple AWS services effectively while maintaining modularity. The separation of concerns between different components (embedding, retrieval, ranking, generation, validation) allows for individual optimization of each piece. The implementation of guardrails and validation checks demonstrates a production-ready approach to LLM deployment, with appropriate attention to safety and accuracy. The citation generation feature adds transparency and verifiability to the system's outputs, which is crucial for financial operations. This case study provides valuable insights for organizations looking to implement RAG systems in production, particularly in domains where accuracy is crucial. The systematic approach to improvement and the detailed documentation of what worked (and by how much) makes it an excellent template for similar projects.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source