Amazon Finance Automation developed a RAG-based Q&A chat assistant using Amazon Bedrock to help analysts quickly retrieve answers to customer queries. Through systematic improvements in document chunking, prompt engineering, and embedding model selection, they increased the accuracy of responses from 49% to 86%, significantly reducing query response times from days to minutes.
This case study presents a detailed examination of how Amazon Finance Automation developed and iteratively improved a production LLM system to handle customer queries in their Accounts Payable (AP) and Accounts Receivable (AR) operations. The journey demonstrates several key aspects of implementing LLMs in production, with particular emphasis on systematic evaluation and improvement of RAG pipelines.
The initial problem faced by Amazon Finance was that analysts were spending excessive time searching through policy documents and consulting subject matter experts to answer customer queries, with response times stretching from hours to days. To address this, they implemented a RAG-based solution using Amazon Bedrock, incorporating several sophisticated components:
* A vector store using Amazon OpenSearch Service for the knowledge base
* Amazon Titan Multimodal Embeddings G1 model for document embedding
* Foundation models from Amazon Bedrock for response generation
* Custom components for diversity ranking and "lost in the middle" ranking
* Guardrails implementation through Amazon Bedrock for PII detection and prompt injection protection
* A validation engine to prevent hallucinations
* A Streamlit-based user interface
What makes this case study particularly valuable from an LLMOps perspective is the systematic approach to evaluation and improvement. The team started with a baseline system that only achieved 49% accuracy and methodically improved it to 86% through several iterations.
The evaluation methodology they developed is especially noteworthy:
* They created a test dataset of 100 questions with manually labeled answers from SMEs
* They compared traditional NLP metrics (ROUGE and METEOR) with LLM-based evaluation
* They found that traditional metrics had a 30% variance from human evaluation, while LLM-based evaluation achieved much better alignment with only 5% variance
* They implemented specialized LLM prompts for evaluating accuracy, acceptability, and factualness
The improvement process followed three main phases:
First, they addressed document chunking issues. The initial fixed-size chunking (512 tokens or 384 words) was causing 14% of inaccuracies due to incomplete context. They developed a semantic chunking approach using:
* QUILL Editor for converting unstructured text to HTML
* Logical structure identification based on HTML tags
* Semantic vector representation of chunks
* Tag assignment based on keywords
* This improvement alone increased accuracy from 49% to 64%
Second, they focused on prompt engineering, addressing several specific issues:
* Hallucination prevention when no relevant context was found
* Comprehensive response generation
* Support for both concise and detailed answers
* Citation generation
* Chain-of-thought reasoning implementation
These changes further improved accuracy from 64% to 76%
Finally, they optimized their embedding model selection:
* Experimented with multiple first-party and third-party models
* Found that contextual embedding models like bge-base-en-v1.5 performed better than alternatives
* Ultimately selected Amazon Titan Embeddings G1 model
* This improved context retrieval relevance from 55-65% to 75-80%
* Overall accuracy increased from 76% to 86%
The project demonstrates several LLMOps best practices:
* Systematic evaluation with both automated and human-verified metrics
* Iterative improvement with clear measurement of impact
* Multiple layers of safety (guardrails, validation engine, citation generation)
* Careful attention to document preprocessing and chunking
* Thoughtful prompt engineering with specific goals
* Empirical comparison of different embedding models
The case study also highlights the importance of having a clear evaluation framework before starting improvements. The team's ability to measure the impact of each change allowed them to make data-driven decisions about which improvements to pursue.
From an architectural perspective, the solution shows how to combine multiple AWS services effectively while maintaining modularity. The separation of concerns between different components (embedding, retrieval, ranking, generation, validation) allows for individual optimization of each piece.
The implementation of guardrails and validation checks demonstrates a production-ready approach to LLM deployment, with appropriate attention to safety and accuracy. The citation generation feature adds transparency and verifiability to the system's outputs, which is crucial for financial operations.
This case study provides valuable insights for organizations looking to implement RAG systems in production, particularly in domains where accuracy is crucial. The systematic approach to improvement and the detailed documentation of what worked (and by how much) makes it an excellent template for similar projects.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.