Shortwave built an AI email assistant that helps users interact with their email history as a knowledge base. They implemented a sophisticated Retrieval Augmented Generation (RAG) system with a four-step process: tool selection, data retrieval, question answering, and post-processing. The system combines multiple AI technologies including LLMs, embeddings, vector search, and cross-encoder models to provide context-aware responses within 3-5 seconds, while handling complex infrastructure challenges around prompt engineering, context windows, and data retrieval.
Shortwave's case study presents a comprehensive look at building and deploying a production-grade AI assistant for email management, highlighting the complexities and engineering decisions involved in creating a reliable LLM-powered system. The company's journey began with a simple email summarization feature but evolved into an ambitious project to create an AI executive assistant that could leverage users' entire email history as a knowledge base.
The architecture they developed is particularly noteworthy for its emphasis on simplicity and reliability. Rather than following the trend of using complex chains of LLM calls, Shortwave opted for a single-LLM approach for final response generation, betting on the superior reasoning capabilities and large context windows of models like GPT-4. This architectural decision was made after observing that longer chains often introduced compounding errors and data loss at each stage.
Their system architecture consists of four main stages:
**Tool Selection Stage:**
The system uses GPT-4 to analyze user queries and determine which data sources (tools) are needed to answer the question. This meta-reasoning step is crucial for efficient resource utilization and helps maintain a modular system architecture. The LLM receives extensive context about the current application state and available tools to make informed decisions.
**Data Retrieval Stage:**
Once tools are selected, data retrieval occurs in parallel across different systems. The most sophisticated tool is their AI Search system, which combines multiple search technologies:
* Traditional full-text search via Elasticsearch
* Vector similarity search using Pinecone
* Feature extraction using parallel LLM calls
* Heuristic-based re-ranking
* Cross-encoder model re-ranking using MS Marco MiniLM
**Question Answering Stage:**
The system assembles all retrieved information into a carefully constructed prompt for GPT-4, including specialized instructions for output formatting and source citation. A key challenge here is managing token limits and making intelligent tradeoffs about what context to include.
**Post-Processing Stage:**
The final stage handles output formatting, adding citations, and suggesting actionable next steps to users.
From an infrastructure perspective, several key decisions and implementations stand out:
**Embedding Infrastructure:**
* They run their own GPU infrastructure for embedding generation using the Instructor model
* Embeddings are stored in Pinecone with per-user namespacing
* They perform real-time embedding similarity calculations for ranking
**Performance Optimization:**
* Heavy use of parallel processing and streaming
* Sophisticated batching and chunking strategies for email content
* Integration of multiple search technologies for optimal results
* Optimization to achieve 3-5 second response times despite complex processing
**Production Considerations:**
* Built-in error handling and fallback mechanisms
* Careful management of API costs by using different models for different tasks
* Extensive use of caching and pre-computation where possible
* Scalable architecture supporting per-user data isolation
The system demonstrates several innovative approaches to common LLMOps challenges:
**Context Window Management:**
Instead of trying to fit everything into one prompt, they developed sophisticated ranking and selection mechanisms to choose the most relevant content. This includes multiple stages of re-ranking using both heuristics and ML models.
**Query Understanding:**
Their query reformulation system shows sophisticated handling of conversational context, enabling natural follow-up questions while maintaining accurate understanding of user intent.
**Hybrid Search Architecture:**
The combination of traditional search, vector search, and cross-encoder ranking shows a practical approach to achieving both recall and precision in production systems.
For developers and architects working on similar systems, several lessons emerge:
* The value of keeping the core architecture simple while allowing for complexity in specific components
* The importance of parallel processing and streaming for acceptable latency
* The benefits of running your own infrastructure for certain AI components while using cloud services for others
* The need for sophisticated ranking and selection mechanisms when dealing with large amounts of context
The case study also demonstrates the importance of careful product thinking in AI systems. Shortwave's decision to position their AI as an "executive assistant" helped users understand its capabilities and limitations, while their focus on email as a knowledge base gave them a clear scope for feature development.
While the system shows impressive capabilities, it's worth noting that such an architecture requires significant computational resources and careful optimization to maintain reasonable response times. The use of multiple GPU clusters and complex infrastructure suggests that similar systems might be expensive to operate at scale.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.