Hansard: Building a Modern Search Engine for Parliamentary Records with RAG Capabilities

LLMOps Database

Government

Hansard

Company

Hansard

Title

Building a Modern Search Engine for Parliamentary Records with RAG Capabilities

Industry

Government

Link

https://hack.gov.sg/hack-for-public-good-2024/2024-projects/pairsearch/

Year

2024

Summary (short)

The Singapore government developed Pair Search, a modern search engine for accessing Parliamentary records (Hansard), addressing the limitations of traditional keyword-based search. The system combines semantic search using e5 embeddings with ColbertV2 reranking, and is designed to serve both human users and as a retrieval backend for RAG applications. Early deployment shows significant user satisfaction with around 150 daily users and 200 daily searches, demonstrating improved search result quality over the previous system.

Tags

This case study examines the development and deployment of Pair Search, a modern search engine system developed by the Singapore government to improve access to Parliamentary records (Hansard). The project represents a significant step forward in making government information more accessible while also preparing for the future of AI-enabled information retrieval. **Project Context and Challenges** The Hansard database contains over 30,000 parliamentary reports dating back to 1955, presenting several key challenges: * Legacy search was purely keyword-based, leading to poor result quality * Documents spans multiple decades with varying formats requiring standardization * Need to serve both human users and AI systems effectively * Requirement for high performance while using sophisticated algorithms **Technical Architecture and Implementation** The system is built on Vespa.ai as the core search engine, chosen for its scalability and advanced text search capabilities. The search process is implemented in three distinct phases: *Document Processing Phase* The team tackled the complex task of standardizing decades of parliamentary records into a uniform format suitable for modern search operations. This involved careful consideration of changing data formats and structures over time while maintaining the integrity of the historical records. *Retrieval Phase* The system implements a hybrid approach combining: * Keyword-based search using Vespa's weakAnd operator * BM25 and nativeRank algorithms for text matching * Semantic search using e5 embeddings, chosen for their balance of performance and cost-effectiveness compared to alternatives like OpenAI's ada embeddings *Re-ranking Phase* A sophisticated three-phase approach was implemented to maintain speed while using complex ranking algorithms: * Phase 1: Initial filtering using cost-effective algorithms * Phase 2: ColbertV2 model-based reranking for improved relevance * Phase 3: Global phase combining semantic, keyword-based, and ColbertV2 scores into a hybrid scoring system **Production Deployment and Monitoring** The system has been soft-launched with specific government departments including: * Attorney General's Chambers * Ministry of Law legal policy officers * Communications Operations officers at MCI and PMO * COS coordinators Performance metrics are being actively monitored, including: * Daily user count (~150) * Daily search volume (~200) * Result click-through patterns * Number of pages viewed before finding desired results **Future Development Plans** The team has outlined several strategic directions for system enhancement: *Data Expansion* * Planning to incorporate High Court and Court of Appeal case judgments * Exploring integration with other government data sources *Search Enhancement* * Implementing LLM-based index enrichment through automated tagging * Developing question generation capabilities * Exploring query expansion using LLMs to improve retrieval accuracy *RAG Integration* The system is designed to serve as a retrieval backend for RAG applications, with specific focus on: * Providing API access for both basic search and RAG-specific retrieval * Supporting the Assistants feature in Pair Chat * Enabling integration with other government LLM applications **Technical Lessons and Best Practices** Several key insights emerge from this implementation: *Architecture Design* * The three-phase search approach helps balance speed and accuracy * Hybrid scoring systems outperform single-metric approaches * Careful attention to document processing and standardization is crucial for historical data *Model Selection* * E5 embeddings provide a good balance of cost and performance * ColbertV2 reranking adds significant value to result quality * Combining multiple ranking approaches yields better results than relying on a single method *Production Considerations* * The system maintains high performance despite complex algorithms through careful phase design * API-first design enables broader application integration * Continuous monitoring of user interaction metrics guides optimization **Impact and Results** The initial deployment has shown promising results: * Significant positive feedback from government users * Improved search result quality compared to the previous system * Successfully handling a growing user base with consistent performance * Recognition at the parliamentary level, with mention by the Prime Minister This case study demonstrates the successful implementation of modern search and RAG capabilities in a government context, showing how careful attention to architecture, model selection, and user needs can result in a system that effectively serves both human users and AI applications. The project also highlights the importance of planning for future AI integration while maintaining current performance and usability standards.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source