ZenML

Building a Modern Search Engine for Parliamentary Records with RAG Capabilities

Hansard 2024
View original source

The Singapore government developed Pair Search, a modern search engine for accessing Parliamentary records (Hansard), addressing the limitations of traditional keyword-based search. The system combines semantic search using e5 embeddings with ColbertV2 reranking, and is designed to serve both human users and as a retrieval backend for RAG applications. Early deployment shows significant user satisfaction with around 150 daily users and 200 daily searches, demonstrating improved search result quality over the previous system.

Industry

Government

Technologies

Overview

Pair Search is a prototype search engine developed by Singapore’s Open Government Products (OGP) team during the Hack for Public Good 2024 hackathon. The project addresses a significant pain point in government information retrieval: searching through decades of parliamentary records (Hansard) that were previously only accessible via poor-quality keyword search. The system is explicitly designed with a dual purpose in mind—serving human users directly while also functioning as the retrieval component for Retrieval Augmented Generation (RAG) systems used by other government LLM products.

The Hansard database contains the official record of every word spoken in Singapore’s Parliament, dating back to 1955 when Parliament was known as the Legislative Assembly. This represents over 30,000 reports spanning nearly 70 years of evolving data formats. Policy makers, legal professionals, and the public all rely on this information, but the existing search infrastructure was woefully inadequate for modern needs.

The original Hansard search engine relied entirely on keyword-based matching, which produced poor results for complex queries. The case study provides a concrete example: searching for “covid 19 rapid testing” in the legacy system returns results flooded with documents that merely mention “Covid” frequently, rather than documents actually discussing rapid testing protocols. The system lacked semantic understanding and couldn’t interpret user intent beyond literal word matching.

Additionally, the legacy interface only displayed document titles without contextual snippets, forcing users to click through multiple links to determine relevance. This created significant friction for policy officers who needed to research topics quickly and thoroughly.

Technical Architecture

Document Processing Pipeline

The team faced substantial data engineering challenges in preparing the corpus for indexing. The Hansard database spans decades during which data formats evolved significantly. Standardizing this heterogeneous information into a uniform format suitable for modern search indexing required careful parsing and transformation. While the case study doesn’t detail the specific ETL processes used, this kind of historical document processing is a common but often underestimated component of production search systems.

Search Engine Infrastructure

Pair Search is built on Vespa.ai, an open-source big data serving engine. This choice reflects several strategic considerations: Vespa provides both keyword and vector search capabilities in a single platform, it’s designed for production-scale workloads, and it has active integration of state-of-the-art models and techniques. The open-source nature also aligns with government preferences for avoiding vendor lock-in and maintaining transparency.

Hybrid Retrieval Strategy

The retrieval mechanism employs a dual-pronged approach combining keyword and semantic search:

Keyword Search Component: Uses Vespa’s weakAnd operator with nativeRank and BM25 text matching algorithms. BM25 is a well-established probabilistic ranking function that considers term frequency and document length normalization. The weakAnd operator allows efficient approximate matching without exhaustively scoring every document.

Semantic Search Component: Incorporates e5 embeddings for vector-based similarity search. The team explicitly chose e5 over alternatives like OpenAI’s ada embeddings, citing better speed, cost-effectiveness, and performance. This reflects a pragmatic production decision—while OpenAI embeddings are popular, e5 models (particularly the multilingual variants like M3 mentioned in the text) can be self-hosted, reducing API dependencies and costs for high-volume government applications.

The hybrid approach captures both the literal textual content users specify and the semantic intent behind their queries. This is particularly valuable for parliamentary records where the same concept may be expressed using different terminology across decades.

Three-Phase Reranking Pipeline

To maintain low latency despite complex ranking algorithms, Pair Search implements a tiered reranking approach:

Phase 1 (Content Node Level): Each content node applies cost-effective initial filtering algorithms to reduce the candidate set. This distributes the computational load and eliminates clearly irrelevant results early.

Phase 2 (ColBERT v2 Reranking): A more resource-intensive pass using ColBERT v2, which performs late interaction between query and document token embeddings. ColBERT is known for providing high-quality relevance scores while being more efficient than cross-encoder approaches, making it suitable for production reranking.

Phase 3 (Global Aggregation): The final phase combines top results from all content nodes, computing a hybrid score that integrates semantic similarity, keyword matching, and ColBERT scores. The team notes that this multi-signal approach significantly outperforms single-metric ranking, which tends to be “overly biased towards one dimension of result quality.”

This architecture represents a classic production ML pattern: use cheap, fast models to filter at scale, then apply expensive, accurate models to a smaller candidate set.

RAG Integration and Broader LLMOps Context

A key strategic aspect of Pair Search is its explicit design as RAG infrastructure. The team frames the project within a larger context: advances in LLMs have created widespread demand for data-augmented generation, and the quality of RAG systems fundamentally depends on retrieval quality. By building a high-quality search engine, OGP creates reusable infrastructure for multiple LLM applications across government.

The case study mentions that Pair Search is designed to work “out of the box as the retrieval stack for a Retrieval Augmented Generation system” and that they’ve been trialing this integration with an “Assistants feature in Pair Chat” (presumably another OGP product). By exposing APIs for both base search and RAG-specific retrieval, the team enables multiple applications to benefit from the same underlying engine.

This architectural approach—building specialized retrieval infrastructure that serves multiple LLM applications—reflects emerging best practices in LLMOps. Rather than embedding search logic into individual applications, centralizing retrieval creates opportunities for optimization, monitoring, and improvement that benefit all downstream systems.

Production Deployment and Metrics

The system was soft-launched with specific user groups including the Attorney-General’s Chambers (AGC), Ministry of Law legal policy officers, Communications Operations officers at MCI and PMO, and Committee of Supply coordinators. Early metrics showed approximately 150 daily users and 200 daily searches.

The team describes using engagement metrics for ongoing optimization: average rank of clicked results and number of pages users traverse before finding relevant content. These metrics inform tuning of the hybrid algorithm weights to improve both accuracy and relevance. This represents a reasonable production evaluation approach, though it’s worth noting that click-based metrics can have biases (users may click on first results regardless of true relevance).

User feedback quoted in the case study is uniformly positive, with policy officers describing significant productivity improvements. The project also received political recognition when Prime Minister Lee Hsien Loong referenced it in Parliament, noting that “soon, we will be able to do a generative AI search on it.”

Future Directions

The team outlines several planned enhancements that further integrate LLM capabilities:

LLM-Augmented Indexing: Using language models to enrich the search index through automated tagging and potential-question generation. This preprocessing approach can improve retrieval without changing query-time complexity.

Query Expansion: Leveraging LLMs to enhance queries by appending related terms and phrases, increasing the probability of matching relevant documents. This is a well-established information retrieval technique that LLMs can automate effectively.

Magic Summary Feature: The case study mentions a feature that “automatically generates a summary of the best results in a chronological timeline” that was deprioritized for initial launch. This suggests plans for generative summarization as a post-retrieval enhancement.

Expansion to Other Corpora: The team plans to extend indexing to other government data sources including High Court and Court of Appeal case judgments, addressing similar search quality issues in legal research.

Assessment

Pair Search represents a well-architected production search system that thoughtfully combines established information retrieval techniques with modern embedding-based approaches. The choice of Vespa.ai, hybrid retrieval, and tiered reranking reflects pragmatic engineering decisions appropriate for government production systems where reliability, cost-control, and maintainability matter.

The explicit framing as RAG infrastructure is notable—the team recognizes that search quality underlies LLM application quality and has designed accordingly. The three-phase reranking pipeline demonstrates understanding of production ML tradeoffs between accuracy and latency.

The case study is relatively light on operational details such as monitoring, testing strategies, or handling of edge cases. The evaluation approach based on click metrics, while practical, could be supplemented with more rigorous relevance assessment. The user testimonials, while positive, come from a small group of early adopters during soft launch.

Overall, Pair Search illustrates how government technology teams are building foundational infrastructure for LLM applications, with retrieval quality recognized as a critical enabler for downstream generative AI use cases.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building a Search Engine for AI Agents: Infrastructure, Product Development, and Production Deployment

Exa.ai 2025

Exa.ai has built the first search engine specifically designed for AI agents rather than human users, addressing the fundamental problem that existing search engines like Google are optimized for consumer clicks and keyword-based queries rather than semantic understanding and agent workflows. The company trained its own models, built its own index, and invested heavily in compute infrastructure (including purchasing their own GPU cluster) to enable meaning-based search that returns raw, primary data sources rather than listicles or summaries. Their solution includes both an API for developers building AI applications and an agentic search tool called Websites that can find and enrich complex, multi-criteria queries. The results include serving hundreds of millions of queries across use cases like sales intelligence, recruiting, market research, and research paper discovery, with 95% inbound growth and expanding from 7 to 28+ employees within a year.

question_answering data_analysis chatbot +44

Multi-Agent AI System for Financial Intelligence and Risk Analysis

Moody’s 2025

Moody's Analytics, a century-old financial institution serving over 1,500 customers across 165 countries, transformed their approach to serving high-stakes financial decision-making by evolving from a basic RAG chatbot to a sophisticated multi-agent AI system on AWS. Facing challenges with unstructured financial data (PDFs with complex tables, charts, and regulatory documents), context window limitations, and the need for 100% accuracy in billion-dollar decisions, they architected a serverless multi-agent orchestration system using Amazon Bedrock, specialized task agents, custom workflows supporting up to 400 steps, and intelligent document processing pipelines. The solution processes over 1 million tokens daily in production, achieving 60% faster insights and 30% reduction in task completion times while maintaining the precision required for credit ratings, risk intelligence, and regulatory compliance across credit, climate, economics, and compliance domains.

fraud_detection document_processing question_answering +42