Great Ormond Street Hospital NHS Trust developed a solution to extract information from 15,000 unstructured cardiac MRI reports spanning 10 years. They implemented a hybrid approach using small LLMs for entity extraction and few-shot learning for table structure classification. The system successfully extracted patient identifiers and clinical measurements from heterogeneous reports, enabling linkage with structured data and improving clinical research capabilities. The solution demonstrated significant improvements in extraction accuracy when using contextual prompting with models like FLAN-T5 and RoBERTa, while operating within NHS security constraints.
This case study comes from Great Ormond Street Hospital (GOSH), a pediatric NHS trust in the UK that specializes in treating children and young people with rare and complex conditions. The presentation was delivered by Bavi Rajendran (Senior Data Scientist leading NLP and Computer Vision) and Sabin Sabu (NLP and ML Engineer), both part of the Digital Research Environment (DRE) team within GOSH Drive, the hospital’s innovation hub. The work described is part of a five-year partnership with Raj (likely referring to a research or technology partner).
The fundamental problem the team addressed was the massive amount of valuable clinical information locked within unstructured data in their electronic patient record systems. Clinicians and researchers frequently need to perform retrospective analysis on patient cohorts for clinical decision-making or secondary research purposes, but this data—including text reports, discharge summaries, and images—is not available in structured formats. For example, a clinician wanting to identify all patients diagnosed with Tetralogy of Fallot would need to manually review unstructured reports, a process that is time-consuming, requires domain expertise, and can only be done on small subsets of data.
The specific use case focused on approximately 15,000 cardiac MRI (CMR) reports produced over a 10-year period by different consultants. These reports existed as PDF documents containing clinical measurements in tabular format along with free-text information. Due to being created over such a long timeframe by various physicians, the reports were heterogeneous in nature—lacking consistent structure or format. The clinical measurements contained within these reports are highly valuable for precision medicine research, drug discovery, and cohort studies in the cardiac domain.
The goal was to enable automated extraction of information from these reports to answer clinical questions in a timely manner, support clinical decision-making, and enrich the hospital’s structured data assets. Successfully extracting patient identifiers from these reports would allow linking this newly structured data with GOSH’s existing structured data holdings, which are described as one of the largest data points available in the UK for research and analysis.
A significant operational challenge was the lack of strong infrastructure to process large volumes of data, combined with strict NHS information governance requirements. The team developed a solution called GRID, consisting of two Linux servers with distinct roles:
This architecture enabled the team to develop and iterate on their NLP pipelines while maintaining compliance with NHS data governance requirements. The presenters emphasized that GRID was crucial for the project, reducing processing time from days on a work laptop to just a few hours.
Given the heterogeneous nature of the CMR reports spanning a decade, traditional rule-based NLP techniques were not viable because they could not accommodate all the variations and changes in report formats over time. The team turned to large language models as question-answering tools, leveraging their ability to “absorb the heterogeneity of the data without much development overhead.”
A key constraint was the lack of GPU resources, forcing the team to work with smaller LLMs that could run on CPU. They framed the extraction task as question-answering, providing the model with a context (an unstructured chunk of text from a report) and asking questions like “What is the patient name?”
The team evaluated several models for entity extraction, including:
A particularly interesting finding emerged from their experiments with prompt engineering. By adding contextual prompts that explained the nature of the document (e.g., “This is a part from an unstructured text in a cardiac MRI report in the GOSH hospital”), the team observed significant improvements in entity extraction performance for names and reference names in Flan-T5 and RoBERTa models.
However, they discovered an unexpected result with BERT: performance on extracting MRN (Medical Record Numbers) and NHS numbers actually dropped when prompts were added. The team speculated that BERT may struggle with extracting very large numbers (which NHS numbers and MRNs typically are), based on literature suggesting this limitation. Why adding prompts specifically worsened this performance remains an open research question.
The team used integrated gradients to analyze attention and attribution scores within the models. Their visualization showed that when context prompts were provided, the model’s ability to correctly identify patient names increased substantially. Without prompts, models were confused when multiple names appeared in the text (patient name, clinician name, hospital name), as they lacked context to prioritize which name was being requested. This interpretability analysis reinforced the importance of proper prompt engineering for domain-specific extraction tasks.
The second major component of the pipeline addressed extracting structured measurements from tables within the PDF reports. Initial attempts with rule-based table extraction failed to scale across the 10-year dataset due to data heterogeneity and the problem of incorrectly captured information being classified as tabular data.
The team observed that while there was variation in the data, tables followed consistent patterns: a header row followed by value rows containing measurements. They needed to classify each row as either a header row, a value row, or incorrectly captured information. However, they had limited annotated examples to work with.
The team adopted SetFit (Sentence Transformer Fine-tuning), a few-shot learning approach that:
For implementation details:
The team visualized embedding spaces before and after fine-tuning, showing clear separation between non-relevant/incorrect information, header rows, and value rows after the SetFit process. This visualization demonstrated how the contrastive learning approach successfully organized the representation space for effective classification.
Results showed a significant performance boost with the SetFit-based classifier compared to rule-based approaches, achieving strong F1 scores on data randomly sampled across the full 10-year dataset to ensure the approach handled heterogeneity.
The team outlined several areas for future work:
This case study illustrates several practical LLMOps challenges and solutions in a healthcare context:
Infrastructure Constraints: The need to balance development agility (internet access for packages and models) with data governance (patient data isolation) led to the GRID architecture—a practical pattern for regulated industries.
Compute Limitations: Without GPU access, the team was forced to work with smaller models that could run on CPU, demonstrating that production LLM deployments must work within available compute budgets.
Model Selection Through Evaluation: The team conducted systematic evaluation of multiple models, discovering that apparent “smaller” variants (like Flan-T5 Small) can have serious issues like hallucination that make them unsuitable for production use.
Prompt Engineering as a Production Concern: The significant performance improvements from proper prompt design, along with unexpected degradation in some cases (BERT on large numbers), highlights that prompt engineering requires careful evaluation and cannot be assumed to universally improve performance.
Hybrid Approaches: The combination of LLM-based question-answering for entity extraction with few-shot learning (SetFit) for table classification shows that production NLP systems often benefit from selecting the right technique for each sub-problem rather than using a single approach throughout.
Interpretability: The use of integrated gradients to understand model behavior reflects good practice for production ML systems, especially in healthcare where understanding model decisions has clinical and regulatory implications.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Martin Der, a data scientist at Xomnia, presents practical approaches to GenAI governance addressing the challenge that only 5% of GenAI projects deliver immediate ROI. The talk focuses on three key pillars: access and control (enabling self-service prototyping through tools like Open WebUI while avoiding shadow AI), unstructured data quality (detecting contradictions and redundancies in knowledge bases through similarity search and LLM-based validation), and LLM ops monitoring (implementing tracing platforms like LangFuse and creating dynamic golden datasets for continuous testing). The solutions include deploying Chrome extensions for workflow integration, API gateways for centralized policy enforcement, and developing a knowledge agent called "Genie" for internal use cases across telecom, healthcare, logistics, and maritime industries.
Reducto has built a production document parsing system that processes over 1 billion documents by combining specialized vision-language models, traditional OCR, and layout detection models in a hybrid pipeline. The system addresses critical challenges in document parsing including hallucinations from frontier models, dense tables, handwritten forms, and complex charts. Their approach uses a divide-and-conquer strategy where different models are routed to different document regions based on complexity, achieving higher accuracy than AWS Textract, Microsoft Azure Document Intelligence, and Google Cloud OCR on their internal benchmarks. The company has expanded beyond parsing to offer extraction with pixel-level citations and an edit endpoint for automated form filling.