Dandelion Health developed a sophisticated de-identification pipeline for processing sensitive patient healthcare data while maintaining HIPAA compliance. The solution combines John Snow Labs' Healthcare NLP with custom pre- and post-processing steps to identify and transform protected health information (PHI) in free-text patient notes. Their approach includes risk categorization by medical specialty, context-aware processing, and innovative "hiding in plain sight" techniques to achieve high-quality de-identification while preserving data utility for medical research.
Dandelion Health is tackling one of the most challenging aspects of healthcare data processing: de-identifying free-text patient notes while maintaining their utility for medical research and AI training. This case study demonstrates a sophisticated approach to implementing NLP in a highly regulated production environment where privacy concerns are paramount.
The company's mission centers on providing safe, ethical access to curated clinical data, building what they describe as the world's largest AI-ready training and validation dataset. Their solution to the de-identification challenge showcases several important aspects of production ML/NLP systems:
## System Architecture and Security
The system is implemented within a zero-trust AWS environment, with each hospital partner having a dedicated AWS account. Security is paramount:
* All processing occurs within AWS private networks
* No internet access is allowed for systems handling raw data
* Services are either AWS-native or air-gapped containerized solutions running on ECS
* Processing is distributed using SQS queuing and Lambda functions for scalability
## The De-identification Pipeline
The core NLP pipeline is built around John Snow Labs' Healthcare NLP solution, but what makes this case particularly interesting from an LLMOps perspective is the sophisticated wrapper they've built around it to handle real-world complexities. The pipeline includes:
### Pre-processing Stage
* Modality categorization (e.g., radiology, pathology, progress notes)
* Risk level assessment for each category
* Custom preprocessing for known document structures
* Special handling for embedded tables and metadata
* Extraction and separate processing of headers/footers
### Main Processing Stage
* Application of Healthcare NLP for PHI detection
* Context-aware processing based on document type
* Special rule handling for edge cases (e.g., obstetrics notes where "30 weeks" needs special handling)
### Post-processing Stage
* Implementation of "hiding in plain sight" (HIPS) technique
* Systematic replacement of detected PHI with similar but false information
* Careful management of date shifting to maintain temporal relationships
* Quality control and validation processes
## Quality Assurance and Evaluation
The system includes robust evaluation mechanisms:
* Manual review by clinically trained analysts
* Double-blind review process with third-party arbitration for disagreements
* Detailed recall metrics for different types of PHI
* Cross-checking of HIPS dictionaries and date jitters
* Validation of formatting consistency across years and subcategories
## Handling Edge Cases and Challenges
The case study reveals several interesting challenges in deploying NLP systems in healthcare:
* Dealing with PDF conversions and formatting artifacts
* Managing ASCII tables and non-standard data formats
* Handling ambiguous identifiers (e.g., CPT codes vs. ZIP codes)
* Processing specialty-specific language patterns
* Maintaining temporal relationships while anonymizing dates
## Risk Management and Trade-offs
The solution demonstrates sophisticated risk management:
* Modality-specific risk assessment and handling
* Balanced approach to recall vs. precision
* Use of probability-based replacement strategies
* Careful handling of false positives to prevent reverse engineering
* Multiple layers of validation and quality control
## Production Deployment Considerations
The system shows careful attention to production requirements:
* Scalable processing using AWS services
* Automated pipeline with human-in-the-loop validation
* Comprehensive quality reporting for hospital partners
* Careful handling of data transformations to maintain utility
* Regular retraining and updating of analysts
## Results and Impact
While specific metrics aren't provided in the presentation, the system appears to achieve:
* High recall rates (95%+ before HIPS, 99%+ after)
* Successful processing of complex medical documents
* Maintenance of data utility for research purposes
* HIPAA compliance through expert determination method
* Successful deployment across multiple hospital systems
This case study is particularly valuable because it demonstrates how to wrap and enhance existing NLP tools with sophisticated pre- and post-processing to handle real-world complexities. It shows how to build production-grade systems that must balance multiple competing requirements: privacy, data utility, processing efficiency, and regulatory compliance. The attention to edge cases and the sophisticated approach to risk management provide valuable lessons for anyone deploying NLP systems in highly regulated environments.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.