Appen: Human-AI Co-Annotation System for Efficient Data Labeling

LLMOps Database

Tech

Appen

Company

Appen

Title

Human-AI Co-Annotation System for Efficient Data Labeling

Industry

Tech

Link

https://www.youtube.com/watch?v=ZLOHQLyJs58&list=PLlcxuf1qTrwBGJBE0nVbAs0fbNLHidJaN&index=6

Year

2024

Summary (short)

Appen developed a hybrid approach combining LLMs with human annotators to address the growing challenges in data annotation for AI models. They implemented a co-annotation engine that uses model uncertainty metrics to efficiently route annotation tasks between LLMs and human annotators. Using GPT-3.5 Turbo for initial annotations and entropy-based confidence scoring, they achieved 87% accuracy while reducing costs by 62% and annotation time by 63% compared to purely human annotation, demonstrating an effective balance between automation and human expertise.

Tags

This case study from Appen demonstrates a sophisticated approach to integrating LLMs into production data annotation workflows, highlighting both the opportunities and challenges of combining artificial and human intelligence in practical applications. The context of this implementation is particularly important: Appen's research showed a 177% increase in generative AI adoption in 2024, yet simultaneously observed an 8% drop in projects making it to deployment since 2021. Their analysis identified data management as a key bottleneck, with a 10 percentage point increase in data-related challenges from 2023 to 2024. This speaks to the broader industry challenge of scaling high-quality data annotation for AI systems. The technical architecture of their co-annotation system consists of several key components: * A co-annotation engine that intelligently routes work between LLMs and human annotators * An uncertainty calculation system using GPT-3.5 Turbo with multiple prompt variations to assess confidence * A flexible threshold system for balancing accuracy vs. cost * Integration with their AI data platform (ADAP) through a feature called Model Mate The system's workflow is particularly noteworthy from an LLMOps perspective: 1. Initial data processing through LLMs (primarily GPT-3.5 Turbo) 2. Uncertainty calculation using multiple prompt variations 3. Automated routing based on entropy/uncertainty thresholds 4. Human review for high-uncertainty cases 5. Quality sampling of high-confidence cases 6. Feedback loop for continuous improvement One of the most interesting aspects of their implementation is the uncertainty measurement approach. Rather than relying on the model's self-reported confidence scores (which they found to be unreliable), they generate multiple annotations using different prompt variations and measure the consistency of the outputs. This provides a more robust measure of model uncertainty and helps determine which items need human review. The system demonstrated impressive results in production: * 87% accuracy achieved with hybrid approach (compared to 95% with pure human annotation) * 62% cost reduction ($450 to $169 per thousand items) * 63% reduction in labor time (150 hours to 55 hours) * 8 seconds per item for LLM processing vs 180 seconds for human annotation Their Model Mate feature implementation shows careful consideration of production requirements, including: * Support for multiple LLMs in the same workflow * Real-time integration within existing task designs * Flexibility to handle various data types (text, image, audio, video, geospatial) * Built-in monitoring and validation capabilities * Support for custom routing rules and multi-stage reviews A particularly interesting production case study involved a leading US electronics company seeking to enhance search relevance data accuracy. The implementation used GPT-4 for multimodal analysis of search queries, product titles, and images. Key findings included: * 3-4 percentage point accuracy increase across different components * 94% accuracy when combining LLM assistance with human annotation (up from 90%) * Importantly, incorrect LLM suggestions did not negatively impact human accuracy From an LLMOps perspective, several best practices emerge from this implementation: * Use of multiple prompt variations for robust uncertainty estimation * Implementation of flexible thresholds that can be adjusted based on accuracy/cost requirements * Integration of human expertise at strategic points in the workflow * Regular sampling of high-confidence predictions to ensure quality * Support for multimodal inputs and various data types * Built-in monitoring and evaluation capabilities The system also addresses common LLM challenges in production: * Hallucination mitigation through human verification * Bias protection through diverse human annotator pools * Quality control through strategic sampling * Cost management through intelligent routing The implementation demonstrates a sophisticated understanding of both the capabilities and limitations of LLMs in production. Rather than attempting to fully automate annotation, Appen has created a system that leverages the strengths of both AI and human annotators while mitigating their respective weaknesses. This approach appears particularly well-suited to scaling annotation operations while maintaining quality standards. Looking forward, the system appears designed to accommodate future improvements in LLM technology while maintaining its core benefits. The flexible architecture allows for easy integration of new models and adjustment of routing thresholds as model capabilities evolve.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free