Thumbtack: Fine-tuned LLM for Message Content Moderation and Trust & Safety

LLMOps Database

Tech

Thumbtack

Company

Thumbtack

Title

Fine-tuned LLM for Message Content Moderation and Trust & Safety

Industry

Tech

Link

https://medium.com/thumbtack-engineering/using-genai-to-enhance-trust-and-safety-at-thumbtack-2b8355556f1f

Year

2024

Summary (short)

Thumbtack implemented a fine-tuned LLM solution to enhance their message review system for detecting policy violations in customer-professional communications. After experimenting with prompt engineering and finding it insufficient (AUC 0.56), they successfully fine-tuned an LLM model achieving an AUC of 0.93. The production system uses a cost-effective two-tier approach: a CNN model pre-filters messages, with only suspicious ones (20%) processed by the LLM. Using LangChain for deployment, the system has processed tens of millions of messages, improving precision by 3.7x and recall by 1.5x compared to their previous system.

Tags

This case study from Thumbtack, a home services marketplace platform, presents a comprehensive example of taking LLMs from experimentation to production for content moderation. The company faced the challenge of reviewing messages between customers and service professionals to identify policy violations, ranging from obvious issues like abusive language to more subtle violations like job seeking or partnership requests. ## Initial System and Challenge Before implementing LLMs, Thumbtack used a two-part system consisting of: * A rule-based engine for detecting obvious violations * A CNN-based ML model for more complex cases While this system worked for straightforward cases, it struggled with nuanced language, sarcasm, and implied threats. This limitation led them to explore LLM solutions. ## Experimental Phase The team's approach to implementing LLMs was methodical and data-driven. They first experimented with prompt engineering using off-the-shelf models, testing against a dataset of 1,000 sample messages (90% legitimate, 10% suspicious). The prompt engineering approach, despite careful crafting of prompts that included detailed service criteria and guidelines, only achieved an AUC of 0.56, which was deemed insufficient for production use. This led to their second experiment with fine-tuning LLMs. The results here were much more promising. Even with just a few thousand training samples, they saw significant improvements. Upon expanding the dataset to tens of thousands of samples, they achieved an impressive AUC of 0.93, making the model suitable for production deployment. ## Production Implementation The production implementation addressed two critical challenges: ### 1. Infrastructure and Integration They built a centralized LLM service using LangChain, which proved to be a crucial architectural decision. This approach: * Enabled consistent LLM usage across teams * Provided a scalable infrastructure for future LLM deployments * Streamlined the integration process ### 2. Cost Optimization The team developed a clever two-tier approach to manage costs: * Repurposed their existing CNN model as a pre-filter * Only about 20% of messages (those flagged as potentially suspicious) are processed by the more expensive LLM * This significantly reduced computational costs while maintaining high accuracy ## Technical Architecture The production system's workflow is particularly noteworthy for its efficiency: * Messages first pass through the CNN-based pre-filter * Only suspicious messages are routed to the LLM for detailed analysis * Results are aggregated and suspicious messages are sent for manual review * Clean messages are immediately delivered to professionals * The system maintains high accuracy while optimizing resource usage ## Results and Performance The production results have been impressive: * The system has successfully processed tens of millions of messages * Precision improved by 3.7x compared to the previous system * Recall improved by 1.5x * These improvements significantly enhanced the platform's trust and safety measures ## Technical Lessons and Best Practices Several key learnings emerge from this case study: * Prompt engineering alone may not be sufficient for specialized tasks * Fine-tuning can dramatically improve performance for domain-specific applications * A hybrid approach combining traditional ML models with LLMs can be both cost-effective and high-performing * Using frameworks like LangChain can significantly simplify production deployment * Cost considerations should be built into the architecture from the start ## Infrastructure Considerations The case study highlights important infrastructure decisions: * The importance of building centralized services for LLM integration * The value of frameworks like LangChain for managing LLM deployments * The need to balance model performance with computational costs * The benefits of maintaining and repurposing existing ML infrastructure This implementation demonstrates a mature approach to LLMOps, showing how to effectively move from experimentation to production while considering practical constraints like costs and scalability. The team's approach to testing, deployment, and optimization provides valuable insights for other organizations looking to implement LLMs in production systems. The success of this project also highlights the importance of cross-functional collaboration, with contributions from various teams including Risk Analytics, Product Management, Machine Learning Infrastructure, and others. This collaborative approach was crucial in successfully deploying and scaling the LLM solution.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source