This case study from Thumbtack, a home services marketplace platform, presents a comprehensive example of taking LLMs from experimentation to production for content moderation. The company faced the challenge of reviewing messages between customers and service professionals to identify policy violations, ranging from obvious issues like abusive language to more subtle violations like job seeking or partnership requests.
Before implementing LLMs, Thumbtack used a two-part system consisting of:
While this system worked for straightforward cases, it struggled with nuanced language, sarcasm, and implied threats. This limitation led them to explore LLM solutions.
The team's approach to implementing LLMs was methodical and data-driven. They first experimented with prompt engineering using off-the-shelf models, testing against a dataset of 1,000 sample messages (90% legitimate, 10% suspicious). The prompt engineering approach, despite careful crafting of prompts that included detailed service criteria and guidelines, only achieved an AUC of 0.56, which was deemed insufficient for production use.
This led to their second experiment with fine-tuning LLMs. The results here were much more promising. Even with just a few thousand training samples, they saw significant improvements. Upon expanding the dataset to tens of thousands of samples, they achieved an impressive AUC of 0.93, making the model suitable for production deployment.
The production implementation addressed two critical challenges:
They built a centralized LLM service using LangChain, which proved to be a crucial architectural decision. This approach:
The team developed a clever two-tier approach to manage costs:
The production system's workflow is particularly noteworthy for its efficiency:
The production results have been impressive:
Several key learnings emerge from this case study:
The case study highlights important infrastructure decisions:
This implementation demonstrates a mature approach to LLMOps, showing how to effectively move from experimentation to production while considering practical constraints like costs and scalability. The team's approach to testing, deployment, and optimization provides valuable insights for other organizations looking to implement LLMs in production systems.
The success of this project also highlights the importance of cross-functional collaboration, with contributions from various teams including Risk Analytics, Product Management, Machine Learning Infrastructure, and others. This collaborative approach was crucial in successfully deploying and scaling the LLM solution.