This case study examines Discord's journey in deploying Clyde AI, a large-scale chatbot implementation that reached over 200 million users. The presentation, given by a former team lead who worked on both the developer platform and LLM products teams, provides valuable insights into the challenges and solutions in deploying LLMs at scale with a particular focus on safety and evaluation practices.
The primary challenge wasn't in the model development or fine-tuning, but rather in ensuring safety and preventing harmful outputs. The team faced significant hurdles in preventing the system from generating dangerous content (like bomb-making instructions) or engaging in harmful behaviors. This was particularly challenging given Discord's young user base and the tendency of some users to actively try to break or exploit the system.
The team identified that the major launch blockers were typically related to security, legal, safety, and policy concerns rather than technical issues. This led to the development of a comprehensive evaluation framework that could quantify risks ahead of time and satisfy stakeholders' concerns.
Discord's approach to evaluations (evals) was notably practical and developer-focused. They treated evals as unit tests, emphasizing:
The team developed PromptFu, an open-source CLI tool for evals, which features declarative configs and supports developer-first evaluation practices. Every pull request required an accompanying eval, creating a culture of continuous testing and evaluation.
A particularly interesting example of their practical approach was their solution for maintaining a casual chat personality. Instead of complex LLM graders or sophisticated metrics, they simply checked if responses began with lowercase letters - a simple heuristic that achieved 80% of their goals with minimal effort.
The team implemented several innovative technical solutions:
Discord's approach to safety was particularly comprehensive, given their unique challenges with a young, technically savvy user base prone to testing system limits. They developed a two-pronged approach:
Pre-deployment safeguards:
Live filtering and monitoring
The team created sophisticated red teaming approaches, including using unaligned models to generate toxic inputs and developing application-specific jailbreak testing. They documented various attack vectors, including the "grandma jailbreak" incident, which helped improve their safety measures.
The observability strategy focused on practical integration with existing tools, particularly DataDog. While they weren't able to implement a complete feedback loop for privacy reasons, they did:
The case study honestly addresses several limitations and challenges:
The case study emphasizes several important lessons for LLMOps at scale:
This case study is particularly valuable as it provides real-world insights into deploying LLMs at scale while maintaining safety and quality standards. Discord's approach demonstrates that successful LLMOps isn't just about sophisticated technical solutions, but about building practical, maintainable systems that can be effectively monitored and improved over time.