Nearpod: Building and Managing Production Agents with Testing and Evaluation Infrastructure

LLMOps Database

Education

Nearpod

Company

Nearpod

Title

Building and Managing Production Agents with Testing and Evaluation Infrastructure

Industry

Education

Link

https://www.youtube.com/watch?v=GDw29ThkqjM

Year

2023

Summary (short)

Nearpod, an edtech company, implemented a sophisticated agent-based architecture to help teachers generate educational content. They developed a framework for building, testing, and deploying AI agents with robust evaluation capabilities, ensuring 98-100% accuracy while managing costs. The system includes specialized agents for different tasks, an agent registry for reuse across teams, and extensive testing infrastructure to ensure reliable production deployment of non-deterministic systems.

This case study explores how Nearpod, an educational technology company serving K-12 schools globally, implemented a sophisticated agent-based architecture for production AI systems. The implementation is particularly noteworthy for its comprehensive approach to testing, evaluation, and production deployment of non-deterministic systems. The journey began with establishing a robust data foundation. Nearpod first addressed their data infrastructure challenges by implementing a data platform using DBT Core, Redshift, and Snowflake. This foundation proved crucial for later AI implementations, as it provided the necessary high-quality data for their AI agents to function effectively. The core AI implementation focused on building agents to assist teachers with question generation, a critical but time-consuming task for educators. However, what makes this case study particularly interesting from an LLMOps perspective is their approach to building and managing these agents at scale: **Agent Architecture and Design Philosophy** The team approached agent development with a "narrow scope" philosophy, creating specialized agents for specific tasks rather than general-purpose agents. This approach resulted in several benefits: * Reduced token usage and costs * More reliable and predictable behavior * Easier testing and validation * Better reusability across the organization They implemented what they call "three-year-old consultants" - specialized agents that handle specific aspects of the overall task. This modular approach allows for both deterministic and non-deterministic orchestration of agents, depending on the use case. **Testing and Evaluation Infrastructure** One of the most significant aspects of their LLMOps implementation is the custom evaluation framework they built. Key features include: * Custom evals framework based on OpenAI's framework but adapted for their specific needs * Support for both Python and TypeScript implementations * Integration with CI/CD pipelines * Comprehensive testing with thousands of evals per agent * Achievement of 98-100% accuracy in evaluations * Cost tracking and optimization capabilities **Production Deployment and Monitoring** The team developed a sophisticated approach to managing non-deterministic systems in production: * Environment variable-based deployment system * Comprehensive monitoring of agent performance * Cost tracking per agent and interaction * Real-time performance metrics * Feedback loops for continuous improvement **Cost Management and Optimization** They implemented sophisticated cost management strategies: * Token usage optimization at the prompt level * Cost prediction based on usage patterns * Cost tracking per agent and interaction * Integration of cost metrics into the evaluation framework **Organizational Impact and Cross-functional Collaboration** The implementation had significant organizational implications: * Reduced development time from months to hours for initial prototypes * Enabled closer collaboration between technical and non-technical teams * Created an agent registry for reuse across departments * Facilitated rapid prototyping and iteration * Changed the traditional product development cycle to be more collaborative and efficient **Handling Production Challenges** The team acknowledged and addressed several critical challenges in running non-deterministic systems in production: * Cultural sensitivity and legislative compliance across different regions * Input validation and safety checks * Handling of sensitive topics in educational contexts * Managing the inherent risks of non-deterministic systems * Balancing quality with cost optimization **Infrastructure and Scaling** The implementation includes several key infrastructure components: * Agent registry for discovering and reusing agents across teams * Integration with existing data infrastructure * Scalable evaluation system * Cost prediction and monitoring systems * Integration with CI/CD pipelines What makes this case study particularly valuable is its comprehensive approach to managing AI agents in production. Rather than focusing solely on the technical implementation, they created a complete ecosystem for building, testing, deploying, and monitoring AI agents. The emphasis on testing and evaluation, cost management, and organizational collaboration provides a blueprint for other organizations looking to implement similar systems. The team's approach to handling non-deterministic systems in production is especially noteworthy. They acknowledge the inherent risks and uncertainty while implementing robust systems to minimize and manage these risks. Their evaluation framework, which achieves 98-100% accuracy while optimizing for cost, demonstrates that it's possible to deploy non-deterministic systems reliably in production environments. The case study also highlights the importance of organizational change management in implementing AI systems. By bringing different departments closer together and enabling rapid prototyping and iteration, they've created a more efficient and collaborative development process that better serves their users' needs.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source