CircleCI shares their experience building AI-enabled applications like their error summarizer tool, focusing on the challenges of testing and evaluating LLM-powered applications in production. They discuss implementing model-graded evals, handling non-deterministic outputs, managing costs, and building robust testing strategies that balance thoroughness with practicality. The case study provides insights into applying traditional software development practices to AI applications while addressing unique challenges around evaluation, cost management, and scaling.
# Building and Testing Production AI Applications at CircleCI
CircleCI, a leading CI/CD platform provider, shares their experiences and insights in building, testing, and deploying AI-enabled applications. This case study is based on a discussion between Rob Zuber (CTO) and Michael Webster (Principal Engineer) about their journey in implementing LLM-powered features and the associated operational challenges.
## Core AI Initiatives
CircleCI's AI initiatives fall into two main categories:
- AI for Software Development
- AI-Powered Product Development
## Key Challenges in AI Application Testing
### Non-deterministic Outputs
- Traditional software testing assumes deterministic outputs
- LLM outputs are probabilistic and non-deterministic
- Language responses can have multiple valid forms
- Need for new testing approaches beyond simple string matching
### Subjectivity in Evaluation
- Similar to UI/UX testing challenges
- Requires assessment of subjective qualities like:
- Need for both automated and human evaluation
## Testing and Evaluation Strategies
### Model-Graded Evaluations
- Using LLMs to evaluate LLM outputs
- Implementation of teacher-student evaluation frameworks
- Awareness of potential biases:
### Testing Best Practices
- Table-driven testing with multiple scenarios
- Randomization of inputs and scenarios
- Mutation testing approaches
- Use of multiple models for cross-validation
- Focus on critical user journeys and core functionality
### Error Handling and Fallbacks
- Implementation of robust error handling
- Function calling validation and error recovery
- Fallback mechanisms for unreliable responses
- Retry strategies for failed operations
## Feedback and Improvement Systems
### User Feedback Collection
- Implementation of feedback mechanisms:
- Data flywheel approach for continuous improvement
### Signal Analysis
- Monitoring user engagement patterns
- Analysis of reprompt behaviors
- Understanding user satisfaction indicators
- Using feedback to improve testing and evaluation
## Cost Management and Scaling
### Cost Optimization Strategies
- Token usage monitoring and optimization
- Model routing based on query complexity
- Fine-tuning considerations for token efficiency
### Scaling Considerations
- Infrastructure choices:
- Throughput management
- Cost-benefit analysis of different approaches
## Best Practices and Recommendations
### Development Approach
- Start with critical features that provide clear value
- Build robust testing frameworks early
- Focus on core user journeys
- Implement comprehensive error handling
### Operational Considerations
- Monitor costs and usage patterns
- Build feedback loops into the system
- Plan for scale and cost optimization
- Maintain balance between quality and cost
### Future Considerations
- Anticipate cost optimization needs
- Plan for scaling challenges
- Consider alternative models and providers
- Build flexible architecture for future adaptability
## Results and Impact
CircleCI's implementation of these practices has allowed them to:
- Successfully deploy AI-powered features like the error summarizer
- Build reliable testing and evaluation frameworks
- Maintain quality while managing costs
- Create scalable AI-enabled applications
## Lessons Learned
- Traditional software practices can be adapted for AI applications
- Testing requires a multi-faceted approach
- Cost management needs early consideration
- User feedback is crucial for improvement
- Balance between automation and human oversight is essential
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.