Canva developed a systematic framework for evaluating LLM outputs in their design transformation feature called Magic Switch. The framework focuses on establishing clear success criteria, codifying these into measurable metrics, and using both rule-based and LLM-based evaluators to assess content quality. They implemented a comprehensive evaluation system that measures information preservation, intent alignment, semantic order, tone appropriateness, and format consistency, while also incorporating regression testing principles to ensure prompt improvements don't negatively impact other metrics.
# LLM Evaluation Framework at Canva
## Company Overview
Canva is an online design platform that aims to make design accessible to everyone. They utilize LLMs in their product ecosystem to transform user designs and generate creative content. One of their key features, Magic Switch, helps transform user designs into different formats such as documents, summaries, emails, and even creative content like songs.
## Challenge
- Working with diverse and messy input from various design formats
- Need to ensure consistent quality in LLM-generated content
- Difficulty in evaluating subjective and creative outputs
- Handling non-deterministic nature of LLM outputs
## Solution Architecture
### Evaluation Framework Design
- Reversed the typical LLM development process
- Started with defining success criteria before implementation
- Established measurable metrics for evaluation
- Created a systematic approach to prompt engineering
### Quality Criteria Focus Areas
- Content Quality
- Format Quality
### Evaluation System Components
- Two types of evaluators:
- Evaluation criteria modules for each output type
- Metric scoring modules for different quality aspects
### Metrics Implementation
- Information Score: Measures preservation of input details
- Intent Score: Assesses alignment with content type
- Semantic Order Score: Evaluates logical flow
- Tone Score: Checks appropriateness of content tone
- Format Score: Validates structural consistency
## Quality Assurance Process
### Evaluator Validation
- Performed ablation studies to verify LLM evaluator reliability
- Deliberately degraded outputs to test scorer sensitivity
- Plans for future human-in-loop validation
- Correlation analysis between LLM and human evaluations
### Testing Methodology
- Regular evaluation runs against test input-output pairs
- Aggregation of scores across multiple test cases
- Comparison of metrics before and after prompt changes
- Implementation of regression testing principles
## Implementation Details
### Format Evaluation Example
- Specific criteria for email format:
- Scoring system between 0 and 1
- Continuous rather than binary scoring
### Prompt Engineering Integration
- Iterative prompt improvements based on evaluation results
- Monitoring of multiple metrics to prevent regression
- Balance between specific improvements and overall quality
## Key Learnings and Best Practices
### Development Approach
- Define and codify output criteria before implementation
- Convert expectations into measurable metrics
- Use evaluation as regression testing for prompt changes
- Apply engineering principles to creative processes
### Future Improvements
- Scale up evaluation with larger datasets
- Implement automated periodic evaluation jobs
- Develop automated prompt engineering capabilities
- Enhance system reliability through more rigorous validation
## Technical Implementation Details
### Evaluation Pipeline
- Input processing for diverse design formats
- Parallel evaluation across multiple metrics
- Aggregation of scores for decision making
- Integration with prompt engineering workflow
### Measurement System
- Normalized scoring between 0 and 1
- Reason-based scoring from LLM evaluators
- Aggregation methods for multiple test cases
- Statistical analysis of evaluation results
## Future Roadmap
### Planned Enhancements
- More systematic evaluator validation
- Scaling to handle larger volumes of user input
- Automated prompt engineering opportunities
- Regular evaluation scheduling
- Integration with retrieval-augmented generation systems
### Quality Improvements
- Enhanced human-in-loop validation
- More comprehensive regression testing
- Automated anomaly detection
- Consistency checks across multiple evaluations
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.