Fuzzy Labs helped a tech company improve their developer documentation and tooling experience by implementing a self-hosted LLM system using Mistral-7B. They tackled performance challenges through systematic load testing with Locust, optimized inference latency using vLLM's paged attention, and achieved horizontal scaling with Ray Serve. The solution improved response times from 11 seconds to 3 seconds and enabled handling of concurrent users while efficiently managing GPU resources.
# Scaling Self-Hosted LLMs at Fuzzy Labs
## Project Overview
Fuzzy Labs, an MLOps consultancy based in Manchester, UK, undertook a project to build a self-hosted LLM system for a major tech company that sells products to developers. The primary challenge was to create a system that could help developers better navigate technical documentation and understand product intricacies.
## System Architecture
- Model Server:
- Key Components:
## Performance Challenges and Solutions
### Initial Performance Baseline
- Single user performance:
- Scaling issues:
### Testing Methodology
- Used Locust for load testing
- Defined multiple test scenarios:
### Key Metrics Tracked
- Latency (response time)
- Throughput (tokens per second)
- Request rate (successful requests per minute)
- Additional tracking:
### Performance Optimizations
### Phase 1: Inference Speed Optimization
- Implemented vLLM with paged attention
- Results:
- Benefits:
### Phase 2: Horizontal Scaling
- Implemented Ray Serve for:
- Integration benefits:
## Key Learnings and Best Practices
### Performance Optimization Approach
- Always benchmark before optimization
- Separate latency and scaling concerns
- Focus on hardware efficiency
- Consider cost implications of GPU usage
### Infrastructure Management
- Importance of autoscaling
- GPU sharing strategies
- Balance between performance and cost
- Infrastructure monitoring and metrics
### Implementation Guidelines
- Start with performance measurement
- Address basic latency issues first
- Scale horizontally when needed
- Consider GPU sharing for auxiliary models
## Production Considerations
- Only implement complex scaling solutions when needed for production
- Focus on prototyping first, then scale
- Consider self-hosting trade-offs:
## Results and Impact
The final implementation successfully achieved:
- Sub-second inference times
- Efficient handling of concurrent users
- Optimized GPU resource usage
- Production-ready system capable of handling public traffic
- Balanced cost-performance ratio
This case study demonstrates the importance of systematic performance optimization and scaling strategies when deploying self-hosted LLMs in production environments. The combination of vLLM for inference optimization and Ray Serve for horizontal scaling proved to be an effective approach for meeting production requirements while managing infrastructure costs.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.