StoryGraph: Scaling LLM and ML Models to 300M Monthly Requests with Self-Hosting

LLMOps Database

Media & Entertainment

StoryGraph

Company

StoryGraph

Title

Scaling LLM and ML Models to 300M Monthly Requests with Self-Hosting

Industry

Media & Entertainment

Link

https://www.youtube.com/watch?v=ZFQq5Djd5JY

Year

2024

Summary (short)

StoryGraph, a book recommendation platform, successfully scaled their AI/ML infrastructure to handle 300M monthly requests by transitioning from cloud services to self-hosted solutions. The company implemented multiple custom ML models, including book recommendations, similar users, and a large language model, while maintaining data privacy and reducing costs significantly compared to using cloud APIs. Through innovative self-hosting approaches and careful infrastructure optimization, they managed to scale their operations despite being a small team, though not without facing significant challenges during high-traffic periods.

StoryGraph is a book recommendation platform that helps users discover books based on their preferences and reading history. This case study details their journey in implementing and scaling AI/ML infrastructure from 2019 to early 2024, providing valuable insights into real-world LLMOps challenges and solutions for small teams. The company's AI journey began with implementing basic machine learning models for book categorization, which evolved into a comprehensive suite of AI features including: * Book mood classification * Recommendation engine * Personalized previews using LLMs * Similar user detection * Book cover image validation * Multiple specialized custom models A key aspect of their LLMOps strategy was the decision to self-host all AI/ML infrastructure rather than relying on cloud APIs. This decision was driven by several factors: * Data privacy concerns and user trust * Cost optimization * Need for customization and fine-tuning * Control over infrastructure The scale of their LLM operations is significant, processing approximately 1 million requests per day (about 10 LLM requests per second). A notable achievement was their ability to handle this volume at a fraction of the cost of using ChatGPT APIs - their implementation saved them from potential costs of $770,000 per month had they used GPT-4. Their infrastructure evolution showcases several important LLMOps lessons: **Queue Management and Scaling** The team initially struggled with basic queue management, trying various approaches including file-based queuing and load balancer-based distribution. They eventually settled on Redis for queue management, which proved crucial for handling their ML workload across multiple servers. The Redis implementation allowed for efficient distribution of work across their processing cluster. **Cost Optimization Through Self-Hosting** The team achieved significant cost savings by moving away from cloud services: * Transitioned from Heroku to self-hosted solutions * Implemented custom ML infrastructure instead of using cloud APIs * Used dedicated servers from Hetzner for better price-performance ratio * Self-hosted monitoring solutions like PostHog instead of Mixpanel **Database Architecture Evolution** A critical part of their LLMOps infrastructure was the database layer, which went through several iterations: * Started with PostgreSQL * Faced scaling challenges with increased load * Migrated to YugabyteDB for horizontal scaling * Implemented PGBouncer for connection pooling * Used HAProxy for load balancing **Infrastructure Optimization Techniques** The team implemented several optimization strategies: * Distributed database queries using CTEs * Strategic index creation for distributed database performance * Denormalized tables for faster access patterns * Local HAProxy instances on web servers * Session persistence management **Challenges and Learning** The case study highlights several significant challenges: * Scaling issues during high-traffic periods (New Year's Day) * Database connection management * Load balancer configuration complexities * Session persistence issues * Resource utilization optimization Their most recent challenge in January 2024 revealed how even seemingly simple configurations (like HAProxy cookie-based session persistence) can cause significant issues at scale. This demonstrates the importance of understanding all components in an LLMOps stack, from the ML models themselves to the underlying infrastructure. **Resource Utilization** The team made efficient use of available resources: * Utilized personal machine GPUs for processing * Implemented queue-based job distribution * Scaled horizontally with multiple servers * Optimized database queries for distributed systems The case study emphasizes the importance of perseverance in solving LLMOps challenges and the value of understanding the entire stack. It demonstrates that small teams can successfully implement and scale sophisticated AI/ML infrastructure through careful planning, continuous learning, and strategic use of open-source tools. A particularly interesting aspect is their approach to failure and recovery. Rather than avoiding all possible failures, they focused on building robust systems that could recover quickly and maintaining transparency with their user base about technical challenges. The story serves as an excellent example of how modern tools and technologies enable small teams to implement sophisticated LLMOps solutions without requiring extensive specialized knowledge or massive budgets, though it also highlights the ongoing challenges of maintaining such systems at scale.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free