Block (Square) implemented a comprehensive LLMOps strategy across multiple business units using a combination of retrieval augmentation, fine-tuning, and pre-training approaches. They built a scalable architecture using Databricks' platform that allowed them to manage hundreds of AI endpoints while maintaining operational efficiency, cost control, and quality assurance. The solution enabled them to handle sensitive data securely, optimize model performance, and iterate quickly while maintaining version control and monitoring capabilities.
This case study presents a comprehensive overview of how Block (Square) implemented production-grade generative AI applications across their various business units including Square, Cash App, TIDAL, and TBD. The presentation, delivered by Bradley Axen from Block alongside Ina Koleva from Databricks, provides detailed insights into their LLMOps journey and technical implementation.
The company faced several key challenges in bringing generative AI to production:
* Managing multiple business units with different AI solution needs
* Balancing quick iteration with robust production systems
* Handling sensitive data securely
* Optimizing cost and performance at scale
* Maintaining operational visibility across hundreds of endpoints
Block's approach to LLMOps emphasizes practical implementation while maintaining flexibility for rapid iteration. They built their solution on several key architectural principles:
**Data and Retrieval Architecture**
The foundation of their system uses vector search for efficient retrieval augmentation. While they initially implemented in-memory vector search, they moved to a decoupled vector search endpoint to allow for easier context updates without model redeployment. This separation enabled independent iteration on prompts and model selection while maintaining the ability to continuously update context data.
**Model Serving and Gateway Architecture**
Block implemented a sophisticated model serving architecture using AI Gateway as a proxy to route requests to different models. This approach provided several benefits:
* Centralized management of API keys and rate limits
* Unified monitoring and cost attribution
* Flexibility to switch between hosted and third-party models
* Ability to enforce company-wide quotas for specific models like GPT-4
**Quality Assurance and Safety**
They implemented a comprehensive quality assurance system with both pre- and post-processing steps:
* Input filtering to detect and prevent prompt injection
* Output validation for toxic content
* Secondary model validation to detect hallucinations
* Version tracking and A/B testing capabilities
**Monitoring and Feedback**
Block implemented a robust monitoring system that includes:
* Inference logging to capture input-output pairs
* Integration with customer feedback through Kafka
* Delta tables for analyzing performance and outcomes
* Ability to join feedback data with inference logs for detailed analysis
**Fine-tuning and Optimization**
Their approach to model optimization focused on practical benefits:
* Using fine-tuning to reduce model size while maintaining performance
* Optimizing GPU serving endpoints for improved latency
* Balancing the cost of hosted models against usage requirements
* Leveraging tools like Mosaic to simplify the fine-tuning process
**Iterative Development Process**
Block emphasizes an iterative approach to LLMOps:
* Starting with simple implementations and gradually adding complexity
* Treating all implementations as versioned models
* Maintaining separate endpoints for applications to reduce update frequency
* Continuous evaluation of investments against performance and cost metrics
**Security and Data Privacy**
The system was designed with security in mind:
* Self-hosted models for sensitive data processing
* Optimized GPU serving endpoints for competitive performance
* Careful control over data flow and model access
One of the most significant learnings from Block's implementation is the importance of building a minimal but effective platform that enables quick iteration while maintaining production reliability. They found that even simple applications benefit from being treated as versioned models, allowing for systematic improvement and evaluation.
Their architecture demonstrates how to scale from basic implementations to hundreds of production endpoints while maintaining operational control. The use of central services like AI Gateway and vector search helped manage complexity while enabling flexibility in model selection and data updates.
Block's implementation shows a practical balance between using third-party services and building custom solutions. They leverage existing tools where possible (like Databricks' platform and AI Gateway) while maintaining control over critical components that affect security, performance, and cost.
The case study emphasizes the importance of starting simple and iterating based on concrete needs rather than over-engineering solutions from the start. This approach allowed them to scale effectively while maintaining the ability to adapt to new requirements and technologies in the fast-moving field of generative AI.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.