Company
Cleric
Title
AI SRE Agents for Production System Diagnostics
Industry
Tech
Year
2023
Summary (short)
Cleric is developing an AI Site Reliability Engineering (SRE) agent system that helps diagnose and troubleshoot production system issues. The system uses knowledge graphs to map relationships between system components, background scanning to maintain system awareness, and confidence scoring to minimize alert fatigue. The solution aims to reduce the burden on human engineers by efficiently narrowing down problem spaces and providing actionable insights, while maintaining strict security controls and read-only access to production systems.
Cleric is developing an innovative approach to Site Reliability Engineering (SRE) by creating AI agents that can assist with production system diagnostics and troubleshooting. This case study explores their implementation of LLMs in production environments, highlighting both the challenges and solutions in creating reliable AI-powered system operations tools. Core System Architecture and Components: The system is built around several key components that work together to provide reliable production diagnostics: * Knowledge Graph System: At the heart of the system is a multi-layered knowledge graph that maps relationships between different system components. The graph includes both deterministic relationships (like Kubernetes cluster components) and fuzzy relationships discovered through LLM analysis. Rather than maintaining one monolithic graph, Cleric uses multiple graph layers with different confidence levels and update frequencies. * Background Scanning: The system employs continuous background scanning to maintain and update its understanding of the environment. To manage costs, this uses more efficient, specialized models rather than expensive general-purpose LLMs. The scanning process helps build context about system state and recent changes. * Memory Management: The system implements three types of memory: * Knowledge Graph Memory: Captures system state and relationships * Procedural Memory: Stores general procedures and runbooks * Episodic Memory: Records specific instances of problems and their solutions Production Safety and Security: A critical aspect of the implementation is maintaining strict security controls. The system: * Operates in read-only mode for production systems * Is deployed within the customer's production environment * Has carefully controlled access to monitoring and observability tools The team emphasizes building trust progressively, starting with lower-risk systems before potentially expanding to more critical components. Evaluation and Quality Control: Cleric has implemented several mechanisms to ensure reliable operation: * Confidence Scoring: Each finding or recommendation comes with a confidence score. Low-confidence results are filtered out to prevent alert fatigue. * Custom Evaluation Framework: The team maintains an evaluation environment that simulates production conditions to test and improve agent performance. * Chaos Engineering: The evaluation system includes chaos testing to verify agent behavior under various failure conditions. Tool Integration and Observability: The system integrates with existing observability tools but recognizes their limitations for LLM-based analysis: * Strong capabilities with semantic data (logs, config, code) * Moderate capabilities with traces * Limited capabilities with metrics and time series data This has led to interesting insights about potential shifts in observability approaches, with a potential trend toward high-cardinality trace-based observability that might be more suitable for LLM analysis. Cost Management and Optimization: The system implements several strategies to manage costs and maintain efficiency: * Budget limits per investigation * Use of cheaper, specialized models for background tasks * Tiered usage model with committed compute hours * Early termination of investigations that are unlikely to yield results Usage Patterns and Interaction Model: The system supports both asynchronous and synchronous interaction modes: * Asynchronous: Initial alert processing and background scanning * Synchronous: Interactive troubleshooting sessions with engineers This flexibility allows for different usage patterns while maintaining cost control and effectiveness. Lessons and Insights: Several key insights have emerged from this implementation: * The importance of building trust gradually with engineering teams * The need for extensive evaluation frameworks * The challenge of maintaining up-to-date system knowledge * The value of combining deterministic and LLM-based analysis * The importance of proper tool integration and abstraction layers Future Directions: The team is working toward several future capabilities: * Expanded automation capabilities * More comprehensive resolution capabilities * Better integration with existing workflows * Improved efficiency in knowledge graph updates * Enhanced evaluation frameworks Challenges: The implementation faces several ongoing challenges: * Maintaining accurate system state information in rapidly changing environments * Managing the complexity of tool integration * Balancing cost with effectiveness * Building trust with engineering teams * Handling the unsupervised nature of production problems The case study demonstrates both the potential and challenges of implementing LLM-based agents in production operations. While the system shows promising results in reducing engineer workload and improving incident response times, it also highlights the complexity of building reliable AI systems for production operations.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.