Formula 1: AI-Powered Root Cause Analysis Assistant for Race Day Operations

LLMOps Database

Automotive

Formula 1

Company

Formula 1

Title

AI-Powered Root Cause Analysis Assistant for Race Day Operations

Industry

Automotive

Link

https://aws.amazon.com/blogs/machine-learning/how-formula-1-uses-generative-ai-to-accelerate-race-day-issue-resolution?tag=soumet-20

Year

2025

Summary (short)

Formula 1 developed an AI-driven root cause analysis assistant using Amazon Bedrock to streamline issue resolution during race events. The solution reduced troubleshooting time from weeks to minutes by enabling engineers to query system issues using natural language, automatically checking system health, and providing remediation recommendations. The implementation combines ETL pipelines, RAG, and agentic capabilities to process logs and interact with internal systems, resulting in an 86% reduction in end-to-end resolution time.

Tags

high_stakes_application

Formula 1's implementation of an AI-powered root cause analysis system represents a significant advancement in using LLMs for production operations support. This case study demonstrates how generative AI can be effectively deployed to solve complex operational challenges in high-stakes, time-critical environments. The problem Formula 1 faced was significant: during live race events, IT engineers needed to quickly triage critical issues across various services, including network degradation affecting their APIs and downstream services like F1 TV. The traditional approach to resolving these issues could take up to three weeks, involving multiple teams and extensive manual investigation. A specific example highlighted in the case study showed that a recurring web API system issue required around 15 full engineer days to resolve through iterative analysis across multiple events. The solution architecture implemented by Formula 1 demonstrates several key aspects of modern LLMOps: Data Processing and ETL Pipeline: * Raw logs are centralized in S3 buckets with automated hourly checks via EventBridge * AWS Glue and Apache Spark handle log transformation through a three-step process: * Data standardization to unify formats and schemas * Data filtering to remove unnecessary information * Data aggregation to reduce data size while maintaining useful insights * This transformed data feeds into Amazon Bedrock Knowledge Bases for efficient querying RAG Implementation: * Amazon Bedrock Knowledge Bases provides the RAG workflow capability * The system maintains accurate context by efficiently querying transformed logs and other business data sources * Claude 3 Sonnet model was chosen for its comprehensive answer generation and ability to handle diverse input formats Agent-based Architecture: * Amazon Bedrock Agents enables interaction with internal and external systems * The system can perform live checks including: * Database queries for health monitoring * Integration with monitoring tools like Datadog * Creation of Jira tickets for future investigation * Security is maintained through controlled SQL queries and API checks, following the principle of least privilege Frontend Implementation: * Built using Streamlit framework for a user-friendly interface * Provides conversation history and clear response formatting * Includes detailed execution traces for verification and debugging Security Considerations: * Data encryption in transit and at rest * Identity-based policies for access control * Protection against hallucinations and prompt injections through controlled queries * Input/output schema validation using Powertools The results of this implementation were impressive: * Initial triage time reduced from over a day to less than 20 minutes * End-to-end resolution time reduced by up to 86% * Response time for specific queries down to 5-10 seconds * A specific issue that previously took 15 engineer days was resolved in 3 days The system's success lies not just in its technical implementation but in its practical approach to real-world constraints. The solution addresses several critical LLMOps challenges: * Model Selection: Using Claude 3 for its specific capabilities in understanding diverse inputs * Data Quality: Implementing robust ETL pipelines to ensure high-quality input data * Security: Building in protections against common LLM vulnerabilities * Integration: Connecting with existing tools and workflows * Scalability: Using AWS Fargate for elastic scaling * Monitoring: Implementing comprehensive logging and metrics This case study also highlights important considerations for LLMOps implementations: * The importance of data preparation and transformation in ensuring reliable LLM performance * The value of combining multiple AWS services for a comprehensive solution * The need for careful security considerations when deploying LLMs in production * The benefits of using agents to orchestrate complex workflows * The importance of maintaining human oversight while automating processes The success of this implementation has enabled Formula 1's engineering teams to focus more on innovation and service improvements rather than troubleshooting, ultimately enhancing the experience for fans and partners. The solution demonstrates how carefully implemented LLMOps can transform operational efficiency in high-pressure environments while maintaining security and reliability.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free