New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.
New Relic, as a major player in the observability platform space, provides a comprehensive case study of implementing GenAI both for internal operations and customer-facing products. Their experience offers valuable insights into the practical challenges and solutions of deploying LLMs in production at scale.
The company's GenAI journey began about two years ago, with a focus on proof of value rather than just proof of concept. This pragmatic approach helped ensure that their GenAI implementations delivered measurable business impact. Their internal implementation focused primarily on developer productivity, where they achieved a 15% increase in productivity across different engineering levels (P1, P2, and P3).
Their LLMOps architecture demonstrates several sophisticated approaches to production AI:
For internal operations, they implemented:
* A multi-agent architecture where different specialized agents handle specific domains, but present a unified interface through their "Nova" agent
* Both synchronous and asynchronous agent interactions, allowing for complex workflows where agents can create pull requests and handle longer-running tasks
* Automated incident management with correlation of alerts and automatic RCA (Root Cause Analysis) generation
* Cloud cost optimization using AI to monitor and optimize AWS resource usage, including creative use of convertible Reserved Instances
For their product offerings, they developed a three-layer AI architecture:
1. Data Collection Layer: Handles the ingestion of 7 petabytes of data daily
2. Platform Intelligence Layer: Combines both probabilistic and deterministic engines
3. Action Platform Layer: Delivers anomaly alerts and recommendations
Their approach to GenAI implementation shows several best practices for LLMOps:
Cost Management:
* They use different models for different use cases to optimize costs
* They implement careful monitoring of model performance and costs
* They maintain a pragmatic balance between classical ML and GenAI approaches
Technical Architecture:
* Implementation of both co-pilot mode (explicit invocation) and workflow mode (automatic integration)
* Use of RAG (Retrieval Augmented Generation) instead of immediate fine-tuning to reduce costs and complexity
* Development of sophisticated monitoring tools for AI operations
Important lessons from their implementation include:
1. Focus on Measurable Value: They emphasize the importance of measuring actual value delivered rather than just technical capabilities.
2. Balanced Approach to Models:
* Starting with prompt engineering and RAG before considering fine-tuning
* Using multiple models through a router to optimize for different use cases
* Maintaining awareness of cost-performance tradeoffs
3. Continuous Improvement:
* Models aren't static and require continuous refinement
* Regular monitoring of model drift and performance
* Iterative approach to feature development
4. Integration Strategy:
* Careful consideration of synchronous vs asynchronous operations
* Balance between automation and human oversight
* Integration into existing workflows rather than standalone tools
Their AI monitoring capabilities are particularly noteworthy, offering:
* Performance monitoring of AI models in production
* Detection of model drift
* Latency monitoring
* Comparison capabilities between different models
The case study emphasizes the importance of proper scoping and prioritization of use cases. They maintain approximately 40-50 experimental use cases, with only about 15 expected to reach production status. This selective approach helps ensure resources are focused on the most impactful implementations.
For skills development, they implement a hybrid approach:
* Internal training to ensure basic AI literacy across the organization
* Strategic hiring for specialized expertise
* Utilization of AWS's GenAI competency center
The case study concludes with important cautions about over-engineering and the importance of starting with simpler approaches before moving to more complex solutions. Their experience suggests that while GenAI offers significant opportunities, success requires careful attention to practical considerations of cost, performance, and measurable business value.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.