Netflix has developed a sophisticated knowledge graph system for entertainment content that helps understand relationships between movies, actors, and other entities. While initially focused on traditional entity matching techniques, they are now incorporating LLMs to enhance their graph by inferring new relationships and entity types from unstructured data. The system uses Metaflow for orchestration and supports both traditional and LLM-based approaches, allowing for flexible model deployment while maintaining production stability.
This case study comes from a Metaflow Meetup presentation where a Netflix engineer discussed the infrastructure behind building and maintaining a large-scale Knowledge Graph for entertainment data. While the primary focus of the talk was on the traditional ML infrastructure for entity matching at scale, the presentation also touched on emerging work using LLMs for entity type inference and relationship extraction. This represents an interesting hybrid case where a mature ML infrastructure system is being augmented with generative AI capabilities.
Netflix’s Knowledge Graph is a critical piece of infrastructure used across multiple teams and platforms within the company. It connects various entities in the entertainment domain—actors, movies, countries, books, semantic concepts—and captures the complex relationships between them. The graph serves multiple use cases including content similarity analysis, search enhancement, recommendations, and predictive analytics.
Before diving into the LLM aspects, it’s important to understand the foundational challenge. When building a Knowledge Graph from multiple data sources, entity matching becomes a critical step. The team needs to determine when two records from different sources refer to the same entity (like a movie) versus different entities that happen to be similar.
The presentation highlighted just how tricky this can be in the entertainment domain. For example, “Knives Out” and “Knives Out 2” share the same director, similar actors, and similar themes, but are fundamentally different movies. Similarly, movies with common names like “Mother” exist in multiple countries and languages every year, requiring precise matching to avoid errors.
The scale of the problem is substantial. With millions of entities, even with good candidate pair generation, the team ends up with billions of pairs that need to be evaluated. The approach they took was to model this as a pairwise classification problem, extracting metadata from each entity pair and running it through classification models.
The team extensively leverages Metaflow, an ML infrastructure framework originally developed at Netflix, to build and operate these systems. Key architectural decisions include:
Parallel Processing Architecture: The system is designed for massive parallelization. Data is partitioned across multiple nodes, each handling matching independently. Results are written out in parallel with no shuffle or gather phase, eliminating bottlenecks. The team can scale horizontally by simply adding more machines.
Fast Data Layer: A surprising early win came from using Netflix’s fast data layer with Apache Arrow for parsing large JSON metadata blobs. This alone provided a 10x speedup over traditional approaches, which is significant when processing billions of pairs.
Within-Node Parallelization: Beyond distributing across nodes, the system also parallelizes feature computation within each node using Metaflow dataframes, splitting work across multiple cores and sharing memory for additional speedups.
Operational Tooling: The Metaflow UI provides visibility into pipeline execution, helping identify skewed tasks (e.g., a movie with an unusually large filmography causing one task to run slowly). Failed instances can be restarted locally with additional debugging logs, which the presenter noted was a significant improvement over their previous PySpark-based system.
The more LLMOps-relevant portion of the talk focused on emerging work using LLMs for inferring entity types and relationships. This represents a newer, work-in-progress initiative that augments the traditional ML approaches.
Use Case: The team is exploring using LLMs to derive information that may not be explicitly available in the data but can be inferred from context and the model’s world knowledge. Examples include:
This information is valuable for building richer embeddings and establishing new semantic relationships in the graph.
Technical Approach: The system uses a combination of:
The presenter noted that LLMs are “getting much better” at converting unstructured documents to structured data and extracting graph structures, suggesting this is an area of active improvement.
Pipeline Architecture: The LLM-based extraction pipeline follows this structure:
Why Metaflow for LLM Jobs: The same benefits that made Metaflow valuable for traditional ML apply to the LLM workloads:
A significant portion of the discussion focused on how Metaflow enables safe production operations, which is highly relevant to LLMOps practices:
Branch and Namespace Management: The team uses Metaflow’s project versioning and branching features extensively. This ensures:
When asked about this, the presenter explained that a developer can take the same training pipeline, deploy it in their personal namespace, and their deployment won’t accidentally overwrite production. The production branch is only updated through controlled merges.
Resource Management: The presenter acknowledged that while steps have fixed resource allocations by default, developers have the freedom to change resources for their jobs. This operates on a “freedom and responsibility” model where developers are trusted to use good judgment. When asked if resource limits could be set to prevent expensive accidental runs, the answer was that it’s possible but relies on developer judgment.
Debugging Distributed Systems: The ability to restart failed instances locally with additional debugging logs was highlighted as a major operational improvement. This is particularly valuable when dealing with data quality issues from various licensed content sources.
The presentation illustrates a practical approach to integrating LLM capabilities into an existing ML infrastructure:
It’s worth noting that the LLM work was described as “work in progress” and “ongoing,” suggesting this is relatively early-stage compared to the mature entity matching infrastructure. The presenter was careful to frame this as exploratory work with “some success” rather than a fully proven production system.
The case study demonstrates how teams with strong ML infrastructure foundations can extend those patterns to LLM use cases, rather than treating LLMs as a completely separate operational domain. The emphasis on operability, debugging, resource management, and safe production deployment practices applies equally to both traditional ML and emerging LLM workloads.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
LinkedIn developed Hiring Assistant, an AI agent designed to transform the recruiting workflow by automating repetitive tasks like candidate sourcing, evaluation, and engagement across 1.2+ billion profiles. The system addresses the challenge of recruiters spending excessive time on pattern-recognition tasks rather than high-value decision-making and relationship building. Using a plan-and-execute agent architecture with specialized sub-agents for intake, sourcing, evaluation, outreach, screening, and learning, Hiring Assistant combines real-time conversational interfaces with large-scale asynchronous execution. The solution leverages LinkedIn's Economic Graph for talent insights, custom fine-tuned LLMs for candidate evaluation, and cognitive memory systems that learn from recruiter behavior over time. The result is a globally available agentic product that enables recruiters to work with greater speed, scale, and intelligence while maintaining human-in-the-loop control for critical decisions.
This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.