ZenML

LLM Observability for Enhanced Audience Segmentation Systems

Acxiom 2025
View original source

Acxiom developed an AI-driven audience segmentation system using LLMs but faced challenges in scaling and debugging their solution. By implementing LangSmith, they achieved robust observability for their LangChain-based application, enabling efficient debugging of complex workflows involving multiple LLM calls, improved audience segment creation, and better token usage optimization. The solution successfully handled conversational memory, dynamic updates, and data consistency requirements while scaling to meet growing user demands.

Industry

Media & Entertainment

Technologies

Overview

Acxiom is a global leader in customer intelligence and AI-enabled, data-driven marketing, operating as part of Interpublic Group (IPG). With over 55 years of experience and operations spanning the US, UK, Germany, China, Poland, and Mexico, the company specializes in high-performance solutions for customer acquisition and retention. This case study, published in January 2025, details how their Data and Identity Data Science team built and scaled a generative AI solution for dynamic audience segmentation, ultimately adopting LangSmith for production observability.

The core use case involves transforming natural language user inputs into detailed audience segments derived from Acxiom’s extensive transactional and predictive data catalog. For example, a marketer might request: “Identify an audience of men over thirty who rock climb or hike but aren’t married.” The system then needs to interpret this request and return a structured JSON containing curated IDs and values from Acxiom’s data products.

Technical Challenges and Requirements

The Acxiom team faced several significant LLMOps challenges as they scaled their audience segmentation application. These challenges are representative of common issues encountered when moving LLM-based applications from prototype to production.

Conversational Memory and Context Management: The application required long-term memory capabilities to maintain context across potentially unrelated user conversations while building audience segments. This is a common challenge in production LLM applications where user sessions may span multiple interactions, and the system must track state effectively.

Dynamic Updates: The system needed the ability to refine or update audience segments during active sessions. This requirement introduces complexity in terms of state management and ensuring that modifications don’t introduce inconsistencies or hallucinations.

Data Consistency: Performing accurate attribute-specific searches without forgetting or hallucinating previously processed information was critical. In a marketing context, incorrect audience segmentation could lead to wasted ad spend or poorly targeted campaigns.

Initial Architecture and Pain Points

The team initially designed their workflow using LangChain’s Retrieval-Augmented Generation (RAG) tools combined with custom agentic code. The RAG workflow utilized metadata and data dictionary information from Acxiom’s core data products, including detailed descriptions. This is a sensible architectural choice for grounding LLM responses in specific, authoritative data sources.

However, as the solution scaled, several production-related pain points emerged:

Complex Debugging: Failures or omissions in LLM reasoning cascaded into incorrect or hallucinated results. This is a particularly insidious problem in agentic systems where multiple LLM calls are chained together—an error in an early step can propagate and amplify through subsequent reasoning steps.

Scaling Issues: The team had initially developed a lightweight prompt input/output logging system, but this proved insufficient as the user base expanded. Simple logging approaches often lack the structured visibility needed to understand complex multi-step LLM workflows in production.

Evolving Requirements: Continuous feature growth demanded iterative development, introducing increasing complexity into the agent-based architecture. The team found themselves needing to add new agents, such as an “overseer” and “researcher” agent, for more nuanced decision-making.

LangSmith Adoption and Integration

To address these pain points, Acxiom adopted LangSmith, LangChain’s LLM testing and observability platform. The integration reportedly required minimal additional effort due to their existing use of LangChain primitives.

It’s worth noting that this case study was published by LangChain themselves, so the claims should be evaluated with appropriate skepticism. However, the technical details provided do align with common LLMOps challenges and reasonable solutions.

Seamless Integration: LangSmith’s simple decorator-based approach allowed the team to gain visibility into LLM calls, function executions, and utility workflows without significant code refactoring. This low-friction integration is important for teams that need to add observability to existing systems.

Multi-Model Support: The platform supported Acxiom’s hybrid ecosystem of models, including open-source vLLM deployments, Claude via AWS Bedrock, and Databricks’ model endpoints. This flexibility was crucial for a team using multiple model providers in their production stack.

Tree-Structured Trace Visualization: LangSmith’s hierarchical trace visualization proved particularly valuable for understanding complex workflows. The case study mentions that some user interactions involved more than 60 LLM calls and consumed 200,000 tokens—a scale that would be extremely difficult to debug with traditional logging approaches.

Metadata Tracking: The platform’s metadata tracking capabilities helped the team identify bottlenecks in these complex request chains. Understanding where time and tokens are being spent is essential for cost optimization and performance tuning.

Annotation and Testing: LangSmith’s ability to log and annotate arbitrary code supported the team’s goal of streamlining unit test creation. The platform allowed them to adapt as new agents were added to the architecture.

Production Scale Considerations

The case study reveals several interesting aspects of production-scale LLM deployment:

Token Economics: With interactions potentially consuming 200,000 tokens and involving 60+ LLM calls, token usage visibility became critical for cost management. The hybrid model approach (using different models for different tasks) suggests the team was actively optimizing for cost-performance tradeoffs.

Agent Architecture Evolution: The mention of adding “overseer” and “researcher” agents indicates an evolving multi-agent architecture. Production systems often grow more complex over time as edge cases are discovered and new requirements emerge.

User Base Growth: The emphasis on scalability suggests real production traffic concerns, though specific metrics on user counts or request volumes are not provided.

Reported Outcomes

The case study claims several improvements, though specific metrics are notably absent:

Streamlined Debugging: Deep visibility into nested LLM calls and RAG agents simplified troubleshooting and accelerated development of more refined audience segments.

Improved Audience Reach: The hierarchical agent architecture reportedly led to more accurate and dynamic audience segment creation, though no quantitative improvement is specified.

Scalable Growth: The observability layer could handle increasing user demands and complexity without re-engineering, which is an important consideration for growing production systems.

Optimized Token Usage: Visibility into token and call usage informed cost management strategies, though again no specific savings are mentioned.

Critical Assessment

While this case study provides useful insights into production LLMOps challenges, several limitations should be noted:

The source is a vendor case study published by LangChain promoting their LangSmith product, so the presentation is inherently favorable. No comparative analysis with alternative observability solutions is provided, and specific quantitative metrics are largely absent.

That said, the challenges described—debugging complex agent chains, scaling observability, managing multi-model deployments, and controlling token costs—are all legitimate concerns faced by teams running LLM applications in production. The architectural decisions (RAG-based grounding, multi-agent systems, hybrid model deployment) represent reasonable approaches to building sophisticated LLM applications.

The case study provides a useful illustration of how LLMOps tools can address real production challenges, even if the specific claims about LangSmith’s benefits should be validated through independent evaluation.

Key Takeaways for LLMOps Practitioners

This case study highlights several important LLMOps considerations:

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Powered Vehicle Information Platform for Dealership Sales Support

Toyota 2025

Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.

customer_support chatbot question_answering +47

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec 2025

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

chatbot content_moderation customer_support +40