ZenML

LangSmith Implementation for Full Product Lifecycle Development and Monitoring

Wordsmith 2024
View original source

Wordsmith, an AI legal assistant platform, implemented LangSmith to enhance their LLM operations across the entire product lifecycle. They tackled challenges in prototyping, debugging, and evaluating complex LLM pipelines by utilizing LangSmith's hierarchical tracing, evaluation datasets, monitoring capabilities, and experimentation features. This implementation enabled faster development cycles, confident model deployment, efficient debugging, and data-driven experimentation while managing multiple LLM providers including OpenAI, Anthropic, Google, and Mistral.

Industry

Legal

Technologies

Overview

Wordsmith is an AI assistant designed specifically for in-house legal teams, providing capabilities such as legal document review, email drafting, and contract generation using LLMs powered by customer knowledge bases. The company differentiates itself by claiming deep domain knowledge from leading law firms and seamless integration into communication tools like email and messaging systems. Their core value proposition is automating legal workflows in a way that mimics having an additional team member.

This case study, published by LangChain (the creators of LangSmith), documents how Wordsmith adopted LangSmith as their central LLMOps platform across the entire product development lifecycle. It’s worth noting that this is essentially a vendor case study, so the claims should be viewed with appropriate context—though the technical details provided do offer genuine insights into production LLM deployment patterns.

Technical Architecture and Data Sources

Wordsmith’s initial feature was a configurable RAG (Retrieval-Augmented Generation) pipeline for Slack, which has since evolved into a more complex system supporting multi-stage inferences across diverse data sources. The system ingests data from:

This heterogeneous data environment presents significant challenges for maintaining consistency and accuracy across different domains and NLP tasks. The company uses a multi-model strategy, leveraging LLMs from OpenAI, Anthropic, Google, and Mistral to optimize for different objectives including cost, latency, and accuracy.

Prototyping and Development: Hierarchical Tracing

One of the key LLMOps challenges Wordsmith faced was managing the complexity of their multi-stage inference chains. Their agentic workflows can contain up to 100 nested inferences, making traditional logging approaches (they mention previously relying on CloudWatch logs) inadequate for debugging and development iteration.

LangSmith’s hierarchical tracing capability provides visibility into what each step of the inference chain receives as input and produces as output. This structured approach to trace organization allows engineers to quickly identify issues at specific points in complex workflows. The case study provides a specific example of debugging a scenario where GPT-4 generated an invalid DynamoDB query within an agentic workflow—something that would be extremely time-consuming to diagnose through flat log files.

The hierarchical organization of traces reportedly enables much faster iteration during development compared to their previous approach. This represents a common pattern in LLMOps where the move from traditional logging to LLM-specific observability tools provides significant developer productivity improvements, though the exact magnitude of improvement is not quantified in this case study.

Evaluation Infrastructure

A significant portion of Wordsmith’s LLMOps maturity appears to come from their investment in systematic evaluation. They have created static evaluation datasets for various task types including:

The case study articulates three key benefits of maintaining these evaluation sets. First, the process of creating evaluation sets forces the team to crystallize requirements by writing explicit question-answer pairs, which establishes clear expectations for LLM behavior. Second, well-defined evaluation sets enable rapid iteration with confidence—the case study claims that when Claude 3.5 was released, the team was able to compare its performance to GPT-4o within an hour and deploy to production the same day. Third, evaluation sets enable cost and latency optimization while maintaining accuracy as a constraint, with the case study claiming up to 10x cost reduction on particular tasks by using faster and cheaper models where appropriate.

The emphasis on reproducible measurement as the differentiator between “a promising GenAI demo” and “a production-ready product” reflects broader industry recognition that evaluation infrastructure is essential for production LLM systems. However, the case study does not provide details on evaluation methodologies, metrics used, or how they handle subjective quality assessments common in legal document generation.

Production Monitoring and Debugging

Wordsmith uses LangSmith’s filtering and querying capabilities as part of their production monitoring infrastructure. A key operational improvement is the ability to immediately link production errors to their corresponding LangSmith traces. The case study claims this reduced debugging time from minutes to seconds—engineers can follow a LangSmith URL directly to the relevant trace rather than searching through logs.

LangSmith’s indexed queries allow the team to isolate production errors specifically related to inference issues, enabling more targeted investigation of LLM-specific problems versus other system issues. This separation of concerns is important in production systems where LLM failures may manifest differently from traditional software errors.

Online Experimentation

Wordsmith integrates LangSmith with Statsig, their feature flag and experiment exposure library, to enable A/B testing of LLM-related features. The integration is achieved through LangSmith’s tagging system—each experiment exposure from Statsig is associated with a corresponding tag in LangSmith.

The case study includes a code snippet showing how they map experiment exposures to LangSmith tags through the RunnableConfig mechanism. This allows them to:

This pattern of integrating LLM observability with general experimentation infrastructure is a sophisticated approach that enables data-driven decision making about LLM configurations, prompts, and model choices. The ability to correlate experiment exposures with actual LLM behavior and outputs provides a feedback loop that is difficult to achieve with traditional A/B testing alone.

Future Directions: Customer-Specific Optimization

Wordsmith’s roadmap includes using LangSmith for customer-specific hyperparameter optimization. Their RAG pipelines have numerous configurable parameters including:

By mapping these hyperparameters to LangSmith tags (using the same technique as their experimentation system), they plan to create online datasets that can inform optimization of these parameters on a per-customer and per-use-case basis. The vision is for each customer’s RAG experience to be automatically optimized based on their specific datasets and query patterns.

This represents an ambitious goal of moving from static configuration to adaptive, customer-specific optimization—though the case study does not provide details on how they would implement such automation or handle the potential risks of automatic tuning in a legal context where accuracy is critical.

Critical Assessment

While this case study provides useful insights into LLMOps practices, several limitations should be noted. As a vendor-published case study from LangChain, it naturally emphasizes the benefits of LangSmith without discussing alternatives or limitations. The quantitative claims (10x cost reduction, same-day model deployment, debugging in seconds vs. minutes) are presented without detailed methodology or baseline comparisons.

The case study also doesn’t address several important LLMOps concerns for legal applications, including data privacy considerations when sending legal documents through tracing systems, compliance requirements, or how they handle potential LLM hallucinations in legal advice contexts.

Despite these caveats, the case study illustrates several best practices in production LLM deployment: investing in structured observability, maintaining reproducible evaluation sets, integrating experimentation infrastructure, and thinking systematically about the full product lifecycle rather than treating LLM deployment as a one-time event.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Migration of Credit AI RAG Application from Multi-Cloud to AWS Bedrock

Octus 2025

Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.

document_processing question_answering summarization +45