Company
Wordsmith
Title
LangSmith Implementation for Full Product Lifecycle Development and Monitoring
Industry
Legal
Year
2024
Summary (short)
Wordsmith, an AI legal assistant platform, implemented LangSmith to enhance their LLM operations across the entire product lifecycle. They tackled challenges in prototyping, debugging, and evaluating complex LLM pipelines by utilizing LangSmith's hierarchical tracing, evaluation datasets, monitoring capabilities, and experimentation features. This implementation enabled faster development cycles, confident model deployment, efficient debugging, and data-driven experimentation while managing multiple LLM providers including OpenAI, Anthropic, Google, and Mistral.
## Overview Wordsmith is an AI assistant designed specifically for in-house legal teams, providing capabilities such as legal document review, email drafting, and contract generation using LLMs powered by customer knowledge bases. The company differentiates itself by claiming deep domain knowledge from leading law firms and seamless integration into communication tools like email and messaging systems. Their core value proposition is automating legal workflows in a way that mimics having an additional team member. This case study, published by LangChain (the creators of LangSmith), documents how Wordsmith adopted LangSmith as their central LLMOps platform across the entire product development lifecycle. It's worth noting that this is essentially a vendor case study, so the claims should be viewed with appropriate context—though the technical details provided do offer genuine insights into production LLM deployment patterns. ## Technical Architecture and Data Sources Wordsmith's initial feature was a configurable RAG (Retrieval-Augmented Generation) pipeline for Slack, which has since evolved into a more complex system supporting multi-stage inferences across diverse data sources. The system ingests data from: - Slack messages - Zendesk tickets - Pull requests - Legal documents This heterogeneous data environment presents significant challenges for maintaining consistency and accuracy across different domains and NLP tasks. The company uses a multi-model strategy, leveraging LLMs from OpenAI, Anthropic, Google, and Mistral to optimize for different objectives including cost, latency, and accuracy. ## Prototyping and Development: Hierarchical Tracing One of the key LLMOps challenges Wordsmith faced was managing the complexity of their multi-stage inference chains. Their agentic workflows can contain up to 100 nested inferences, making traditional logging approaches (they mention previously relying on CloudWatch logs) inadequate for debugging and development iteration. LangSmith's hierarchical tracing capability provides visibility into what each step of the inference chain receives as input and produces as output. This structured approach to trace organization allows engineers to quickly identify issues at specific points in complex workflows. The case study provides a specific example of debugging a scenario where GPT-4 generated an invalid DynamoDB query within an agentic workflow—something that would be extremely time-consuming to diagnose through flat log files. The hierarchical organization of traces reportedly enables much faster iteration during development compared to their previous approach. This represents a common pattern in LLMOps where the move from traditional logging to LLM-specific observability tools provides significant developer productivity improvements, though the exact magnitude of improvement is not quantified in this case study. ## Evaluation Infrastructure A significant portion of Wordsmith's LLMOps maturity appears to come from their investment in systematic evaluation. They have created static evaluation datasets for various task types including: - RAG pipelines - Agentic workloads - Attribute extractions - XML-based changeset targeting The case study articulates three key benefits of maintaining these evaluation sets. First, the process of creating evaluation sets forces the team to crystallize requirements by writing explicit question-answer pairs, which establishes clear expectations for LLM behavior. Second, well-defined evaluation sets enable rapid iteration with confidence—the case study claims that when Claude 3.5 was released, the team was able to compare its performance to GPT-4o within an hour and deploy to production the same day. Third, evaluation sets enable cost and latency optimization while maintaining accuracy as a constraint, with the case study claiming up to 10x cost reduction on particular tasks by using faster and cheaper models where appropriate. The emphasis on reproducible measurement as the differentiator between "a promising GenAI demo" and "a production-ready product" reflects broader industry recognition that evaluation infrastructure is essential for production LLM systems. However, the case study does not provide details on evaluation methodologies, metrics used, or how they handle subjective quality assessments common in legal document generation. ## Production Monitoring and Debugging Wordsmith uses LangSmith's filtering and querying capabilities as part of their production monitoring infrastructure. A key operational improvement is the ability to immediately link production errors to their corresponding LangSmith traces. The case study claims this reduced debugging time from minutes to seconds—engineers can follow a LangSmith URL directly to the relevant trace rather than searching through logs. LangSmith's indexed queries allow the team to isolate production errors specifically related to inference issues, enabling more targeted investigation of LLM-specific problems versus other system issues. This separation of concerns is important in production systems where LLM failures may manifest differently from traditional software errors. ## Online Experimentation Wordsmith integrates LangSmith with Statsig, their feature flag and experiment exposure library, to enable A/B testing of LLM-related features. The integration is achieved through LangSmith's tagging system—each experiment exposure from Statsig is associated with a corresponding tag in LangSmith. The case study includes a code snippet showing how they map experiment exposures to LangSmith tags through the RunnableConfig mechanism. This allows them to: - Query traces by experiment group - Save filtered traces to new datasets - Export datasets for downstream analysis This pattern of integrating LLM observability with general experimentation infrastructure is a sophisticated approach that enables data-driven decision making about LLM configurations, prompts, and model choices. The ability to correlate experiment exposures with actual LLM behavior and outputs provides a feedback loop that is difficult to achieve with traditional A/B testing alone. ## Future Directions: Customer-Specific Optimization Wordsmith's roadmap includes using LangSmith for customer-specific hyperparameter optimization. Their RAG pipelines have numerous configurable parameters including: - Embedding models - Chunk sizes - Ranking and re-ranking configurations By mapping these hyperparameters to LangSmith tags (using the same technique as their experimentation system), they plan to create online datasets that can inform optimization of these parameters on a per-customer and per-use-case basis. The vision is for each customer's RAG experience to be automatically optimized based on their specific datasets and query patterns. This represents an ambitious goal of moving from static configuration to adaptive, customer-specific optimization—though the case study does not provide details on how they would implement such automation or handle the potential risks of automatic tuning in a legal context where accuracy is critical. ## Critical Assessment While this case study provides useful insights into LLMOps practices, several limitations should be noted. As a vendor-published case study from LangChain, it naturally emphasizes the benefits of LangSmith without discussing alternatives or limitations. The quantitative claims (10x cost reduction, same-day model deployment, debugging in seconds vs. minutes) are presented without detailed methodology or baseline comparisons. The case study also doesn't address several important LLMOps concerns for legal applications, including data privacy considerations when sending legal documents through tracing systems, compliance requirements, or how they handle potential LLM hallucinations in legal advice contexts. Despite these caveats, the case study illustrates several best practices in production LLM deployment: investing in structured observability, maintaining reproducible evaluation sets, integrating experimentation infrastructure, and thinking systematically about the full product lifecycle rather than treating LLM deployment as a one-time event.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.