ZenML

Production Agents: Real-world Implementations of LLM-powered Autonomous Systems

Various 2023
View original source

A panel discussion featuring three practitioners implementing LLM-powered agents in production: Sam's personal assistant with real-time feedback and router agents, Div's browser automation system Melton with reliability and monitoring features, and Devin's GitHub repository assistant that helps with code understanding and feature requests. Each presenter shared their architecture choices, testing strategies, and approaches to handling challenges like latency, reliability, and model selection in production environments.

Industry

Tech

Technologies

Overview

This case study captures insights from a LangChain-hosted webinar featuring three practitioners who have deployed LLM-powered agents in production settings. The discussion provides practical, battle-tested techniques for building reliable autonomous agents, addressing common challenges like context management, user experience, reliability, and cost optimization. The three presenters—Sam (personal assistant), Div (MultiOn browser automation), and Devin (GitHub code navigation bot)—each bring different perspectives on productionizing agents.

Sam’s Personal Assistant: Real-Time Feedback and Router Agents

Sam has been building a conversational personal assistant for approximately six months, with previous contributions to LangChain in areas like memory and vector stores. His agent uses a ReAct-style architecture (thought-action-action input-observation loops) primarily interacting with APIs rather than code generation.

Real-Time User Feedback Integration

One of Sam’s key innovations addresses the common “rabbit hole” problem where agents pursue unproductive paths while users watch helplessly. Rather than implementing a confirmation step at each action (as in AutoGPT), Sam created a real-time feedback mechanism that allows users to guide the agent mid-execution without disrupting flow.

The implementation uses a parallel WebSocket connection that allows users to send messages while the agent is running. These messages are written to a Redis store, and before each planning stage in the executor loop, the agent reads from this store to check for new user input. Any feedback is appended to the intermediate steps before the next planning phase. The prompt is modified to introduce the concept of “user feedback” as a special tool, instructing the agent to prioritize recent user input over past observations.

This approach feels more organic than forced confirmations—users can optionally intervene rather than being required to approve every step. Sam demonstrated an example where he redirected the agent from checking his calendar to checking text messages mid-execution, simply by typing feedback while the agent was working.

Router Agent Architecture

Sam’s second major technique involves heavy use of router agents to address the tool explosion problem. When agents have access to CRUD operations across many APIs, they quickly run out of context space and make poor tool selection decisions.

The solution involves defining “product flows”—specific user workflows that appear frequently—and creating template agents constrained to tools relevant for each flow. For example, a scheduling agent gets access to calendar retrieval, calendar creation, and user scheduling preferences. A “context on person” agent gets access to recent email search, contact search, and web search for researching individuals.

The implementation uses a conversational agent at the top level, with template names and descriptions exposed as tools. A fallback “dynamic agent” handles queries that don’t match any template by using an LLM to predict relevant tools. When a specialized agent is triggered, it runs as a background task while the conversational agent returns an acknowledgment to the user.

Sam found that providing template-specific instructions (not just tool selection) significantly improved performance—the agent doesn’t have to “reinvent the wheel” for common scenarios like scheduling meetings.

Div’s MultiOn: Browser Automation at Scale

Div (from MultiOn) presented a browser automation agent that can control any website through natural language commands. The system can order food on DoorDash, post on Twitter, navigate GitHub, and even modify AWS configurations—all through conversational instructions.

Technical Architecture

MultiOn uses a custom DOM parser that compresses HTML into a highly efficient representation, achieving 90% website coverage in under 2K tokens. The parser is general-purpose and works across websites without customization, though approximately 5% of sites with unusual DOM structures may require adjustments.

The system incorporates multimodal elements through OCR for icons and images, which is particularly useful for food ordering apps where visual appeal influences decisions. Currently, the system is primarily text-based with OCR augmentation rather than fully vision-based.

Reliability Challenges and Mitigations

Div explicitly contrasted MultiOn with AutoGPT, noting that AutoGPT “sucks” because it fails over 90% of the time, lacks practical use cases, and has zero personalization. MultiOn aims to address these production requirements.

Several reliability mechanisms are in place or planned:

Div acknowledged the tension between accuracy (which benefits from critic agents and verification steps) and latency (which users expect to be “snappy”). Different use cases may warrant different architectural tradeoffs.

Platform Integration

MultiOn demonstrated integration with ChatGPT’s plugin ecosystem, allowing users to automate web tasks directly from ChatGPT. The system also works on mobile apps—a hackathon demo showed ordering food by analyzing a photo of ingredients using multimodal models, then triggering the browser agent running on a laptop or cloud server.

Comparison to Adept

When asked about how MultiOn achieves similar results to Adept (which raised $350M and is training custom models), Div explained that Adept started too early and overinvested in data collection and custom architectures. GPT-4, being trained on the entire internet, provides sufficient capability for most tasks. The real challenge is building reliable systems around existing models, not training the best model from scratch.

Devin’s GitHub Code Navigation Bot

Devin focuses on using LLMs’ code understanding capabilities for non-coding engineering tasks: customer support, documentation, triage, and communicating technical changes to non-technical stakeholders. The immediate application is a GitHub bot that helps with feature requests by navigating codebases and suggesting implementation approaches.

The Problem Space

Large open-source projects face overwhelming maintenance overhead. LangChain itself had over 1,000 issues, 200 open pull requests, and nearly 50 new issues in 48 hours at the time of the discussion. Maintainers spend significant time giving contributors context and guidance rather than reviewing code. The goal is to accelerate contributions while reducing maintainer burden.

Workflow-Focused Architecture

Following the same routing philosophy as Sam, Devin started with a single, well-defined workflow: “closed-end feature requests” that adapt existing repository structure without requiring extensive external information. The bot indexes repositories and responds to issues by finding relevant files and suggesting implementation approaches.

A key architectural concept Devin introduced is “checkpointing”—designing agents so that intermediate results are useful even if the full task can’t be completed. If the agent can’t figure out how to change code but has found relevant files, it posts those files. If it identifies relevant existing issues or appropriate code owners, it shares those. This ensures the agent provides value even when it can’t complete end-to-end automation.

Technical Implementation Details

The bot uses a planning agent that creates a task list, with each task potentially triggering a specialized execution agent. Currently, plans are fixed once created (avoiding the AutoGPT loop problem), though revisiting plans is a future goal, especially with human-in-the-loop capabilities.

A critical insight is that user requests are not search queries. The system must translate natural language feature requests into something that can effectively search a codebase. Devin addresses this through several techniques:

Devin emphasized that search for agents differs from search for humans—agents are more patient and can process more results, so iterating on queries is acceptable if the agent can effectively navigate information.

Situating the Agent

An interesting prompt engineering insight: the “where” context (current directory, file list, repository summary) dramatically improves agent performance. Situating the agent in its environment helps both decision-making and synthetic data generation.

Cross-Cutting Production Themes

Model Selection

All presenters use a mix of GPT-3.5 and GPT-4, with clear patterns:

Sam noted that 3.5 can sometimes substitute for 4 if you provide very specific, detailed instructions rather than relying on the model to reason.

Speed and Latency

Multiple strategies address the speed challenge:

Testing Non-Deterministic Outputs

Testing remains challenging across all implementations:

Div drew parallels to autonomous driving testing, suggesting that agent testing may eventually require simulation environments and automated evaluation infrastructure.

The “Do Nothing” Principle

Multiple presenters emphasized that doing nothing is better than doing the wrong thing. If an agent can’t handle a request, failing gracefully (or explicitly declining) maintains user trust and avoids creating additional confusion or work.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49