Airtop developed a web automation platform that enables AI agents to interact with websites through natural language commands. They leveraged the LangChain ecosystem (LangChain, LangSmith, and LangGraph) to build flexible agent architectures, integrate multiple LLM models, and implement robust debugging and testing processes. The platform successfully enables structured information extraction and real-time website interactions while maintaining reliability and scalability.
Airtop is a technology platform that provides browser automation capabilities specifically designed for AI agents. The company’s core value proposition is enabling developers to create web automations that allow AI agents to perform complex tasks like logging in, extracting information, filling forms, and interacting with web interfaces—all through natural language commands rather than traditional scripting approaches. This case study, published in November 2024, details how Airtop built its production infrastructure using the LangChain ecosystem of tools.
The fundamental problem Airtop addresses is that AI agents are only as useful as the data they can access, and navigating websites at scale introduces significant technical challenges including authentication flows and Captcha handling. Traditional approaches often require complex CSS selector manipulation or Puppeteer scripts, which are brittle and difficult to maintain. Airtop’s solution provides a more reliable abstraction layer through natural language APIs.
Airtop has developed two primary API offerings that leverage LLM capabilities:
Extract API: This enables structured information extraction from web pages, supporting use cases like extracting speaker lists, LinkedIn URLs, or monitoring flight prices. Notably, it also works with authenticated sites, enabling applications in social listening and e-commerce monitoring.
Act API: This adds the capability to take actions on websites, such as entering search queries or interacting with UI elements in real-time.
Both of these capabilities require sophisticated LLM integration to interpret natural language commands and translate them into appropriate web interactions, making robust LLMOps practices essential for production reliability.
A critical architectural decision for Airtop was choosing how to integrate multiple LLM providers into their platform. The team selected LangChain primarily for its “batteries-included” approach to model integration. LangChain provides built-in integrations for major model providers including the GPT-4 series, Claude (Anthropic), Fireworks, and Gemini (Google).
According to Kyle, Airtop’s AI Engineer, the standardized interface that LangChain provides has been transformative for their development workflow. The ability to switch between models effortlessly has proven critical as the team optimizes for different use cases. This flexibility is particularly important in production environments where different tasks may benefit from different model characteristics—some tasks might require GPT-4’s reasoning capabilities while others might benefit from Claude’s longer context windows or Gemini’s multimodal features.
From an LLMOps perspective, this abstraction layer is significant because it allows the team to respond to changes in the model landscape without major architectural rewrites. As new models become available or existing models are deprecated, the standardized interface minimizes the migration effort required.
As Airtop expanded their browser automation capabilities, the engineering team adopted LangGraph to build their agent system. LangGraph’s flexible architecture enabled Airtop to construct individual browser automations as subgraphs, which represents a modular approach to agent design.
This subgraph architecture provides several LLMOps benefits:
Future-proofing: New automations can be added as additional subgraphs without redesigning the overall control flow. This is crucial for a rapidly evolving product where new capabilities are frequently being developed.
Dynamic control: The team gains more granular control over agent behavior without monolithic code changes.
Validation and reliability: LangGraph helped Airtop validate the accuracy of agent steps as the agent took actions on websites. This is a critical quality assurance feature for production deployments where incorrect actions could have real consequences.
The team’s development philosophy is noteworthy from an LLMOps maturity perspective. Rather than attempting to build sophisticated agents from the start, they began with micro-capabilities—small, focused agent functions—and then progressively built more sophisticated agents capable of clicking on elements and performing keystrokes. This incremental approach reduces risk and allows for thorough validation at each stage of capability expansion.
LangSmith plays a central role in Airtop’s development and operations workflow. The team’s adoption of LangSmith evolved organically—they initially began using it to debug issues surfaced through customer support tickets, but quickly discovered broader applications across the development lifecycle.
One of the most valuable LangSmith features for Airtop is its multimodal debugging functionality. When working with AI models from OpenAI or Claude, error messages can often be nebulous or uninformative. LangSmith’s debugging tools provide clarity in these situations, allowing the team to identify whether issues stem from formatting problems or misplaced prompt components. This diagnostic capability is essential for production troubleshooting where rapid issue resolution directly impacts customer satisfaction.
The team leverages LangSmith’s playground feature extensively for prompt iteration and testing. The playground allows them to run parallel model requests, simulating real-world use cases in a controlled environment. This capability speeds up internal development workflows significantly—rather than deploying changes to production to test prompt modifications, the team can iterate rapidly in the playground.
The ability to compare responses across different models and prompt variations is particularly valuable for Airtop’s use case, where they need to ensure consistent behavior across the multiple model providers they support. This parallel testing capability helps the team identify which prompts work well across different models versus which might need model-specific tuning.
For Airtop, empowering users with reliable web automation capabilities is a core requirement. The combination of LangSmith’s testing features and LangGraph’s validation capabilities creates a development workflow that prioritizes reliability. The team can iterate on prompts, validate agent behavior, and identify issues before they reach production users.
While the case study is primarily promotional in nature (being published on the LangChain blog), it does highlight some genuine production challenges that Airtop addresses:
Scale: Web automation at scale introduces unique challenges compared to single-user scripting approaches.
Authentication handling: Real-world web automation must contend with login flows, session management, and authentication challenges.
Captcha handling: This is explicitly called out as a challenge that Airtop’s platform addresses, though the specific technical approach is not detailed.
Reliability validation: The emphasis on LangGraph’s validation capabilities suggests that ensuring consistent, correct behavior is an ongoing operational concern.
It’s worth noting that the case study does not provide specific metrics on error rates, latency, cost optimization, or other quantitative LLMOps measures. The benefits described are largely qualitative—“accelerated time-to-market,” “faster development,” and “enhanced ability to deliver accurate responses.” While these are reasonable claims given the tools described, readers should understand that this is a vendor-published case study and may not present a complete picture of the challenges and trade-offs involved.
Airtop’s roadmap indicates continued investment in their LLM-powered agent capabilities:
The mention of enhanced benchmarking is particularly relevant from an LLMOps perspective, as systematic evaluation becomes increasingly important as agent capabilities grow more complex. The ability to measure and compare performance across model configurations suggests a maturing approach to LLM operations.
This case study illustrates several patterns relevant to LLMOps practitioners building agent-based systems:
The value of abstraction layers for model integration cannot be overstated—being able to switch between providers without architectural changes provides operational flexibility and reduces vendor lock-in. The modular subgraph approach to agent design facilitates incremental capability expansion and simplifies testing and validation. Starting with micro-capabilities and progressively building complexity is a pragmatic approach that reduces risk in production environments.
Debugging tools that support multimodal content and parallel testing accelerate development cycles and improve production troubleshooting. Validation mechanisms at the agent step level are essential for ensuring reliable behavior in automated systems that take real-world actions.
The integration of development-time tools (prompt engineering, testing) with production debugging capabilities (customer support issue investigation) creates a cohesive workflow that supports the full LLM application lifecycle.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study explores how Airia developed an orchestration platform to help organizations deploy AI agents in production environments. The problem addressed is the significant complexity and security challenges that prevent businesses from moving beyond prototype AI agents to production-ready systems. The solution involves a comprehensive platform that provides agent building capabilities, security guardrails, evaluation frameworks, red teaming, and authentication controls. Results include successful deployments across multiple industries including hospitality (customer profiling across hotel chains), HR, legal (contract analysis), marketing (personalized content generation), and operations (real-time incident response through automated data aggregation), with customers reporting significant efficiency gains while maintaining enterprise security standards.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.