Rechat developed an AI agent to assist real estate agents with tasks like contact management, email marketing, and website creation. Initially struggling with reliability and performance issues using GPT-3.5, they implemented a comprehensive evaluation framework that enabled systematic improvement through unit testing, logging, human review, and fine-tuning. This methodical approach helped them achieve production-ready reliability and handle complex multi-step commands that combine natural language with UI elements.
Rechat is a real estate technology company that provides software for real estate agents and brokers, offering features like contact management, email marketing, and social marketing. The company’s CTO, Emil, along with partner Hamel, presented their journey of building an AI agent called “Lucy” for real estate professionals. This case study is particularly valuable because it candidly discusses the challenges of moving from a prototype to a production-ready LLM application, and the systematic evaluation framework that ultimately made this transition possible.
The company recognized they had substantial internal APIs and customer data, leading them to build an AI agent that could perform tasks like creating contacts, sending emails, finding listings, and even creating websites for real estate agents. While this started as an exciting prototype, the journey to production readiness proved far more challenging than anticipated.
Rechat initially built their prototype using GPT-3.5 and the ReAct (Reasoning and Acting) framework. The prototype was described as “very very slow” and “making mistakes all the time,” yet when it worked, it provided what they called a “majestic experience.” This is a common pattern in LLM development where the potential is clearly visible but the reliability is not yet production-ready.
The fundamental problem they faced when trying to improve the system was a lack of visibility into actual performance. When making changes to prompts or system configurations, the team would invoke the system a few times and get a “feeling” about whether it worked, but they had no quantitative understanding of success or failure rates. They couldn’t determine whether changes would work 50% or 80% of the time, making it essentially impossible to confidently launch a production application.
An additional complication was the regression problem: improving one use case through prompt changes would often break other use cases. Without systematic evaluation, they were “essentially in the dark” about the overall health of the system.
Hamel, the partner brought in to help make the application production-ready, emphasized that while “vibe checks” and rapid iteration work well for building MVPs, this approach “doesn’t work for that long at all” and “leads to stagnation.” The core insight was simple: “if you don’t have a way of measuring progress you can’t really build.”
The systematic approach they developed can be broken down into several key components:
One of the most important lessons from this case study is the emphasis on starting with simple, deterministic tests before reaching for more complex evaluation methods. The team emphasized that many developers skip this step and jump straight to “LLM as a judge” or generic evaluations, which is a mistake.
The assertions Rechat developed were based on observed failure modes in the data. Examples included testing whether agents (as in tool-calling agents) were working properly, checking for emails not being sent correctly, validating that invalid placeholders weren’t appearing in outputs, and ensuring certain details weren’t being repeated when they shouldn’t be. These simple tests provided immediate feedback and were essentially free to run.
For running these assertions, they used CI (Continuous Integration), acknowledging that while teams might outgrow CI as they mature, the philosophy should be to “use what you have when you begin” rather than immediately jumping to specialized tools.
Results from assertions and tests were logged to a database. Rechat already used Metabase for analytics, so they simply logged results there to visualize and track progress over time. This aligns with the repeated guidance to “keep it simple and stupid” and use existing tools rather than buying new ones when starting out.
For logging traces, the team did recommend using tools from the start, as this is one area where the tooling provides significant value. They mentioned various commercial and open-source tools, with Rechat ultimately choosing LangSmith. However, they emphasized that logging is meaningless if you don’t actually review the data.
A critical insight from this case study is the importance of reducing friction in data review. The team found that off-the-shelf tools often had too much friction for their specific use case, so they built their own data viewing and annotation application. This could be done in frameworks like Gradio, Streamlit, or Shiny for Python.
The custom application they built included domain-specific features: the ability to filter data in ways specific to real estate use cases, display associated metadata for each trace, and facilitate human review and labeling workflows. The key message was emphatic: “if you have any friction in looking at data people are not going to do it and it will destroy the whole process.”
To bootstrap test cases, especially when starting out without enough real user data, the team used LLMs to synthetically generate inputs. In Rechat’s case, they had an LLM “roleplay as a real estate agent” and generate questions covering different features, scenarios, and tools to achieve comprehensive test coverage.
With a minimal evaluation system in place, the recommended approach was to iterate through prompt engineering cycles as many times as possible. This served dual purposes: making actual progress on the AI while simultaneously stress-testing the evaluation system itself—checking whether test coverage was adequate, traces were logging correctly, and friction had been minimized.
An important “superpower” that emerged from having a comprehensive evaluation framework was the ability to curate data for fine-tuning. The eval framework could filter out good cases for human review, enabling efficient data curation. Failed cases could be worked through and used to continuously update fine-tuning datasets. The team observed that as the eval framework became more comprehensive, the cost of human review decreased because more processes were automated.
Only after establishing the foundation of simpler evaluations did the team recommend moving to LLM-as-a-judge for cases where assertions couldn’t capture the evaluation criteria. A critical point emphasized was the need to align the LLM judge with human judgment. They recommended using a simple spreadsheet where domain experts label critique data, iterating until the LLM judge demonstrates high alignment with human evaluators.
The presentation highlighted several anti-patterns:
Despite improvements through prompt engineering, Rechat ultimately found that fine-tuning was essential for certain capabilities. The team explicitly pushed back against the notion that few-shot prompting can replace fine-tuning in all cases. They noted they “wish we could just be a ChatGPT wrapper” but their use case demanded more.
Three specific challenges required fine-tuning:
While specific metrics weren’t provided, the team reported they “managed to rapidly increase the success rate of the LLM application” once they achieved the “virtuous cycle” of the evaluation framework. The project was described as “completely impossible” without this framework. The end result was an agent that could compress hours of work for a real estate agent into approximately one minute, demonstrating genuine productivity gains for non-technical users.
This case study demonstrates that production-readiness for LLM applications requires more than just good prompts or model selection. The emphasis on evaluation infrastructure, friction reduction in human review workflows, building custom tools when needed, and the discipline to start simple before adding complexity provides a practical roadmap for teams building LLM-powered products. The candid acknowledgment that fine-tuning was ultimately necessary despite advances in prompting techniques is also valuable for teams who may be receiving contradictory advice about when fine-tuning is appropriate.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.
Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.