This case study details the journey and lessons learned by Rosco, a company that completely rebuilt their product around AI agents for enterprise data analysis. The speaker, Patrick (former CTO), provides valuable insights into the practical challenges and solutions of deploying AI agents in production environments.
At its core, the case study focuses on building AI agents that could effectively query enterprise data warehouses. The team developed a specific definition for what constitutes an AI agent, requiring three key elements:
* The ability to take directions (from humans or other AIs)
* Access to call at least one tool and receive responses
* Autonomous reasoning capability for tool usage
One of the most significant technical insights was their approach to agent design. Rather than following the common pattern of using RAG (Retrieval Augmented Generation) with content inserted into system prompts, they focused on enabling agents to think and reason through problems using discrete tool calls. This approach proved particularly valuable when dealing with SQL query generation.
The team discovered that overwhelming the agent with too much schema information in the prompt led to poor performance. Instead, they broke down the functionality into smaller, more focused tool calls such as:
* Search tables
* Get table detail
* Profile columns
This modular approach allowed the agent to iteratively build understanding and generate more accurate queries.
A particularly interesting technical comparison was made between GPT-4 and Claude. The team found that response formatting had a crucial impact on agent performance:
* GPT-4 performed better with JSON-formatted responses
* Claude showed better results with XML-formatted responses
* Initial markdown formatting proved problematic, especially with large result sets (30,000+ tokens)
The case study provides valuable insights into production deployment considerations. They intentionally avoided using third-party frameworks like LangGraph or Crew AI, despite their popularity. This decision was driven by specific production requirements, particularly around security and authentication. They needed to cascade end-user security credentials down to the agent level, allowing it to query Snowflake with appropriate user-specific permissions through OAuth integration.
The team's experience with model selection and usage was particularly instructive. They found that:
* Fine-tuning models actually decreased reasoning capabilities
* Claude 3.5 provided an optimal balance of speed, cost, and decision-making quality
* The main reasoning model needed to be highly capable, while subsidiary tasks could use cheaper models
A significant portion of their learning came from implementing multi-agent systems. Their key architectural decisions included:
* Implementing a manager agent within a hierarchy
* Limiting multi-agent teams to 5-8 agents (similar to Amazon's "two-pizza rule")
* Focus on incentivization rather than strict process control
* Careful delegation of subtasks to specialized worker agents
The team's approach to production deployment emphasized pragmatic solutions over theoretical elegance. They found that the real value wasn't in the system prompts (which many teams treated as proprietary IP) but in:
* The ecosystem around the agent
* User experience design
* Security and authentication implementation
* Integration with enterprise systems
Security implementation was a crucial aspect of their production deployment. They developed systems to:
* Handle OAuth integration for enterprise data access
* Manage user-specific permissions at the data warehouse level
* Ensure secure credential management and proper access controls
The case study also reveals interesting insights about model behavior in production. For instance, they observed that model hallucinations often indicated preferred input formats - when an agent consistently ignored the specified JSON schema for tool calls, it was often suggesting a more natural format aligned with its training data.
A crucial learning was about the importance of what they termed the "Agent Computer Interface" (ACI). Small changes in tool call syntax and response formatting had outsized impacts on agent performance. This led to continuous iteration and refinement of:
* Tool call formats
* Response structures
* Error handling patterns
* Context management
The team's experience highlighted the importance of focusing on reasoning capabilities over knowledge embedding. This approach proved more robust and maintainable in production, allowing agents to handle novel situations and edge cases more effectively.
This case study represents a valuable contribution to the field of practical LLMOps, especially in enterprise settings. It demonstrates how theoretical concepts around AI agents need to be adapted and refined for production use, with particular attention to security, scalability, and real-world performance considerations.