Prosus developed a SQL-generating agent called "Token Data Analyst" to help democratize data access across their portfolio companies. The agent serves as a first-line support for data queries, allowing non-technical users to get insights from databases through natural language questions in Slack. The system achieved a 74% reduction in query response time and significantly increased the total number of data insights generated, while maintaining high accuracy through careful prompt engineering and context management.
This case study explores Prosus’s development and deployment of the “Token Data Analyst” agent, a system designed to democratize data access across their portfolio companies including iFood, OLX, and other technology companies.
The core problem being addressed was the bottleneck created by data analysts having to handle numerous routine data queries, preventing them from focusing on more complex analytical work. With around 30,000 employees across various tech companies, the need for quick data access was critical for decision-making, customer service, and operations.
System Architecture and Implementation
The Token Data Analyst agent was built with several key architectural decisions:
Core Architecture: The system uses a standard agent framework with an LLM as the brain, specialized tools for execution, and history management for maintaining context.
Database Integration: Instead of direct database connections, the agent uses a generalized SQL execution tool that interfaces with different database types (Snowflake, Databricks, BigQuery) through specific connectors. Each instance is configured for a particular database and SQL dialect.
Context Management: The system uses dedicated Slack channels for different data domains (marketing, logistics, etc.), which helps manage context and access control. This channel-based approach simplifies the agent’s context management by limiting it to specific database contexts.
Query Validation: A crucial innovation was the implementation of a pre-check step that validates whether the agent has sufficient information to answer a query before attempting to generate SQL. This helps prevent hallucination and improper responses.
Technical Challenges and Solutions
The team encountered several significant challenges during development:
Model Confidence Control: Early versions of the agent would attempt to answer questions even without sufficient context. This was addressed by implementing a separate “Clarity Check” step before query generation.
Context Management: Business context and terminology required careful curation. The team found that standard data documentation, while human-readable, wasn’t suitable for LLMs and required specific formatting to remove ambiguity and implicit knowledge.
Query Complexity Ceiling: The agent tends to default to simpler solutions, creating a natural ceiling for query complexity. This limitation was embraced by positioning the agent as a complement to, rather than replacement for, human analysts.
Model Updates: The team discovered that prompts could be “overfitted” to specific model versions, causing significant behavior changes when updating to newer models. This highlighted the need for robust testing and evaluation procedures.
Performance and Safety: Implementation included guards against expensive queries (like SELECT *) and appropriate access controls to prevent database overload or security issues.
Evaluation and Testing
The evaluation process was multifaceted:
Accuracy Testing: Due to SQL’s flexibility (multiple valid ways to write the same query) and time-dependent data, traditional accuracy metrics were challenging to implement.
Use Case Classification: The team developed a system of categorizing use cases by complexity and readiness for production, with separate development and production instances for testing.
Analyst Integration: Close collaboration with data analysts was crucial for validation and improvement of the system.
Impact and Results
The implementation showed significant positive outcomes:
Lessons Learned and Best Practices
Several key insights emerged from the project:
Future Directions
The team is exploring several improvements:
This case study represents a successful implementation of LLMs in production, demonstrating how careful architectural decisions, close user collaboration, and pragmatic engineering choices can lead to significant business value while maintaining system reliability and safety.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.