A detailed case study on automating data analytics using ChatGPT, where the challenge of LLMs' limitations in quantitative reasoning is addressed through a novel multi-agent system. The solution implements two specialized ChatGPT agents - a data engineer and data scientist - working together to analyze structured business data. The system uses ReAct framework for reasoning, SQL for data retrieval, and Streamlit for deployment, demonstrating how to effectively operationalize LLMs for complex business analytics tasks.
This case study from Microsoft’s Data Science team, authored by James Nguyen, presents a reference implementation and methodology for using ChatGPT as an automated business analytics assistant. The core innovation addresses a fundamental limitation of LLMs: while they excel at natural language tasks, they struggle with quantitative reasoning and structured data analysis. Rather than relying on the LLM to perform calculations directly, the solution treats ChatGPT as the “brain” that orchestrates analytical workflows using external tools—much like a human analyst would use software to perform the actual number crunching.
The approach is notable for being presented as an open-source reference implementation (available on GitHub) rather than a proprietary production system, which means the claims are more about technical feasibility than proven business results. This is an important distinction for evaluating the case study fairly.
The fundamental challenge addressed is the gap between LLM capabilities and business needs for structured data analytics. While ChatGPT and similar models have proven valuable for unstructured text tasks (summarization, information extraction, augmented data generation), they have historically been unreliable for quantitative analysis. The author explicitly acknowledges that LLMs “are less reliable for quantitative reasoning tasks, including documented instances in which ChatGPT and other LLMs have made errors in dealing with numbers.”
However, structured data remains critical for business decision-making, creating a need for systems that can bridge natural language interaction with accurate data analysis. The goal is to enable users without advanced technical skills to ask complex analytical questions of business data and receive accurate answers with rich visualizations.
The solution employs a two-agent architecture that separates concerns between data acquisition and data analysis:
Data Engineer Agent: Responsible for data acquisition from source systems (SQL databases). This agent receives instructions from the Data Scientist agent, identifies relevant tables, retrieves schemas, and formulates SQL queries to extract the necessary data. The separation ensures that the complexity of database interaction is isolated from analytical reasoning.
Data Scientist Agent: The primary agent that interfaces with users, plans analytical approaches, requests data from the Data Engineer agent, performs analysis using Python and data science libraries, and produces visualizations. This agent orchestrates the overall workflow and is responsible for the final output.
This separation follows good software engineering principles of modularity and makes each agent’s task more tractable for the LLM. The author notes that “dividing a potentially complex task into multiple sub-tasks that are easier for ChatGPT to work on” is a key design consideration.
Both agents implement the ReAct (Reasoning and Acting) framework, which combines chain-of-thought prompting with action execution. This represents a significant advancement over simple prompt engineering because it allows agents to:
The iterative nature of ReAct is crucial for handling non-trivial problems where intermediate analysis steps may lead to “unanticipated yet advanced outcomes” and where observations may change the original plan. This adaptive capability is essential for real-world analytics scenarios where the optimal approach isn’t always clear from the initial question.
The prompts are carefully structured with several key components:
The author emphasizes that few-shot examples are “needed to help ChatGPT understand details that are difficult to convey with only instructions.” The examples include specific Python code patterns while “unnecessary specifics omitted for brevity and generalizability.”
The prompts also establish inter-agent communication patterns, making each agent aware of the other’s existence and how to collaborate. For example, the Data Scientist agent knows to issue requests like request_to_data_engineer to acquire necessary data.
A key LLMOps principle demonstrated is the abstraction of complex operations behind simple tool interfaces. The agents are provided with utility functions that hide implementation complexity:
This tool abstraction is essential for reliable LLM operation. The author explicitly recommends “wrapping multiple complex APIs into a simple API before exposing to ChatGPT” because “complex APIs and interaction flow may confuse ChatGPT.”
The case study provides substantial guidance on production deployment challenges, which demonstrates mature thinking about LLMOps:
A significant practical challenge is that production data sources often have “numerous data objects and tables with complex schema” that would exceed ChatGPT’s token limits if passed entirely as context. The solution implements dynamic context building in three stages: first identifying relevant tables, then retrieving only those schemas, and finally building queries based on the retrieved schema. This pattern is essential for scaling to real-world enterprise data environments.
The author acknowledges that each business domain has “proprietary names, rules, and concepts that are not part of the public knowledge that ChatGPT was trained on.” These must be incorporated as additional context, potentially using the same dynamic loading mechanisms as schema information. This is a crucial consideration for enterprise deployments where domain-specific terminology is common.
For scenarios requiring complex analytical logic (revenue forecasting, causal analysis), the recommendation is to build “specialized prompt template[s] just for your scenario.” While this limits generality, specialized templates can provide the detailed instructions needed for reliable execution of domain-specific logic.
A critical production concern is that “ChatGPT has a certain level of randomness in its output format, even with clear instruction.” The recommended mitigation is implementing “validation and retry flow” mechanisms. This acknowledges that LLM outputs require programmatic verification before downstream processing.
The author explicitly states that “randomness and hallucination may impact the ability of ChatGPT — or any LLM — to deliver accuracy and reliability.” Two mitigations are suggested:
This transparency approach—showing the work rather than just the final answer—is essential for building trust and enabling human oversight in production analytics systems.
For production deployment, the author recommends separating concerns: “deploy the agents as restful back-end APIs using a framework such as Flask and deploy the UI layer using Streamlit as a front-end application for scalability and maintainability.” This separation allows independent scaling of the agent backend and user interface layers.
Streamlit is chosen as the application platform for its:
While alternatives like Dash and Bokeh are mentioned, Streamlit’s simplicity and stateful memory support make it well-suited for this prototype. The session state mechanism is particularly important as it enables data persistence and exchange between the Data Engineer and Data Scientist agents during a user session.
While this case study presents a thoughtful architecture for LLM-powered analytics, several limitations should be noted:
Nevertheless, the architectural patterns and best practices documented provide valuable guidance for teams considering similar LLM-powered analytics applications.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study presents the deployment of Dust.tt's AI platform across multiple companies including Payfit and Alan, focusing on enterprise-wide productivity improvements through LLM-powered assistants. The companies implemented a comprehensive AI strategy involving both top-down leadership support and bottom-up adoption, creating custom assistants for various workflows including sales processes, customer support, performance reviews, and content generation. The implementation achieved significant productivity gains of approximately 20% across teams, with some specific use cases reaching 50% improvements, while addressing challenges around security, model selection, and user adoption through structured rollout processes and continuous iteration.
Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.