ZenML

Production Intent Recognition System for Enterprise Chatbots

FeedYou 2023
View original source

FeedYou developed a sophisticated intent recognition system for their enterprise chatbot platform, addressing challenges in handling complex conversational flows and out-of-domain queries. They experimented with different NLP approaches before settling on a modular architecture using NLP.js, implementing hierarchical intent recognition with local and global intents, and integrating generative models for handling edge cases. The system achieved a 72% success rate for local intent matching and effectively handled complex conversational scenarios across multiple customer deployments.

Industry

Tech

Technologies

Overview

This case study is drawn from a machine learning meetup (MLMU Prague) featuring two complementary presentations on chatbot development in production environments. The first presentation by Tomáš from FeedYou discusses their practical experiences with intent matching using lightweight NLP models, while the second by Honza from Prometeus AI covers their Flowstorm conversational AI platform built on lessons learned from winning the Alexa Prize competition. Together, they provide valuable insights into the operational challenges and solutions for deploying conversational AI systems at scale.

FeedYou’s Chatbot Platform and NLP Approach

FeedYou develops a visual chatbot designer tool called Feedbot Designer that allows users to build chatbots through drag-and-drop interfaces and deploy them to channels like Facebook Messenger and web chat. The majority of their chatbots operate on tree structures—essentially branching scenarios where the bot follows predefined routes. These structures can become quite complex, with hundreds of dialogues and conditions, particularly for production use cases in domains like HR and market research.

Production Results and Use Case

FeedYou shared compelling production statistics: approximately 72% of customer queries handled by their chatbots don’t require human intervention, allowing support staff to focus on more complex problems. They also noted that about half of users interact with chatbots outside of standard working hours, demonstrating the value of automated conversational systems for 24/7 availability.

One highlighted case was their work with Ipsos for market research, where chatbots collect survey responses from users. The HR domain was also mentioned as particularly successful for their platform.

User Input Analysis and Model Performance

A critical insight from FeedYou’s production data analysis was the distribution of user input lengths. Looking at one month of data from a QnA bot with over 4,000 queries, they found that most inputs are very short—typically one or two words. Approximately 25% of inputs are essentially noise (greetings like “hello” or random text), which tends to skew toward one or two words.

More importantly, they discovered an inverse relationship between input length and match success rate: shorter inputs are matched successfully more often, while longer inputs (50+ words) see dramatically lower match rates. This finding had significant implications for their model selection and training strategy.

Model Selection: NLP.js vs. FastText

FeedYou uses NLP.js, a lightweight JavaScript library specifically designed for chatbots. When they evaluated whether more sophisticated models like Facebook’s FastText would improve results, they found no significant benefit for their use case. The reasons were instructive:

The NLP.js pipeline involves normalization (removing accents and punctuation), tokenization (splitting by spaces), optional stop word removal (though they found keeping stop words helped with very short inputs), stemming (removing word endings), and finally classification via a neural network. The output includes confidence scores for each intent plus a list of recognized entities.

Multi-Model Approach Challenges

FeedYou experimented with splitting intents across multiple models to improve matching accuracy. While this did improve match rates for targeted intents, it introduced a worse problem: incorrect matches with high confidence. In one example, a small talk model containing a “marry me” intent with the phrase “meet me at the altar” would incorrectly match when users asked to “schedule a meeting” because “meet” appeared only once in that model’s vocabulary.

This taught them that false positives create worse user experiences than acknowledging “I don’t understand”—when users receive an incorrect answer, it’s more frustrating than being asked to rephrase their question.

Intent Consolidation with Named Entity Recognition

Their most successful approach involved consolidating similar intents and using named entity recognition to determine user intent. For example, instead of separate “forgotten password” and “new password” intents that the model struggles to distinguish, they created a single “password” intent with a custom named entity for “action” (forgotten vs. new).

The dialog flow then handles disambiguation: if the entity is recognized, it routes to the appropriate branch; if not, the bot asks a clarifying question. This approach improved both model accuracy and user experience, as the bot appears to engage in natural conversation rather than failing silently.

Real-Time Model Validation

FeedYou implemented a validation system that runs during chatbot development. Every time a user modifies the NLP model, the system:

This provides immediate feedback to chatbot designers about overlapping intents that need refinement.

Prometeus AI’s Flowstorm Platform

The second presentation by Honza from Prometeus AI covered their Flowstorm conversational AI platform, which evolved from four years of competing in the Amazon Alexa Prize (winning the final year with their “Alquist” bot). This background is notable because the Alexa Prize requires building coherent, engaging social chatbots capable of open-domain conversation—a significantly more challenging task than task-oriented bots.

Lessons from Alexa Prize Competition

The Alexa Prize required handling long-tail content with generic conversation handlers, manually managing engaging content, quickly creating and testing conversational models, implementing dialog management logic, and incorporating external knowledge (sports scores, movie information, etc.).

Their early approach used a visual editor to design flows, export to code, train NLU models, implement custom logic, and redeploy the entire application. This process was slow and created versioning problems since the visual state was separate from the code state. The complexity of managing high-level intent recognition combined with hierarchical intent recognition across topics led to edge cases and development speed issues.

Sub-Dialog Hierarchy Architecture

Flowstorm’s architecture is built around sub-dialogs—modular conversation components that can be nested hierarchically. The main dialog serves as an entry point, with sub-dialog nodes that reference other sub-dialogs. When the flow enters a sub-dialog, it follows that sub-dialog’s flow until reaching an exit node, then returns to the parent dialog.

Benefits of this approach include:

Hierarchical Intent Recognition

The sub-dialog architecture necessitated a hierarchical approach to intent recognition with four categories:

Their production data showed that local intents account for approximately 73% of matches, global intents 11%, and out-of-domain 16%—though these numbers vary based on dialog structure.

Model Architecture and Classification Approach

Flowstorm uses one model per decision point (for local intents) plus one model per dialog (for global intents). This results in multiple models that must be combined at runtime, but enables modular development where sub-dialogs can be developed and tested independently.

They evaluated two classification approaches:

The top-down approach was preferred as it avoided the problem of enumerating infinite out-of-domain examples.

Generative Models for Out-of-Domain Handling

A key challenge is that out-of-domain utterances represent an infinite set of possibilities that cannot be enumerated or pre-designed. Flowstorm addresses this by incorporating GPT-2 generative models to handle out-of-domain responses.

When the system detects an out-of-domain utterance, the generative model receives the conversation context (several previous turns) and generates multiple candidate responses. A ranking system then scores and selects the most appropriate response.

They chose GPT-2 over GPT-3 because it’s less resource-demanding while still providing adequate quality for these purposes. The models are also used for incorporating external knowledge—for example, generating contextual responses based on news articles or other text the conversation references.

Research Directions

Prometeus AI is actively researching ways to add control over generative model outputs, including:

Key Takeaways for LLMOps

Both presentations highlight important production considerations for conversational AI:

More Like This

Architecture Patterns for Production AI Systems: Lessons from Building and Failing with Generative AI Products

Outropy 2025

Phil Calçado shares a post-mortem analysis of Outropy, a failed AI productivity startup that served thousands of users, revealing why most AI products struggle in production. Despite having superior technology compared to competitors like Salesforce's Slack AI, Outropy failed commercially but provided valuable insights into building production AI systems. Calçado argues that successful AI products require treating agents as objects and workflows as data pipelines, applying traditional software engineering principles rather than falling into "Twitter-driven development" or purely data science approaches.

customer_support document_processing chatbot +32

AI-Powered Security Operations Center with Agentic AI for Threat Detection and Response

Trellix 2025

Trellix, in partnership with AWS, developed an AI-powered Security Operations Center (SOC) using agentic AI to address the challenge of overwhelming security alerts that human analysts cannot effectively process. The solution leverages AWS Bedrock with multiple models (Amazon Nova for classification, Claude Sonnet for analysis) to automatically investigate security alerts, correlate data across multiple sources, and provide detailed threat assessments. The system uses a multi-agent architecture where AI agents autonomously select tools, gather context from various security platforms, and generate comprehensive incident reports, significantly reducing the burden on human analysts while improving threat detection accuracy.

fraud_detection customer_support classification +31

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic 2025

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

question_answering document_processing data_analysis +48