ZenML

Structured LLM Conversations for Language Learning Video Calls

Duolingo 2025
View original source

Duolingo implemented an AI-powered video call feature called "Video Call with Lily" that enables language learners to practice speaking with an AI character. The system uses carefully structured prompts, conversational blueprints, and dynamic evaluations to ensure appropriate difficulty levels and natural interactions. The implementation includes memory management to maintain conversation context across sessions and separate processing steps to prevent LLM overload, resulting in a personalized and effective language learning experience.

Industry

Education

Technologies

Overview

Duolingo’s “Video Call with Lily” feature represents a compelling case study in deploying Large Language Models for educational purposes. The feature enables language learners to engage in real-time spoken conversations with an AI character named Lily, providing speaking practice in a low-pressure environment. This article, published in April 2025, provides a behind-the-scenes look at how Duolingo’s engineering and learning design teams structure their LLM interactions to create a consistent, pedagogically sound, and engaging user experience.

The fundamental challenge addressed in this case study is that while modern LLMs like ChatGPT, Claude, and Gemini are trained on vast amounts of language data and can produce natural-sounding exchanges, they cannot be simply instructed to “teach a language learner” without significant additional structure and guidance. The Duolingo team has developed a sophisticated prompting architecture that transforms general-purpose language models into specialized educational tools.

Technical Architecture and Prompt Engineering

At the core of Duolingo’s approach is a carefully designed prompt structure that employs a three-character paradigm for organizing LLM interactions:

This structure allows the team to maintain separation between instructional content (System) and conversational behavior (Assistant), enabling precise control over the AI’s output while still allowing natural conversation flow. The System instructions include detailed information about Lily’s personality (described as a “sarcastic emo teenage girl”), her backstory, strategies for helping stuck learners, and guidelines for speaking at the appropriate CEFR (Common European Framework of Reference for Languages) level.

Conversation Structure and Flow Control

The team has implemented a predictable four-part structure for every conversation:

This structure demonstrates a key LLMOps principle: constraining AI behavior through predictable patterns while still allowing sufficient flexibility for natural interaction.

Multi-Stage Prompting to Prevent LLM Overload

One of the most interesting technical insights from this case study is Duolingo’s discovery that combining too many instructions into a single prompt can degrade output quality. The team found that when first-question generation instructions were combined with general conversation instructions, the LLM would produce undesirable results such as overly complex sentences or missing target vocabulary.

Their solution was to implement a multi-stage prompting architecture. The first question is generated in a separate “Conversation Prep” phase with focused instructions about CEFR level, required vocabulary, and other criteria. This question is then fed into the “Main Conversation” phase as a pre-determined input. The article notes that “when your Video Call is ringing, that’s when the System is formulating the first question”—indicating this preparation happens in real-time just before the conversation begins.

This approach reflects a broader LLMOps best practice: decomposing complex tasks into simpler, more focused subtasks can significantly improve output quality. The analogy provided—that a human given fifty tasks at once will either forget some or complete all in a “half-baked way”—illustrates the intuition behind this design decision.

Memory and Personalization System

The feature includes a sophisticated memory system that enables Lily to remember information about users across sessions. After each call ends, the transcript is processed by the LLM with a specific prompt asking “What important information have we learned about the User?” The extracted facts are stored in a “List of Facts” that becomes part of the System instructions for subsequent calls.

This approach allows Lily to make personalized callbacks to previous conversations, such as “How are your dogs doing?” or “Have you tried any good tacos lately?” The article describes this as making calls feel “personalized and magical”—a key user experience goal.

From an LLMOps perspective, this demonstrates a pattern for implementing persistent memory in conversational AI systems without fine-tuning or complex vector database retrieval. By using the LLM itself to extract salient facts and then injecting those facts into future System prompts, the team achieves personalization through prompt augmentation rather than model modification.

Real-Time Evaluation and Dynamic Adaptation

Perhaps the most operationally sophisticated element described is the mid-call evaluation system. The team recognized that rigid adherence to pre-planned topics could create poor user experiences—for example, when a learner excitedly shares news about completing a course and Lily responds with an unrelated question about Swiss folk music.

To address this, they implemented real-time evaluations during the conversation itself. The System evaluates user responses and asks Lily to consider questions such as:

Based on these evaluations, Lily can dynamically adjust her responses—expressing excitement when appropriate, rephrasing when confusion is detected, or abandoning her original topic to follow the learner’s lead.

This represents a significant operational complexity: the LLM is performing evaluative reasoning in real-time during active conversations, not just generating responses. The article mentions “the LLM is always working—even during the Video Call itself” to ensure quality, suggesting continuous evaluation alongside generation.

Balancing Competing Priorities

The case study highlights the challenge of balancing multiple, sometimes competing, requirements in production AI systems:

The solution involves layering multiple systems (character instructions, level-appropriate vocabulary, memory injection, real-time evaluation) rather than attempting to solve everything with a single prompt.

Critical Assessment

While this case study provides valuable insights into Duolingo’s approach, it’s worth noting some limitations:

The article is promotional in nature and doesn’t discuss failure modes, error rates, or quantitative measures of success. There’s no discussion of how often the mid-call evaluations fail to catch problematic responses, or how frequently the memory system extracts incorrect or irrelevant facts.

The article also doesn’t address latency considerations, which would be significant for real-time voice conversations with multiple LLM calls (question generation, response generation, mid-call evaluation) happening within a single interaction.

Additionally, there’s no mention of which specific LLM provider powers the feature, model versioning strategies, or how they handle model updates that might affect Lily’s behavior. These would be important operational considerations for a feature of this scale.

Despite these limitations, the case study provides a useful template for teams building character-based conversational AI with specific behavioral requirements. The multi-stage prompting approach, memory system design, and real-time evaluation patterns are all applicable techniques for similar production deployments.

Conclusion

Duolingo’s Video Call with Lily demonstrates that deploying LLMs for specialized educational applications requires significant engineering beyond basic API integration. The combination of structured prompting, multi-stage processing, persistent memory, and real-time evaluation creates a system that transforms general-purpose language models into targeted educational tools with consistent personality and appropriate pedagogical behavior. This case study offers practical insights for teams facing similar challenges in constraining and directing LLM behavior for domain-specific applications.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec 2025

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

chatbot content_moderation customer_support +40

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57