LLMOps in Production: 457 Case Studies of What Actually Works

Throughout 2024, we collated a huge database of real-world LLMOps and GenAI case studies, examining how companies actually implement and deploy large language models in production. Starting with around 300 entries and growing to 457 case studies by January 2025, this collection represents over 600,000 words of technical implementation details, architectural decisions, and practical problem-solving approaches. To help practitioners navigate this wealth of information, we created a comprehensive series of thematic deep dives, each accompanied by an in-depth podcast episode powered by Google's NotebookLM.

Our journey began with "Demystifying LLMOps", providing an essential overview of our database's methodology and structure, offering a practical starting point for those venturing into production LLM deployments. This foundation was expanded in "LLMOps Lessons Learned", which synthesized insights from our first 380 case studies to paint a broad picture of the current LLMOps landscape.

The technical series started with "Building LLM Applications That Know What They're Talking About", exploring the crucial role of Retrieval-Augmented Generation (RAG) in creating knowledge-aware applications. This was followed by "Building Advanced Search, Retrieval, and Recommendation Systems with LLMs", which explored the practical challenges and solutions in implementing embedding-based search systems.

For those interested in automation and orchestration, "LLM Agents in Production" offered a pragmatic examination of real-world agent deployments, while "Prompt Engineering and Management in Production" tackled the often-overlooked challenges of maintaining and scaling prompt infrastructure. Our "Evaluation Playbook" provided a comprehensive look at how organizations measure and improve their LLM applications' performance.

The series concluded with two crucial operational aspects: "Optimizing LLM Performance and Cost", which examined the practical trade-offs in scaling LLM infrastructure, and "Production LLM Security", offering essential insights into protecting LLM applications against emerging threats.

Each blog post distilled key patterns and anti-patterns from our database of case studies, providing actionable insights for practitioners at every stage of their LLMOps journey. Given the extensive scope of the database, we've created these high-level summaries to offer another way to explore the wealth of implementation details and lessons learned.

What follows are summaries of the summaries, which is to say high-level summaries of the key parts of each case study, from the problems that companies faced to the solutions they tried out. This is a long blog post but it's rewarding just to scroll through and see what jumps out for you.

Going forward, we'll continue to expand the database with new technical write-ups and panel discussions as they emerge. We're also planning regular thematic deep-dives that explore specific aspects of AI Engineering and production GenAI deployments, helping practitioners stay current with evolving best practices and emerging patterns.

Here are the case study summaries:

A bank - A bank’s attempt to build a customer support chatbot using GPT-4 and RAG revealed the complexities of deploying LLMs in production, highlighting challenges in domain knowledge management, retrieval optimization, conversation flow design, and state management, with production issues including latency and regulatory compliance. The project, initially planned as a three-month proof of concept, underscores the need for robust infrastructure, comprehensive planning, and ongoing maintenance in LLMOps projects.
A major gaming company - A major gaming company collaborated with AWS Professional Services to build an automated toxic speech detection system by fine-tuning LLMs, starting with a small dataset and moving from a two-stage to a single-stage model, achieving 88% precision and 83% recall while reducing complexity and costs.
A technology company - A technology company improved developer documentation accessibility by deploying a self-hosted LLM solution using RAG, with guard rails for content safety and topic validation. They optimized performance using vLLM for faster inference and Ray Serve for horizontal scaling, achieving significant improvements in latency and throughput while maintaining cost efficiency, and ensuring proprietary information remained secure.
Aachen Uniklinik - A UK-based NLQ company collaborated with Aachen Uniklinik to develop a natural language interface for healthcare data analytics, allowing medical professionals to query complex patient data using natural language. The system employs a hybrid architecture, combining vector databases for semantic search, a fine-tuned LLM for intent detection and query transformation, and traditional SQL for structured data access, addressing challenges like handling “dirty data” and medical terminology complexity.
Accenture - Accenture’s Knowledge Assist utilizes a multi-model GenAI architecture on AWS, combining Anthropic’s Claude-2, Amazon Titan, Pinecone, and Kendra, to provide a scalable enterprise knowledge solution, resulting in a 50% reduction in new hire training time and a 40% drop in query escalations. The system demonstrates mature LLMOps practices with real-time processing, robust monitoring, and multi-language support.
Accenture - Accenture’s Industry X division implemented generative AI across manufacturing, validating nine use cases including operations twins and technical documentation automation, achieving 40-50% effort reduction in some areas. Their approach emphasized multi-agent architectures, human-in-the-loop workflows, and the use of domain-specific data for successful deployments.
Accenture / Databricks - Accenture and Databricks partnered to deploy specialized language models (SLMs) for a client’s contact center, moving beyond basic prompt engineering by using Databricks’ MLOps platform and GPU infrastructure to create fine-tuned models that understand industry-specific context, cultural nuances, and brand styles, resulting in improved customer experience and operational efficiency. The solution includes real-time monitoring, multimodal capabilities, and advanced security, demonstrating a sophisticated approach to AI-driven customer service.
Accolade - Accolade, a healthcare provider, addressed fragmented data by implementing a RAG system using Databricks’ DBRX model, improving information retrieval and customer service through a unified data lakehouse and real-time data ingestion, while maintaining HIPAA compliance. This setup, deployed via Databricks Model Serving, demonstrates a practical approach to LLM implementation in a regulated industry, emphasizing data governance, security, and compliance.
Activeloop - Activeloop leveraged its Deep Lake vector database to build an enterprise-grade memory agent system for patent processing, handling 600,000 new patents annually and managing 80 million total. The system uses specialized AI agents for tasks like claim search and abstract generation, reducing patent generation time and improving information retrieval accuracy by 5-10% using their Deep Memory technology.
Acxiom - Acxiom leveraged LLMs and LangChain to create an audience segmentation system, but faced challenges in debugging complex workflows. By implementing LangSmith for observability, they gained visibility into multi-agent interactions, optimized token usage, and scaled their hybrid model deployment effectively.
Addverb - Addverb has developed a multi-lingual voice control system for managing AGV fleets, utilizing edge-deployed Llama 3 for low-latency processing and cloud-based ChatGPT for complex tasks; this allows warehouse workers to use natural language commands in 98 languages to control AGVs, improving operational efficiency and reducing reliance on specialized engineers.
ADP - ADP is developing “ADP Assist,” a generative AI platform to enhance user interaction across their HR, payroll, and workforce management tools, leveraging a “One AI” and “One Data” platform with Databricks for MLOps, vector search, and data governance. Their approach emphasizes quality assurance through robust testing and RAG implementation, while also addressing critical concerns around data security, cost optimization, and scalability for their global operations serving over 41 million workers.
Adyen - Adyen, a global fintech platform, enhanced its support operations by deploying a smart ticket routing system and a support agent copilot, both powered by LLMs and built using LangChain on Kubernetes. This resulted in improved ticket routing accuracy and faster response times through automated document retrieval and answer suggestions, while maintaining the flexibility to switch between different LLM models.
Agmatix - Agmatix developed Leafy, a generative AI assistant powered by Amazon Bedrock and the Anthropic Claude model, to streamline agricultural field trial analysis, enabling agronomists to query data using natural language and automatically generate visualizations, resulting in a 20% efficiency improvement, 25% better data integrity, and a tripling of analysis throughput. The system leverages AWS services for data pipeline management and provides a natural language interface for querying complex agricultural research data.
AI Hero / Outer Bounds - A panel of industry experts explored the use of Kubernetes for LLM operations, discussing its strengths in workload orchestration and vendor agnosticism, while also addressing challenges like GPU management and container size limitations. The discussion emphasized the need for tailored abstractions and optimizations when deploying LLMs on Kubernetes, covering key areas such as cost optimization and architectural patterns.
Aiera - Aiera, a financial intelligence platform, developed an automated earnings call summarization system using Anthropic’s Claude models, focusing on extracting key financial insights. Their rigorous evaluation process compared ROUGE and BERTScore metrics, revealing the trade-offs in scoring methodologies and the challenges of assessing generative AI outputs in production, ultimately selecting Claude 3.5 Sonnet as the best performer.
Aimpoint Digital - Aimpoint Digital built an AI agent system for automated travel itinerary generation, utilizing a multi-RAG architecture with parallel processing for places, restaurants, and events, and Databricks Vector Search for scalable data retrieval. The system employs LLM-as-judge for automated evaluation, alongside retrieval metrics and DSPy for prompt optimization, ensuring personalized itineraries are generated quickly and accurately.
AirBnB - AirBnB upgraded their conversational AI platform to a hybrid system, integrating LLMs for enhanced natural language understanding while retaining traditional workflows for sensitive operations. Their new platform features Chain of Thought reasoning, robust context management, and a comprehensive guardrails framework, demonstrating a pragmatic approach to production LLM deployment.
Airbnb - Airbnb enhanced its customer support using LLMs for content recommendation, real-time agent assistance, and chatbot paraphrasing, moving from classification to prompt-based generation with encoder-decoder architectures. They used DeepSpeed for efficient training, implemented data cleaning pipelines, and focused on prompt engineering to improve content relevance, agent efficiency, and user engagement.
Airtop - Airtop utilized the LangChain ecosystem to develop a web automation platform, enabling AI agents to interact with websites using natural language, featuring modular architecture for easy LLM switching and LangGraph for building scalable agents with built-in validation. The platform includes an Extract API for data retrieval and an Act API for real-time interactions, with LangSmith providing debugging and testing capabilities to ensure production reliability.
Airtrain / Healthcare company / E-commerce unicorn - Airtrain’s case studies demonstrate the cost-effectiveness of fine-tuning smaller LLMs for production, with a healthcare company achieving performance parity with GPT-3.5 for a patient intake chatbot by fine-tuning Mistral-7B, and an e-commerce company improving product classification accuracy from 47% to 94% while cutting costs by 94% compared to GPT-4. These examples highlight that fine-tuning can deliver significant cost savings without sacrificing performance, provided there is high-quality training data and clear evaluation metrics.
Alaska Airlines - Alaska Airlines utilized Google Cloud’s Gemini models to develop a natural language flight search, enabling customers to describe their travel needs conversationally, rather than using traditional search parameters. This system integrates Gemini with real-time flight data, customer profiles, and pricing systems, supporting complex queries and providing accurate, personalized recommendations, all while prioritizing factuality and user experience.
Alaska Airlines / Bitra - Alaska Airlines, in partnership with Bitra, developed QARL, a novel testing framework that uses LLMs to evaluate other LLMs in production by simulating various user interactions and evaluating responses for safety and business alignment. This approach allows for automated adversarial testing of customer-facing chatbots, identifying potential risks and unwanted behaviors before deployment.
Allianz Benelux - Allianz Benelux implemented an AI-powered chatbot using Landbot to streamline their insurance claims process, analyzing over 92,000 unique search terms and integrating a real-time feedback loop with Slack and Trello for rapid iteration, achieving a 90% positive feedback rating across 18,000+ customer interactions. This resulted in a simplified claims process and improved operational efficiency.
Allianz Direct - Allianz Direct utilized Databricks’ Mosaic AI to create a RAG-based agent assist tool, improving answer accuracy by 10-15% by providing agents with quick access to policy information, while also adhering to strict financial industry compliance through Unity Catalog for data governance. This allowed agents to focus on customer relationships rather than searching through documentation.
Altana - Altana, a supply chain intelligence company, utilizes a “compound AI systems” approach, integrating custom deep learning models, fine-tuned LLMs, and RAG workflows, all managed through Databricks Mosaic AI. This sophisticated LLMOps infrastructure enables them to automate complex tasks like tax classification and legal write-ups, while ensuring data privacy and achieving a 20x speedup in model deployment and a 20-50% performance boost.
Amazon - Amazon Finance Automation implemented a RAG-based Q&A system using Amazon Bedrock, achieving a significant accuracy increase from 49% to 86% by iteratively improving document chunking, prompt engineering, and embedding model selection, demonstrating a systematic approach to LLM optimization. This resulted in a substantial reduction in customer query response times, showcasing the practical application of LLMOps best practices in a finance setting.
Amazon - Amazon’s COSMO framework employs LLMs to construct a commonsense knowledge graph, enhancing product recommendations by extracting relationships from customer data, refining them with human and ML filters, and integrating the graph into recommendation models, achieving a 60% performance improvement. This demonstrates a robust LLMOps pipeline for production use.
Amazon Alexa - Amazon Alexa’s NLP team focused on maintaining consistent performance during model updates and handling input variations in their production voice assistant. They used positive congruent training to preserve correct predictions from previous models and T5 models to generate synthetic data, improving the model’s robustness to slight changes in phrasing.
Amazon Pharmacy - Amazon Pharmacy deployed a HIPAA-compliant, LLM-powered chatbot using a Retrieval Augmented Generation (RAG) pattern with SageMaker JumpStart foundation models to provide customer service agents with quick access to accurate pharmacy information. The system incorporates a feedback loop for continuous improvement, while maintaining strict security and compliance through network isolation and role-based access control.
AngelList - AngelList replaced their initial AWS Comprehend-based document processing with OpenAI models, leading to the development of Relay, a system that automates the extraction of key investment data with 99% accuracy using LangChain for prompt orchestration and a multi-API approach for redundancy. This transition significantly reduced manual processing overhead and improved accuracy.
Anthem Blue Cross - An on-premise LLM system was built to generate health insurance appeals, using fine-tuned models trained on medical review board data and synthetic data, deployed on personal hardware with Kubernetes. The system includes GPU inference, a Django frontend, and a redundant network setup, addressing challenges like GPU optimization and hardware reliability.
Anthropic - Anthropic developed Clio, a privacy-preserving system that uses their LLM Claude to analyze and cluster user conversations, extracting high-level insights without exposing raw data. This allows them to improve safety, understand usage patterns, and detect misuse while maintaining strong privacy through techniques like minimum cluster sizes and privacy auditing.
Anthropic - Anthropic developed Clio, an automated system powered by Claude, to analyze production usage of their Claude language models while preserving user privacy. Clio extracts metadata, performs semantic clustering, and generates cluster descriptions to identify usage patterns and potential misuse, improving safety classifiers by identifying both false positives and negatives.
Anthropic / Amazon - This case study outlines a production-ready multilingual document processing pipeline that uses Anthropic’s Claude models via Amazon Bedrock, combined with human-in-the-loop validation using Amazon A2I, and a custom ReactJS UI. The system leverages a multi-stage AWS Step Functions pipeline, the Rhubarb framework for document understanding, and emphasizes structured output using JSON schemas, comprehensive state tracking with DynamoDB, and serverless architecture for scalability.
Anthropic / OpenAI - A study evaluated Claude, GPT-4, LLaMA, and Pi 3.1 for medical transcript summarization, finding GPT-4 achieved the highest accuracy while Pi 3.1 balanced accuracy and conciseness, with results suggesting a potential to reduce care coordinator preparation time by over 50%. The research used over 5,000 medical transcripts and compared model performance using ROUGE scores and cosine similarity.
Anzen - Anzen utilizes a multi-model approach, combining specialized models for document structure with LLMs for content understanding, to build a robust legal document processing system, achieving 99.9% accuracy in document classification. Their solution incorporates comprehensive monitoring, feedback, and fine-tuned classification models, demonstrating practical techniques for managing LLM hallucinations and building production-grade systems in high-stakes environments.
Anzen - Anzen, a small insurance company, implemented LLMs to automate their underwriting process using BERT for document classification and AWS Textract for information extraction, achieving 95% accuracy, and also built a compliance document review system using sentence embeddings and question-answering models to provide rapid feedback on legal documents. This allowed them to compete with larger insurers by streamlining key operations.
AppFolio - AppFolio developed Realm-X Assistant, a property management AI copilot, using LangGraph for complex workflow management and LangSmith for monitoring and debugging, achieving a significant performance boost in text-to-data functionality from 40% to 80% through dynamic few-shot prompting and saving users over 10 hours per week. The system incorporates robust testing, evaluation, and CI/CD pipelines, demonstrating a mature approach to LLMOps.
Applaud - Applaud, an HR tech company, successfully deployed an HR-focused AI assistant, addressing challenges like content management with selective integration and building a context-aware engine, while also innovating with a novel testing approach and implementing temperature controls for accuracy; the deployment emphasized integration with existing HR systems and clear communication about the AI’s capabilities, resulting in improved HR service delivery and a framework for continuous optimization.
Arcade AI - Arcade AI has developed a tool calling platform to address the challenges of deploying LLM agents in production, providing a dedicated runtime for tools, separate from orchestration, and a robust authentication system with secure token management. The platform includes a Tool SDK for standardized development, an engine for serving APIs, and an actor system for scalable tool execution, along with built-in monitoring and evaluation capabilities.
Ask-a-Metric - Ask-a-Metric, a WhatsApp-based AI data analyst, refined its natural language to SQL query system by experimenting with an agent-based approach using CrewAI, ultimately implementing an optimized hybrid pipeline that achieved high accuracy with significantly reduced query response times and costs. The case study highlights the value of using agent experiments to inform the design of a production system, demonstrating how a hybrid approach can combine the benefits of different architectures.
Assembled - Assembled, a customer support solutions company, automated their test generation process using LLMs, reducing test writing time from hours to minutes and saving hundreds of engineering hours. By integrating high-quality models and structured prompts into their workflow, they achieved increased test coverage and improved code quality, while maintaining a focus on manual verification and iterative refinement.
Athena Intelligence - Athena Intelligence developed Olympus, an AI-powered platform for generating enterprise research reports, leveraging LangChain for model abstraction and tool integration, LangGraph for orchestrating complex multi-agent workflows, and LangSmith for development and production monitoring. This stack enabled them to handle complex data tasks, generate high-quality reports with accurate source citations, and achieve significant improvements in development speed and system reliability.
Austrian Post Group IT - Austrian Post Group IT developed an Autonomous LLM-based Agent System (ALAS) using GPT-3.5-turbo-16k and GPT-4 to enhance user story quality in agile development, employing specialized agents for Product Owner and Requirements Engineer roles. The system improved story clarity and completeness, addressing challenges like token limits through prompt optimization and manual validation, and was validated by 11 professionals across six agile teams.
AWS / Metaflow - AWS Trainium and Metaflow are democratizing large-scale ML training by integrating purpose-built hardware with modern MLOps frameworks, enabling teams to achieve enterprise-grade infrastructure without deep distributed systems expertise. This combination results in significant cost reductions, simplified deployment, and the ability to scale from small experiments to massive distributed training with minimal code changes.
AWS GenAIIC - AWS GenAIIC shares practical lessons from building production RAG systems, detailing their use of OpenSearch Serverless for vector search and Amazon Bedrock for custom pipelines, emphasizing retrieval optimization through hybrid search, metadata enhancement, and query rewriting, alongside chunking strategies and advanced features like custom embedding training and response quality control. The case study highlights performance optimization, scalability, and reliability mechanisms, demonstrating improved retrieval accuracy, response quality, and user trust.
AWS GenAIIC - AWS GenAIIC demonstrates building production RAG systems that handle heterogeneous data, using routers to direct queries, LLMs for code generation on structured data, and multimodal approaches for images; the system uses OpenSearch for vector storage and k-NN search, with modular design and robust error handling. The case study highlights practical implementations across multiple industries, focusing on managing latency, data quality, and scalability.
Babbel - Babbel leveraged Python, LangChain, and OpenAI GPT models to create an AI-assisted content creation platform, deployed on AWS, that significantly reduces the time required to produce language learning materials. The platform, featuring a Gradio-based UI, manages prompts, generates diverse content formats, and integrates human feedback loops, achieving over 85% acceptance rate from editors.
Bank of America / NVIDIA / Microsoft / IBM - A panel of experts from Bank of America, NVIDIA, Microsoft, and IBM discussed the implementation of LLM systems in enterprise environments, emphasizing the need for clear business metrics, robust data governance, and continuous monitoring. The discussion highlighted the differences between traditional MLOps and LLMOps, focusing on testing, evaluation, and the increasing importance of retrieval accuracy and agent-based workflows.
Barclays - Barclays is adapting its MLOps infrastructure to integrate LLMs, using a hybrid approach that combines traditional ML with GenAI, emphasizing open-source tools and interoperability. Their strategy includes vector databases for RAG, new metrics for LLM monitoring, and a strong focus on regulatory compliance, all while maintaining a clear focus on business value and ROI.
BenchSci - BenchSci utilizes domain-specific LLMs and a RAG architecture, integrating their biomedical knowledge base with Google’s Med-PaLM, to accelerate drug discovery, specifically in biomarker identification, achieving a 40% increase in scientist productivity and reducing processing times from months to days. The platform emphasizes scientific accuracy through continuous validation and addresses enterprise-level security and compliance requirements.
Bito - Bito, an AI coding assistant, built a multi-LLM orchestration system to handle API rate limits and ensure high availability, intelligently routing requests across providers like OpenAI, Anthropic, and Azure, while selecting models based on context size, cost, and performance. They also use sophisticated prompt engineering for security and accuracy, prioritizing local code processing to maintain user privacy.
Block / Databricks - Block (Square) deployed a comprehensive LLMOps strategy across its business units, utilizing a decoupled vector search, an AI Gateway for model management, and robust quality assurance. Built on Databricks, their architecture enabled them to scale to hundreds of production endpoints while maintaining operational control, cost-effectiveness, and security.
Blueprint AI - Blueprint AI leverages GPT-4 to create a platform that bridges communication gaps between business and technical teams by automatically analyzing data from development tools like GitHub and Jira to generate intelligent reports, track progress, and identify potential blockers, while also focusing on performance optimization through streaming responses and caching. The platform provides 24/7 monitoring and context-aware updates, aiming to keep teams informed without manual reporting overhead.
BNY Mellon - BNY Mellon deployed an enterprise-wide virtual assistant powered by Vertex AI, enabling 50,000 employees to access internal knowledge and policies, overcoming challenges in document processing and context-aware information delivery. The solution, which started with pilot programs, now handles diverse information requests, improving information accessibility and streamlining knowledge management practices across the organization.
Bosch - Bosch, a global enterprise, deployed “Gen Playground,” a centralized generative AI platform, to streamline marketing content creation and translation across its vast digital ecosystem, enabling 430,000+ employees to generate text and images, and perform translations, significantly reducing content creation time and costs while maintaining brand consistency. This implementation, leveraging Google Cloud services and custom models, focused on practical business use cases and user empowerment, demonstrating a pragmatic approach to enterprise AI deployment.
Box / Glean / Tyace / Security AI / Citibank - A panel of leaders from Box, Glean, Tyace, Security AI, and Citibank discussed their experiences implementing LLMs in production, covering challenges like data integration, security, and cost. The discussion highlighted different use cases, including content management, enterprise search, personalized content generation, and enterprise-wide AI deployment, while emphasizing the importance of data governance and a systematic approach to scaling LLMs.
BQA - BQA in Bahrain implemented a production LLM system using Amazon Bedrock and other AWS services to automate the analysis of education quality assessment reports, employing a dual-model approach with Meta Llama for summarization and Amazon Titan Express for evaluation, achieving 70% accuracy in generating standards-compliant reports and reducing evidence analysis time by 30%. The system uses an event-driven architecture with SQS queues and Lambda functions for scalable document processing.
Bud Financial / Scotts Miracle-Gro - Bud Financial and Scotts Miracle-Gro are using Google Cloud’s AI to create personalized experiences, with Bud Financial building a conversational AI for banking using Vertex AI and GKE, and Scotts Miracle-Gro developing “MyScotty,” an AI assistant for gardening advice leveraging Vertex AI Search and multimodal inputs. Both companies prioritize rigorous testing, continuous monitoring, and seamless integration with Google Cloud services to deliver accurate and relevant responses.
Build Great AI - Build Great AI has developed a system that uses multiple LLMs, including LLaMA, GPT-4, and Claude, to generate 3D printable models from text descriptions, outputting OpenSCAD code that is converted to STL files; this approach achieves a 60x speed improvement in prototyping compared to manual CAD work, while addressing LLM spatial reasoning limitations through multiple simultaneous generations and iterative refinement.
BuzzFeed - BuzzFeed Tech successfully integrated LLMs into their content platform by moving from basic ChatGPT implementations to a custom retrieval-augmented generation system, addressing limitations in dataset recency and context window constraints by developing a “native ReAct” implementation and enhancing their vector search architecture with Pinecone, resulting in a more controlled, cost-efficient, and production-ready LLM system.
Cambrium - Cambrium is using LLMs and AI to design novel proteins for sustainable materials, starting with vegan human collagen for cosmetics. They’ve developed a protein programming language and leveraged LLMs to transform protein design into a mathematical optimization problem, enabling them to efficiently search through massive protein sequence spaces.
Canva - Canva developed a systematic LLM evaluation framework for its Magic Switch feature, focusing on defining success criteria and measurable metrics before implementation. This framework uses both rule-based and LLM-based evaluators to assess content quality across dimensions like information preservation, intent alignment, and format, incorporating regression testing to ensure prompt improvements don’t degrade overall quality.
Canva - Canva utilized LLMs for feature extraction to categorize user search queries and group content pages semantically, replacing traditional ML classifiers. This resulted in improved accuracy, reduced development time, and lower operational costs, while also simplifying the feature extraction process for content categorization.
Canva - Canva automated their Post Incident Review (PIR) summarization process using GPT-4, extracting data from Confluence, preprocessing it, and then using a structured prompt to generate concise summaries, which are then integrated with their data warehouse and Jira, improving data consistency and reducing engineering workload. The solution proved effective, with most AI-generated summaries requiring no human modification, while maintaining high quality and consistency at a cost of $0.6 per summary.
Cedars Sinai - Cedars Sinai has implemented a range of AI-powered tools in production for neurosurgery, including a CNN-based brain tumor classification system achieving 95%+ accuracy, a graph neural network for hematoma management with 80%+ accuracy, and AI-driven surgical planning and intraoperative guidance systems, demonstrating real-time processing and integration with existing medical infrastructure. These implementations showcase the use of multiple model types and address challenges like data limitations, regulatory compliance, and clinical workflow integration.
Chaos Labs - Chaos Labs’ Edge AI Oracle uses LangChain and LangGraph to create a multi-agent system for resolving prediction market queries, employing a decentralized network of AI agents powered by multiple LLMs to ensure objective and accurate resolutions. The system features a sophisticated workflow with specialized agents, providing transparent, traceable results with configurable consensus requirements.
Character.ai - Character.ai scaled their conversational AI platform to 30,000 messages per second by building custom foundation models and implementing multi-query attention for GPU cache reduction, while also developing a sophisticated GPU caching system and a prompt management system called “prompt-poet”. This case study highlights the need for innovative solutions across the entire stack, including database optimization, and comprehensive monitoring and testing when running LLMs at scale.
Checkr - Checkr leveraged a fine-tuned Llama-3-8b-instruct model on Predibase to classify complex background check records, achieving 90% accuracy on challenging cases, a 5x cost reduction, and a 30x speedup compared to their initial GPT-4 implementation. Their journey highlights the effectiveness of fine-tuned smaller models and the importance of monitoring, metrics, and efficient serving techniques like LoRA for production LLM deployments.
Chevron Phillips Chemical - Chevron Phillips Chemical is strategically deploying LLMs for virtual agents and document processing, emphasizing a measured approach with a cross-functional team to address challenges like testing and bias. They are focusing on specific use cases, such as extracting data from unstructured documents and creating topic-specific virtual agents, while implementing a robust governance framework.
CircleCI - CircleCI shares their experience building and deploying AI-powered features, like an error summarizer, detailing the challenges of testing LLM-based applications with non-deterministic outputs and subjective evaluations, and how they addressed these with model-graded evaluations, robust error handling, and user feedback loops. They emphasize the need for new testing strategies beyond simple string matching, while also focusing on cost optimization and scaling considerations.
CircleCI - CircleCI formed a tiger team to explore AI integration, resulting in an AI error summarizer feature, using existing foundation models, LangChain, and OpenAI APIs. The team prioritized rapid prototyping and API-first integration, demonstrating that valuable AI features can be achieved through focused exploration and iterative development without the need for complex custom models.
Circuitry.ai - Circuitry.ai developed a RAG-powered decision intelligence platform for manufacturers, using Delta Lake and Unity Catalog on Databricks for data management and governance, and Llama and DBRX models for response generation, achieving a 60-70% reduction in information search time. Their implementation highlights the importance of robust data management, flexible model architecture, and continuous feedback loops in enterprise LLM deployments.
Cisco - Cisco developed a comprehensive LLMOps framework to manage the complexities of deploying LLMs at scale, adapting traditional DevOps practices to address the unique challenges of AI-powered applications, with a focus on continuous delivery, robust monitoring, stringent security, and specialized operational support. This framework highlights the need for enterprise-specific considerations like scalability, integration with existing systems, and governance to ensure the successful and secure implementation of LLMs.
Cleric AI - Cleric AI’s AI-powered SRE system automates production issue investigation by integrating with existing observability tools, using a concurrent architecture to test multiple strategies and LangSmith for monitoring. The system implements continuous learning, capturing feedback and generalizing successful patterns across deployments while maintaining strict privacy controls.
Cleric Ai - Cleric Ai has developed an AI-powered SRE agent that autonomously monitors infrastructure, investigates issues, and provides diagnoses using a reasoning engine, tool integrations, and memory systems, aiming to reduce engineer workload by automating investigation workflows. The system emphasizes transparent decision-making, configurable model selection, and human oversight for remediation actions.
Clipping - Clipping, an educational technology startup, developed ClippingGPT, an AI tutor that leverages a specialized knowledge base and embeddings to significantly improve LLM accuracy, achieving a 26% performance increase over GPT-4 on the Brazilian Diplomatic Career Examination by prioritizing factual recall before response generation. This demonstrates how domain-specific knowledge integration can enhance LLM accuracy for educational applications.
Co-op - Co-op, a major UK retailer, implemented a RAG-powered virtual assistant using the Databricks Data Intelligence Platform to improve store employee access to operational information, processing over 1,000 policy documents with vector embeddings and semantic recall, and selecting GPT-3.5 after evaluating multiple models. The system, designed to handle 50,000-60,000 weekly queries, is currently in proof-of-concept, showing improved information retrieval and reduced support center load, with plans for a phased rollout.
CoActive AI - CoActive AI developed a system for processing unstructured data at scale, using logical data models to bridge the gap between traditional storage and AI processing needs, and optimizing embedding computations to reduce costs. Their approach includes hybrid data and AI teams, cached embeddings, and task-specific output layers to enable efficient and scalable AI operations across diverse data modalities.
Codeium - Codeium developed a novel “M-query” system that uses custom infrastructure to enable parallel processing of thousands of LLM calls, allowing for independent reasoning over potential context items, moving beyond the limitations of traditional embedding-based retrieval for code generation. This approach, combined with custom model training and a focus on real-world usage metrics, allows them to handle complex contextual queries across large codebases, delivering fast and accurate code generation for their IDE plugins used by Fortune 500 companies.
Codeium - Codeium’s development of their AI-powered IDE demonstrates the importance of early investment in robust infrastructure for enterprise-grade AI tools. By prioritizing containerization, security, and flexible deployment options from the outset, they were able to scale from individual developers to large enterprises.
Cognition AI - Cognition AI’s Devin is an autonomous software engineering agent that integrates with standard development tools like GitHub, Slack, and VS Code to perform complex tasks. Devin can operate within complete development environments, manage machine states, handle pull requests, and even debug code, showcasing its capacity for parallel task processing and integration with existing workflows.
Consulting Firm / Car manufacturer / International bank - This case study examines several LLM implementations, including a consulting firm’s financial database search using text-to-SQL and keyword generation, an automotive showroom assistant employing multi-layer processing for non-canonical data, and a banking code copilot project emphasizing the need for clear requirements and technical expertise. The studies highlight the importance of robust testing, systematic measurement, and careful data handling, while also noting that vector databases aren’t always necessary and that engineering work can be more challenging than the LLM integration itself.
Convirza - Convirza utilizes Llama 3B with LoRA adapters, deployed via Predibase, to analyze millions of call center interactions monthly, achieving sub-0.1 second inference times and a 10x cost reduction compared to OpenAI. Their multi-adapter serving architecture on single GPUs enables efficient analysis of numerous custom metrics for agent performance and caller behavior, demonstrating that smaller, well-tuned models can outperform larger ones in specific domains.
Convirza / Predibase - Convirza, an AI-powered software platform, enhanced its agent performance evaluation by switching from Longformer models to a fine-tuned Llama-3-8b model using Predibase’s multi-LoRA infrastructure, achieving a 10x cost reduction compared to OpenAI, an 8% increase in F1 scores, and an 80% increase in throughput while efficiently processing millions of customer service calls. This implementation demonstrates the effectiveness of multi-LoRA serving for high-volume, real-time analysis, maintaining sub-second inference times across 60 performance indicators.
Coval - Coval is revolutionizing AI agent testing by applying autonomous vehicle simulation principles, moving from manual testing to a probabilistic approach with dynamic scenarios and multi-layered testing architectures. This approach emphasizes robust error handling and reliability, using LLMs to benchmark agent performance against human capabilities, and provides tools for dynamic scenario generation and performance monitoring.
Cox 2M - Cox 2M leveraged Gemini LLMs with Thoughtspot and Google Cloud to overcome slow analytics and resource constraints, enabling non-technical users to query complex IoT and fleet management data using natural language, reducing time to insights by 88% and cutting response times from a week to under an hour. The solution also features automated insight generation, change analysis, and a feedback loop for continuous improvement, all while processing real-time IoT sensor data.
Credal - This case study analyzes the journey of enterprises adopting LLMs, detailing a four-stage progression from basic experimentation to core operations integration, emphasizing the importance of a multi-LLM approach, robust security, and advanced RAG for enterprise search. It also highlights the need for careful build vs. buy decisions, platform architecture, and comprehensive monitoring and governance frameworks to address challenges like security, debugging, and performance optimization.
Credal - Credal, specializing in enterprise GenAI, processed 250,000 LLM calls across 100,000 corporate documents, finding that effective production LLM systems require careful data formatting and prompt engineering, especially for complex documents with footnotes and tables, and that focusing prompts on specific, challenging aspects of tasks led to better results.
Crisis Text Line / Databricks - Crisis Text Line utilized Databricks to create a robust LLMOps platform, deploying a fine-tuned Llama 2 conversation simulator for training crisis counselors with synthetic data, and a conversation phase classifier to maintain support quality, enhancing training and supporting over 1.3 million crisis conversations. This implementation demonstrates a responsible approach to using LLMs in a sensitive healthcare context.
Cursor - Cursor built a next-generation AI-enhanced code editor by forking VS Code and integrating advanced LLM capabilities, focusing on a responsive and predictive coding experience beyond basic autocompletion. They employ techniques like Mixture of Experts (MoE) models, speculative decoding, and sophisticated caching to handle large context windows efficiently and maintain low latency.
Da.tes - A software team developed a semi-automated test case generation system using GPT-3.5 Turbo and LangChain, employing structured prompts and standardized templates. The AI-generated tests, applied to the Da.tes platform, achieved a 4.31 quality score, slightly outperforming human-generated tests at 4.18, demonstrating the viability of LLMs for test automation.
Dandelion Health - Dandelion Health implemented a HIPAA-compliant NLP pipeline for de-identifying patient data, combining John Snow Labs’ Healthcare NLP with custom pre- and post-processing within a secure AWS environment. Their system employs context-aware processing, “hiding in plain sight” techniques, and rigorous quality control to achieve high recall rates while preserving data utility for medical research.
Danswer - Danswer, an enterprise search solution, migrated to Vespa to overcome limitations in their initial vector search setup, enabling hybrid search for team-specific terminology, custom decay functions for document versioning, and multi-vector embeddings for improved context, all while maintaining performance at scale. This migration improved search accuracy and resource efficiency for their RAG-based enterprise search product, highlighting the complexities of scaling LLM applications in production.
Databricks - Databricks built a Field AI Assistant using their Mosaic AI agent framework to streamline sales operations, integrating data from their Lakehouse, CRM, and other sources. The system, powered by Azure OpenAI’s GPT-4, provides conversational data access, automates tasks, and manages CRM updates, while emphasizing data quality, governance, and continuous monitoring.
Databricks - Databricks built a custom, fine-tuned 7B parameter LLM to generate documentation for their Unity Catalog, overcoming challenges with quality, cost, and performance experienced with SaaS LLMs. This bespoke model now powers 80% of table metadata updates, achieving better quality, 10x cost reduction, and higher throughput.
Databricks / Last Mile AI / Honeycomb - A panel of experts from Databricks, Last Mile AI, and Honeycomb, among others, discussed the complexities of deploying LLM applications to production, highlighting challenges such as unpredictable user interactions and the need for robust feedback mechanisms, while also emphasizing the importance of domain-specific evaluation and strong knowledge management. The discussion covered best practices like gradual rollouts, automated tooling, and continuous improvement strategies, drawing parallels with traditional MLOps and offering recommendations for teams to ensure successful LLM deployments.
Datastax - Datastax created UnReel, a real-time multiplayer movie trivia game, using Langflow for AI pipelines and RAG to generate both real and fake movie quotes, with MistralAI’s model selected after testing. The game uses Astra DB for data storage, Cloudflare Durable Objects for state management, and PartyKit for real-time multiplayer, highlighting key LLMOps lessons such as prompt engineering, batch processing, and robust state management.
Dataworkz - Dataworkz is using a RAG-based platform to improve insurance call center efficiency by converting call recordings into searchable vectors using Amazon Transcribe, Cohere, and MongoDB Atlas Vector Search, enabling agents to quickly access relevant information. This system includes a sophisticated ETL pipeline, robust monitoring, and A/B testing capabilities, demonstrating a practical approach to implementing LLMs in production.
DDI - DDI, a leadership development company, automated their behavioral simulation assessments using LLMs and a robust MLOps platform built on Databricks, reducing report generation time from 48 hours to just 10 seconds. By leveraging prompt engineering techniques and fine-tuning Llama3-8b, they achieved significant improvements in both speed and accuracy of complex behavioral analysis, with a recall score of 0.98 and an F1 score of 0.86.
Deepgram - Deepgram, a leader in transcription services, details the challenges and solutions in building production-ready conversational AI voice agents, highlighting their new text-to-speech product, Aura. The case study emphasizes the need to manage latency, aiming for a 300ms benchmark, and addresses complexities in end-pointing, voice quality optimization through prosody, and natural conversation implementation using filler words and pauses.
Deepgram - Deepgram, a Speech-to-Text company, developed domain-specific small language models for call center applications, fine-tuning a 500M parameter model on call center transcripts to achieve superior performance in tasks like conversation continuation and summarization compared to larger models, while also being more cost-effective and faster. This demonstrates the value of specialized models over general-purpose ones for practical, real-world applications.
Defense Innovation Unit / Global Fishing Watch / Coast Guard / NOAA - The Defense Innovation Unit developed a machine learning system using satellite-based SAR imagery to detect illegal fishing, deploying it across 100+ countries via the SeaVision platform, and successfully identifying “dark vessels” that do not broadcast AIS signals. This system uses a Faster R-CNN model and addresses challenges with large image sizes and edge deployment, enabling targeted enforcement in marine protected areas.
Delivery Hero - Delivery Hero built a production-ready product matching system using LLMs, moving from basic lexical matching to a retrieval-rerank architecture with SBERT for semantic encoding and transformer-based cross-encoders, efficiently handling large catalogs while balancing accuracy and computational cost through hard negative sampling and fine-tuning.
Department of Energy / U.S. Navy - U.S. federal agencies are deploying AI and LLMs into production, addressing challenges like budget constraints and security. The Department of Energy’s Energy GPT project uses open models in a controlled environment, while the U.S. Navy’s Project AMMO showcases MLOps success, reducing model retraining time for undersea vehicles from six months to one week.
Deutsche Bank - Deutsche Bank and other major banks are implementing generative AI for document processing, research, and risk modeling, using a service-oriented architecture and Google Cloud’s Doc AI to handle large volumes of unstructured data, automate research, and improve risk assessments, while prioritizing regulatory compliance and responsible AI practices. They are focusing on internal efficiency gains, augmenting human capabilities, and managing risks like bias and hallucinations, with a dedicated Center of Excellence and employee upskilling programs.
Digits - Digits, a fintech company, implemented a production system using fine-tuned T5 models on Google Cloud’s Vertex AI to generate accounting-related questions for client interactions, leveraging TensorFlow Extended for a robust MLOps pipeline and addressing challenges like hallucinations and training-serving skew with a multi-layered validation system and in-house fine-tuning. This system streamlines communication, maintains professional standards, and scales using Google Cloud infrastructure.
Discord - Discord’s deployment of Clyde AI, a chatbot reaching over 200 million users, prioritized safety and evaluation, treating evals as unit tests integrated into their development workflow. They developed PromptFu, an open-source CLI tool for simple, fast, and deterministic evaluations, and used a practical approach to maintaining a casual chat personality.
Discord - Discord’s case study outlines their systematic approach to developing and deploying LLM-powered features, emphasizing rapid prototyping with commercial LLMs, followed by rigorous prompt engineering and AI-assisted evaluation, and scaling through hosted or self-hosted solutions. Their framework focuses on balancing rapid development with robust production deployment, while maintaining focus on user experience, safety, and cost efficiency.
Discover Financial Services - Multiple major banks, including Discover Financial Services, have implemented generative AI solutions, achieving a 70% reduction in agent search time by using RAG and summarization techniques on procedure documentation, leveraging Google Cloud’s Vertex AI and Gemini models. These implementations emphasized robust data governance, security, and compliance, with human-in-the-loop validation involving technical writers and expert agents to ensure accuracy and regulatory adherence.
Doctolib - Doctolib built Alfred, an agentic AI system using LangGraph, to handle customer support requests, employing multiple specialized LLM-powered agents in a directed graph architecture and integrating their existing RAG engine, with a focus on security using JWT authentication and human-in-the-loop confirmation for critical actions. The system manages approximately 17,000 messages daily, with an initial use case focused on calendar access management, demonstrating a robust approach to LLM limitations, security, and production scaling.
Docugami / Reet - Docugami built a document processing system using custom XML knowledge graphs, structural chunking, and Apache Spark, deploying models on Kubernetes with Nvidia Triton and Redis for vector storage, while Reet developed Lucy, a real estate agent co-pilot, transitioning to OpenAI function calling for better performance and focusing on robust testing and CI/CD. Both companies emphasized comprehensive testing, monitoring, and continuous improvement, highlighting the need for adaptable systems in the rapidly evolving LLMOps space, balancing automation with human oversight.
Doordash - Doordash integrated LLMs into their search system, using a hybrid approach with knowledge graphs and RAG to improve understanding of complex food delivery queries, resulting in a 30% increase in popular dish carousel trigger rates and a 2% improvement in whole page relevance. This implementation demonstrates a practical approach to leveraging LLMs in production while maintaining accuracy and performance.
Doordash - Doordash uses LLMs to build a product knowledge graph and enhance search across their expanding e-commerce platform, employing techniques like LLM-assisted annotation, RAG for training data generation, and model optimization for production deployment. Their system also uses LLMs for multi-intent understanding in search queries, while implementing guardrails to prevent over-personalization, and includes distributed computing and low-latency pipelines.
Doordash - Doordash built a RAG-based support system for delivery contractors, emphasizing quality control with a two-tiered LLM Guardrail that reduced hallucinations by 90% and compliance issues by 99%, alongside an LLM Judge for continuous monitoring and improvement. Their system handles thousands of daily requests, using a regression prevention strategy and strategically defaulting to human agents when latency becomes an issue.
DoorDash - DoorDash is strategically implementing Generative AI across its platform, focusing on areas like customer assistance with automated cart building, interactive discovery through advanced recommendation engines, and personalized content generation, while also improving internal operations with automated SQL generation and data extraction. The framework emphasizes data privacy and security, alongside model training and continuous monitoring, aiming for enhanced customer experience and operational efficiency.
DoorDash - DoorDash utilizes LLMs to automate and enhance their retail catalog management by extracting product attributes from unstructured SKU data, implementing a brand extraction pipeline with a hierarchical knowledge graph, organic product labeling using a waterfall approach, and generalized attribute extraction using RAG for entity resolution, leading to improved scalability, product discovery, and personalization.
DoorDash - DoorDash implemented an LLM-based support system for their delivery contractors, using RAG to access a knowledge base and employing a multi-layered quality control approach with an LLM Guardrail and LLM Judge. This system handles thousands of daily requests, achieving a 90% reduction in hallucinations and a 99% reduction in compliance issues.
Doordash - Doordash’s ML Platform team details their approach to building an enterprise LLMOps stack, tackling challenges like inference optimization and cost management, and implementing key components such as gateway services, RAG, and fine-tuning infrastructure. They also share insights from LinkedIn and Uber’s LLMOps strategies, focusing on modular design, automation, and robust evaluation frameworks.
DoorDash - DoorDash implemented a generative AI contact center solution using Amazon Bedrock and Anthropic’s Claude models, leveraging RAG with Knowledge Bases for accurate responses and achieving a 2.5 second response latency, resulting in a 50% reduction in development time and a significant decrease in live agent escalations. The system, integrated with Amazon Connect and Amazon Lex, handles hundreds of thousands of daily support calls, demonstrating the effectiveness of LLMs in high-volume production environments.
Dropbox - Dropbox’s security team discovered a novel prompt injection technique using control characters like backspace and carriage return to bypass system instructions in OpenAI’s GPT models, effectively making them forget context and instructions. This research highlights the need for robust input sanitization and validation when deploying LLMs in production environments.
Dropbox - Dropbox built an AI-powered file understanding system for web previews, leveraging their Riviera framework to handle 2.5 billion daily requests, enabling summarization and Q&A across various file types with significant cost and latency improvements using k-means clustering and similarity-based chunk selection.
Dropbox - Dropbox is transforming into an AI-powered universal search and organization platform, using LLMs to enhance its Dash product with features like semantic search across enterprise content. Their approach combines open-source LLMs, custom inference stacks, and hybrid architectures to deliver AI-driven search and organization features while maintaining strict data privacy and security for over 700 million users.
Dropbox / OpenAI - Dropbox’s security research revealed that repeated token sequences, both single and multi-token, could bypass security guardrails in OpenAI’s GPT-3.5 and GPT-4 models, leading to model divergence and the extraction of training data; this prompted OpenAI to implement improved filtering and timeouts to mitigate these vulnerabilities.
Duolingo - Duolingo integrated GitHub Copilot, along with Codespaces and custom API integrations, to improve developer efficiency and code consistency across their growing codebase, resulting in a 25% speed increase for developers new to a repository and a 10% increase for experienced developers. This implementation also streamlined workflows, reduced context switching, and helped maintain consistent standards across their projects.
Duolingo - Duolingo leverages a custom-trained LLM and a structured prompting system to accelerate language lesson creation, enabling rapid content generation while maintaining educational quality through human oversight. This system allows for automated parameter handling and multi-stage review pipelines, resulting in faster course development and expansion into new features.
Duolingo / Brainly / SoloLearn - Duolingo, Brainly, and SoloLearn have integrated LLMs into their platforms for language learning, homework help, and coding education, respectively, focusing on challenges like fact accuracy, cost management, and content personalization; they’ve found success using LLMs for synthesis, augmenting prompts, pre-generating content, and prioritizing teaching effectiveness. These companies emphasize the importance of controlled rollouts, quality control, and monitoring model outputs to achieve real learning outcomes.
Dust.tt - Dust.tt transitioned from a developer tool to a horizontal enterprise platform for AI agent deployment, achieving high daily active user rates by prioritizing a robust infrastructure with custom integrations and function calling capabilities. Their approach emphasizes real-world usage metrics and a tech stack including Next.js, Rust, and Temporal, while abstracting technical complexities to make agent creation accessible to non-technical users.
DXC Technology - DXC Technology developed an LLM-powered AI assistant for oil and gas data exploration, significantly reducing analysis time by routing queries to specialized tools optimized for different data types. Leveraging Anthropic’s Claude models on Amazon Bedrock, the solution incorporates conversational capabilities and semantic search, enabling users to efficiently analyze complex datasets and accelerate the time to first oil.
Dynamo - Dynamo, focused on secure AI solutions, developed an 8B parameter multilingual LLM using Databricks, achieving a 20% training speed improvement and completing training in 10 days. The model includes built-in security, compliance, and multilingual support for enterprise applications like customer support and fraud detection.
eBay - eBay enhanced developer productivity using a three-pronged approach: integrating GitHub Copilot, developing a custom LLM (eBayCoder) based on Code Llama 13B, and deploying an internal knowledge base GPT using RAG, resulting in improved code acceptance, software maintenance, and access to internal documentation.
eBay - eBay built a hybrid system using transformer models to provide sellers with price recommendations and similar item suggestions, particularly for sports trading cards. The system combines semantic similarity and direct price prediction, generating embeddings that balance price accuracy with item relevance using a multi-task learning framework and hard negative examples.
Echo AI / Log10 - Echo AI partnered with Log10 to implement automated LLM evaluation for their customer support analytics platform, achieving a 20-point F1 score improvement by using techniques like few-shot learning and fine-tuning to ensure high accuracy and reliability across various customer use cases. This system provides real-time monitoring, human override capabilities, and detailed visibility into model performance, while also supporting multiple LLM providers and open-source models.
Echo.ai / Log10 - Echo.ai, a SaaS platform for customer conversation analysis, partnered with Log10 to improve LLM accuracy and evaluation in production. By leveraging Log10’s automated feedback and tuning infrastructure, Echo.ai achieved a 20-point F1 score increase and a 44% reduction in feedback prediction error, enabling successful deployment of large enterprise contracts.
Edmunds - Edmunds automated their dealer review moderation using a GenAI solution powered by GPT-4 and Databricks Model Serving, reducing processing time from 72 hours to minutes. This implementation, which included custom prompt engineering and Databricks Unity Catalog for data governance, significantly decreased moderation team size and improved decision consistency for over 300 daily reviews.
Elastic - Elastic leveraged LangChain and LangGraph to develop three production-ready security features: Automatic Import, Attack Discovery, and Elastic AI Assistant, streamlining security operations with RAG and controllable agents for ES|QL query generation and data integration automation. The system, which includes LangSmith for debugging and performance monitoring, is currently serving over 350 users in production.
ElevenLabs - ElevenLabs leverages Google Kubernetes Engine (GKE) with NVIDIA GPUs, including H100s, to power its voice AI platform, achieving a 600:1 ratio of generated to real-time audio across 29 languages. They employ NVIDIA’s AI Enterprise software stack, including NeMo for model customization and NIM for inference optimization, alongside GKE Autopilot for managed deployment.
Ellipsis - This case study explores 15 months of building and operating LLM agents in production, detailing the implementation of custom caching, CI evaluation pipelines, and observability stacks, while also addressing challenges in prompt engineering and cost optimization. The study emphasizes the need for custom solutions and manual inspection, highlighting the limitations of current tools and frameworks when building reliable LLM-based systems.
Emergent Methods - Emergent Methods has deployed a production-scale RAG system that processes over 1 million news articles daily, utilizing a microservices architecture for real-time analysis and context engineering, combining tools like Quadrant for vector search, VLM for GPU optimization, and their own Flow.app for orchestration to ensure low latency and high availability while addressing challenges like news freshness and multilingual support.
Enlightened Airlines - Enlightened Airlines implemented a real-time data streaming architecture using Kafka and Flink to power their AI customer support, replacing a batch-oriented system and enabling their AI agents to access up-to-date customer information across all channels. This resulted in improved response accuracy, reduced hallucination incidents, and faster query resolution, leading to enhanced customer satisfaction and decreased operational overhead.
ESGpedia - ESGpedia, an ESG data platform in Asia-Pacific, consolidated 300 data pipelines into a Databricks lakehouse and implemented a custom RAG solution using Mosaic AI, achieving a 4x cost reduction in data pipeline management and enabling context-aware ESG data analysis. This allowed them to deliver granular, tailored sustainability insights to clients across the region.
Faber Labs - Faber Labs’ Gora system employs Goal-Oriented Retrieval Agents to optimize subjective relevance ranking, achieving over 200% improvements in key metrics like conversion rates and average order value, while maintaining sub-second latency using a high-performance Rust backend and real-time user feedback processing. This system demonstrates effectiveness across e-commerce and healthcare sectors, showcasing the power of unified goal optimization and privacy-preserving learning.
Facebook AI Research / Unusual Ventures / Digits / Bountiful - A panel of experts from Facebook AI Research, Unusual Ventures, Digits, and Bountiful discussed the practical challenges of deploying LLMs in production, focusing on managing latency, optimizing costs, and building trust. Digits shared their experience processing 100 million daily financial transactions using LLMs, highlighting model optimization and safety measures, while the panel also explored API vs self-hosted trade-offs and strategies for mitigating hallucinations.
Factory - Factory, an enterprise AI company, implemented a self-hosted LangSmith instance to improve observability and feedback within their SDLC automation platform, specifically for their Code Droid system. By integrating LangSmith with AWS CloudWatch and using its Feedback API, they achieved end-to-end LLM pipeline monitoring, automated feedback collection, and streamlined prompt optimization, resulting in a 2x improvement in iteration speed, a 20% reduction in open-to-merge time, and a 3x reduction in code churn.
Factory.ai - Factory.ai’s Code Droid system uses a multi-model LLM approach, combining models from Anthropic and OpenAI, to automate software development tasks, incorporating HyperCode for codebase understanding and ByteRank for information retrieval, achieving 19.27% on SWE-bench Full and 31.67% on SWE-bench Lite while prioritizing safety and compliance.
Factory.ai - Factory.ai is building an AI platform for software engineering automation, focusing on reliable agentic systems. They use techniques like context propagation, consensus mechanisms, and careful tool design to address planning, decision-making, and environmental grounding challenges, emphasizing modularity and human oversight.
FactSet - FactSet, a financial data and analytics provider, implemented a standardized LLMOps framework using Databricks Mosaic AI and MLflow to address challenges with fragmented GenAI development, resulting in a 70% reduction in latency for code generation and a 60% reduction in end-to-end latency for text-to-formula generation, while also enabling cost-effective use of fine-tuned open-source models. This framework enabled unified governance, streamlined model development, and improved deployment capabilities, fostering a culture of collaboration and innovation.
Faire - Faire, a global wholesale marketplace, improved its search relevance evaluation by implementing a fine-tuned Llama model, achieving a 28% improvement in prediction accuracy compared to their previous GPT model, and scaling to 70 million predictions per day using 16 GPUs with a self-hosted solution. This transition from manual labeling to an automated LLM-based system enabled near real-time feedback on search algorithm performance and supports various applications like offline retrieval analysis and ranker optimization.
Farfetch - Farfetch implemented iFetch, a multimodal conversational AI system, to enhance product discovery on their fashion marketplace, using semantic search and vector databases to handle nuanced language. They extended CLIP with fashion-specific taxonomic information and relaxed contrastive loss for improved image-based search, focusing on practical improvements like handling brand-specific queries and maintaining conversational context.
Farfetch - Farfetch utilizes Vespa as a vector database to power a scalable recommender system, delivering personalized recommendations across multiple retailers with sub-100ms latency by employing matrix decomposition on user-product interactions and features, generating user and item embeddings. The system cleverly handles sparse data through a custom storage schema and optimized dot-product operations, along with storing the dense user embeddings matrix in a single document.
FeedYou - FeedYou’s FeedBot Designer employs a hierarchical intent recognition system using NLP.js, with dedicated models per intent for improved performance and maintainability, achieving a 72% local intent matching rate and handling 72% of queries without human intervention by focusing on simple, well-tuned models and robust error handling. The platform prioritizes rapid model training, automated conflict detection, and real-time validation, demonstrating a practical approach to production-grade chatbot deployment.
Fiddler AI - Fiddler AI developed a documentation chatbot using GPT-3.5 and RAG, leveraging LangChain for its LLM pipeline. The project highlights practical LLMOps, addressing challenges like query processing, document chunking, and hallucination reduction through continuous monitoring, user feedback, and iterative improvements to the knowledge base.
First Orion - First Orion, a telecommunications software company, implemented Amazon Q to unify access to their siloed cloud operations data, enabling engineers to use natural language queries across sources like S3, Confluence, and ServiceNow. This solution not only provides a unified access point but also automates ticket creation and management, streamlining their cloud operations workflow.
FiscalNote - FiscalNote, a legal and regulatory intelligence company, streamlined their ML model deployment process by implementing Databricks’ MLflow and Model Serving, increasing deployment frequency by 3x. This new MLOps pipeline automated infrastructure management, enabled seamless model updates, and improved data asset discoverability, ultimately enhancing their ability to deliver timely legislative insights.
Fuzzy Labs - Fuzzy Labs developed a self-hosted LLM system using Mistral-7B to improve developer documentation for a tech company, employing vLLM for inference optimization and Ray Serve for horizontal scaling to achieve sub-second response times and efficient GPU usage. Through systematic load testing with Locust, they reduced response times from 11 seconds to 3 seconds, enabling the system to handle concurrent users effectively.
Gantry / Structured.ie / NVIDIA - A panel of experts from Gantry, Structured.ie, and NVIDIA discussed the shift in LLM deployment, highlighting the need for robust evaluation frameworks that combine automated metrics with human feedback, and emphasizing the importance of continuous monitoring and domain-specific benchmarks. They also addressed the need for better tooling and safety measures in production environments, focusing on user outcomes over model-centric metrics.
Gerdau - Gerdau, a steel manufacturer, implemented an LLM-powered upskilling assistant for employees after migrating to the Databricks Data Intelligence Platform to address data infrastructure challenges, resulting in a 40% cost reduction in data processing and the onboarding of 300 new data users. This strategic move showcases a measured approach to LLM adoption, prioritizing a robust data foundation and platform integration for future AI initiatives.
GitHub - GitHub’s development of Copilot began with experimenting with GPT-3 for code generation, evolving from a basic chatbot concept to an interactive IDE integration. Through iterative model improvements, prompt engineering, and fine-tuning, they progressed from a Python-only model to a multilingual one (Codex), implementing context-aware completions and targeted training.
GitHub - GitHub Copilot, leveraging OpenAI’s Codex model, enhanced its contextual understanding through advanced prompt engineering, including a Fill-in-the-Middle (FIM) paradigm and a neighboring tabs feature, resulting in a 10% and 5% increase in completion and suggestion acceptance rates respectively, and leading to a 55% increase in coding speed for developers. These improvements, combined with sophisticated caching and retrieval systems, maintain low latency while providing more relevant code suggestions.
GitHub - GitHub’s journey developing Copilot showcases the complexities of building and deploying an enterprise-grade LLM application, emphasizing rapid iteration, robust infrastructure, and a focus on user feedback to achieve a 55% increase in coding speed and a 74% reduction in developer frustration. Their approach involved transitioning from direct API usage to a scalable Azure infrastructure, implementing caching and quality pipelines, and prioritizing community engagement for a successful launch.
GitHub - Based on insights from GitHub’s ML experts, this case study provides a detailed architectural guide for deploying LLMs in production, covering key areas such as problem scoping, model selection, and customization, alongside essential architectural components like user input processing and enrichment pipelines. The guide also emphasizes implementation best practices for data management, security, performance optimization, and quality assurance, including vector database considerations, data filtering, caching strategies, and evaluation frameworks.
GitHub - GitHub’s development of Copilot exemplifies a structured approach to integrating LLMs into developer workflows, starting with early access to GPT-4 and resulting in features like Copilot for Pull Requests, Docs, and CLI. Through iterative development and user feedback, they emphasized key principles like predictability, tolerability, steerability, and verifiability, prioritizing user experience and workflow integration over perfect accuracy.
GitLab - GitLab integrated its AI-powered GitLab Duo suite into its own development workflows, using features like AI-assisted code suggestions, merge request summarization, and documentation generation. This internal implementation led to efficiency gains, improved code quality, and streamlined incident management, while also focusing on LLMOps best practices and measuring ROI.
GitLab - GitLab developed a Centralized Evaluation Framework (CEF) to rigorously test and validate LLMs powering their GitLab Duo AI features, using a library of thousands of prompts to evaluate model performance across numerous use cases. This framework employs a systematic approach, including establishing performance baselines, iterative development, and continuous validation using metrics like Cosine Similarity and LLM Judge, ensuring consistent quality and improvement.
Gitlab - Gitlab’s ModelOps team implemented a production-scale code completion tool using a combination of open-source and third-party LLMs, featuring a continuous evaluation pipeline that incorporates token-based analysis, historical code patterns, and developer feedback. Their system includes a dual-engine architecture for prompt management and a gateway, along with reinforcement learning to iteratively improve code completion accuracy and developer productivity.
Glean - Glean, an enterprise search company, employs a hybrid approach, combining traditional information retrieval with modern LLMs and embeddings, to deliver a comprehensive search solution. Their platform prioritizes rigorous ranking algorithm tuning, personalization, and cross-application integrations, rather than relying solely on AI, enabling them to serve major enterprises with features like feed recommendations and real-time updates.
GoDaddy - GoDaddy’s Digital Care team leverages LLMs to automate customer support across messaging channels, transitioning from monolithic prompts to task-specific ones using a Controller-Delegate pattern, and implementing RAG with Sparse Priming Representations, achieving a 1% failure rate in chat completions while navigating challenges like latency and complex memory management. They emphasize the importance of structured output validation, human oversight, and model switching capabilities, while also highlighting the need for robust guardrails and testing methodologies.
Golden State Warriors - The Golden State Warriors utilized Vertex AI on Google Cloud to create a personalized content recommendation system, delivering tailored content across their digital platforms. This system integrates diverse data sources to provide relevant recommendations for both sports and entertainment events at the Chase Center, supporting over 18,000 seats with a lean technical team.
Gong - Gong developed “Deal Me,” an LLM-powered question-answering feature that allows users to query extensive sales interaction data, providing rapid insights. A hybrid approach was implemented to optimize costs and improve quality, routing queries to either direct LLM-based QA or pre-computed insights based on the nature of the query.
Google - Google’s NotebookLM uses source grounding, allowing users to upload their own documents to create a personalized AI assistant powered by Gemini 1.5 Pro, complete with a citation system, transient context windows for privacy, and safety filters; it also features a sophisticated audio overview capability that generates human-like podcast-style conversations with natural speech patterns and dual AI personas. The platform prioritizes safety and responsibility through content monitoring, clear labeling of AI-generated content, and a privacy-preserving architecture, demonstrating best practices in LLMOps.
Google - Google’s Vertex AI team shares insights from deploying numerous LLM-powered agents, highlighting the need for comprehensive system design beyond just the models themselves, focusing on meta-prompting, multi-layered safety, and robust evaluation frameworks. They emphasize treating agent components as code, regular evaluation cycles, and addressing challenges like prompt injection and maintaining consistent quality.
Google - Google has implemented an LLM-powered system to automate security incident response, specifically focusing on generating incident summaries and executive communications. This resulted in a 51% reduction in time spent on incident summaries and a 53% reduction in time spent on executive communications, while maintaining or improving quality compared to human-written content, all while adhering to strict data protection measures.
Google / Scale Venture Partners - Barak Turovsky, a veteran of Google’s AI initiatives, proposes a framework for evaluating LLM production use cases based on accuracy, fluency, and risk, recommending creative and productivity tasks for initial deployment while cautioning against high-stakes applications. The framework emphasizes data management, system architecture, and risk mitigation, advocating for a phased approach and hybrid systems that combine LLMs with traditional methods.
Google Cloud / Symbol AI / Chain ML / Deloitte - A panel of AI leaders from Google Cloud, Symbol AI, Chain ML, and Deloitte discussed the practical challenges of scaling generative AI in the enterprise, covering model selection, infrastructure, and implementation strategies. The discussion emphasized a value-driven approach, the importance of production readiness assessments, and highlighted emerging trends like agent-based systems and domain specialization.
Grab - Grab’s Integrity Analytics team built an LLM-powered system using their internal Spellvault LLM and a custom Data-Arks middleware to automate data analysis and fraud investigations. This RAG-based solution, chosen for its cost-effectiveness and scalability, reduced report generation time by 3-4 hours per report and streamlined fraud investigations to minutes.
Grab - Grab addressed data discovery issues across their extensive data infrastructure by creating HubbleIQ, an LLM-powered platform that improved search with Elasticsearch optimizations and automated documentation generation using GPT-4, increasing coverage from 20% to 90% for frequently accessed tables and reducing data discovery time. They also integrated a chatbot using Glean, resulting in a 17 percentage point increase in user satisfaction.
Grab - Grab utilized LangChain and LangSmith to enhance their Metasense V2 data governance system, which employs LLMs for automated data classification and metadata generation, resulting in improved accuracy and reduced manual review. By optimizing prompts, splitting complex tasks, and implementing robust monitoring, they streamlined team collaboration and now process thousands of data entries daily, demonstrating successful LLMOps best practices in production.
Grab - Grab implemented a hybrid search approach combining vector similarity search with LLM-based reranking, using FAISS and OpenAI embeddings for initial retrieval and GPT-4 for reranking, which improved performance on complex queries with constraints and negations compared to vector search alone. This two-stage process demonstrated the benefits of combining traditional and advanced techniques for enhanced search relevance.
Grab - Grab developed an LLM-powered data classification system, using GPT-3.5 via an orchestration service called Gemini, to automate metadata generation across their PetaByte-scale data infrastructure, replacing manual tagging of sensitive data. The system classifies database columns and generates metadata tags, processing over 20,000 data entities within a month of deployment, achieving 80% user satisfaction and significantly reducing manual effort in data governance.
Gradient Labs - Gradient Labs developed a production-ready AI customer support agent, tackling challenges beyond simple LLM prototypes by using a state machine architecture with a durable execution engine to manage complex state, race conditions, and knowledge integration. The system successfully manages hundreds of daily conversations, demonstrating the need for robust engineering practices when deploying AI agents in production.
Grainger - Grainger, a major MRO distributor, implemented an enterprise-scale RAG system using Databricks to enhance product discovery across their 2.5 million item catalog, leveraging Databricks Vector Search to manage product embeddings and handle 400,000 daily updates, ensuring low-latency, real-time search and improved customer service. The system uses a flexible model-serving strategy, allowing for experimentation with different LLMs, and integrates contextual information to improve search accuracy for diverse user queries.
Grammarly - Grammarly’s CoEdIT showcases the effectiveness of specialized LLMs for text editing, achieving state-of-the-art results with models up to 60x smaller than GPT-3-Edit, through targeted instruction tuning on a curated dataset of non-meaning-changing edits, and offering an open-source implementation for community adoption.
Grammarly - Grammarly developed a novel system for detecting delicate text, going beyond standard toxicity detection to identify emotionally charged or potentially triggering content. They created DeTexD, a benchmark dataset and a RoBERTa-based classification model achieving a 79.3% F1 score, outperforming existing toxic text detection methods in this domain.
Great Ormond Street Hospital NHS Trust - Great Ormond Street Hospital utilized a hybrid approach, combining smaller LLMs for entity extraction with few-shot learning for tabular data, to process 15,000 unstructured pediatric cardiac MRI reports, successfully extracting patient identifiers and clinical measurements while adhering to NHS security constraints. The project demonstrated the effectiveness of prompt engineering with models like FLAN-T5 and RoBERTa, and the viability of using smaller LLMs in production healthcare settings.
Greptile - Greptile improved their AI code review bot by using vector embeddings to filter out low-value comments, increasing the rate of addressed feedback from 19% to over 55% after initial attempts with prompt engineering and LLM-based severity ratings failed. This highlights the effectiveness of combining LLMs with traditional ML techniques and the importance of user feedback in production LLM systems.
Guaros - A discussion between Patrick Barker, CTO of Guaros, and Farud, an ML engineer, explores the nature of LLMOps, with Patrick arguing it’s a distinct field due to unique tooling and user needs, while Farud views it as an extension of MLOps, highlighting the need for practitioners to balance traditional MLOps skills with LLM-specific knowledge. The debate covers data pipeline similarities, tool development approaches, environmental concerns, and future trends, emphasizing the importance of practical implementation over hype.
Hapag-Lloyd - Hapag-Lloyd streamlined their corporate audits by implementing a GenAI solution using Databricks’ DBRX model, fine-tuned on 12T tokens of audit data. This resulted in a 66% reduction in time spent creating new findings and a 77% reduction in executive summary review time, showcasing the impact of LLMs on real-world business processes.
Harvard Business School - Harvard Business School built ChatLTV, an AI teaching assistant for their Launching Tech Ventures course, using Azure OpenAI, Pinecone, and Langchain. This RAG-based system, trained on a 15 million-word course corpus, served 250 MBA students via Slack, handling over 3000 queries and improving class preparation.
Hearst / Gannett / The Globe and Mail / E24 - A collaborative project between several news organizations developed Real Estate Alerter, an AI-powered system that uses anomaly detection and LLMs to identify newsworthy real estate transactions, incorporating a human feedback loop to improve accuracy. The system, which includes a celebrity detection feature and a Slack bot for alerts, demonstrates the practical application of GenAI in automating news discovery, while highlighting the importance of human oversight.
Heidelberg University - Heidelberg University’s Department of Radiology and Nuclear Medicine automated radiology report generation using Vision Transformers and a fine-tuned Llama 3 model, achieving a training loss of 0.72 and a validation loss of 1.36 while optimizing for a single GPU using techniques like 4-bit quantization and LoRA. This demonstrates the practical application of LLMs in healthcare, emphasizing efficient resource utilization and the importance of human oversight.
HeyRevia - HeyRevia’s AI-powered call center solution leverages a multi-layered architecture with real-time audio processing, context-aware decision-making, and goal-oriented planning to handle complex healthcare tasks like insurance verification and claims processing, achieving improved efficiency and success rates while maintaining strict compliance. The system prioritizes performance with sub-500ms latency, manages multiple concurrent calls, and ensures compliance through self-hosted LLMs, SOC 2, and HIPAA adherence.
Holiday Extras - Holiday Extras, a European travel extras provider, successfully implemented ChatGPT Enterprise across their organization, achieving significant productivity gains and cultural transformation by leveraging AI for multilingual content creation, data analysis, engineering support, and customer service, resulting in 500+ hours saved weekly, $500k annual savings, and a 95% weekly adoption rate. This enterprise-wide rollout improved their NPS from 60% to 70% and fostered a more data-driven culture, empowering both technical and non-technical staff.
Honeycomb - Honeycomb built a natural language query interface for their observability platform using GPT-3.5, enabling users to translate natural language into structured queries with a 94% success rate. This feature significantly improved user engagement, with teams using the query assistant being 2-3x more likely to create complex queries and save them to boards.
Honeycomb - Honeycomb, an observability company, implemented a natural language querying interface and used comprehensive observability, including distributed tracing with OpenTelemetry, to address post-launch challenges. This enabled them to monitor the entire user experience, isolate issues, and establish a continuous improvement cycle, resulting in improved product retention and conversion rates.
Honeycomb - Honeycomb’s development of Query Assistant, a natural language to query interface, revealed the complexities of production LLM features, including managing large schemas, optimizing for latency, and navigating prompt engineering. Their approach focused on treating LLMs as feature engines, emphasizing non-destructive design, robust validation, and security, while also addressing legal and compliance requirements.
Honeycomb - Honeycomb built an LLM-powered Query Assistant using GPT-3.5-turbo and text embeddings to simplify querying on their observability platform, resulting in high adoption rates among enterprise and self-serve users and a significant increase in manual query retention and complex query creation. The cost-effective implementation, averaging $30/month in OpenAI costs, also demonstrated the assistant’s ability to handle unexpected inputs like DSL expressions and trace IDs, validating their “ship to learn” approach to AI integration.
Hotelplan Suisse - Hotelplan Suisse collaborated with Datatonic to create a generative AI-powered knowledge-sharing system for their travel experts, integrating over 10 data sources and using semantic search to provide rapid travel recommendations, reducing response times from hours to minutes, and includes features like chat history management, automated testing, and CI/CD pipelines. The system also prioritizes safety with guardrails and extensive logging in Google Cloud.
HP - HP’s data engineering team, burdened by support requests, deployed a RAG-based chatbot using Databricks Mosaic AI, which reduced operational costs by 20-30% and significantly decreased manual support requests by providing a self-service knowledge base for data models, platform features and access requests. Built in just three weeks, the system uses a vector database and web crawler to ingest internal documentation, showcasing the efficiency of LLMs for internal knowledge management.
Human Loop / Find.xyz - Human Loop, a developer platform, shares insights from deploying LLMs in production, emphasizing objective evaluation, prompt management, and strategic optimization, including fine-tuning, while cautioning against premature optimization with complex agents. The case study of Find.xyz demonstrates the effectiveness of fine-tuning open-source models for niche applications, highlighting the need for specialized tooling and practices tailored to the unique demands of LLM applications.
Humanloop / Duolingo / Gusto - Humanloop has built a comprehensive LLMOps platform that provides engineers with tools for prompt engineering, version control, and evaluation, addressing the challenges of managing prompts as code. The platform also includes feedback collection and production monitoring capabilities, enabling continuous improvement of LLM performance, and is used by companies like Duolingo and Gusto to manage their LLM applications at scale.
HumanLoop / Filevine / GitHub / Duolingo / Ironclad - This study examines successful LLMOps implementations across multiple companies, emphasizing the importance of domain experts in prompt engineering and the need for robust evaluation frameworks, including iterative prototyping and user feedback integration. It highlights the necessity of tooling that enables collaboration, comprehensive logging, and debugging, showcasing examples like Ironclad’s use of Rivet to achieve a 50% contract auto-negotiation rate.
HumanLoop / Jingo - HumanLoop, a developer tools platform, shares best practices for deploying LLMs in production, emphasizing systematic evaluation, prompt management with versioning, and fine-tuning for performance and cost optimization; they use GitHub Copilot as a case study of successful large-scale LLM deployment.
IBM - This case study analyzes the progression of MLOps maturity in enterprises, from manual to fully automated systems, detailing the challenges faced by data scientists, ML engineers, and DevOps teams. It highlights the unique considerations for LLM deployments, including infrastructure, security, and evaluation, while providing best practices for data management, model deployment, and team collaboration.
ICE / NYSE - ICE/NYSE built a production text-to-SQL system using structured RAG on Databricks’ Mosaic AI stack, enabling business users to query data with natural language. The system features a robust evaluation framework with syntactic and execution matching, achieving 77% and 96% accuracy respectively, and incorporates a continuous improvement pipeline using feedback loops and few-shot learning.
incident.io - incident.io implemented an AI-powered incident summary generator using OpenAI models, focusing on prompt engineering, testing, and phased rollouts. The system integrates with Slack to enrich incident data, processes updates and metadata, and generates structured summaries with a 63% direct acceptance rate and a further 26% edited before use.
IncludedHealth - IncludedHealth developed Wordsmith, a comprehensive LLM platform for healthcare applications, featuring a proxy service for multi-provider access, model serving with MLServer and HuggingFace, and robust infrastructure for training and evaluation. This platform enabled production applications like automated documentation, coverage checking, and clinical scribing, all while adhering to strict security and compliance requirements in a regulated healthcare environment.
Instacart - Instacart implemented LLMs to enhance search and product discovery, moving beyond exact matches by generating complementary and substitute product recommendations. They used a two-pronged approach, combining carefully crafted prompts with domain-specific knowledge, and built a sophisticated pipeline with offline generation, post-processing, and a novel “LLM as Judge” evaluation system.
Instacart - Instacart integrated LLMs into their search architecture to improve query understanding, product attribute extraction, and complex intent handling across their grocery e-commerce platform, addressing challenges with tail queries and enabling personalized merchandising. Their implementation combines offline and online LLM processing, focusing on cost optimization and robust evaluation to enhance search relevance and enable new product discovery capabilities.
Instacart - Instacart’s Ava, an internal AI assistant built on GPT-4 and GPT-3.5, has become a key productivity tool, boasting over 50% monthly employee adoption and 900+ weekly active users, with features like a web interface, Slack integration, and a prompt exchange system, enhancing workflows across engineering and other departments.
Instacart - Instacart employs a range of advanced prompt engineering techniques, including Chain of Thought, ReAct, and novel methods like “Room for Thought” and Monte Carlo sampling, to optimize their production LLM applications such as internal tools and search features, focusing on improving output reliability and managing token usage. These techniques, primarily implemented with GPT-4, demonstrate practical strategies for enhancing LLM performance in real-world environments.
InsuranceDekho - InsuranceDekho implemented a RAG-based chat assistant using Amazon Bedrock and Anthropic’s Claude Haiku to streamline insurance agent support, leveraging OpenSearch for vector storage and Redis for caching, resulting in an 80% reduction in query response times. This system significantly reduced reliance on subject matter experts and improved customer service efficiency.
IntellectAI - IntellectAI’s Purple Fabric platform uses MongoDB Atlas and Vector Search to automate ESG compliance analysis, scaling from 150 to over 8,000 companies. This RAG-based system processes 10 million documents across 30+ formats, achieving over 90% accuracy in compliance assessments, demonstrating a significant speed improvement over manual analysis.
Interact.ai / Amberflow / Google / Databricks / Inflection AI - A panel discussion featuring AI leaders explored the complexities of deploying LLMs in production, covering model selection, cost optimization, and ethical considerations. The discussion highlighted practical experiences from companies like Interact.ai’s healthcare deployment, Inflection AI’s emotionally intelligent models, and insights from Google and Databricks on responsible AI deployment and tooling.
Interweb Alchemy - Interweb Alchemy built an interactive chess tutoring system that combines LLMs like GPT-4-mini for move generation with Stockfish for position evaluation, using chess.js for legal move validation, providing real-time feedback and analysis to enhance the learning experience. The project showcases practical LLMOps techniques such as iterative model selection, prompt engineering, and the integration of multiple AI components.
Jabil - Jabil, a global manufacturing giant, implemented Amazon Q to transform its manufacturing and supply chain operations, deploying GenAI solutions for shop floor assistance, procurement intelligence, and supply chain management, resulting in reduced downtime and improved efficiency. The company established robust governance through AI and GenAI councils, focusing on practical use cases and clear ROI.
JOBifAI - JOBifAI, a game using LLMs for interactive gameplay, struggled with inconsistent safety filter behavior, requiring a three-retry mechanism to achieve a 99% success rate, but highlighted the need for more granular error reporting and transparent safety filter implementations to improve reliability and cost-effectiveness. The case study underscores the challenges of deploying LLMs in production, particularly regarding safety filters, and calls for better error handling and cost management strategies.
John Snow Labs - John Snow Labs has developed an enterprise-scale healthcare LLM system, deployed via Kubernetes within customer infrastructure, that processes multi-modal patient data using specialized medical LLMs for information extraction and unified knowledge graphs, enabling natural language querying and outperforming general-purpose LLMs in medical tasks. The system emphasizes data privacy, explainability, and integration with existing healthcare IT infrastructure, demonstrating best practices in LLMOps for large-scale deployments.
John Snow Labs - John Snow Labs developed a healthcare analytics platform using specialized medical LLMs to process diverse patient data, including unstructured text and images, enabling natural language queries for patient history analysis and cohort building. Deployed within the customer’s infrastructure using Kubernetes, the system prioritizes security and scalability, and it significantly outperforms general-purpose LLMs like GPT-4 in medical tasks, while maintaining consistency and explainability.
John Snow Labs - John Snow Labs utilizes multiple specialized LLMs to integrate diverse healthcare data, including structured EHR data, unstructured text, and semi-structured FHIR resources, addressing the challenge of fragmented medical information by using LLMs for information extraction, semantic modeling, data deduplication, and natural language query processing, all while maintaining security, scalability, and compliance. This system improves patient data analysis, clinical decision support, and reduces manual data integration efforts.
John Snow Labs - John Snow Labs has developed a medical literature review system using domain-specific LLMs to automate the traditionally time-consuming process of analyzing medical research, combining proprietary LLMs with a comprehensive knowledge base to enable rapid analysis of hundreds of papers, with features like custom knowledge base integration, intelligent data extraction, and automated filtering. The system offers both SaaS and on-premise deployment options, with enterprise-grade features like security, scalability, and API integration.
Johns Hopkins Applied Physics Laboratory - Johns Hopkins APL is developing CPG-AI, a battlefield medical assistant that uses LLMs to guide untrained soldiers through medical procedures by translating clinical guidelines into conversational guidance. Built using APL’s RALF framework, the system demonstrates capabilities in condition inference, natural language Q&A, and step-by-step care guidance, focusing on common battlefield injuries.
Kantar Worldpanel - Kantar Worldpanel modernized their product description matching system by using LLMs to generate training data, achieving 94% accuracy with GPT-4 and then fine-tuning smaller models for production using Databricks Mosaic AI and MLflow, automating a previously manual process. This approach allowed them to balance cost and performance while freeing up resources for more complex tasks.
Kapa.ai / Docker / CircleCI / Reddit / Monday.com - Kapa.ai’s case study, drawing from over 100 RAG implementations at companies like Docker and Reddit, outlines best practices for production RAG systems, covering data management, refresh pipelines, and evaluation frameworks, while also addressing security and performance optimization. The study emphasizes the challenges of moving beyond proof-of-concept and provides concrete guidance for successful production deployments.
Kentauros AI - Kentauros AI is building production-grade AI agents, addressing challenges in reasoning, tool use, and real-world deployment through an iterative agent architecture evolution from G2 to G5, focusing on memory systems, skill management, and multiple model integration. Their approach emphasizes practical solutions, iterative testing, and resource optimization, with future directions including enhanced reinforcement learning and more sophisticated memory systems.
Klarity - Klarity, a document processing automation company, successfully transitioned to generative AI, processing over half a million documents for B2B SaaS customers in finance and accounting, and developed a robust evaluation framework to address the challenges of non-deterministic performance, rapid development cycles, and the limitations of standard benchmarks, using techniques like staged evaluation, customer-specific metrics, and synthetic data generation.
Klarna - Klarna deployed an OpenAI-powered AI assistant for customer service, handling 2.3 million conversations across 23 markets and 35+ languages, reducing resolution times from 11 minutes to under 2 minutes and decreasing repeat inquiries by 25%. This system, integrated into the Klarna app, achieved customer satisfaction scores comparable to human agents and is projected to deliver a $40 million profit improvement in 2024.
Komodo Health - Komodo Health’s MapAI uses multiple LLMs and a LangChain/LangGraph framework to provide an NLP interface for their MapLab platform, enabling non-technical users to perform complex healthcare data analysis, reducing weeks-long processes to instant insights. This system, built with an API-first architecture, integrates with their Healthcare Map data source and maintains HIPAA compliance, while supporting various skill levels through different interfaces.
LangGraph / Waii - This case study showcases the creation of production-ready conversational analytics applications by combining LangGraph’s multi-agent framework with Waii’s text-to-SQL capabilities, enabling natural language querying of complex databases through sophisticated join handling and agentic workflows. The solution demonstrates how to achieve accurate and scalable interactions with intricate data structures.
leboncoin - leboncoin, France’s leading second-hand marketplace, implemented an LLM-powered search re-ranking system using a bi-encoder architecture with pre-computed ad embeddings to improve search relevance across their 60 million listings, achieving up to 5% improvement in click and contact rates and 10% improvement in user experience KPIs while maintaining strict latency requirements. The system leverages a distilled BERT model and a two-phase deployment strategy to handle high throughput and low latency demands.
Lemonade - Lemonade, a technology-driven insurance company, utilizes RAG pipelines for its chat-based customer interactions, addressing challenges like missing content, retrieval issues, and response generation through data cleaning, prompt engineering, and advanced retrieval strategies. Their experience highlights the importance of systematic troubleshooting, data quality, and continuous optimization for RAG pipelines in production.
Lime - Lime, a global micromobility company, deployed Forethought’s AI platform to automate customer support, achieving a 27% case automation rate and reducing first response times by 77% through intelligent routing and automated responses, while also processing 1.7 million tickets annually and supporting multiple languages. This implementation addressed challenges like manual ticket handling and language barriers, demonstrating significant improvements in efficiency and customer satisfaction.
Lindy.ai - Lindy.ai transitioned from an open-ended LLM agent platform to a structured, visual workflow-based system, improving reliability and usability by constraining LLM behavior through guided workflows and rails. This shift included a dedicated memory module, prompt caching, and structured output calls, demonstrating that guided workflows lead to more robust AI agents capable of handling complex automation tasks.
LinkedIn - LinkedIn’s SQL Bot utilizes a multi-agent architecture built on LangChain and LangGraph to provide a text-to-SQL interface within their DARWIN data platform, addressing the complexities of enterprise data warehouses through embedding-based retrieval, LLM-based re-ranking, and self-correction agents, resulting in a 95% user satisfaction rate for query accuracy. This system demonstrates the value of a methodical approach to LLMOps, with particular attention to integration with existing enterprise systems and workflows.
LinkedIn - LinkedIn’s production GenAI journey involved a strategic shift from Java to Python, leveraging LangChain for building sophisticated conversational agents. This included developing a robust prompt management system, a skill-based task automation framework, and a memory system for conversational context, all while ensuring production stability and enabling both commercial and fine-tuned LLM deployments.
LinkedIn - LinkedIn’s Hiring Assistant uses LLMs to automate recruiting workflows, incorporating an experiential memory system for personalization and an agent orchestration layer for complex task management. This system handles tasks from job description creation to interview coordination, while emphasizing responsible AI practices and integrating with existing LinkedIn technologies.
LinkedIn - LinkedIn’s Security Posture Platform (SPP) uses an AI-driven interface, SPP AI, to manage security vulnerabilities, leveraging a security knowledge graph and LLMs for natural language queries. This system improved vulnerability response speed by 150% and increased digital infrastructure coverage by 155% through a multi-stage query processing pipeline and sophisticated context generation.
LinkedIn - LinkedIn implemented a production-grade generative AI system using a Retrieval Augmented Generation (RAG) architecture to improve job searches and content browsing, employing specialized AI agents and a custom “skills” wrapper for API integration. The team focused on overcoming challenges like LLM schema compliance, quality assurance, and optimizing for latency and resource management, using a multi-tiered evaluation framework and streaming architecture to achieve significant improvements.
LinkedIn - LinkedIn’s Pan Cha outlines a product-led approach to LLM integration, emphasizing solving user problems over forced AI adoption, starting with simple implementations using public APIs, and iteratively improving through robust prompt engineering and evaluation frameworks. This pragmatic strategy prioritizes user trust, cost management, and clear problem definition, avoiding AI for its own sake.
LinkedIn - LinkedIn has implemented a multi-stage LLM-based system for extracting and mapping skills from platform content, powering their Skills Graph, using BERT models optimized with knowledge distillation for production. This system handles 200 profile edits per second with sub-100ms latency, leveraging a hybrid serving approach and Spark for offline scoring, resulting in significant improvements in job recommendations and skills matching.
LinkedIn - LinkedIn implemented generative AI features across their platform, prioritizing user-centric design and systematic prompt engineering, finding that optimized prompts with GPT-3.5 Turbo could match GPT-4 performance. Their approach emphasized building trust through transparent communication and content policies, while also addressing challenges like GPU resource constraints and prompt reliability at scale.
London Stock Exchange Group - The London Stock Exchange Group (LSEG) deployed an AI-powered client services assistant using Amazon Q Business, enhancing post-trade support with a RAG architecture that integrates internal knowledge and public data; a rigorous validation process using Claude v2 ensures response accuracy, improving customer experience and staff productivity.
Malt - Malt enhanced its freelancer matching system using a two-step retriever-ranker architecture powered by a vector database (Qdrant), significantly reducing response times from over 60 seconds to under 3 seconds while maintaining recommendation quality. This approach leverages custom-trained models and a robust monitoring system to achieve scalability and performance improvements.
Mark43 - Mark43, a public safety technology company, integrated Amazon Q Business into their platform, providing law enforcement agencies with secure generative AI capabilities, enabling natural language queries and automated case report summaries, reducing administrative time significantly. The implementation prioritizes security, using built-in connectors and embedded web experiences to create a seamless AI assistant within existing workflows, while maintaining strict data access controls.
Marsh McLennan - Marsh McLennan rolled out an enterprise-wide LLM assistant, achieving 87% adoption across 90,000 employees and processing 25 million requests annually, initially using cloud-based APIs and RAG for secure data access, then evolving to fine-tuned models for specific tasks, achieving accuracy exceeding GPT-4 with low training costs and saving over a million hours annually.
Mastercard - Mastercard implemented a RAG architecture for fraud detection using LLMs, achieving a 300% improvement in some cases, while prioritizing responsible AI principles, security, and access controls. This case study highlights the complexities of enterprise-scale LLM deployment, including the need for robust data pipelines and scalable infrastructure.
Mastercard - Mastercard’s data science team is using a linguistic-first approach to LLM deployment, focusing on syntax, morphology, semantics, and pragmatics to address challenges like evolving language and tokenization issues. This methodology, demonstrated with a biology question-answering system, improved accuracy from 35% to 85% using pragmatic instruction with Falcon 7B and the guidance framework, while also reducing inference time compared to vanilla ChatGPT.
MediaRadar / Vivvix - MediaRadar | Vivvix, an advertising intelligence company, automated their video ad classification process using Databricks Mosaic AI and Apache Spark Structured Streaming, combining GenAI models with their existing classification systems to increase throughput from 800 to 2,000 ads per hour and reduce model experimentation time from 2 days to 4 hours. They optimized costs by choosing GPT-3.5 and improved accuracy by combining multiple classification approaches.
Mendable.ai - Mendable.ai utilized LangSmith to debug and optimize their production LLM-powered enterprise AI assistant, which uses tools and actions to interact with APIs and data sources, resulting in improved system performance and $1.3 million in savings for a major tech company client within five months. By implementing LangSmith, they gained visibility into agent decision-making, optimized prompts, and validated tool schemas, addressing initial challenges with observability and reliability.
Mendix - Mendix, a low-code platform, integrated Amazon Bedrock to provide secure and scalable access to generative AI models within their development environment, enabling features like text generation and image creation. Their implementation includes custom model training, robust security measures using AWS services, and cost-effective model selection, showcasing a mature approach to LLMOps.
Mercado Libre - Mercado Libre, Latin America’s largest e-commerce platform, transitioned from a word-matching search system to one using vector embeddings and Google’s Vector Search, significantly improving results for complex, natural language queries which make up half of their search traffic, leading to increased conversion rates. This new system generates vector embeddings for both products and user queries, enabling a more semantic understanding of search intent.
Mercado Libre - Mercado Libre, a leading e-commerce platform in Latin America, deployed GitHub Copilot across its 9,000+ developer team, integrating it with their existing GitHub Enterprise infrastructure, resulting in a 50% reduction in code writing time and improved developer satisfaction. The implementation also included security workflows and automated testing, demonstrating a focus on both productivity and code quality at scale.
Mercado Libre - Mercado Libre implemented a centralized LLM gateway, “Fury,” to manage large-scale generative AI deployments across their organization, integrating with multiple providers and offering a custom playground and SDK; a key use case is a real-time product recommendation system that leverages LLMs for personalized suggestions, supporting multiple languages and dynamic prompt versioning.
Mercado Libre - Mercado Libre implemented LLMs for RAG-based documentation search using Llama Index, automated documentation generation, and natural language processing for product information and booking, emphasizing data pre-processing, quality control, and iterative prompt engineering. Their experience highlighted the importance of comprehensive documentation, structured outputs, and careful model selection, showcasing practical LLMOps including function calling and continuous refinement.
Mercado Libre / ATB Financial / LBLA / Collibra - Mercado Libre, ATB Financial, LBLA, and Collibra are implementing data and AI governance for GenAI, utilizing Google Cloud tools like Dataplex. They are exploring GenAI for automated metadata, natural language search, and lineage tracking, while addressing challenges like data quality and multi-cloud integration, emphasizing the need for strong data governance for successful AI deployment.
Mercari - Mercari implemented an AI Assist feature using a hybrid LLM approach, leveraging GPT-4 for offline attribute extraction and GPT-3.5-turbo for real-time title suggestions, focusing on practical challenges like prompt engineering, error handling, and managing output inconsistencies in a production environment. They employed both offline and online evaluations to ensure quality and optimize for cost and performance.
Mercari - Mercari fine-tuned a 2B parameter LLM using QLoRA to extract dynamic attributes from user-generated listings, achieving a 95% reduction in model size and a 14x cost reduction compared to GPT-3.5-turbo, while also improving BLEU score and controlling hallucinations. The implementation included careful dataset preparation, parameter-efficient fine-tuning, and post-training quantization with llama.cpp.
Meta - Meta’s deployment of their AI image animation feature demonstrates a comprehensive approach to scaling generative AI for billions of users, using techniques like floating-point precision reduction and temporal-attention optimization to improve model performance. The system leverages PyTorch optimizations and a sophisticated traffic management system with regional routing and load balancing to minimize latency and ensure global reliability.
Meta - Meta scaled its AI infrastructure to train LLaMA 3 by building two 24K GPU clusters, achieving 95% training efficiency through full-stack optimizations across hardware, networking, and software, while addressing challenges in hardware reliability, thermal management, and network topology. This involved a transition from smaller recommendation models to massive LLM training jobs requiring thousands of GPUs running for months.
Meta - Meta’s TestGen-LLM leverages large language models to automatically enhance unit test coverage for Android applications, including platforms like Instagram and Facebook. Using an Assured Offline LLM-Based Software Engineering approach, it generates additional test cases with strict quality controls, resulting in a 10% improvement in targeted classes and high acceptance from engineers.
Microsoft - Microsoft’s team developed a production-grade RAG system for analyzing complex financial documents, tackling challenges like metadata extraction, chart analysis, and nuanced evaluation. Their solution incorporated multi-modal models, specialized prompt engineering, and a robust evaluation framework, highlighting the complexities of building real-world RAG systems beyond basic implementations.
Microsoft - A financial services firm developed a RAG-based chatbot using Azure OpenAI and Azure AI Search to provide access to financial documents, overcoming initial challenges with context loss and search accuracy by implementing a “search-first” architecture that leverages GPT-4 generated summaries for improved relevance. This approach, combined with hybrid search and custom scoring, significantly improved response accuracy and reduced manual research time for financial analysts.
Microsoft - Microsoft engineers detail their experiences deploying LLMs for enterprise clients in Australia, emphasizing the need for cross-functional teams and robust LLMOps practices including continuous experimentation and evaluation pipelines. The case study highlights careful RAG implementation and critical security measures like guard rails to mitigate vulnerabilities such as prompt injection.
Microsoft - Microsoft’s Raj Ricky outlines best practices for AI agent development, emphasizing starting with a minimal viable product and constrained environments, while avoiding premature adoption of complex frameworks. He highlights the importance of clear success criteria, human oversight during initial development, and performance optimization techniques like quantization, while balancing autonomy and control.
Microsoft - Microsoft built a real-time question-answering system for their MSX Sales Copilot, enabling sellers to quickly find relevant sales content using a two-stage LLM architecture with bi-encoder retrieval and cross-encoder re-ranking, achieving few-second response times and a 3.7/5 relevancy rating from users. The system operates on document metadata due to content access limitations, and is deployed on Azure Machine Learning endpoints with weekly model refreshes.
Microsoft - Microsoft Research leveraged LLMs, including GPT-3 and GPT-3.5, to automate incident management for Microsoft 365, analyzing 40,000 incidents across 1000+ services to generate root cause analysis and mitigation recommendations, with fine-tuned GPT-3.5 models achieving a 70% usefulness rating from on-call engineers in production.
Microsoft / GitHub - A study of 26 software engineers building AI-powered product copilots highlights challenges in prompt engineering, orchestration, and testing, leading to the development of solutions like prompt linters and intent detection systems, while emphasizing safety, privacy, and cost management. The research underscores the need for more mature tooling and standardized practices to support AI-first development.
Microsoft Research / Deepgram / Prem AI / ISO AI - A panel of experts from Microsoft Research, Deepgram, Prem AI, and ISO AI discussed the challenges of deploying multi-modal AI agents, covering topics like latency, model architecture, and scaling. They explored how combining voice, vision, and text can improve agent performance, and how hierarchical architectures using both large and smaller specialized models can optimize for different use cases.
MLflow - MLflow introduces a production-ready agent framework with comprehensive tracing, evaluation, and experiment tracking, addressing the challenges of deploying LLM agents. This system provides deep visibility into agent operations, detailed logging of multi-turn conversations, and evaluation tools for assessing retrieval relevance and prompt engineering effectiveness, enabling teams to efficiently monitor, debug, and optimize their LLM-powered applications.
MNP - MNP, a Canadian professional services firm, modernized its data analytics platform by implementing a lakehouse architecture on Databricks, leveraging Mixtral 8x7B with a RAG approach to deliver contextual insights to clients. This solution, deployed in under six weeks, enabled secure and efficient processing of complex data queries while maintaining data isolation through Private AI standards.
MongoDB / Dataworkz - MongoDB and Dataworkz partnered to implement an agentic RAG solution for retail, leveraging MongoDB Atlas for vector search and Dataworkz’s RAG builder, enabling personalized customer experiences through intelligent chatbots, dynamic product recommendations, and enhanced search by integrating real-time operational data with unstructured information. The system uses an agentic approach to intelligently query multiple data sources, combining lexical and semantic search with knowledge graphs, demonstrating a significant advancement in LLMOps for complex, context-aware applications.
Morgan Stanley - Morgan Stanley’s wealth management division implemented a GPT-4 powered internal chatbot, enabling their financial advisors to quickly access a vast library of investment strategies, market research, and analyst insights. This system, processing hundreds of thousands of pages of PDF-based content, has over 200 daily active users, showcasing the effectiveness of LLMs for enterprise knowledge management.
MosaicML - MosaicML developed the open-source MPT family of large language models, including 7B and 30B parameter versions, demonstrating that high-quality LLMs can be trained at significantly lower costs, with the 7B model costing under $250,000. They built a complete training platform that handles data processing, distributed training across 512+ GPUs, and model deployment at scale, while establishing key best practices around planning, experimentation, data quality, and operational excellence for production LLM development, including robust checkpointing and failure recovery.
MSD - MSD partnered with the AWS Generative Innovation Center to build a text-to-SQL system using Amazon Bedrock and Anthropic’s Claude 3.5 Sonnet, enabling analysts to query complex healthcare databases with natural language, using custom lookup tools and prompt engineering to handle coded columns and complex queries. This resulted in significantly reduced query times and improved data accessibility for non-technical users.
MultiCare - The MultiCare project created a large-scale, multimodal medical case report dataset with over 75,000 articles and 135,000 images, using a sophisticated data processing pipeline with tools like BioPython, OpenCV, and spaCy to extract and structure text, images, and metadata, enabling training of language, computer vision, and multimodal AI systems. The dataset features automated extraction of demographic data, edge detection for image splitting, and contextual parsing of image captions, and is hosted on Zenodo and Hugging Face with a flexible filtering system.
N8N - A company developed and deployed autonomous AI agents over 18 months, focusing on lead generation and inbox management, using vector databases, automated data collection, and structured prompt engineering with n8n for custom tool integration. This resulted in a scalable multi-agent system that highlights the importance of data quality, agent architecture, and robust tool integration for complex business workflows.
National Healthcare Group - National Healthcare Group integrated LLMs into existing healthcare apps and messaging platforms to provide 24/7 multilingual patient education, focusing on conditions like eczema and medical test preparation. The implementation emphasizes practical integration, careful monitoring, and manual review of LLM responses to ensure accuracy and privacy.
Neeva - Neeva, a search engine company, successfully navigated the complexities of deploying LLMs in production by addressing infrastructural challenges like speed and API reliability, as well as output-related issues such as format variability and safety, emphasizing a phased approach starting with non-critical workflows and focusing on robust evaluation frameworks. Their strategy included optimizing for speed and cost, ensuring output consistency, and planning for scale, all while maintaining a user-centric approach.
Neva / Intercom / Prompt Layer / OctoML - A panel of experts from Neva, Intercom, Prompt Layer, and OctoML discussed strategies for optimizing LLM deployments in production, covering cost and performance, including transitioning from API services to in-house deployments, hardware optimization, and technical optimizations like structured printing and knowledge distillation. They also covered latency optimization through libraries and model compression, and emphasized the importance of monitoring tail latencies, costs, and quality, while balancing user experience.
New Computer - New Computer utilized LangSmith to refine the memory retrieval system of their AI assistant, Dot, achieving a 50% increase in recall and a 40% improvement in precision through synthetic data testing, comparison views, and prompt optimization, which led to a successful launch and a 45% conversion rate to their paid tier.
New Relic - New Relic, a major observability platform, integrated GenAI for both internal operations and customer-facing products, achieving a 15% increase in developer productivity. Their implementation features a multi-agent architecture for internal tasks, automated incident management, and a three-layer AI architecture for their product offerings, emphasizing cost management through model selection, RAG, and careful monitoring.
Nextdoor - Nextdoor implemented an LLM-based system to optimize email subject lines, using prompt engineering with the OpenAI API and a fine-tuned reward model for engagement prediction, resulting in a 1% lift in sessions and a 0.4% increase in Weekly Active Users. The system employs rejection sampling, caching, and robust monitoring to ensure cost-effective and reliable performance.
NICE - NICE built a production-grade natural language to SQL system for querying contact center data, achieving 86% accuracy with robust safeguards like tenant isolation, query parameter management, and result visualization. The system also includes context management for follow-up questions, query caching, and validation against business rules to ensure reliable operation.
NICE Actimize - NICE Actimize, a leader in financial fraud prevention, uses vector embeddings to enhance real-time fraud detection by transforming tabular transaction data into text and then into vector embeddings using a RoBERTa variant, enabling them to identify similar fraud patterns with sub-millisecond processing times. This approach maintains the high performance required for large-scale, real-time transaction analysis while preserving semantic meaning.
NICE Actimize - NICE Actimize integrated generative AI into their “Excite” financial crime detection platform, enabling analysts to create analytical artifacts from natural language requests. This system automates the generation of aggregations, features, and models, with built-in validation pipelines and MLOps capabilities to ensure safe and efficient deployment.
No company name - This case study examines the critical need for robust error handling and response validation when deploying LLMs in production, emphasizing a multi-layered approach including input validation, response processing with retry and fallback mechanisms, and comprehensive monitoring and logging. It also covers production considerations like scalability, error recovery, security, and documentation, highlighting the importance of continuous improvement and testing to ensure system reliability and a positive user experience.
No company name is explicitly mentioned in the case study. - A trivia quiz application was transformed from a static database to a dynamic, LLM-powered system using Google’s Vertex AI, leveraging Gemini models for quiz generation and validation, achieving a significant accuracy improvement from 70% to 91%. The team implemented robust prompt engineering, testing, and validation frameworks, showcasing the practical application of LLMs in a production environment.
No company name is mentioned in this case study. - This case study details the use of LLMs to process complex legacy PDF documents, addressing challenges like binary data, stream compression, and intricate object relationships. The solution employs LLMs for intelligent text extraction, document understanding, and semantic analysis, creating a pipeline that includes pre-processing, decoding, content analysis, and post-processing.
No company name mentioned - This case study examines the challenges of deploying LLMs at scale, contrasting traditional MLOps with the emerging LLMOps, and emphasizes the need for new approaches to data handling, evaluation, and infrastructure, while recommending leveraging existing MLOps knowledge and adapting team structures for successful implementation. It also highlights the importance of cost management, change management, and clear ROI justification for large-scale LLM deployments.
North Dakota University System - The North Dakota University System (NDUS) deployed a “Policy Assistant” using Llama 2 on Databricks’ Data Intelligence Platform to streamline policy document search across its 11 institutions, achieving a 10-20x speedup in search operations and reducing development time from one year to six months. This system, built on Azure, processes thousands of PDFs, creates vector embeddings for efficient search, and provides natural language query capabilities with citations, all while maintaining robust governance and security.
Notion - Notion, grappling with massive data growth, built a scalable data lake using S3, Spark, Kafka, and Debezium CDC with Apache Hudi for efficient change data capture, reducing data ingestion times from days to minutes/hours and saving over $1 million. This infrastructure enabled the rollout of Notion AI features, including their Search and AI Embedding RAG infrastructure.
NTT Data - NTT Data partnered with a large infrastructure company to implement a GenAI-powered work order management system, using a privately hosted LLM and a custom knowledge base to automate classification, urgency assessment, and special handling identification for over 500,000 annual maintenance requests, improving accuracy and consistency compared to the previous manual approach. The system also provides reasoning explanations and audit trails, while prioritizing security and data privacy.
Nubank / Harvey AI / Galileo / Convirza - A panel of leaders from Nubank, Harvey AI, Galileo, and Convirza discussed their experiences deploying LLMs in production, highlighting the shift from large proprietary models to optimized, specialized ones for cost and latency, emphasizing modular architectures and robust evaluation frameworks incorporating human feedback. They also covered sophisticated model selection based on quality, latency, cost, and technical debt, alongside cost management strategies like fine-tuning and optimized inference.
Numbers Station - Numbers Station developed a production-ready self-service analytics platform using LLMs, RAG, and a unified knowledge layer to address the bottleneck of data team requests, enabling users to generate SQL queries and charts through a multi-agent architecture with a focus on accuracy, scalability, and enterprise integration. Their system demonstrated significant improvements in real-world benchmarks, reducing setup time from weeks to hours while maintaining high accuracy through contextual knowledge integration.
Numbers Station - Numbers Station is integrating foundation models into the modern data stack to accelerate data insights, focusing on practical applications like natural language to SQL translation, data cleaning, and data linkage. They address challenges such as model scale and performance through techniques like model distillation and hybrid approaches, while tackling prompt brittleness with prompt ensembling and decomposition.
NVIDIA - NVIDIA’s product security and AI red team share their experiences securing LLM deployments, highlighting real-world challenges with RAG systems and plugin architectures. The study identifies vulnerabilities such as data poisoning in RAG, prompt injection leading to SQL injection, and remote code execution through plugins, emphasizing the need for strict access controls, input validation, sandboxing, and careful logging strategies.
NVIDIA - NVIDIA’s Agent Morpheus uses four specialized Llama3 LLMs and AI agents in an event-driven architecture to automate CVE analysis and remediation, triggered by container uploads and integrating with multiple data sources to generate remediation plans and security documentation, achieving a 9.3x speedup through parallel processing. This system, deployed using NVIDIA NIM, reduces analysis time from hours/days to seconds and includes a human-in-the-loop feedback mechanism for continuous improvement.
OLX - OLX developed “OLX Magic,” an AI-powered shopping assistant for their secondhand marketplace, combining traditional keyword search with LLM-driven agents to handle natural language, multi-modal (text, image, voice), and modified visual searches. The system tackles e-commerce personalization and search refinement challenges, balancing user experience with technical constraints like latency and cost.
OLX - OLX implemented a production system using the Prosus AI Assistant to automate job role extraction from listings, processing 2,000 daily updates with 4,000 API calls, and using LangChain for prompt engineering; initial A/B tests showed positive results, but the $15K monthly cost is driving a potential shift to self-hosted models.
ONE - ONE’s chatbot initiative, deployed across Facebook Messenger and WhatsApp, reached over 38,000 users in six African countries, facilitating campaign engagement and generating over 17,000 user actions using RapidPro, ActionKit CRM, and Google BigQuery; key learnings included the importance of iterative development and localized language support, while challenges included platform restrictions and varying user acquisition costs.
OpenAI - This case study demonstrates how LLMs can streamline e-commerce review analysis by replacing traditional machine learning workflows with a single model capable of multi-task analysis, including sentiment analysis, aspect extraction, and topic clustering, using OpenAI’s API and carefully engineered prompts. This approach enhances efficiency, reduces development time, and provides improved explainability compared to traditional black-box models.
OpenAI / AWS - A private equity firm successfully deployed an LLM-based recommendation system, leveraging OpenAI APIs for data cleaning and text embeddings, and AWS for deployment, focusing on practical implementation and addressing challenges in data quality and resource management. The system prioritizes relevant recommendations within the first five suggestions for a boomer-generation user base.
OpenGPA / Microsoft Research - OpenGPA’s study reveals the limitations of standard RAG when processing context-rich documents like movie scripts, showing how basic chunking and vector search fail to capture temporal and relational context, leading to inaccurate answers; while Graph RAG offers improvements, the study emphasizes the need for advanced context management techniques and proposes a movie script benchmark for evaluating RAG systems.
OpenRecovery - OpenRecovery developed a multi-agent system using LangGraph to provide AI-powered addiction recovery support, featuring specialized agents, shared state, and dynamic context switching. Deployed via LangGraph Platform, the system integrates with mobile apps and uses LangSmith for observability and testing, while incorporating human-in-the-loop verification to ensure accuracy and empathy.
Orizon - Orizon, a healthcare platform, automated 63% of their medical rule documentation tasks by implementing a GenAI solution using Databricks, fine-tuning Llama2-code and DBRX models, and deploying them through Mosaic AI Model Serving, while maintaining strict security and governance. This reduced documentation time to under 5 minutes and freed up developer resources, demonstrating the potential of LLMs in regulated industries.
PagerDuty - PagerDuty rapidly deployed multiple GenAI features, including AI-powered runbook generation and incident summarization, within two months by adopting a centralized LLM API service built on Kubernetes. This architecture enabled them to quickly iterate on new features, manage multiple LLM providers, and maintain robust security and monitoring.
Paradigm - Paradigm, a YC24 company, built an AI-powered spreadsheet platform using LangChain to develop specialized agents for tasks like schema generation and task planning, while leveraging LangSmith for monitoring and usage-based pricing, enabling them to manage thousands of parallel agents efficiently. This combination of tools allowed them to optimize costs and maintain high performance in their production environment.
Parameta Solutions - Parameta Solutions implemented an automated email triage system using Amazon Bedrock Flows, processing client requests by classifying emails, extracting entities, and generating responses, integrating with data sources like Snowflake and OpenSearch. This system reduced resolution times from weeks to days, showcasing a practical application of LLMs in a regulated financial environment.
Paramount+ - Paramount+ partnered with Google Cloud to implement Gen AI for video summarization and metadata enrichment, processing over 50,000 videos using techniques like prompt chaining and model fine-tuning to improve content discoverability and reduce reliance on manual processes and third-party services. The system includes a three-component architecture for transcription, content generation, and personalization integration, optimizing for token limits and model selection.
Parcha - Parcha is developing AI agents for enterprise operations and compliance, particularly in fintech, moving from LangChain to a custom framework for better control and debugging, using Claude as their primary LLM. Their approach focuses on breaking down complex workflows into smaller, manageable components with a hierarchical agent structure, emphasizing controlled execution of pre-defined procedures, achieving 90% accuracy before deployment, and providing transparency into agent decision-making.
Parcha - Parcha built a production-grade AI agent system for compliance and operations, transitioning from a basic Langchain setup to a distributed architecture with asynchronous processing and a coordinator-worker pattern, addressing issues like context pollution and unreliable connections. Their solution incorporates robust error handling with queue-based execution, self-correction, and automated reporting, alongside a modular tool framework for easy integration and scalability.
Parlance Labs / GitHub - Parlance Labs, leveraging experience from GitHub Copilot, advocates for a practical LLM deployment approach centered on rigorous evaluation, data-centric strategies, and iterative development, including multi-level evaluation frameworks and instruction tuning. Their work, demonstrated in a real estate CRM integration, emphasizes data quality, robust evaluation systems, and human oversight, while addressing infrastructure and scalability challenges.
Patronus AI - Patronus AI developed Lynx, a specialized hallucination detection model, by fine-tuning Llama-3-70B-Instruct on their HaluBench dataset using Databricks’ Mosaic AI infrastructure, achieving a 1% accuracy improvement over GPT-4 in hallucination detection and open-sourcing the model and dataset. Leveraging LLM Foundry and Composer for training optimizations like FSDP and Flash Attention, they demonstrated significant gains in domain-specific areas like medical question answering.
Perplexity - Perplexity built a production-grade conversational search engine using multiple LLMs, including GPT-4 and custom models, optimized for low latency and high-quality results. Their system combines search indices, tools, and custom embeddings to deliver personalized, accurate responses at scale, while also focusing on reliability and maintainability with a small, efficient engineering team.
Perplexity - Perplexity’s Pro Search is an advanced AI answer engine that tackles complex queries using multi-step reasoning, separating planning and execution phases, and integrating tools like code interpreters and Wolfram Alpha. This approach, combined with a user-friendly interface, has led to a 50% increase in query search volume.
Perplexity AI - Perplexity AI transitioned from internal SQL tools to a production-ready AI search and research assistant, iteratively developing from Slack and Discord bots to a web interface. They tackled challenges in search relevance, model selection, latency, and cost by implementing a hybrid approach using fine-tuned GPT models and custom LLaMA-based models, achieving superior performance metrics in citation accuracy and perceived utility compared to competitors.
PeterCat.ai - PeterCat.ai developed a system that creates customized AI assistants for GitHub repositories, using LLMs and RAG to improve code review and issue management, and deploying it as a GitHub App. The system, adopted by 178 open source projects, leverages LangChain for orchestration, a vector database for knowledge storage, and AWS Lambda for asynchronous processing.
Philadelphia Union - The Philadelphia Union utilized a RAG architecture on Databricks, incorporating Vector Search and the DBRX Instruct model, to create a chatbot that simplifies complex MLS roster rules, enabling faster decision-making and ensuring compliance. Deployed via Databricks Apps, the system showcases robust LLMOps practices, including thorough testing, monitoring, and governance.
Picnic - Picnic, a grocery delivery platform, implemented an LLM-powered search system using GPT-3.5-turbo and OpenAI’s text-embedding-3-small model to improve product discovery across multiple languages, leveraging OpenSearch for efficient retrieval and precomputed embeddings with caching to maintain low latency. This system effectively handles multilingual queries, typos, and varied user intents while maintaining the speed and reliability required for e-commerce applications.
Pinterest - Pinterest successfully rolled out GitHub Copilot for AI-assisted development, emphasizing a secure and compliant approach through a large-scale trial and cross-functional collaboration, achieving 35% adoption within six months. Their methodical implementation included robust security measures, integration with existing workflows, and a focus on developer experience, resulting in positive feedback, particularly for cross-language development.
Pinterest - Pinterest built a Text-to-SQL system within their Querybook tool, initially using an LLM for SQL generation and later enhancing it with RAG for table selection, improving task completion speed by 35% and first-shot acceptance rates from 20% to over 40%. The system leverages technologies like Langchain for JSON parsing, OpenSearch for vector storage, and WebSocket streaming to handle long response times, demonstrating a practical approach to deploying LLMs in a production environment.
Podium - Podium, a communication platform for small businesses, utilized LangSmith to optimize their AI Employee agent, achieving a 98.6% F1 score in response quality through comprehensive testing and dataset management, while also reducing engineering intervention by 90%. This LLMOps approach empowered their Technical Product Specialists to independently troubleshoot issues, improving overall customer satisfaction.
Podzy / The Learning Agency Lab / Vanderbilt’s LEER Lab - Educational organizations are integrating LLMs and LangChain to create enhanced learning experiences, with Podzy building a spaced repetition system with LLM-powered question generation, The Learning Agency Lab focusing on datasets and competitions for LLM educational applications, and Vanderbilt’s LEER Lab developing intelligent textbooks using LLMs for content summarization and question generation. These implementations highlight the integration of LLMs with existing educational tools while addressing challenges of accuracy, personalization, and fairness.
PredictionGuard - PredictionGuard’s framework provides a comprehensive approach to securing enterprise LLM deployments, addressing challenges like hallucinations with factual consistency models and RAG, supply chain risks with trusted registries, and prompt injection attacks using custom filtering. The framework also emphasizes data privacy through PII detection and confidential computing, while maintaining performance and integrating with existing enterprise systems.
Prem AI - Prem AI optimized their production vision pipelines to generate millions of realistic ethereal planet images using Stable Diffusion XL, fine-tuning the model with a curated dataset and implementing a custom multi-stage upscaling pipeline. They further optimized performance through techniques like LoRA fusion, model quantization, and efficient serving with Ray Serve, resulting in consistent high-quality image generation with specific aspect ratios and controllable attributes.
Principal Financial Group - Principal Financial Group implemented an enterprise-wide RAG solution using Amazon Q Business, enabling their customer service team to efficiently query over 9,000 pages of work instructions, achieving 84% accuracy in document retrieval and a 50% reduction in some workloads. The project highlights the importance of metadata quality, user training, and strong governance frameworks in successful LLM deployments.
PromptOps / Insight Partners / Meta / Stripe - A panel of experts from PromptOps, Insight Partners, Meta, and Stripe shared practical insights on deploying LLMs in production, covering topics like managing hallucinations, prompt engineering techniques, latency optimization, and evaluation strategies, while also emphasizing cost considerations and the importance of continuous feedback loops. The discussion highlighted the need for robust infrastructure, risk management, and a pragmatic approach to model selection, including the use of open-source alternatives.
Prosus - Prosus developed Plus One, an internal LLM platform accessible via Slack, to democratize AI adoption across their diverse portfolio of companies. The platform supports multiple LLMs and RAG, serving thousands of users with over half a million queries spanning software development, product management, and general business tasks. Through iterative optimization, they’ve reduced hallucination rates to below 2% and implemented cost-saving measures like token economy and strategic model selection, enabling both technical and non-technical users to leverage AI effectively.
Prosus / OLX - Prosus has deployed two key AI agent applications: Toan, an enterprise assistant used by over 15,000 employees across 24 companies, and OLX Magic, an e-commerce assistant integrated into their marketplace, with Toan achieving a reduction in hallucinations from 10% to 1% and saving users an average of 48 minutes per day, while OLX Magic leverages similar agent technology to enhance product discovery through generative AI features.
Q4 Inc. - Q4 Inc. developed a financial data Q&A chatbot for Investor Relations Officers, using Amazon Bedrock and a novel RAG approach that leverages LLMs to generate SQL queries against financial datasets, achieving high accuracy and single-digit second response times. The system uses multiple foundation models through Amazon Bedrock for different tasks (SQL generation, validation, summarization) optimized for performance and cost.
Qatar Computing Research Institute - Qatar Computing Research Institute developed T-RAG, an on-premise question-answering system for confidential organizational documents, combining RAG with a fine-tuned Llama-2 7B model and a custom tree-based entity structure, achieving 73% accuracy and robust entity tracking. This approach demonstrates the benefits of combining multiple techniques for production LLM applications, while also addressing the constraints of limited resources and on-premise requirements.
QualIT - QualIT has developed an LLM-enhanced topic modeling system that combines LLMs for key phrase extraction with a two-stage hierarchical clustering approach, achieving a 70% topic coherence and 95.5% topic diversity, outperforming LDA and BERTopic benchmarks, while also implementing a robust hallucination detection framework. This system has been validated through human evaluation, demonstrating its practical utility for analyzing large volumes of text data from sources like surveys and customer feedback.
QuantumBlack - QuantumBlack developed two LLM-powered systems: a molecular discovery platform using chemical language models and RAG for pharmaceutical research, and a call center analytics solution that processes audio with diarization, transcription, and LLM analysis, achieving a 60x speedup through optimizations. The call center system uses a pipeline including Whisper for transcription and a quantized Mistral 7B model for analysis.
QuantumBlack - QuantumBlack’s data engineers share their practical experiences implementing LLMs in production, addressing challenges around unstructured data, data quality, and privacy, while also exploring how LLMs can assist with tasks like pipeline development and synthetic data creation. They emphasize the importance of human oversight, risk mitigation, and careful evaluation when integrating LLMs into enterprise data workflows.
Rakuten - Rakuten Group implemented LangChain and LangSmith to create a suite of AI applications, including AI Analyst, AI Agent, and AI Librarian, for both clients and employees; they also built an internal chatbot platform using OpenGPTs, enabling rapid development and deployment while maintaining enterprise-grade security and scalability.
Ramp - Ramp built an AI-powered Tour Guide agent that uses a visible cursor and step-by-step explanations to guide users through their financial platform, employing an iterative action generation system and optimized prompts. The system prioritizes human-agent collaboration, ensuring user trust through transparent actions and clear feedback, while also focusing on performance optimization and safety through guardrails and controlled interaction spaces.
Rasgo - Rasgo’s platform for AI-powered data analysis agents emphasizes database interactions and custom agent creation, highlighting the importance of a well-designed agent-computer interface and the impact of base model selection, with GPT-4 outperforming GPT-3.5 for complex tasks. Their experience underscores the necessity of robust production infrastructure, focusing on reasoning capabilities, and avoiding unnecessary abstractions for successful LLM deployment, alongside careful attention to security, error handling, and performance optimization.
RealChar - RealChar is developing an AI phone call assistant for customer service, utilizing a multi-modal processing system inspired by self-driving car architectures, with real-time audio processing and millisecond-level tracing. The system employs an event bus for parallel processing, fallback mechanisms for managing variable LLM response times, and a tiered model system for speed and accuracy tradeoffs, all while prioritizing reliability and real-time performance monitoring.
Realtime - Realtime built an automated data journalism platform using LLMs to generate news stories from real-time data, employing a multi-stage pipeline with GPT-4 Turbo and focusing on quality control, cost optimization, and transparency. The platform processes diverse data sources, constructs dynamic prompts, and implements safeguards against common LLM errors, demonstrating a practical approach to deploying LLMs in a production news environment.
Reddit - A network security block page, while seemingly simple, can be enhanced with LLMs for user authentication, ticket management, and security monitoring, requiring robust backend infrastructure and careful attention to data privacy and model security. Future improvements could include enhanced security features, user experience enhancements, and broader integration capabilities.
Renovai - Renovai’s R&D team presented their approach to building production-ready LLM agents, detailing state management, workflow engineering, and multi-agent systems. Their work includes practical implementation patterns like Router + Code and state machines, along with advanced techniques such as multimodal agents using GPT-4V for web navigation, emphasizing the importance of robust state management and clear workflow design.
Replit - Replit’s multi-agent system enhances application development by using specialized agents for workflow management, coding, and user interaction, emphasizing reliability and user engagement. They leverage advanced prompt engineering, a custom DSL for tool invocation, and robust observability via LangSmith, enabling a user-friendly experience with flexible engagement levels and a focus on real-time monitoring and trace analysis.
Replit - Replit, a browser-based IDE platform, utilized Databricks’ Mosaic AI Training to rapidly develop and deploy a custom code completion LLM, scaling from smaller models to a multi-billion parameter model in just three weeks. This allowed them to deploy a production-ready code generation system to their 25 million users, demonstrating the feasibility of rapid LLM deployment with a small team by abstracting away infrastructure complexity.
Replit - Replit’s development of their AI-powered code agent involved navigating challenges such as defining diverse user needs, implementing robust failure detection using rollback tracking and sentiment analysis, and creating custom evaluation harnesses beyond generic benchmarks. The team scaled from 3 to 20 engineers, integrating traditional engineering practices with LLM-specific expertise, and deployed features like “rapid build mode” that significantly reduced application setup time.
Replit - Replit, a platform for over 30 million developers, integrated LangSmith to improve the observability of their complex AI agent system, Replit Agent, built on LangGraph, addressing challenges in handling large-scale trace data, enabling efficient debugging through within-trace search, and introducing thread views for monitoring human-in-the-loop interactions. This resulted in faster debugging, better system understanding, and enhanced human-AI collaboration.
Replit - Replit optimized their LLM infrastructure by using preemptable GPUs, achieving a 66% cost reduction by reducing server startup time from 18 minutes to under 2 minutes through container optimization, GKE image streaming, and improved model loading.
Replit - Replit developed Replit Agent, a multi-agent AI system, to streamline application development, using specialized agents for managing, editing, and verifying code. The system incorporates a custom DSL, advanced prompt engineering, and Claude 3.5 Sonnet, alongside comprehensive monitoring and version control, ensuring reliability and user control.
Replit - Replit developed a multi-agent coding assistant that allows users to create software applications without writing code, using a system of specialized agents for managing, editing, and verifying code, and leveraging GPT-3.5 Turbo for code generation. The system has seen hundreds of thousands of production runs, prioritizing user engagement and feedback, and achieving a 90% success rate in tool invocations through techniques like code-based tool calls and state replay for debugging.
Replit - Replit built an automated code repair system directly into their IDE, using a fine-tuned 7B parameter LLM to fix Python errors identified by LSP diagnostics, achieving performance comparable to much larger models like GPT-4 and Claude-3. The production system features low-latency inference, load balancing, and real-time code application, demonstrating successful deployment of an LLM in a demanding development environment.
Resides - A panel discussion highlighted real-world LLM production use cases, including Resides achieving 95-99% question answering rates in property management by processing unstructured documents, and sales optimization with a 30% improvement through argument analysis, alongside structured output validation for executive coaching. The discussion emphasized avoiding over-engineering, focusing on quick iterations, and prioritizing user value, while also covering data management with vector databases and human-in-the-loop workflows.
Revolut / Seen.it - Revolut’s Sherlock fraud detection system uses vector search to identify anomalous transactions in under 50ms, achieving a 96% fraud detection rate and saving customers over $3 million annually, while Seen.it leverages vector embeddings for natural language video search across half a million clips, enhancing content discovery and marketing workflows. These case studies demonstrate the practical application of vector search and RAG in production, emphasizing performance optimization and user-centric design.
Rexera - Rexera, a real estate transaction company, improved its quality control by moving from single-prompt LLMs to a LangGraph-based system, achieving significant accuracy gains by implementing structured decision paths and reducing false positives from 8% to 2% and false negatives from 5% to 2%. This evolution highlights the importance of choosing the right architecture for complex LLM workflows.
Roche Diagnostics / John Snow Labs - Roche Diagnostics, in collaboration with John Snow Labs, built a production system using healthcare-specific LLMs to extract and structure oncology patient timelines from unstructured clinical notes, leveraging a pipeline that includes OCR, NLP, and LLMs to process diverse medical documents and extract key entities. The system uses zero-shot learning with structured prompts to address challenges like data complexity and ethical considerations, demonstrating the potential of LLMs to automate data extraction and improve the accuracy of medical timeline creation.
Rolls-Royce / Databricks - Rolls-Royce employed conditional GANs on the Databricks platform to optimize engineering design, enabling the generation of new design concepts from simulation data, bypassing traditional modeling. This implementation, focusing on data modeling, cGAN architecture, and MLOps, resulted in faster design iterations and reduced costs while maintaining strict compliance.
Rolls-Royce / Databricks - Rolls-Royce, in collaboration with Databricks and the University of Southampton, developed a cloud-based generative AI system using GANs to accelerate preliminary engineering design, encoding design parameters into images and validating them through physics-based simulations, achieving significant training time reductions and implementing robust data governance. The team also discovered that CPU training sometimes outperformed GPU training for specific validation tasks, highlighting the importance of workload-specific optimization.
Runway - Runway, a pioneer in generative AI for creative tools, implemented a “multimodal feature store” to manage diverse data types like video, images, and text, along with pre-computed features and embeddings, enabling efficient distributed training and improved collaboration between research and engineering teams. This system facilitates semantic queries, efficient batch access, and integrates with existing ML infrastructure, leading to faster iteration cycles for model development.
Salesforce - Salesforce’s Agent Force platform facilitates the creation and deployment of AI agents, addressing the challenges of transitioning LLMs from proof-of-concept to production by emphasizing robust testing, evaluation, and monitoring, including automated pipelines, synthetic data, and iterative testing. The case study highlights the importance of strategic fine-tuning, RAG pipeline optimization, and cost management, while also considering brand voice consistency and data privacy in enterprise environments.
Salesforce - Salesforce AI Research developed AI Summarist, a production-grade LLM system integrated with Slack, to tackle information overload by providing on-demand and scheduled summaries of conversations, channels, and threads using state-of-the-art conversational AI, with a zero-storage architecture for privacy and features like conversation disentanglement and context-aware summarization. The system also includes user feedback loops for continuous improvement and is designed with safeguards like rate limiting and error handling.
Salesforce - Salesforce’s Agentforce Service Agent leverages LLMs and CRM data to create an autonomous AI agent that enhances traditional chatbot interactions by providing intelligent, context-aware responses and actions grounded in company data, with quick deployment, privacy guardrails, and seamless escalation to human agents.
Salesforce - Salesforce’s internal deployment of Einstein Copilot demonstrates a large-scale enterprise LLM implementation, emphasizing a phased rollout starting with standard actions before introducing custom capabilities. The deployment included a comprehensive testing framework, custom prompt templates, and specialized business-specific actions, with a focus on data privacy and continuous model evaluation.
Salesforce - Salesforce’s Einstein GPT, a generative AI system for CRM, integrates LLMs across sales, service, marketing, and development, providing features like automated email composition, content and code generation, and data analysis, all while maintaining strict data privacy and security controls. The system includes human-in-the-loop validation and per-tenant model deployment, addressing challenges in data management, integration, and production deployment to improve efficiency and response times.
Sam / Div / Devin - Three engineers detail their production LLM agent deployments: Sam’s personal assistant uses real-time feedback and template-based routing, Div’s Milton automates browsers with multimodal capabilities and performance optimizations, and Devin’s agent assists engineers with code understanding via background indexing and knowledge graphs. These case studies highlight practical strategies for model selection, testing, performance, and routing in production environments.
Sam / Div / Devin - A panel of engineers shared their experiences deploying LLM-powered agents, detailing use cases like a personal assistant with real-time feedback, a browser automation system, and a GitHub repository assistant, emphasizing routing layers, performance optimization, and real-time feedback mechanisms, while also addressing testing challenges and production considerations. The discussion highlighted practical approaches to model selection, error handling, and monitoring, offering insights into building reliable and efficient LLM-based systems.
Santalucía Seguros - Santalucía Seguros utilizes a RAG-based Virtual Assistant, integrated with Microsoft Teams, to provide insurance agents with quick access to product information. The system, built on Databricks and Azure, features an LLM-as-judge evaluation system within its CI/CD pipeline to ensure consistent quality and prevent regressions.
Schneider Electric - Schneider Electric implemented a RAG system using LangChain and the Flan-T5-XXL model on SageMaker to automate CRM account linking, integrating real-time data from Google Search and SEC filings, and improving accuracy from 55% to 71% with domain-specific prompts. This solution significantly reduced manual processing time for account teams by identifying parent-subsidiary relationships.
SEGA Europe - SEGA Europe implemented a production LLM-based sentiment analysis system on Databricks, processing over 10,000 daily user reviews to improve player retention by up to 40% in some titles. The system leverages Delta Lake, Lakehouse Federation, and Unity Catalog to unify data and provide real-time feedback loops, while also democratizing access to AI insights through natural language interfaces.
Shortwave - Shortwave’s Ghostwriter uses vector embeddings and a fine-tuned LLM to generate email drafts that match a user’s writing style, incorporating relevant information from past emails via semantic search. The system addresses challenges like style matching through fine-tuning and information accuracy by integrating with AI search and using carefully crafted training examples.
Sinch - Sinch’s experience building over 50 global chatbots reveals that these projects should be treated as AI initiatives, not traditional IT projects, emphasizing the need for a “99-intents” approach to handle out-of-scope queries, hierarchical intent organization, and robust error handling, aiming for 90-95% recognition rates through continuous optimization.
Singapore government - The Singapore government developed Pair Search, a modern search engine for Parliamentary records, using a hybrid approach combining keyword, BM25, and semantic search with e5 embeddings, followed by ColbertV2 reranking. Designed for both human users and as a RAG backend, the system has seen positive feedback from government users, with around 150 daily users and 200 daily searches, demonstrating improved search result quality and performance.
Slack - Slack’s engineering team developed a multi-tiered evaluation framework for their LLM-powered features like message summarization and natural language search, using golden sets, validation sets, and A/B testing, alongside automated quality metrics to assess hallucination, accuracy and system integration, enabling rapid iteration and continuous improvement. This systematic approach ensures quality standards are maintained throughout the development lifecycle of their generative AI products.
Slack - Slack developed a generic recommendation API for internal use, prioritizing privacy and ease of integration by abstracting away ML complexities. They achieved a 10% improvement over hand-tuned models by focusing on interaction patterns rather than message content, demonstrating the effectiveness of a privacy-first, vertically integrated ML team.
Slack - Slack implemented AI features using a secure architecture that ensures customer data privacy and compliance by hosting LLMs in their VPC with AWS SageMaker, using RAG instead of fine-tuning, and maintaining strict data access controls, resulting in 90% of AI-adopting users reporting increased productivity while maintaining enterprise-grade security and compliance.
Smart Business Analyst - A procurement team built “Smart Business Analyst,” an LLM-powered system that automates competitor analysis for medical devices, using a multi-agent architecture to extract data from diverse sources, perform precise numerical comparisons, and generate structured reports with visualizations, significantly reducing analysis time. The system incorporates real-time data updates, multilingual support, conversational memory, and source attribution, addressing the limitations of general-purpose LLMs in specialized industry contexts.
Smith.ai - Smith.ai enhanced their customer engagement platform by integrating LLMs into their chat system, creating a hybrid approach that combines AI automation with human oversight. This system uses a custom-tuned LLM and RAG architecture to provide context-aware responses, with seamless transitions between AI and human agents, resulting in improved customer experience and operational efficiency.
Spotify - Spotify utilized Meta’s Llama models to enhance music recommendations with contextual explanations and power their AI DJ feature, achieving a 4x increase in user engagement and a 14% improvement in Spotify-specific tasks through domain adaptation, multi-task fine-tuning, and a human-in-the-loop process, scaling to millions of users using vLLM for efficient serving.
Stack Overflow - Stack Overflow is leveraging its massive collection of 60 million Q&A posts and 40 billion tokens of technical data to build a Knowledge as a Service platform, offering real-time API access to curated content and improving LLM accuracy by 20% through fine-tuning, while also integrating semantic search and conversational AI. This approach enables them to enhance developer workflows and maintain community engagement through strategic partnerships with major AI companies.
Stackblitz / Qodo - Stackblitz’s Bolt.new achieved rapid growth with its browser-based AI code generation, leveraging a custom WebContainer OS for efficient in-browser execution and detailed error context for the LLM, while Qodo tackles enterprise code testing and review, supporting diverse deployment options and employing specialized models and sophisticated flow engineering techniques. Both companies demonstrate different approaches to productionizing LLMs, highlighting the importance of context management, error handling, and task decomposition.
Stanford - Stanford researchers developed aCLAr, a system leveraging multimodal foundation models to automate enterprise workflows, with a focus on healthcare. The system uses a “Demonstrate, Execute, Validate” framework, passively learning from user demonstrations, interacting with UIs visually, and incorporating self-monitoring.
Stripe - Stripe built an LLM-powered system to provide support agents with AI-generated response suggestions, moving from initial GPT models to a multi-stage pipeline with fine-tuned models for question validation, topic classification, and answer generation. Challenges in production highlighted the importance of UX, online monitoring, and data quality, demonstrating that a data-centric approach and iterative deployment are crucial for successful LLM implementation in complex domains.
Stripe - Stripe deployed a multi-stage LLM system to improve customer support response times, using fine-tuned models for question filtering, topic classification, and response generation, and found that data quality and online monitoring were critical for success, even more so than model sophistication. They also learned that agent adoption and UX considerations are key to successful production deployments.
Summer Health - Summer Health leveraged GPT-4 to automate pediatric visit note generation, reducing note-writing time by 80% while improving clarity for parents by translating medical terminology. The system prioritizes HIPAA compliance and clinical accuracy, demonstrating the effective use of LLMs to improve both provider efficiency and patient experience in healthcare.
SumUp - SumUp developed an LLM-driven system to automate financial crime report generation, using a novel LLM-based evaluation framework with custom benchmarks and scoring to ensure quality. This approach outperformed traditional NLP metrics and correlated well with human assessments, while also mitigating potential biases through techniques like position swapping and few-shot prompting.
Superhuman - Superhuman’s Ask AI leverages a sophisticated cognitive architecture with parallel processing to enable natural language email and calendar searches, moving beyond traditional keyword matching. This evolution from a basic RAG system to a task-specific tool integration resulted in a 14% reduction in user search time and sub-2-second response times, achieved through techniques like double-dipping prompts and advanced reranking algorithms.
Swiggy - Swiggy, a major food delivery platform in India, improved its hyperlocal food search by implementing a two-stage fine-tuning approach for language models, using unsupervised learning on historical data followed by supervised learning with curated query-item pairs. Leveraging TSDAE and Multiple Negatives Ranking Loss, Swiggy achieved superior search relevance while maintaining a 100ms latency requirement.
Swiggy - Swiggy strategically deployed LLMs across their platform, focusing on catalog enrichment, review summarization, and restaurant partner support, building a middle layer for GenAI integration and carefully selecting models based on use case requirements, including custom models for real-time applications and RAG-based systems for vendor support. Their implementation included A/B testing and robust evaluation frameworks to mitigate issues like hallucination and latency, demonstrating a comprehensive approach to LLMOps with a focus on sustained ROI.
Swiggy - Swiggy has implemented a neural search system using fine-tuned LLMs to enable conversational food and grocery discovery, handling open-ended queries across a 50 million item catalog. They’ve also developed LLM-powered chatbots for customer service, partner support, and a Dineout virtual concierge, showcasing a broad application of generative AI.
The Institute for Ethical AI / Linux Foundation - This case study examines the current landscape of production machine learning and LLMOps, detailing the shift towards data-centric approaches and the evolution of MLOps ecosystems, while also emphasizing the importance of robust monitoring, security, and flexible development lifecycles. It further highlights the growing need for comprehensive data management, metadata tracking, and risk assessment, alongside future considerations like responsible AI and the integration of LLMs.
Therapy Bot - This presentation details best practices for deploying LLMs in high-stakes environments like healthcare, emphasizing the need for robust risk assessment, human oversight, and continuous evaluation, while also providing practical solutions such as structured conversation flows, task decomposition, and prompt engineering. The discussion advocates for a balanced approach that combines LLMs with traditional methods, prioritizing safety and reliability.
There is no company name mentioned in the case study. - A production-ready business analytics assistant was built using a multi-agent system with specialized ChatGPT agents for data engineering and data science, leveraging the ReAct framework, SQL, and Streamlit to address challenges like token limits and complex schema handling. This system demonstrates a practical approach to operationalizing LLMs for structured data analysis, emphasizing modular design and robust error handling.
There is no company name mentioned in this case study. - This case study details the implementation of LLMOps platforms in enterprises, applying DevOps principles to manage the complexities of AI in production. It covers platform infrastructure, team structures, and addresses challenges in testing, developer experience, and governance, emphasizing the need for robust testing, balanced automation, and comprehensive monitoring.
Thomas - Thomas, a workplace assessment company, modernized its legacy system by implementing a generative AI solution using Databricks, leveraging RAG and Vector Search to provide personalized insights from their extensive content database, while maintaining robust security and integrating with platforms like Microsoft Teams. This transformation enabled them to develop a new product, improve user experience, and scale their assessment processing capabilities.
Thomson Reuters - Thomson Reuters developed Open Arena, an internal LLM experimentation platform, in under six weeks using AWS serverless architecture, SageMaker, and Hugging Face containers. This platform allows non-technical users to securely test both open-source and in-house LLMs, combined with company data, using a tile-based interface with chat and document upload capabilities.
Thoughtworks - Thoughtworks built Boba, an AI co-pilot for product strategy, showcasing practical LLMOps patterns like templated prompts, structured JSON responses, and real-time streaming. The application integrates external tools, manages context with vector stores, and emphasizes user experience with feedback mechanisms, providing a detailed look at building production-ready LLM applications.
Thoughtworks - Thoughtworks created Boba, an AI co-pilot for product strategy, demonstrating advanced LLM integration patterns such as templated prompts, structured JSON responses, real-time streaming, and RAG with vector stores, using OpenAI, Langchain, and Vercel AI SDK, highlighting practical implementation details for production-ready LLM applications.
Thumbtack - Thumbtack enhanced its message content moderation by fine-tuning an LLM, achieving a significant AUC improvement to 0.93 after initial prompt engineering attempts failed. Their production system uses a cost-effective two-tier approach, pre-filtering messages with a CNN model before routing suspicious ones to the LLM, resulting in a 3.7x precision and 1.5x recall improvement.
Thumbtack - Thumbtack implemented a company-wide GenAI strategy, focusing on enhancing their consumer product, transforming operations, and boosting employee productivity. They built new infrastructure to support both open-source and external LLMs, with a strong emphasis on PII protection and secure access, and focused on governance, security, and measurable business outcomes.
Tinder - Tinder utilizes fine-tuned open-source LLMs with LoRA for efficient trust and safety operations, enabling them to serve multiple models on a single GPU using the Lorax framework, achieving high recall and precision in detecting various violations, including hate speech and scams. This approach demonstrates superior generalization and resilience against adversarial behavior compared to traditional ML methods, while maintaining cost-effectiveness and scalability.
Titan ML / YLabs / Outer Bounds - A panel of experts from Titan ML, YLabs, and Outer Bounds discussed best practices for deploying LLMs in production, covering topics such as prototyping with API providers, system architecture, and addressing hardware constraints. The discussion emphasized the need for robust evaluation pipelines, user feedback mechanisms, and comprehensive observability to ensure performance, cost-effectiveness, and a positive user experience.
TomTom - TomTom adopted a hub-and-spoke model to implement generative AI across their organization, deploying applications like a ChatGPT location plugin and an in-car AI assistant, achieving 30-60% task performance improvements with a focus on responsible AI and workforce upskilling. They also built internal tools for mapmaking and development, all while maintaining centralized oversight and quality control.
Trace3 - Trace3’s Innovation Team developed Innovation-GPT, a custom LLM-powered solution using a RAG architecture to streamline their technology research and knowledge management, automating data collection and analysis via web scraping, structured data processing, and natural language querying, while maintaining human oversight for quality control. This approach highlights the importance of balancing automation with accuracy in production LLM implementations.
Tradestack - Tradestack built an AI-powered WhatsApp assistant using LangGraph to automate quote generation for the trades industry, reducing quote creation time from hours to under 15 minutes. Their system leverages a multi-agent architecture, handles multimodal inputs, and uses LangSmith for rigorous testing, achieving an 85% end-to-end performance improvement and deploying to a large user base in just 6 weeks.
TrainGRC - TrainGRC implemented a RAG system for cybersecurity research, tackling fragmented knowledge and LLM censorship using Amazon Textract for OCR, custom web scraping, and optimized vector search with various embedding models, focusing on data quality, search optimization, and context chunking. The system addresses complex data processing challenges while considering long-term storage and model migration.
Trigent Software - Trigent Software’s IRGPT project aimed to build a multilingual LLM for Ayurvedic medical consultations, using a fine-tuned GPT-2 model and a diverse dataset; however, challenges in data quality, translation, and cultural nuances led to a pivot towards an English-only prototype, highlighting the complexities of multilingual LLM development in specialized domains. The project underscores the importance of iterative development and high-quality data when applying LLMs to specialized fields like traditional medicine.
Twelve Labs - Jockey, an open-source conversational video agent, leverages LangGraph and Twelve Labs APIs to intelligently process and analyze video content, evolving from a basic LangChain implementation to a more robust LangGraph architecture. This multi-agent system, featuring a Supervisor, Planner, and specialized Workers, enables improved scalability and granular control over video workflows.
Twelve Labs / Databricks - Twelve Labs partnered with Databricks Mosaic AI to create a production-grade system for advanced video understanding, using multimodal embeddings to capture visual, textual, and contextual information; this allows for nuanced search and analysis, leveraging Databricks’ Delta Lake for reliable storage and Mosaic AI for scalable vector search, with a focus on MLOps best practices.
Twilio - Twilio’s Emerging Tech and Innovation team built an AI platform to enhance customer engagement by bridging unstructured communications data with structured customer profiles, using a flexible architecture for rapid model switching. They launched “Twilio Alpha” to manage expectations around early-stage AI products, balancing rapid innovation with enterprise quality through iterative development and a cross-functional team.
Twilio Segment - Twilio Segment implemented an LLM-as-Judge framework to evaluate and improve their CustomerAI feature, which uses LLMs to generate audience queries from natural language, achieving over 90% alignment with human evaluations and a 3x improvement in audience creation time. This robust framework uses a multi-agent architecture and a discrete scoring system to provide a scalable and reliable evaluation pipeline for production LLM systems.
Uber - Uber’s Developer Platform team explored using LLMs for a custom IDE assistant, automated test generation (Auto Cover), and Java-to-Kotlin code migration. They found that hybrid approaches, combining deterministic steps with LLM reasoning, were more effective than pure LLM solutions, leading to significant developer productivity gains while maintaining code quality.
Uber - Uber has developed an enterprise-scale prompt engineering toolkit that manages the full lifecycle of LLM deployment, featuring centralized prompt template management, version control, and robust evaluation frameworks. The toolkit supports rapid experimentation, automated prompt generation, and includes production-grade deployment with safety features, both offline batch processing and online serving capabilities, and comprehensive monitoring of metrics.
Uber - Uber’s DragonCrawl utilizes a small language model to automate mobile app testing, achieving 99%+ stability across 85 cities and blocking 10 high-priority bugs. This AI-powered system, built on an MPNet architecture, significantly reduced maintenance costs and demonstrated human-like problem-solving capabilities, showcasing the effectiveness of focused, small models in complex production environments.
Ubisoft / AI21 Labs - Ubisoft partnered with AI21 Labs to integrate LLMs into their game development pipeline, focusing on automating NPC dialogue generation and creating training data. They implemented a writer-in-the-loop system, using AI21’s models for data augmentation and leveraging a cost-effective token pricing model, which enabled them to scale content production, improve efficiency, and maintain creative control.
Unify - Unify developed an AI agent system for automating sales account qualification, using LangGraph for orchestration and LangSmith for experimentation. Through iterative development, they refined the agent architecture to improve planning, reflection, and execution, while optimizing for speed and user experience, ultimately deploying a system with real-time progress visualization and parallel tool execution.
V7 - V7, a training data platform, explores the challenges of implementing human-in-the-loop LLM systems in production, noting that many implementations remain simplistic and often rely on basic feedback. They highlight the limitations of automation, the difficulties in learning from human feedback, and the gap between LLM capabilities and real-world industry requirements, emphasizing the need for careful system design and a balance between automation and human oversight.
Val Town - Val Town’s experience integrating LLM-powered code assistance into their cloud development platform demonstrates the iterative process of productionizing these features. Starting with basic autocomplete using ChatGPT, they progressed to more sophisticated solutions, including purpose-built models for code completion, Claude integration for improved generation, and innovative error detection systems.
Vannevar Labs - Vannevar Labs moved from using GPT-4 to a fine-tuned Mistral 7B model on Databricks Mosaic AI for defense intelligence sentiment analysis, improving accuracy to 76% and reducing latency by 75%, while also cutting costs and improving multilingual support. The entire LLMOps pipeline, including infrastructure, training, and deployment, was implemented in just two weeks, showcasing the efficiency of a custom model approach.
Vendr / Extend - Vendr partnered with Extend to build a production-scale document processing system using LLMs, combining OCR with LLMs for entity recognition and data extraction, and incorporating human review for quality control; the system uses document embeddings for similarity analysis, enabling efficient review processes and has successfully processed over 100,000 documents.
Verizon / Anthropic / Infosys - Verizon, in collaboration with Anthropic and Infosys, is deploying LLMs for content generation, software development lifecycle enhancements, and co-pilot applications, using a model-agnostic architecture and refined RAG techniques. They emphasize rigorous evaluation, aiming for 95% accuracy in production, while addressing challenges like cost, user adoption, and data management through a center of excellence and change management strategies.
Vespa - Vespa built a production-grade Slackbot using RAG and their search infrastructure to manage a surge in support queries, incorporating semantic search, BM25, and user feedback for ranking. Deployed on GCP, the bot features user consent management, anonymization, and continuous learning to improve response quality.
Vimeo - Vimeo implemented an AI-powered help desk using a RAG architecture, leveraging vector embeddings of their Zendesk content for retrieval and integrating multiple LLMs via Langchain, with a focus on model evaluation and cost optimization. The system showcases a production-ready architecture with considerations for scalability, security, and ongoing maintenance, demonstrating practical LLMOps.
Vinted - Vinted, a major e-commerce platform, migrated its search infrastructure from Elasticsearch to Vespa, consolidating multiple clusters into a single deployment and achieving a 2.5x improvement in search latency and a 3x improvement in indexing latency, while also halving their server count. This migration, completed over a year, involved a phased approach, real-time data processing with Apache Flink, and the development of a custom Vespa Kafka connector, demonstrating significant gains in performance and operational efficiency.
Vira Health - Vira Health built a RAG-based menopause information chatbot using GPT-4, prioritizing safety and accuracy by grounding responses in peer-reviewed medical guidelines, and rigorously evaluated it using both AI and human clinicians. The system, built with vanilla Python for control, demonstrated high faithfulness, relevance, and clinical correctness, showcasing a responsible approach to LLM deployment in healthcare.
Vodafone - Vodafone partnered with Google Cloud to modernize its network operations, migrating to a unified data platform managing over 2 petabytes of data and consolidating over 100 legacy systems. By integrating device-level analytics, AIOps, and GenAI for network investment planning, they’ve improved incident response, network performance monitoring, and aim to reduce OSS tools by 50%.
Voiceflow - Voiceflow, a platform for building chat and voice assistants, implemented a hybrid approach, integrating LLMs via the OpenAI API for generative features while retaining their custom NLU model for intent and entity detection due to its superior performance and cost-effectiveness; they also built an ML Gateway to manage connections to both LLMs and traditional models, and implemented prompt engineering and error handling to address challenges like JSON formatting.
VSL Labs - VSL Labs has developed an automated sign language translation platform that uses generative AI to convert English text and audio into American Sign Language (ASL), addressing the challenges of accessibility for the deaf community by using a two-stage process that includes linguistic processing with in-house and GPT-4 models, and then converting this into 3D animation instructions for realistic avatar-based sign language interpretation. The API-first platform is designed for real-world applications, with a focus on quality assurance, performance, and cultural sensitivity.
W&B - Weights & Biases significantly improved their LLM-powered documentation assistant, Wandbot, by adopting an evaluation-driven refactoring approach, resulting in an 84% latency reduction and a 9% increase in accuracy through systematic testing, a switch to ChromaDB, and RAG pipeline optimizations. This case study highlights the importance of continuous evaluation in LLM system development, demonstrating the benefits of modular design and efficient vector stores.
Walmart - Walmart implemented a semantic caching system using vector embeddings and generative AI to improve e-commerce search, achieving a 50% cache hit rate for tail queries by understanding search intent rather than relying on exact matches, while also addressing challenges like latency and cost in a high-scale production environment. This hybrid approach combines traditional and semantic caching to deliver more relevant results and reduce zero-result searches.
Walmart - Walmart’s Ghotok is a hybrid AI system that combines predictive and generative models to categorize 400 million SKUs, using domain-specific features and chain-of-thought prompting. The system employs a two-stage filtering process and caching to ensure millisecond-level response times, while also incorporating robust exception handling and monitoring for production stability.
Wayfair - Wayfair’s Agent Co-pilot uses LLMs to provide real-time, context-aware chat suggestions to digital sales agents, incorporating product information, policies, and conversation history. This system has achieved a 10% reduction in handle time while maintaining high-quality customer interactions, demonstrating the effectiveness of LLMs in enhancing agent productivity.
Weights & Biases - A Weights & Biases founder developed a voice assistant, similar to Alexa, using open-source LLMs, demonstrating the practical steps from demo to production, including speech recognition with Whisper, local LLM deployment with llama.cpp, and iterative improvements through prompt engineering, model switching, and fine-tuning to achieve 98% accuracy. The project highlights the importance of systematic evaluation, comprehensive experiment tracking, and a balanced approach to model selection and optimization for successful LLM deployments.
Weights & Biases - Weights & Biases built a production-ready, open-source voice assistant using Llama 2 and Mistral models, running on affordable hardware and incorporating Whisper for speech recognition, achieving 98% accuracy through iterative improvements like prompt engineering and fine-tuning with QLoRA. The project underscores the challenges of moving from demo to production with LLMs, highlighting the need for a robust evaluation framework and systematic experimentation.
Weights & Biases - Weights & Biases performed a manual evaluation of their production LLM-powered technical support bot, Wandbot, which uses a RAG architecture across multiple platforms, achieving a baseline accuracy of 66.67%. The study highlights the importance of systematic evaluation, clear metrics, and expert annotation in LLMOps, while also showcasing practical solutions using Argilla.io for annotation management.
Weights & Biases - Weights & Biases utilized an evaluation-driven methodology to refine Wandbot 1.1, creating an automated evaluation framework aligned with human annotations and leveraging GPT-4 for multi-faceted assessments, resulting in improvements to data ingestion, query enhancement, and a hybrid retrieval system. This approach led to significant performance gains, with the latest model demonstrating superior answer correctness, relevancy, and context recall.
Weights & Biases - Weights & Biases evolved their documentation chatbot, Wandbot, from a monolithic architecture to a microservices system with four core modules, enhancing scalability and maintainability. The new architecture includes features like multilingual support, model fallback, and caching, achieving a 66.67% response accuracy and 88.636% query relevancy, while also adding new platform integrations.
WellSky - WellSky, a healthcare technology company processing over 100 million forms annually, partnered with Google Cloud to implement an AI-powered form automation solution. This initiative focused on a responsible AI framework, including mandatory evidence citation and a robust governance structure, to reduce clinician burnout and documentation errors while ensuring patient safety.
Whatnot - Whatnot, a live shopping marketplace, enhanced its trust and safety operations by integrating LLMs with its existing rule-based system, achieving a 95% detection rate for scam attempts with 96% precision. This new system uses a three-phase architecture to detect scams, moderate content, and enforce platform policies by analyzing conversational context and user behavior patterns, while maintaining a human-in-the-loop approach for final decisions and incorporating multimodal processing for analyzing text in images.
Whatnot - Whatnot implemented a GPT-based query expansion system to improve e-commerce search by addressing misspellings and abbreviations, using an offline pipeline for processing and a production cache for low-latency serving, reducing irrelevant content by over 50% for problem queries. This hybrid approach demonstrates a practical method for integrating LLMs into production search systems.
Wordsmith - Wordsmith, an AI legal assistant platform, implemented LangSmith to streamline their LLM operations across the entire product lifecycle, using features like hierarchical tracing and evaluation datasets to tackle challenges in prototyping, debugging, and evaluation. This enabled faster development cycles, efficient debugging, and data-driven experimentation while managing multiple LLM providers.
Wroclaw Medical University / Institute of Mother and Child - Wroclaw Medical University, in collaboration with the Institute of Mother and Child, is developing an AI-powered system using NLP and machine learning to detect sepsis in neonatal care. The system processes real-time data, including unstructured medical records, to identify early symptoms, reducing diagnosis time from 24 hours to 2 hours, while maintaining high sensitivity and specificity.
WSC Sport - WSC Sport utilizes LLMs to automate real-time sports commentary and recaps, processing game data into coherent narratives with synthesized voiceovers, employing techniques like dynamic prompt generation and Chain-of-Thought for fact verification to reduce production time from hours to minutes, while maintaining accuracy and enabling multilingual content.
WVU Medicine / John Snow Labs - WVU Medicine deployed an automated HCC code extraction system using John Snow Labs’ Healthcare NLP, processing radiology notes to identify diagnoses, convert them to CPT codes, and map them to HCC codes, achieving an 18.4% provider acceptance rate on over 27,000 codes processed. The system highlights the importance of model customization, confidence scoring, and production system integration in healthcare LLMOps.
Xcel Energy - Xcel Energy deployed a RAG-based chatbot using Databricks’ Data Intelligence Platform to streamline operations like rate case reviews and legal contract analysis, reducing review times from 6 months to 2 weeks. The production-grade GenAI system leverages Vector Search, MLflow, and Foundation Model APIs, while maintaining strict security and governance for sensitive utility data.
Yahoo Mail - Yahoo Mail implemented a new email content extraction system using Google Cloud’s Vertex AI and LLMs, overcoming limitations of their previous ML-based system. This resulted in improved coverage, reaching 94% for standard domains and 99% for long-tail domains, alongside a 51% increase in extraction richness and a 16% reduction in tracking API errors, while processing billions of daily messages.
YouTube - YouTube employs sophisticated localization and content management systems, likely using LLMs for interface translation, content moderation, and ensuring cultural relevance, with a focus on maintaining high-quality translations and optimizing user experience across diverse regions. The platform’s LLMOps implementation also addresses scalability and compliance with regional requirements.
YouTube - YouTube utilizes a sophisticated multilingual content navigation and localization system, employing LLMs for neural machine translation, content analysis, and automated quality checks, ensuring a high-quality user experience across its global audience. This system includes automated language identification, smart routing based on user preferences, and a robust content management system with version control for different language variants.
zeb - zeb developed SuperInsight, a self-service data analytics platform using generative AI and RAG, built on Databricks, that reduced manual data analyst workload by 80-90%. The system leverages DBRX models, fine-tuning, and vector search to process natural language requests, generating reports, forecasts, and ML models, demonstrating a practical application of LLMs in production.
Zillow - Zillow implemented a multi-faceted system to ensure their real estate LLMs adhere to Fair Housing regulations, combining prompt engineering, stop lists, and a BERT-based classifier to prevent discriminatory responses. This approach validates both user inputs and model outputs, using a curated dataset to achieve high recall in identifying non-compliant content.

Check out the full LLMOps Database and please do let us know if you have an entry you’d like to be included!

LLMOps in Production: 457 Case Studies of What Actually Works

Looking to Get Ahead in MLOps & LLMOps?

Thank you!