ZenML

Cost Optimization and Performance Panel Discussion: Strategies for Running LLMs in Production

Various 2023
View original source

A panel discussion featuring experts from Neva, Intercom, Prompt Layer, and OctoML discussing strategies for optimizing costs and performance when running LLMs in production. The panel explores various approaches from using API services to running models in-house, covering topics like model compression, hardware selection, latency optimization, and monitoring techniques. Key insights include the trade-offs between API usage and in-house deployment, strategies for cost reduction, and methods for performance optimization.

Industry

Tech

Technologies

Overview

This panel discussion, moderated by Lina (an ML engineer), brings together four practitioners working with LLMs in production across different domains. The panelists include Daniel, a research scientist at Neeva (an ad-free private search solution using LLMs for summarization, semantic retrieval, and query generation); Mario, a staff engineer at Intercom who helped ship one of the early GPT-4 powered customer service chatbots; Jared, co-founder of Prompt Layer (a platform for managing production LLM applications with prompt versioning, monitoring, and performance tracking); and Luis, co-founder of OctoML (offering automated model deployment and optimization across different hardware). The discussion centers on the practical challenges of cost optimization, latency reduction, and monitoring for LLM applications in production.

Cost Considerations and Optimization Strategies

The panel opened with a frank discussion about the economic realities of running LLMs at scale. Daniel from Neeva explained that while foundational model APIs (like OpenAI) are excellent for rapid prototyping and product validation, they become prohibitively expensive at scale. He gave a concrete example: if you want to run summarization on billions of documents in a web index, the API costs make this impossible, necessitating a move to self-hosted, smaller models.

Daniel outlined the hardware cost thresholds that guide Neeva’s decisions. The golden threshold for them is whether a model can run on a single A10 GPU rather than requiring an A100. On AWS, eight A100s cost around $40/hour, while an A10 spot instance can be obtained for just $0.30/hour—a difference of more than 100x. Moving to CPU inference changes the economics even further, enabling processing of billions of items cost-effectively. For their semantic search use case, Neeva runs small query encoders (30-100 million parameters) directly on CPU, co-located with the retrieval system, which dramatically simplifies the production architecture.

The top cost optimization techniques recommended by the panelists were:

An important counterpoint was raised regarding team costs. Daniel shared an anecdote about a fintech company that needed business classification. Building an in-house ML system would require hiring ML engineers and infrastructure, costing $500-600K/year minimum. For their use case (only 10,000 examples per day), using an API without an ML team was actually more cost-efficient, even if the model was 70% accurate rather than 99%. The cost calculus shifts dramatically based on scale.

Reliability and Control

A major factor driving the move to self-hosted models was API reliability. Daniel mentioned that while OpenAI advertises 99% uptime, their actual experience was significantly worse, with frequent outages of 15+ minutes that left them unable to serve users. Mario confirmed similar experiences at Intercom. This unreliability was “the impetus to bring this in-house” for Neeva.

Beyond reliability, self-hosting provides control over latency optimization. When using external APIs, improvements require waiting for the provider. When self-hosted, the team can optimize every aspect of the serving stack. Daniel noted they achieved 400ms response times for 20B parameter models compared to 3+ seconds from external APIs—not because of any algorithmic improvement, but simply by eliminating network round-trips and external orchestration overhead.

Latency Optimization

The panel provided detailed technical strategies for reducing latency in production systems.

Daniel walked through Neeva’s summarization latency optimization journey. They started with an external LLM taking about 3 seconds per item (30 seconds for a batch of 10). Moving to a self-hosted T5-large model on an A10 still took 8 seconds naively. The key optimizations were:

Luis from OctoML emphasized the importance of using optimized kernels and binaries specific to your hardware target. The best serving libraries (TensorRT, TVM, etc.) vary by hardware, and automating this selection can provide significant gains without any model modifications. OctoML’s platform helps automate this hardware-specific optimization.

For those still using external APIs, Mario pointed out the constraints are much tighter. Options include:

The panel discussed an important insight about batch sizing: when doing greedy token-by-token decoding, all sequences in a batch are limited by the longest sequence due to padding. Research from University of Washington found that in machine translation, up to 70% of tokens in batches were effectively useless padding tokens. This means high variability in output length severely impacts batching efficiency, and smaller batch sizes can paradoxically improve overall throughput for variable-length generation tasks.

Monitoring and Evaluation

The panel identified three main approaches to monitoring LLM quality in production:

Jared from Prompt Layer noted their platform helps users understand which prompts are expensive and identify failure cases in production. Their thesis is that rather than trying to understand which prompts are “good,” it’s more effective to focus on identifying what’s failing—which users are getting bad results, where the chatbot is being rude, etc.

Daniel shared a practical monitoring insight: tail latencies can reveal important edge cases. They discovered massive latency spikes caused by unexpectedly long documents hitting their sliding window—the fix was simply truncating after 10,000 tokens. Monitoring outliers often reveals simple fixes for significant problems.

Trade-offs and Practical Guidance

The panel offered balanced advice for teams at different stages:

The overall message was pragmatic: understand your use case, start simple with powerful but expensive models, measure what matters, and optimize deliberately as you scale. The field is nascent, tooling is still being built, and best practices continue to evolve.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64

Enterprise-Scale GenAI and Agentic AI Deployment in B2B Supply Chain Operations

Wesco 2025

Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.

fraud_detection document_processing content_moderation +52