MosaicML developed and open-sourced MPT, a family of large language models including 7B and 30B parameter versions, demonstrating that high-quality LLMs could be trained for significantly lower costs than commonly believed (under $250,000 for 7B model). They built a complete training platform handling data processing, distributed training, and model deployment at scale, while documenting key lessons around planning, experimentation, data quality, and operational best practices for production LLM development.
This case study comes from a conference presentation by a member of MosaicML (now acquired by Databricks), sharing lessons learned from training their open-source MPT (MosaicML Pretrained Transformer) models. The presenter leads the ML Runtime team responsible for developing the core training infrastructure. The talk provides a practitioner’s perspective on the challenges and best practices of training large language models at scale, with specific focus on operational aspects that are critical for production LLM development.
MosaicML’s motivation for training MPT was threefold: to provide the community with high-quality open-source models with commercially permissive licenses, to demonstrate different LLM capabilities (chat, long context, code generation), and to battle-test their training platform. The case study reveals that they successfully trained MPT-7B for approximately $250,000 on pre-training alone (plus a few thousand dollars for fine-tuning), significantly lower than the $500K-$1M+ that most practitioners expected based on an informal social media poll they conducted.
The presentation opens by framing why organizations would want to train their own LLMs rather than using existing APIs or open-source models. Four key drivers are identified:
Privacy: Organizations with sensitive business data may not want to expose even prompts to third-party APIs. The presenter notes well-documented breach cases that reinforce this concern.
Quality: Sometimes the required capability simply doesn’t exist in available models. The example given is specialized technical documentation—like a user manual for the Large Hadron Collider—where general-purpose models like ChatGPT would be inadequate.
Cost: Training a smaller, specialized model can be more economical than paying inference costs for large third-party models. A 3B parameter model hosted at a fraction of the cost may serve the use case adequately.
Latency: Production applications often have strict latency requirements and need to handle high volumes of users, which may be incompatible with third-party API limitations.
The presentation references four customer case studies to illustrate these points: Metadialogue (Arabic chatbot), Replit (code generation, starting from a 3B MPT model), Patronus (PII detection for data leak prevention), and Hyperwrite (writing assistant with auto-translation capabilities).
The MPT models are decoder-only transformer models following the GPT architecture. The presentation explains the autoregressive generation process where the model predicts the next token, which is then appended to the input for the next prediction cycle, continuing until a stop condition is reached.
The training pipeline offers several paths:
For MPT, MosaicML chose to start from scratch with their own optimized architecture, using randomized weights. They pre-trained and fine-tuned the model but did not apply RLHF. The result was two primary variants: MPT-7B and MPT-30B, with additional specialized versions including 8K and 64K context length models, and a chat-tuned version using the Alpaca dataset (though this version is not commercially licensable due to the ChatGPT-generated training data).
The presentation emphasizes that rigorous planning is essential before committing millions of dollars to training runs. Key planning activities include:
Defining the problem clearly: The presenter stresses questioning whether an LLM is even necessary. If a smaller model or alternative approach can solve the problem, that’s preferable given the complexity of LLM training.
Establishing evaluation criteria: Using the example of building a coding assistant, relevant evaluations would include HumanEval (a code generation benchmark), in-context learning evaluations, and custom evaluations for specific capabilities. The presenter notes that evaluation is a topic worthy of an entire conference on its own.
Vibe checks: Beyond formal metrics, MosaicML uses conversational testing with fun prompts like “If you were an AI model, what would you name yourself?” and the “banana bread test” where they actually cook recipes generated by the model and evaluate the results.
Performance KPIs: Understanding cost and latency requirements in the context of the specific problem being solved.
A significant portion of the presentation covers compute budgeting using Chinchilla scaling laws. These laws provide equations to estimate the amount of data needed to reach a target quality level at minimum training cost. For example, a 30B parameter model would need approximately 600 billion tokens using the Chinchilla-optimal approach.
However, the presenter offers an important caveat: Chinchilla scaling laws only optimize for training compute, not inference. A 175B parameter model might achieve similar loss to a 30B model with fewer tokens, but hosting the larger model is significantly more expensive. MosaicML learned this lesson the hard way initially, training expensive-to-host models before considering inference costs.
The practical recommendation is often to train smaller models on considerably more data than Chinchilla-optimal, achieving better accuracy at lower total cost of ownership when considering both training and inference.
For concrete estimates, training a 30B parameter model on 600 billion tokens requires approximately 3,000 H100-days and costs around $500,000 at $7 per H100-hour. The presenter notes this is within 10-15% accuracy and doesn’t include the smaller experimental runs that precede the main training run.
The presentation emphasizes that data preparation often takes more time than actual training. Key data processing activities include:
Data cleaning: Removing duplicates, filtering toxicity, stripping PII, fixing encoding issues. Training on poorly encoded data wastes compute and can corrupt the tokenizer.
Deduplication: Internet crawl data contains massive redundancy. Removing duplicates improves model quality and reduces token count, directly reducing training cost.
Pre-tokenization and concatenation: Rather than tokenizing on-the-fly during training (which wastes compute), MosaicML pre-tokenizes all datasets. Additionally, they concatenate short documents to fill the context window rather than padding, ensuring every compute cycle contributes to learning.
Shuffling: The presenter cannot emphasize enough the importance of shuffle quality at scale. Each training batch should be a representative sample of the full dataset distribution. Poor shuffling leads to what they call “wavy boys”—noisy, spiky loss curves that indicate suboptimal training.
An important operational lesson shared is that MosaicML initially underinvested in data infrastructure. They started with two beefy machines running custom distributed Python code, taking days or weeks for basic tasks. After the Databricks acquisition, leveraging Apache Spark reduced these operations to hours. The recommendation is to invest in scalable data exploration and ETL architecture from the start.
Scaling from one GPU to thousands requires careful infrastructure design. The math is straightforward: 3,000 GPU-days becomes 6 days on 512 GPUs or 3 days on 1,024 GPUs. However, achieving this requires strong linear scaling.
MosaicML uses FSDP (Fully Sharded Data Parallel), a relatively simple parallelism strategy built into PyTorch and pioneered by Microsoft with DeepSpeed. This scales well with fast interconnects and avoids the complexity of more exotic parallelism approaches.
The platform they built, called the Mosaic ML platform, provides:
Hardware failures are inevitable at scale. The rule of thumb shared is approximately one failure per 1,000 GPU-days. These failures are typically disruptive—jobs die completely. Common causes are CUDA errors and hardware issues (bad GPUs, network interconnect problems).
The presenter contrasts their MPT-7B training log (clean, with simple entries for hardware failure, checkpoint, resume, continue) with the publicly available OPT training log (which documents significant operational challenges). Their system handles failures with automatic checkpointing and resumption, typically losing only 20-30 minutes of progress.
The recommendation is to either choose a platform that handles these challenges or assemble best-of-breed components yourself. The key capabilities needed are: compatible software stacks (correct PyTorch/CUDA versions), scheduling and orchestration, model checkpointing, and comprehensive logging.
A recurring theme is the importance of experimentation and skepticism. The presenter explicitly says “don’t trust anyone”—including conference speakers, published literature, and your own intuition. Things that work in one context may not work in another due to different data distributions or model architectures.
The practical advice is:
The final lesson draws on the presenter’s background in chip design, where tape-outs cost tens of millions of dollars and require rigorous risk management. Similarly, expensive LLM training runs require:
Pre-flight checklists: Define everything that must be verified before launching a $500K-$5M training run. The presenter mentions they may publish a template blog post.
Designated on-call: Technology fails, and someone needs to be available to pull in the right people when it does.
Training logs: Maintain detailed logs and learn from them. If you’re going to repeat this process, build the institutional knowledge to do it efficiently.
The MPT-7B model held its own against Falcon and Llama v1 at the time of release. MPT-30B performed comparably or better on certain dimensions than 40B parameter models like Falcon. The presenter acknowledges that better models have since been released (Mistral, Llama v2), but this speaks to the rapid pace of the field rather than any shortcoming in their approach.
The broader impact is demonstrating that high-quality LLM training is accessible to organizations beyond the largest tech companies, provided they invest appropriately in infrastructure, data preparation, and operational processes.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
The Institute of Science Tokyo successfully developed Llama 3.3 Swallow, a 70-billion-parameter large language model with enhanced Japanese capabilities, using Amazon SageMaker HyperPod infrastructure. The project involved continual pre-training from Meta's Llama 3.3 70B model using 314 billion tokens of primarily Japanese training data over 16 days across 256 H100 GPUs. The resulting model demonstrates superior performance compared to GPT-4o-mini and other leading models on Japanese language benchmarks, showcasing effective distributed training techniques including 4D parallelism, asynchronous checkpointing, and comprehensive monitoring systems that enabled efficient large-scale model training in production.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.