Outerbounds / AWS: AWS Trainium & Metaflow: Democratizing Large-Scale ML Training Through Infrastructure Evolution

Overview

This case study documents a collaboration between Outerbounds (maintainers of the open-source Metaflow ML orchestration framework) and AWS’s Annapurna machine learning accelerator team to integrate Metaflow with AWS’s custom ML chips, Trainium and Inferentia. The integration addresses a key challenge in production LLM operations: enabling cost-efficient training and inference of large language models while maintaining the ease of use that data scientists expect from modern MLOps tooling.

The presentation was delivered as an MLOps community meetup featuring Eddie from Outerbounds and Scott Perry, a Solutions Architect on AWS’s custom ML accelerator team. Their joint presentation illustrates how infrastructure-level innovations (custom silicon) can be made accessible through high-level orchestration frameworks.

The Problem Space

Organizations looking to train and deploy large language models face several interconnected challenges. First, there’s the sheer cost of compute—training state-of-the-art models requires significant GPU resources, and GPU availability has been constrained in recent years. Second, there’s the complexity of setting up distributed training infrastructure that can scale from experimentation to production. Third, organizations want their MLOps investments to be robust against the rapid pace of change in the AI landscape, where new model architectures and training techniques emerge constantly.

Eddie framed this using the concept of “pace layering” from Stewart Brand—the idea that complex systems have layers that evolve at different speeds. The infrastructure layer should be stable and robust, while the modeling layer (where trends like new architectures emerge) changes rapidly. The goal is to build an MLOps stack that enables access to commercial AI opportunities while remaining stable as upper layers change.

AWS Custom Silicon: Trainium and Inferentia

Scott Perry provided deep technical context on AWS’s custom ML chips. These are not simply “AWS’s version of a GPU”—they are purpose-built accelerators with architecture specifically designed for deep learning and generative AI workloads.

The Inferentia 2 and Trainium chips share a similar architecture with several key components. Each chip contains two neuron cores, which are the basic addressable units of compute. Within each neuron core, there are specialized engines: a tensor engine powered by a systolic array for matrix operations (the core of deep learning compute), vector engines for operations like batch normalization, scalar engines for activation functions, and general-purpose SIMD processors that can run custom C code for operators that didn’t exist when the chips were designed.

The chips include 32GB of HBM memory per chip, a Collective Communications Engine that enables overlapping compute and collective operations (critical for distributed training efficiency), and NeuronLink for high-bandwidth, low-latency communication between chips within an instance.

AWS launched Inferentia 1 in 2019, targeting smaller deep learning models with up to 70% lower cost per inference. Inferentia 2 followed in 2023, targeting transformer and diffusion models with up to 40% better price-performance. Trainium launched in 2022 for large-scale distributed training workloads, claiming up to 50% savings on training costs compared to comparable EC2 instances.

The Metaflow Integration

Eddie detailed how Outerbounds integrated Metaflow with AWS Trainium, making these specialized accelerators accessible through familiar MLOps patterns. The integration leverages AWS Batch as the compute backend, with Metaflow handling orchestration, versioning, and workflow management.

The deployment process involves two CloudFormation stacks: one for Metaflow itself (which can be deployed on any cloud or on-premise) and one for the Trainium compute environment. Once deployed, these link together to provide a compute environment ready for distributed training on Trainium devices.

Metaflow’s decorator-based approach allows users to annotate Python functions with resource requirements. A user can specify how many CPUs, how many Trainium/Inferentia devices, and which Docker image to use for dependencies. This declarative paradigm means the same workflow code can dispatch jobs to Kubernetes, AWS Batch, or other compute providers simply by changing configuration.

The integration includes monitoring capabilities that wrap the neuron-monitor CLI tool. Users can add a decorator to their functions that runs neuron-monitor at specified intervals, with results displayed as plots in the Metaflow UI showing neuron core utilization over the function’s lifecycle. This addresses a key operational concern: ensuring that expensive accelerator hardware is actually being utilized efficiently.

The MLOps Stack Architecture

The presentation outlined a conceptual stack for ML infrastructure that remains consistent whether training scikit-learn models or state-of-the-art LLMs:

Data Layer: Foundation for storage including data lakes on S3, warehouse providers like Snowflake, storage for Metaflow metadata, training/evaluation datasets, and model checkpoints. When dealing with Trainium devices, models are stored in slightly different formats (compiled for the Neuron SDK), making robust storage infrastructure important.

Compute Layer: Metaflow connects to different runtimes where data is accessed, with dynamic configuration of resources. The integration allows users to start with smaller Trainium instances (trn1.2xlarge) for testing at lower cost, then scale to full 32-node instances for production training—a factor of roughly 20x cost difference between test and production configurations.

Orchestration Layer: Metaflow provides workflow composition, scheduling, and triggering capabilities. Workflows can be deployed to Argo Workflows, AWS Step Functions, or Airflow. The GitHub repository includes examples with configuration files for HuggingFace datasets, hyperparameters, and Neuron SDK parameters for caching and optimization.

Versioning Layer: Every workflow run is versioned by default, which is particularly valuable for expensive Trainium jobs where understanding historical run behavior is critical before relaunching. The integration with GitHub provides code versioning alongside execution versioning.

Deployment Layer: While the repository focuses on training workflows, AWS provides documentation for deploying models trained with Neuron SDK to Inferentia inf2 instances for inference. The same neuron cores (with some differences) power both training and inference chips.

Modeling Layer: This is where the Neuron SDK becomes central. The SDK includes an extensive library of examples covering different model types: encoders, decoders, vision models, multimodal models. The AWS team actively adds examples as new architectures emerge.

Neuron SDK Details

The Neuron SDK is the complete software stack for driving Trainium and Inferentia chips. It includes several components:

Neuron Runtime: A driver and runtime library for loading and executing models on the hardware.

Framework Integration: PyTorch and JAX integration that allows users to keep their existing model code, moving models and tensors onto XLA devices. AWS is a founding member of the Open XLA initiative, and the Neuron stack uses XLA under the hood.

Compiler: When models run through the framework integration, a compilation process extracts XLA graphs representing the model’s computations and optimizes them for the Trainium/Inferentia hardware. This can be just-in-time or ahead-of-time compilation.

User Land Tools: neuron-top provides a graphical view of core utilization and memory usage; neuron-ls shows available cores and which processes are using them; a profiler helps diagnose performance bottlenecks.

Neuron Kernel Interface (NKI): A recently launched capability similar to OpenAI Triton that allows users to write lower-level kernels in C that execute directly on the hardware. This addresses a key concern for users migrating from GPUs who have custom CUDA kernels—NKI provides a path to implement equivalent functionality. Flash attention has been implemented using NKI as an example.

Ecosystem Integration

The solution emphasizes ecosystem compatibility, recognizing that customers have varied stacks and don’t want to change their tooling to adopt new technology. Key integrations include:

Optimum Neuron: A collaborative project with HuggingFace that adapts Transformers, the Trainer API, and SFTTrainer to work with Trainium/Inferentia. This is described as “probably the easiest way to get started” since users can continue using familiar HuggingFace patterns.

Model Hosting: Support for popular model servers including vLLM, HuggingFace TGI, DJL, TorchServe, and Ray Serve.

AWS Services: Trainium/Inferentia instances are available through EC2, ECS, EKS, SageMaker, Parallel Cluster, and AWS Batch.

Customer Results

Two customer testimonials were highlighted:

NinjaTech AI: Released AI personal assistants with models trained and deployed on Inferentia/Trainium. They reported up to 80% total cost savings and 50% more energy efficiency compared to previous GPU usage.

Leonardo AI: A visual asset design platform that moved models to Inferentia 2 and saw 80% cost reduction compared to previous GPU usage without sacrificing performance.

These are significant claims, though as with any vendor-provided testimonials, specific workload characteristics and baseline comparisons matter. The cost savings likely depend heavily on the specific model architectures and whether they’re well-optimized for the Neuron SDK.

Operational Considerations

The discussion touched on several practical operational aspects:

Availability: During the integration work, Eddie found that on-demand Trainium instances were consistently available within 10-15 minutes when targeting appropriate regions—a qualitatively different experience from the GPU availability challenges of recent years.

Instance Types: Trainium offers trn1.2xlarge (single chip) for fine-tuning and experimentation, and trn1.32xlarge/trn1n.32xlarge (16 chips, 32 neuron cores) for distributed training. The trn1n variant has twice the networking capability for distributed cases. Inferentia 2 offers four instance sizes from inf2.xlarge to inf2.48xlarge.

Regional Availability: 23+ regions with more planned.

Kubernetes vs. Serverless: The discussion noted increasing EKS adoption for ML workloads, with the Metaflow integration currently using AWS Batch but with plans for EKS support. The batch-like experience provides an almost serverless feel where compute appears when needed without manual cluster management.

Future Directions

The session concluded with discussion of continued integration work, including potential EKS support for Metaflow with Trainium, and enthusiasm for the NKI capability enabling community contributions of custom kernels. The ability to write custom kernels was identified as addressing one of the historical friction points in migrating GPU-based workloads.

The overall narrative is one of making specialized AI infrastructure accessible through familiar MLOps abstractions, enabling smaller teams to leverage large-scale training capabilities without deep infrastructure expertise.

AWS Trainium & Metaflow: Democratizing Large-Scale ML Training Through Infrastructure Evolution

Industry

Technologies