Company
LinkedIn
Title
Optimizing GPU Memory Usage in LLM Training with Liger-Kernel
Industry
Tech
Year
2025
Summary (short)
LinkedIn developed Liger-Kernel, a library to optimize GPU performance during LLM training by addressing memory access and per-operation bottlenecks. Using techniques like FlashAttention and operator fusion implemented in Triton, the library achieved a 60% reduction in memory usage, 20% improvement in multi-GPU training throughput, and a 3x reduction in end-to-end training time.
## Summary LinkedIn, like many large tech companies, relies on large language models (LLMs) to power key platform features such as job matching and personalized content recommendations. Given LinkedIn's massive scale of data, training these models presented significant computational challenges. The company developed an open-source library called Liger-Kernel to optimize GPU utilization during model training, achieving notable improvements in memory usage, throughput, and overall training time. This case study focuses specifically on the training phase of LLM operations rather than inference or deployment, making it a valuable reference for organizations dealing with resource-intensive model training at scale. ## The Problem: Training Bottlenecks at Scale Training LLMs involves several resource-intensive steps, with pre-training being the most demanding. During pre-training, models ingest large volumes of unstructured data (books, articles, websites) which must be tokenized, converted to embeddings, and then processed through multiple epochs where the model learns patterns through self-supervised learning and backpropagation. LinkedIn encountered two primary performance bottlenecks during their training processes: - **Heavy GPU memory access**: The constant need to transfer data between slower High Bandwidth Memory (HBM), which stores datasets and weights, and faster Shared Memory (SRAM), which handles frequently accessed data like attention scores, created significant latency. - **Per-operation resource overhead**: Individual operations were being executed separately across multiple GPUs, consuming extra time and resources. Even with parallel GPU processing, training on gigabytes of data through numerous epochs could take days or weeks, making optimization critical for LinkedIn's operational efficiency. ## The Solution: Liger-Kernel To address these bottlenecks, LinkedIn built Liger-Kernel, an open-source Python library designed to optimize GPU performance during LLM training. The library combines several established techniques into a cohesive package specifically designed for their use case. ### FlashAttention for Memory Optimization Liger-Kernel is built upon FlashAttention, a technique that improves GPU performance by calculating attention scores and partial sums directly on the faster SRAM instead of the slower HBM. This reduces the frequency of data transfers between memory types, addressing the memory access latency issue. Attention scores are critical for helping models understand contextual relationships in text—for example, determining which entity in a sentence an action refers to. By computing these scores more efficiently in SRAM, FlashAttention reduces the bottleneck caused by constant memory transfers. ### Operator Fusion for Reduced Overhead Beyond FlashAttention, Liger-Kernel implements operator fusion to merge multiple operations that would typically run on separate GPUs into single GPU operations. The case study uses RMSNorm (Root Mean Square Normalization) and Scaling as examples of operations that can be fused together rather than executed independently. Traditionally, frameworks like PyTorch use eager execution, where operations are executed immediately and synchronously one at a time. While this approach is straightforward, it prevents parallel execution and introduces overhead. To address this, LinkedIn leveraged PyTorch's torch.compile feature, which enables Just-In-Time (JIT) compilation. JIT compiles the model's computational graph to machine code, enabling operator fusion and potentially achieving performance improvements of up to 22 times faster than eager execution in some cases. ### Triton: Enabling Python on GPUs Since raw Python code cannot run directly on GPUs, LinkedIn wrote Liger-Kernel using Triton, an open-source domain-specific language and compiler originally created by an OpenAI employee. Triton allows developers to write custom GPU kernels using Python-based syntax, which then compiles to low-level GPU code that generates optimized GPU kernels. LinkedIn wrote their operator fusion implementations in Triton, along with operations like RMSNorm, to take full advantage of GPU capabilities while maintaining Python's accessibility for their engineering team. ## Deployment Architecture The case study provides insight into how Liger-Kernel is deployed in a production training environment. LinkedIn uses a distributed training setup with multiple GPUs training different parts of the model simultaneously. The described architecture uses Torch Distributed Elastic on AWS and includes: - A **parameter server** that stores and updates model parameters like weights - A **controller** that manages training jobs based on available resources - **Container images** that include Liger-Kernel, running on pods When a pod starts, the Liger-Kernel code compiles and optimizes the GPU kernel before training begins. This approach allows the optimizations to be applied seamlessly within LinkedIn's existing distributed training infrastructure. ## Reported Results According to the case study, LinkedIn's Liger-Kernel library achieved several notable performance improvements: - **60% reduction in GPU memory usage**: This is the headline metric, representing a significant decrease in the resources required for training - **20% improvement in multi-GPU training throughput**: More efficient utilization of distributed GPU resources - **3x reduction in end-to-end training time**: Faster iteration cycles for model development The library has been released as open-source, allowing other organizations to benefit from these optimizations. ## Critical Assessment While the reported results are impressive, it's worth noting several considerations: The source article is written by a third party (Better Stack) summarizing LinkedIn's work, which means some technical details may be simplified or abstracted. The specific conditions under which these performance improvements were measured (model size, data volume, GPU types, etc.) are not detailed, making it difficult to assess how these results might translate to other use cases. The techniques leveraged by Liger-Kernel—FlashAttention, operator fusion, and JIT compilation—are established optimization approaches in the deep learning community rather than novel innovations. LinkedIn's contribution appears to be the curation and integration of these techniques into a cohesive, accessible library tailored for LLM training. Additionally, the focus on training optimization rather than inference is notable. Many LLMOps case studies focus on inference efficiency for production serving, whereas this case study addresses the upstream challenge of model development efficiency. This makes it particularly relevant for organizations that frequently train or fine-tune their own models rather than relying solely on pre-trained models. ## Technical Stack and Tools The case study references several key technologies in LinkedIn's training infrastructure: - **PyTorch**: The deep learning framework used for model development - **torch.compile**: PyTorch's JIT compilation feature enabling operator fusion - **Triton**: OpenAI's domain-specific language for writing custom GPU kernels - **FlashAttention**: Memory-efficient attention computation technique - **AWS**: Cloud infrastructure for distributed training - **Torch Distributed Elastic**: Framework for distributed training across multiple GPUs ## Relevance to LLMOps This case study is primarily relevant to the training and development phase of LLMOps rather than inference or deployment. For organizations operating at scale with significant LLM training workloads, the approaches described here offer practical guidance on reducing computational costs and accelerating development cycles. The open-source nature of Liger-Kernel means organizations can evaluate and potentially adopt these optimizations without building similar capabilities from scratch. However, the benefits will be most pronounced for teams with substantial training workloads and the engineering expertise to integrate such tooling into their existing infrastructure.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.