LinkedIn developed Liger-Kernel, a library to optimize GPU performance during LLM training by addressing memory access and per-operation bottlenecks. Using techniques like FlashAttention and operator fusion implemented in Triton, the library achieved a 60% reduction in memory usage, 20% improvement in multi-GPU training throughput, and a 3x reduction in end-to-end training time.
LinkedIn, like many large tech companies, relies on large language models (LLMs) to power key platform features such as job matching and personalized content recommendations. Given LinkedIn’s massive scale of data, training these models presented significant computational challenges. The company developed an open-source library called Liger-Kernel to optimize GPU utilization during model training, achieving notable improvements in memory usage, throughput, and overall training time.
This case study focuses specifically on the training phase of LLM operations rather than inference or deployment, making it a valuable reference for organizations dealing with resource-intensive model training at scale.
Training LLMs involves several resource-intensive steps, with pre-training being the most demanding. During pre-training, models ingest large volumes of unstructured data (books, articles, websites) which must be tokenized, converted to embeddings, and then processed through multiple epochs where the model learns patterns through self-supervised learning and backpropagation.
LinkedIn encountered two primary performance bottlenecks during their training processes:
Even with parallel GPU processing, training on gigabytes of data through numerous epochs could take days or weeks, making optimization critical for LinkedIn’s operational efficiency.
To address these bottlenecks, LinkedIn built Liger-Kernel, an open-source Python library designed to optimize GPU performance during LLM training. The library combines several established techniques into a cohesive package specifically designed for their use case.
Liger-Kernel is built upon FlashAttention, a technique that improves GPU performance by calculating attention scores and partial sums directly on the faster SRAM instead of the slower HBM. This reduces the frequency of data transfers between memory types, addressing the memory access latency issue.
Attention scores are critical for helping models understand contextual relationships in text—for example, determining which entity in a sentence an action refers to. By computing these scores more efficiently in SRAM, FlashAttention reduces the bottleneck caused by constant memory transfers.
Beyond FlashAttention, Liger-Kernel implements operator fusion to merge multiple operations that would typically run on separate GPUs into single GPU operations. The case study uses RMSNorm (Root Mean Square Normalization) and Scaling as examples of operations that can be fused together rather than executed independently.
Traditionally, frameworks like PyTorch use eager execution, where operations are executed immediately and synchronously one at a time. While this approach is straightforward, it prevents parallel execution and introduces overhead. To address this, LinkedIn leveraged PyTorch’s torch.compile feature, which enables Just-In-Time (JIT) compilation. JIT compiles the model’s computational graph to machine code, enabling operator fusion and potentially achieving performance improvements of up to 22 times faster than eager execution in some cases.
Since raw Python code cannot run directly on GPUs, LinkedIn wrote Liger-Kernel using Triton, an open-source domain-specific language and compiler originally created by an OpenAI employee. Triton allows developers to write custom GPU kernels using Python-based syntax, which then compiles to low-level GPU code that generates optimized GPU kernels.
LinkedIn wrote their operator fusion implementations in Triton, along with operations like RMSNorm, to take full advantage of GPU capabilities while maintaining Python’s accessibility for their engineering team.
The case study provides insight into how Liger-Kernel is deployed in a production training environment. LinkedIn uses a distributed training setup with multiple GPUs training different parts of the model simultaneously.
The described architecture uses Torch Distributed Elastic on AWS and includes:
When a pod starts, the Liger-Kernel code compiles and optimizes the GPU kernel before training begins. This approach allows the optimizations to be applied seamlessly within LinkedIn’s existing distributed training infrastructure.
According to the case study, LinkedIn’s Liger-Kernel library achieved several notable performance improvements:
The library has been released as open-source, allowing other organizations to benefit from these optimizations.
While the reported results are impressive, it’s worth noting several considerations:
The source article is written by a third party (Better Stack) summarizing LinkedIn’s work, which means some technical details may be simplified or abstracted. The specific conditions under which these performance improvements were measured (model size, data volume, GPU types, etc.) are not detailed, making it difficult to assess how these results might translate to other use cases.
The techniques leveraged by Liger-Kernel—FlashAttention, operator fusion, and JIT compilation—are established optimization approaches in the deep learning community rather than novel innovations. LinkedIn’s contribution appears to be the curation and integration of these techniques into a cohesive, accessible library tailored for LLM training.
Additionally, the focus on training optimization rather than inference is notable. Many LLMOps case studies focus on inference efficiency for production serving, whereas this case study addresses the upstream challenge of model development efficiency. This makes it particularly relevant for organizations that frequently train or fine-tune their own models rather than relying solely on pre-trained models.
The case study references several key technologies in LinkedIn’s training infrastructure:
This case study is primarily relevant to the training and development phase of LLMOps rather than inference or deployment. For organizations operating at scale with significant LLM training workloads, the approaches described here offer practical guidance on reducing computational costs and accelerating development cycles.
The open-source nature of Liger-Kernel means organizations can evaluate and potentially adopt these optimizations without building similar capabilities from scratch. However, the benefits will be most pronounced for teams with substantial training workloads and the engineering expertise to integrate such tooling into their existing infrastructure.
Baseten has built a production-grade LLM inference platform focusing on three key pillars: model-level performance optimization, horizontal scaling across regions and clouds, and enabling complex multi-model workflows. The platform supports various frameworks including SGLang and TensorRT-LLM, and has been successfully deployed by foundation model companies and enterprises requiring strict latency, compliance, and reliability requirements. A key differentiator is their ability to handle mission-critical inference workloads with sub-400ms latency for complex use cases like AI phone calls.
Mistral, a European AI company, evolved from developing academic LLMs to building and deploying enterprise-grade language models. They started with the successful launch of Mistral-7B in September 2023, which became one of the top 10 most downloaded models on Hugging Face. The company focuses not just on model development but on providing comprehensive solutions for enterprise deployment, including custom fine-tuning, on-premise deployment infrastructure, and efficient inference optimization. Their approach demonstrates the challenges and solutions in bringing LLMs from research to production at scale.
Apoidea Group tackled the challenge of efficiently processing banking documents by developing a solution using multimodal large language models. They fine-tuned the Qwen2-VL-7B-Instruct model using LLaMA-Factory on Amazon SageMaker HyperPod to enhance visual information extraction from complex banking documents. The solution significantly improved table structure recognition accuracy from 23.4% to 81.1% TEDS score, approaching the performance of more advanced models while maintaining computational efficiency. This enabled reduction of financial spreading process time from 4-6 hours to just 10 minutes.