LinkedIn developed and open-sourced LIER (LinkedIn Efficient and Reusable) kernels to address the fundamental challenge of memory consumption in LLM training. By optimizing core operations like layer normalization, rotary position encoding, and activation functions, they achieved up to 3-4x reduction in memory allocation and 20% throughput improvements for large models. The solution, implemented using Python and Triton, focuses on minimizing data movement between GPU memory and compute units, making LLM training faster and more cost-effective.
LinkedIn, the professional networking platform with over 1 billion members, 67 million companies, and 41,000 skills in their knowledge graph, has invested significantly in LLM technology to power a range of AI-driven features across their platform. This case study, presented at what appears to be a technical conference (likely Ray Summit based on references), focuses on LinkedIn’s internal efforts to optimize LLM training efficiency through the development of custom GPU kernels called “Liger Kernels.”
The presentation was given by a LinkedIn engineer named Dre, who detailed both the business context for LinkedIn’s LLM investments and the deep technical work they’ve done to reduce training costs and improve efficiency. As part of Microsoft, LinkedIn has a relationship with OpenAI, but the focus here is on their in-house infrastructure and optimization work.
LinkedIn uses LLMs across multiple product lines to power various features:
The breadth and scale of LinkedIn’s operations—mapping jobs to skills, skills to products, and products to companies through their knowledge graph—creates substantial demand for efficient LLM training and inference.
LinkedIn’s training infrastructure is built on well-established components:
The diversity of their systems and the scale of their operations made training efficiency a critical concern, leading to the investment in custom kernel development.
The presentation provided an excellent technical explanation of why LLM training is often bottlenecked by memory rather than raw compute. Using a “roofline graph” model, Dre explained that as you move from compute-bound to memory-bound operations, GPUs increasingly sit idle waiting for data.
Modern GPUs have two key subsystems: High Bandwidth Memory (HBM), which serves as fast DRAM on the side of the GPU, and the computational units that perform actual operations. The challenge is that data must flow through a “narrow pipe” from HBM to compute units. For models with 70-80 billion parameters (or even 405 billion for Llama), this data transfer becomes the limiting factor.
The naive implementation of Transformers fetches the same weights multiple times from GPU memory, causing significant inefficiency. The fundamental insight driving Liger Kernels is that reducing memory consumption and memory transfers is what unlocks higher speeds and lower costs—not necessarily having more compute power.
After profiling their training workloads, LinkedIn identified the most impactful kernels to optimize:
The kernels are written in Triton, OpenAI’s domain-specific language for GPU programming. This choice offers several advantages:
The key concept is maximizing FLOPs (floating-point operations) per byte brought into the GPU. Ideally, each byte would only be brought into GPU memory once and used for all necessary computations before being discarded.
One particularly impactful optimization was the fused linear cross-entropy kernel. Cross-entropy is typically the last stage of a layer, and it’s often preceded by a linear layer. In the naive implementation, results must be written back to memory after the linear stage and then read back in for cross-entropy.
This is especially problematic for systems with large vocabularies. Memory requirements scale with three factors:
By fusing these two operations, LinkedIn reduced the memory profile significantly. The presentation showed graphs demonstrating that as vocabulary size increases, the fused kernel maintains a much lower and flatter memory consumption curve compared to naive implementations.
The optimizations delivered substantial improvements:
These gains come from operating more efficiently and not having to bring each byte of data into GPU compute units multiple times.
LinkedIn designed Liger Kernels with multiple integration levels to accommodate different developer needs:
The ease of use has been a key factor in adoption. The fact that developers only need PyTorch and Triton, and can work in a familiar Python-like environment, has allowed LinkedIn’s engineers to explore and build systems naturally.
Liger Kernels are available on GitHub and have gained significant popularity. As of the presentation (which appears to be recent, with references to features “launched just a few days ago”), the kernels have been ported to AMD GPUs in addition to the original NVIDIA implementation. This cross-platform support extends the reach of the optimizations to a broader set of hardware.
LinkedIn is also building integrations with existing training frameworks, making it easy for external developers to adopt these optimizations in their own workflows.
In the Q&A session, Dre addressed how LinkedIn prioritizes generative AI opportunities. Given that LinkedIn is a large organization with multiple business lines, there’s no single decision-maker determining all AI initiatives. Instead, different business units identify and pursue use cases based on their needs. Some use cases are easier to enter (like the profile summary generation), and as the technology matures and understanding deepens, more complex use cases are being explored from a growing backlog.
The ease of experimentation—enabled in part by tools like Liger Kernels that reduce the cost and complexity of training—is a key factor in determining which projects get prioritized.
While the presentation makes compelling claims about performance improvements, a few caveats are worth noting:
That said, the technical explanations are sound and align with well-understood principles of GPU optimization. The open-source nature of the project allows independent verification of claims, and the port to AMD GPUs suggests genuine community interest and adoption.
This case study demonstrates how a large-scale tech company with substantial LLM training needs approached infrastructure optimization from first principles. Rather than accepting the efficiency of standard implementations, LinkedIn’s team profiled their workloads, identified bottlenecks, and developed custom solutions that are now benefiting the broader ML community. The work represents a mature approach to LLMOps where optimization happens at multiple levels of the stack, from high-level product integration down to low-level GPU kernel implementation.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
LinkedIn introduced Liger-Kernel, an open-source library addressing GPU efficiency challenges in LLM training. The solution combines efficient Triton kernels with a flexible API design, integrated into a comprehensive training infrastructure stack. The implementation achieved significant improvements, including 20% better training throughput and 60% reduced memory usage for popular models like Llama, Gemma, and Qwen, while maintaining compatibility with mainstream training frameworks and distributed training systems.
Baseten has built a production-grade LLM inference platform focusing on three key pillars: model-level performance optimization, horizontal scaling across regions and clouds, and enabling complex multi-model workflows. The platform supports various frameworks including SGLang and TensorRT-LLM, and has been successfully deployed by foundation model companies and enterprises requiring strict latency, compliance, and reliability requirements. A key differentiator is their ability to handle mission-critical inference workloads with sub-400ms latency for complex use cases like AI phone calls.