ZenML

Multi-Track Approach to Developer Productivity Using LLMs

ebay 2024
View original source

eBay implemented a three-track approach to enhance developer productivity using LLMs: utilizing GitHub Copilot as a commercial offering, developing eBayCoder (a fine-tuned version of Code Llama 13B), and creating an internal GPT-powered knowledge base using RAG. The implementation showed significant improvements, including a 27% code acceptance rate with Copilot, enhanced software upkeep capabilities with eBayCoder, and increased efficiency in accessing internal documentation through their RAG system.

Industry

E-commerce

Technologies

Overview

eBay, one of the world’s largest e-commerce marketplaces, published this case study describing their systematic approach to deploying generative AI for developer productivity. The article, authored by Senthil Padmanabhan and published in February 2024, outlines three distinct but complementary tracks for leveraging LLMs within their engineering organization. Rather than positioning one approach as superior to others, eBay advocates for a multi-pronged strategy where each track addresses different productivity challenges and can be used in combination.

The case study is notable for its measured, practical approach to AI adoption. eBay acknowledges upfront that developer productivity is complex and cannot be captured by a single metric, employing both quantitative measures (Git metrics, DORA, Flow, code quality via Sonar) and qualitative measures (developer surveys). This methodological rigor adds credibility to their findings, though it’s worth noting that many of the specific results come from internal assessments rather than independent verification.

Track 1: Commercial Offerings (GitHub Copilot)

eBay’s first track involved expanding their use of GitHub Copilot to all developers, but they conducted a controlled experiment first. The pilot program involved 300 developers split into two groups: one using Copilot and a control group without it, matched for similar assignments and abilities. This A/B test was conducted in summer 2023.

The quantitative results reported were:

The use cases where Copilot excelled included converting comments to code, suggesting next lines, generating tests, and auto-filling repetitive code. However, eBay encountered significant limitations with context size. Copilot has inherent prompt size limits, which becomes problematic for a codebase of eBay’s scale with millions of lines of code. Certain tasks require knowledge of the entire codebase, which commercial tools simply cannot accommodate.

It’s important to note that acceptance rate metrics from commercial tools can be somewhat misleading, as they measure what developers accept rather than what ultimately proves useful. The decrease in merge time is a more meaningful operational metric, though the causal relationship to Copilot specifically would benefit from longer-term study.

Track 2: Fine-Tuned LLMs (eBayCoder)

Recognizing the limitations of generic commercial offerings, eBay pursued fine-tuning an existing open-source LLM on their proprietary codebase. They chose Code Llama 13B as their base model, noting that the architecture allows for easy substitution if needed. The resulting system, called “eBayCoder,” was trained on eBay’s codebase and associated documentation.

The primary use case highlighted for eBayCoder is software upkeep and migration tasks. eBay, like many large technology organizations, maintains foundational libraries and frameworks built on top of open-source software for servers, message queues, batch jobs, iOS, and Android. These systems require periodic upgrades for security patches and developer ergonomics improvements (such as upgrading to newer versions of Spring Boot). The effort required for such migrations varies dramatically depending on the application stack’s current version, and eBay reports spending significant engineering resources on these tasks despite having existing migration tools.

The fine-tuned approach also addresses the code duplication problem. With a massive, varied codebase, commercial LLMs typically only have access to immediately relevant context—surrounding files, the current repository, and dependent libraries. They may not be aware of existing internal services or non-dependent libraries maintained by other teams that offer similar functionality to what’s being written. This frequently leads to code duplication. A fine-tuned LLM with access to broader context can potentially identify these overlaps and reduce redundancy.

From an LLMOps perspective, this track represents a more resource-intensive investment. Fine-tuning requires computational resources, data preprocessing pipelines, and ongoing maintenance as the codebase evolves. The article doesn’t detail the infrastructure requirements, training time, or operational costs, which would be valuable for organizations considering similar approaches. The choice of Code Llama 13B suggests a balance between capability and resource requirements—larger models might offer better performance but at significantly higher operational costs.

Track 3: Internal Knowledge Base (RAG-Based GPT)

The third track addresses a different productivity bottleneck: the time developers spend investigating internal knowledge. eBay describes common questions like “Which API should I call to add an item to the cart?” or “How do I create a pipeline to deploy my application to production?” that can require multiple meetings and significant time to answer due to fragmented internal documentation.

eBay built an internal GPT that ingests data from multiple primary sources including enterprise GitHub Markdowns, Google Docs, Jira, Slack, and Wikis. The metaphor they use is instructive: every organization has that long-tenured employee who knows where everything is and whom to ask—this system aims to codify that institutional knowledge.

The technical implementation uses Retrieval Augmented Generation (RAG). The system creates embedding vectors for each piece of content from the various sources, which are then stored in a vector database. This embedding and indexing process is automated and recurring, suggesting they’ve built pipelines to keep the knowledge base current as documentation changes.

When a user enters a query, the system generates an embedding vector for that query and uses similarity mechanisms (such as cosine similarity) to compare against the stored content embeddings. This retrieval step identifies the most relevant content and links. The retrieved context is then passed to their private instance of LLMs (both commercial and open-source options) with a carefully constructed prompt that instructs the model to answer truthfully based on the provided context and to explicitly state “I don’t know” when the answer isn’t contained in the retrieved material.

This prompt engineering approach—instructing the model to acknowledge uncertainty—is a common and effective technique for reducing hallucinations in RAG systems. It shifts the model’s behavior toward conservative responses when context is insufficient.

eBay reports rising daily usage and positive qualitative feedback from developers regarding efficiency and relevance. They’re tracking task completion time and support/meeting requests as metrics, though they note the sample size isn’t yet large enough to share definitive quantitative results.

The acknowledged limitation is that, like any automated chat system, the GPT occasionally delivers nonsensical answers. To address this, eBay implements Reinforcement Learning from Human Feedback (RLHF), allowing employees to provide feedback through the user interface that gets incorporated back into the system. This creates a continuous improvement loop, though the operational complexity of maintaining such a feedback pipeline shouldn’t be underestimated.

LLMOps Considerations and Critical Assessment

From an LLMOps perspective, this case study demonstrates several best practices:

Multi-model strategy: Rather than betting on a single approach, eBay’s three-track strategy provides resilience and flexibility. If commercial offerings change pricing or terms, they have internal alternatives. If fine-tuning proves too resource-intensive for certain use cases, commercial options remain available.

Controlled experimentation: The A/B test methodology for evaluating Copilot represents a more rigorous approach than many organizations take when adopting AI tools. The inclusion of a control group with matched characteristics strengthens the validity of their findings.

Hybrid infrastructure: The mention of “private instances of commercial and open source LLMs” suggests eBay maintains flexibility in their infrastructure, potentially allowing them to balance cost, performance, and data privacy considerations across different use cases.

Continuous improvement loops: The RLHF implementation for the internal GPT demonstrates attention to ongoing system improvement rather than one-time deployment.

However, there are areas where the case study could be more comprehensive:

The article doesn’t discuss infrastructure costs, training times, or the team size required to maintain these systems. For organizations considering similar approaches, operational costs are critical decision factors. The fine-tuning track in particular requires significant ongoing investment as codebases evolve.

The metrics provided, while encouraging, come from internal assessments. Independent verification or longer-term studies would strengthen confidence in the claimed productivity gains. The “perceived productivity” increase from surveys is subjective and could be influenced by novelty effects.

The article also doesn’t address potential downsides like developer skill atrophy, over-reliance on generated code, or security implications of training models on proprietary code. These are legitimate concerns that mature LLMOps programs should consider.

Overall, eBay’s case study provides a useful framework for thinking about generative AI deployment for developer productivity, with the three-track approach offering a pragmatic template that acknowledges the strengths and limitations of different approaches.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production

OpenAI 2025

OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.

customer_support healthcare code_generation +42

Deploying Generative AI at Scale Across 5,000 Developers

Liberty IT 2026

Liberty IT, the technology division of Fortune 100 insurance company Liberty Mutual, embarked on a large-scale deployment of generative AI tools across their global workforce of over 5,000 developers and 50,000+ employees. The initiative involved rolling out custom GenAI platforms including Liberty GPT (an internal ChatGPT variant) to 70% of employees and GitHub Copilot to over 90% of IT staff within the first year. The company faced challenges including rapid technology evolution, model availability constraints, cost management, RAG implementation complexity, and achieving true adoption beyond basic usage. Through building a centralized AI platform with governance controls, implementing comprehensive learning programs across six streams, supporting 28 different models optimized for various use cases, and developing custom dashboards for cost tracking and observability, Liberty IT successfully navigated these challenges while maintaining enterprise security and compliance requirements.

fraud_detection customer_support code_generation +41