ZenML

Multi-Track Approach to Developer Productivity Using LLMs

eBay 2024
View original source

eBay implemented a three-track approach to enhance developer productivity using AI: deploying GitHub Copilot enterprise-wide, creating a custom-trained LLM called eBayCoder based on Code Llama, and developing an internal RAG-based knowledge base system. The Copilot implementation showed a 17% decrease in PR creation to merge time and 12% decrease in Lead Time for Change, while maintaining code quality. Their custom LLM helped with codebase-specific tasks and their internal knowledge base system leveraged RAG to make institutional knowledge more accessible.

Industry

E-commerce

Technologies

eBay’s journey into implementing LLMs for developer productivity represents a comprehensive and pragmatic approach to adopting AI technologies in a large-scale enterprise environment. The company explored three distinct but complementary tracks for improving developer productivity through AI, offering valuable insights into the real-world challenges and benefits of deploying LLMs in production.

The case study is particularly noteworthy for its measured approach to evaluation and deployment, using both quantitative and qualitative metrics to assess the impact of these technologies. Instead of relying on a single solution, eBay recognized that different aspects of developer productivity could be better served by different approaches to LLM deployment.

Track 1: GitHub Copilot Implementation

The first track involved the enterprise-wide deployment of GitHub Copilot, preceded by a carefully designed A/B test experiment with 300 developers. The evaluation methodology was robust, involving:

The results showed significant improvements:

However, eBay was also transparent about the limitations, particularly noting Copilot’s context window constraints when dealing with their massive codebase. This highlights an important consideration for large enterprises implementing similar solutions.

Track 2: Custom LLM Development (eBayCoder)

The second track demonstrates a more specialized approach to handling company-specific code requirements. eBay created eBayCoder by fine-tuning Code Llama 13B on their internal codebase and documentation. This approach addressed several limitations of commercial solutions:

The implementation shows careful consideration of model selection (Code Llama 13B) and training strategy (post-training and fine-tuning on internal data). This represents a significant investment in MLOps infrastructure to support model training and deployment.

Track 3: Internal Knowledge Base System

The third track focused on creating an intelligent knowledge retrieval system using RAG (Retrieval Augmented Generation). This system demonstrates several sophisticated LLMOps practices:

The system includes important production-ready features:

MLOps and Production Considerations

The case study reveals several important MLOps considerations:

Monitoring and Evaluation

eBay implemented comprehensive monitoring and evaluation strategies:

Future Considerations

The case study acknowledges that they are at the beginning of an exponential curve in terms of productivity gains. They maintain a pragmatic view of the technology while recognizing its transformative potential. The implementation of RLHF and continuous improvement mechanisms suggests a long-term commitment to evolving these systems.

This case study provides valuable insights into how large enterprises can systematically approach LLM deployment, balancing commercial solutions with custom development while maintaining a focus on practical productivity improvements. The multi-track approach demonstrates a sophisticated understanding of how different LLM implementations can complement each other in a production environment.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Context Engineering and Agent Development at Scale: Building Open Deep Research

LangChain 2025

Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.

code_generation summarization chatbot +39

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61