AirBnB: Large-Scale Test Framework Migration Using LLMs

LLMOps Database

Tech

AirBnB

Company

AirBnB

Title

Large-Scale Test Framework Migration Using LLMs

Industry

Tech

Link

https://medium.com/airbnb-engineering/accelerating-large-scale-test-migration-with-llms-9565c208023b

Year

2024

Summary (short)

AirBnB successfully migrated 3,500 React component test files from Enzyme to React Testing Library (RTL) using LLMs, reducing what was estimated to be an 18-month manual engineering effort to just 6 weeks. Through a combination of systematic automation, retry loops, and context-rich prompts, they achieved a 97% automated migration success rate, with the remaining 3% completed manually using the LLM-generated code as a baseline.

Tags

continuous_integration

documentation

openai

This case study from AirBnB demonstrates a sophisticated application of LLMs in production for large-scale code migration, specifically transitioning from Enzyme to React Testing Library (RTL) for React component testing. The project showcases how LLMs can be effectively deployed to handle complex code transformation tasks that would traditionally require significant manual engineering effort. The challenge AirBnB faced was significant: they needed to migrate nearly 3,500 React component test files while preserving both the original test intent and code coverage. The manual estimation for this work was approximately 1.5 years of engineering time. The company had adopted RTL in 2020 for new test development, but maintaining two testing frameworks wasn't sustainable, and simply deleting the old Enzyme tests would have created significant coverage gaps. The LLMOps implementation was particularly noteworthy for its systematic and production-oriented approach. The team developed a robust pipeline that broke down the migration into discrete, parallelizable steps, treating it like a production system rather than a one-off transformation. Here's how they structured their approach: First, they implemented a state machine model where each file moved through various validation and refactor stages. This allowed for precise tracking of progress and enabled parallel processing of hundreds of files simultaneously. The pipeline included multiple validation steps: Enzyme refactoring, Jest fixes, and lint/TypeScript compliance checks. A key innovation in their approach was the implementation of configurable retry loops. Rather than trying to perfect the prompt engineering, they found that allowing multiple attempts with dynamic prompts was more effective. The system would feed validation errors and the most recent file version back to the LLM for each retry, allowing up to 10 attempts for most files. The team's approach to context management was particularly sophisticated. They expanded their prompts to include between 40,000 to 100,000 tokens of context, incorporating: * Source code of the component under test * Related tests from the same directory * Team-specific patterns * General migration guidelines * Common solutions * Up to 50 related files * Manually written few-shot examples * Examples of existing, well-written passing test files Their initial bulk run achieved a 75% success rate in just four hours, migrating approximately 2,625 files automatically. To handle the remaining files, they developed a systematic improvement process they called "sample, tune, sweep": * Running remaining files to identify common failure patterns * Selecting representative sample files * Updating prompts and scripts * Validating fixes against sample files * Repeating the process with all remaining files This iterative approach pushed their success rate to 97% over four days of refinement. For the final 3% of particularly complex files, they used the LLM-generated code as a starting point for manual completion. From an LLMOps perspective, several aspects of their implementation stand out as best practices: * The use of automated validation and verification steps * Implementation of retry mechanisms with dynamic prompt updating * Extensive context management and intelligent context selection * Systematic tracking and improvement processes * Parallel processing capabilities * Fallback to human intervention for edge cases The project also demonstrated impressive cost efficiency. Despite high retry counts for complex files, the total cost including LLM API usage and six weeks of engineering time was significantly lower than the estimated manual migration cost. Some limitations and considerations from their approach should be noted. The system required significant upfront investment in automation infrastructure and prompt engineering. The approach also relied heavily on automated validation, which needed to be reliable to ensure the quality of the migrated code. Additionally, the high token count in prompts (up to 100,000 tokens) suggests they were using frontier models, which may have cost implications for smaller organizations. This case study represents a mature example of LLMs in production, showing how they can be effectively integrated into software engineering workflows when combined with robust automation and validation systems. The success of this project has led AirBnB to consider expanding this approach to other code transformation tasks and exploring new applications of LLM-powered automation for developer productivity.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source