Wix developed a customized LLM for their enterprise needs by applying multi-task supervised fine-tuning (SFT) and domain adaptation using full weights fine-tuning (DAPT). Despite having limited data and tokens, their smaller customized model outperformed GPT-3.5 on various Wix-specific tasks. The project focused on three key components: comprehensive evaluation benchmarks, extensive data collection methods, and advanced modeling processes to achieve full domain adaptation capabilities.
This case study explores Wix's journey in developing and deploying a custom domain-adapted LLM for their enterprise needs. The project represents a significant advancement in practical LLMOps implementation, showcasing how enterprises can move beyond simple prompt engineering and RAG solutions to create truly domain-specialized AI systems.
At the core of this implementation is the recognition that while common LLM customization techniques like prompt engineering, RAG, and task-specific fine-tuning are easier to implement, they come with fundamental limitations in production environments. These limitations include high costs, high latency, potential for hallucinations, and inability to handle multiple domain tasks simultaneously. Wix's approach aimed to address these challenges through a more comprehensive solution.
The implementation strategy focused on three key components:
**Evaluation and Benchmarking**
Wix developed a sophisticated evaluation framework before beginning model development. This included:
* Custom Q&A datasets derived from customer service chats and FAQs
* Implementation of "LLM-as-a-judge" technique with domain-specific prompts
* Task-specific evaluations for intent classification, customer segmentation, domain summarization, and sentiment analysis
* Multiple-choice question format combining knowledge and task-based evaluation
This evaluation framework was crucial for maintaining quality control and ensuring the model met production requirements. The team emphasized the importance of having fixed, simple prompts for task evaluation rather than prompts optimized for specific model families.
**Training Data Development**
The training data strategy was particularly noteworthy from an LLMOps perspective:
* Recognition that pre-trained models already contained some domain knowledge, allowing for more efficient fine-tuning
* Careful sampling between different data sources
* Use of both public and domain-specific data, with domain-specific data increased to 2% ratio
* Synthetic data generation for Q&As using organizational data
* Integration of existing labeled data from various NLP projects
* Careful curation to avoid inclusion of mistakes and confidential information
**Model Development and Infrastructure**
The technical implementation included several key considerations:
* Selection of appropriate base models that already showed good domain performance
* Use of AWS P5 instances for GPU computing
* Implementation of both DAPT and SFT techniques
* Careful hyperparameter tuning, especially for LoRA rank in adapter-based training
* Consideration of completion vs. full-prompt training approaches
From an operations perspective, several key decisions and trade-offs were made:
* Choice to limit work to one high-power GPU on AWS P5
* Focus on smaller language models when possible for easier training and serving
* Careful balance between adapter complexity and training data size
* Implementation of both completion and task-based training
The results demonstrated several key successes:
* The custom model outperformed GPT-3.5 on Wix-specific tasks
* Achieved better performance on knowledge-based tasks like Q&A
* Successfully handled multiple domain tasks simultaneously
* Reduced operational costs and latency compared to using larger models
Important lessons learned from the implementation include:
* The value of starting with solid evaluation metrics before model development
* The importance of high-quality, curated training data
* The need to balance model size with performance requirements
* The benefits of combining multiple fine-tuning approaches (SFT and DAPT)
The case study also highlights some challenges and limitations:
* The need for significant computational resources
* The complexity of creating comprehensive evaluation benchmarks
* The challenge of collecting sufficient high-quality training data
* The careful balance required in hyperparameter tuning
From an architectural perspective, the implementation demonstrated the importance of considering the entire ML lifecycle, from data preparation through to deployment and monitoring. The team's focus on evaluation metrics and benchmarks before model development represents a mature approach to ML system design.
The project also shows how enterprises can move beyond the limitations of vendor-provided solutions to create custom models that better serve their specific needs. This is particularly relevant for organizations with unique domain knowledge requirements or those looking to optimize for both performance and cost.
Looking forward, Wix's approach provides a template for other organizations considering similar domain adaptation projects. The emphasis on comprehensive evaluation, careful data curation, and thoughtful model selection offers valuable insights for teams working on enterprise LLM deployments.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.