This case study explores how Rolls-Royce implemented generative AI technology to enhance their engineering design process, specifically focusing on preliminary design phases where full 3D representations aren't always necessary.
The project represents a collaboration between Rolls-Royce, Databricks, and the University of Southampton, utilizing conditional Generative Adversarial Networks (GANs) to generate realistic design solutions based on given requirements while also predicting performance characteristics.
## Technical Implementation
The system's architecture is built around several key components:
* Data Fusion: The approach combines design parameters with pictorial representations of products, creating a comprehensive training dataset that captures both geometric and performance characteristics.
* Image Encoding: A crucial innovation involves encoding design parameters directly into images using bars and glyphs, carefully placed to avoid interfering with critical areas of interest. This encoding allows for parameter extraction and validation later in the process.
* Training Process: The system uses categorized training data, organizing designs based on performance parameters (like loss coefficient in airfoils). This categorization enables targeted generation of designs within specific performance bands.
* Validation Pipeline: Unlike typical GAN applications that focus on qualitative image assessment, this system includes a rigorous validation process where generated designs are decoded back into engineering parameters and verified through traditional physics-based simulations.
## Cloud Implementation and Security
The team implemented several innovative approaches to handle security and scaling challenges:
* Data Classification: They carefully managed data sensitivity by using PL 909c export classification (dual-use civil data) for cloud training, keeping more sensitive data on-premises.
* Transfer Learning Strategy: Models are initially trained on non-sensitive data in the cloud, then transfer learning is applied locally for sensitive applications, maintaining security while leveraging cloud capabilities.
* Infrastructure Optimization: Using Databricks' infrastructure, they achieved significant performance improvements:
- Reduced training time from 7 days to 4-6 hours
- Implemented MLflow for experiment tracking and model management
- Utilized Unity Catalog for governed data storage
- Leveraged GPU acceleration with various options (V100, A100, etc.)
## Technical Challenges and Learnings
The team encountered and documented several interesting technical findings:
* Data Distribution: Initial attempts with 98,000 images showed poor performance due to uneven distribution of training data. After normalizing and curating the dataset to 63,000 images, they achieved much better correlation between predicted and actual values.
* GPU vs CPU Performance: Unexpectedly, CPU training sometimes outperformed GPU training for their specific use case, particularly when validating against simulation results. This finding suggests that architecture optimization might need to be GPU-specific.
* Scaling Guidelines: They developed empirical rules for scaling:
- Single GPU sufficient for datasets under 0.5 million images
- Single node multi-GPU for larger datasets
- Multi-node clusters primarily for hyperparameter optimization
## Current Challenges and Future Work
The team is actively working on several improvements:
* Architecture Optimization: Investigating different loss functions (Kullback-Divergence, Wasserstein with gradient penalty, Jensen-Shannon) and their impact on model performance.
* 3D Model Training: Moving beyond 2D representations to handle full 3D models, which better represents their actual products.
* Multi-Objective Optimization: Working to handle multiple, often conflicting design requirements simultaneously (e.g., weight reduction vs. efficiency improvement).
* Failed Design Integration: Incorporating knowledge about unsuccessful or unconverged solutions to help the model avoid unproductive areas of the design space.
## Key Learnings for LLMOps
The case study provides several valuable insights for LLMOps practitioners:
* Data Governance: The importance of building systems that can handle varying levels of data sensitivity while still leveraging cloud resources.
* Validation Requirements: The need for domain-specific validation beyond typical ML metrics, especially in engineering applications.
* Infrastructure Flexibility: The value of having flexible infrastructure that can scale based on workload requirements while maintaining security and governance.
* Performance Optimization: The importance of careful data curation and the need to challenge common assumptions (like GPU superiority) in specific use cases.
The project demonstrates how traditional engineering disciplines can be enhanced with modern AI techniques while maintaining rigorous validation requirements and security considerations. It also highlights the importance of building flexible, secure infrastructure that can handle sensitive data while still leveraging cloud capabilities for scale and performance.