Cambrium is using LLMs and AI to design and generate novel proteins for sustainable materials, starting with vegan human collagen for cosmetics. They've developed a protein programming language and leveraged LLMs to transform protein design into a mathematical optimization problem, enabling them to efficiently search through massive protein sequence spaces. Their approach combines traditional protein engineering with modern LLM techniques, resulting in successfully bringing a biotech product to market in under two years.
# LLMs for Protein Engineering at Cambrium
## Company Overview
Cambrium is a biotech startup that combines AI and synthetic biology to create sustainable materials. Founded in 2020, they've achieved the remarkable feat of bringing a biotech product to market in less than two years - a record in the EU biotech space. Their first product is Novacoll, a vegan human collagen for cosmetics, created through genetic engineering and AI-driven protein design.
## Technical Infrastructure
### Protein Programming Language
- Developed a novel protein programming language for designing proteins
- Language transforms protein design into a mathematical optimization problem
- Enables systematic exploration of protein sequence space (approximately 10^18 possibilities)
- Allows specification of constraints like:
### LLM Integration
- Built a Langchain-based research assistant for scientists
- Features include:
### Data Infrastructure Challenges
- High cost of experiments makes data extremely valuable
- Critical focus on preventing data loss in pipelines
- Two main data source types:
- Key challenge: digitizing scientific processes
### Production System Architecture
- Focus on reliability over latency
- Main users are internal scientists rather than external customers
- Less emphasis on high QPS or 24/7 availability
- Strong emphasis on data integrity and experiment tracking
- Integration with DNA synthesis APIs for automated ordering
## MLOps Practices
### Data Management
- Strict protocols for experimental data collection
- Integration of robotics to ensure data accuracy
- Focus on complete data capture due to high cost per datapoint
- Structured approach to managing both machine and human-generated data
### Model Development
- Evolution from traditional protein engineering to LLM-based approaches
- Current use of diffusion models for protein generation
- Integration of protein folding predictions
- Optimization for multiple constraints simultaneously
### Testing and Validation
- High stakes for model predictions due to expensive validation
- Focus on pre-experimental validation to reduce costs
- Multiple checkpoints for verification
- Comprehensive tracking of experimental outcomes
## Unique LLMOps Considerations
### Cost Structure
- Emphasis on prediction accuracy over speed
- High cost of experimental validation
- Need to maximize value from each experiment
- Balance between computational and wet-lab costs
### Regulatory Environment
- Consideration of GMO regulations in different markets
- Impact on product development strategy
- Influence on technology stack choices
### Scientific Integration
- Bridging gap between traditional lab work and AI
- Training scientists in data-driven methodologies
- Building systems that support scientific workflow
## Future Directions
### Technology Development
- Expanding use of generative AI for protein design
- Development of more sophisticated protein language models
- Integration of new simulation capabilities
### Product Evolution
- Starting with high-margin cosmetics market
- Planning expansion into industrial materials
- Long-term vision for sustainable alternatives to plastics
- Exploration of specialized applications like space-grade adhesives
## Impact and Results
- Successfully launched Novacoll in under two years
- Demonstrated viability of AI-driven protein design
- Created scalable platform for sustainable materials
- Established new paradigm for biotech product development
## Challenges and Lessons Learned
- Importance of data quality in scientific experiments
- Need for balance between automation and human expertise
- Value of domain-specific languages for complex problems
- Critical role of proper data infrastructure
- Importance of choosing appropriate market entry points
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.