Company
Cambrium
Title
LLMs and Protein Engineering: Building a Sustainable Materials Platform
Industry
Tech
Year
2023
Summary (short)
Cambrium is using LLMs and AI to design and generate novel proteins for sustainable materials, starting with vegan human collagen for cosmetics. They've developed a protein programming language and leveraged LLMs to transform protein design into a mathematical optimization problem, enabling them to efficiently search through massive protein sequence spaces. Their approach combines traditional protein engineering with modern LLM techniques, resulting in successfully bringing a biotech product to market in under two years.
# LLMs for Protein Engineering at Cambrium ## Company Overview Cambrium is a biotech startup that combines AI and synthetic biology to create sustainable materials. Founded in 2020, they've achieved the remarkable feat of bringing a biotech product to market in less than two years - a record in the EU biotech space. Their first product is Novacoll, a vegan human collagen for cosmetics, created through genetic engineering and AI-driven protein design. ## Technical Infrastructure ### Protein Programming Language - Developed a novel protein programming language for designing proteins - Language transforms protein design into a mathematical optimization problem - Enables systematic exploration of protein sequence space (approximately 10^18 possibilities) - Allows specification of constraints like: ### LLM Integration - Built a Langchain-based research assistant for scientists - Features include: ### Data Infrastructure Challenges - High cost of experiments makes data extremely valuable - Critical focus on preventing data loss in pipelines - Two main data source types: - Key challenge: digitizing scientific processes ### Production System Architecture - Focus on reliability over latency - Main users are internal scientists rather than external customers - Less emphasis on high QPS or 24/7 availability - Strong emphasis on data integrity and experiment tracking - Integration with DNA synthesis APIs for automated ordering ## MLOps Practices ### Data Management - Strict protocols for experimental data collection - Integration of robotics to ensure data accuracy - Focus on complete data capture due to high cost per datapoint - Structured approach to managing both machine and human-generated data ### Model Development - Evolution from traditional protein engineering to LLM-based approaches - Current use of diffusion models for protein generation - Integration of protein folding predictions - Optimization for multiple constraints simultaneously ### Testing and Validation - High stakes for model predictions due to expensive validation - Focus on pre-experimental validation to reduce costs - Multiple checkpoints for verification - Comprehensive tracking of experimental outcomes ## Unique LLMOps Considerations ### Cost Structure - Emphasis on prediction accuracy over speed - High cost of experimental validation - Need to maximize value from each experiment - Balance between computational and wet-lab costs ### Regulatory Environment - Consideration of GMO regulations in different markets - Impact on product development strategy - Influence on technology stack choices ### Scientific Integration - Bridging gap between traditional lab work and AI - Training scientists in data-driven methodologies - Building systems that support scientific workflow ## Future Directions ### Technology Development - Expanding use of generative AI for protein design - Development of more sophisticated protein language models - Integration of new simulation capabilities ### Product Evolution - Starting with high-margin cosmetics market - Planning expansion into industrial materials - Long-term vision for sustainable alternatives to plastics - Exploration of specialized applications like space-grade adhesives ## Impact and Results - Successfully launched Novacoll in under two years - Demonstrated viability of AI-driven protein design - Created scalable platform for sustainable materials - Established new paradigm for biotech product development ## Challenges and Lessons Learned - Importance of data quality in scientific experiments - Need for balance between automation and human expertise - Value of domain-specific languages for complex problems - Critical role of proper data infrastructure - Importance of choosing appropriate market entry points

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.