ZenML

LLM-Powered 3D Model Generation for 3D Printing

Build Great AI 2024
View original source

Build Great AI developed a prototype application that leverages multiple LLM models to generate 3D printable models from text descriptions. The system uses various models including LLaMA 3.1, GPT-4, and Claude 3.5 to generate OpenSCAD code, which is then converted to STL files for 3D printing. The solution demonstrates rapid prototyping capabilities, reducing design time from hours to minutes, while handling the challenges of LLMs' spatial reasoning limitations through multiple simultaneous generations and iterative refinement.

Industry

Tech

Technologies

Overview

This case study documents an early-stage prototype application called DesignBench, developed by Dan Becker through Build Great AI, which demonstrates a novel approach to bridging the gap between natural language and physical object creation. The application emerged from observations made while teaching LLM fine-tuning courses to thousands of students, where Dan noticed that despite high enthusiasm for AI, few participants had concrete ideas for useful products. This inspired the focus on creating tangible, physical outputs—what Dan describes as moving “from bits to atoms.”

The core premise is democratizing 3D design: most people who own 3D printers (estimated at 90%) don’t know what to do with them, largely because CAD software has a steep learning curve. DesignBench allows users to describe objects in natural language and receive 3D-printable designs within minutes.

Technical Architecture and Multi-Model Strategy

One of the most interesting LLMOps aspects of this case study is the deliberate use of multiple LLMs in parallel rather than relying on a single model. The system simultaneously queries several models including:

This multi-model approach is a pragmatic response to the current limitations of LLMs in spatial reasoning. As Dan explicitly acknowledges, spatial awareness is “really bad” for LLMs as of August 2024. Many generated objects have detached parts, incorrect proportions, or other fundamental issues. By running multiple models simultaneously, the application provides users with a variety of outputs—some will inevitably be poor, but others may be closer to what the user wants.

The system also experiments with different prompting strategies for each model, including Chain of Thought versus direct prompting. This creates a matrix of outputs: multiple models × multiple prompting strategies × multiple CAD languages. The philosophy is that in the face of uncertainty about what will work best, breadth of experimentation compensates for individual model limitations.

Code Generation and CAD Languages

The LLMs don’t generate 3D models directly—they generate code in CAD languages, primarily OpenSCAD. This code is then rendered to produce the visual 3D model and can be exported as STL files, the standard format for 3D printing software. This approach leverages the strength of LLMs in code generation while outsourcing the actual rendering to deterministic CAD software.

The choice of OpenSCAD as a target language is notable because it’s a programmatic CAD language where objects are defined through code rather than visual manipulation. This makes it more suitable for LLM generation than GUI-based CAD tools. The system experimented with multiple CAD languages, though OpenSCAD appears to be a primary target.

One advantage of generating code rather than direct 3D representations is that the code can be inspected, debugged, and manually modified if needed. Users can view the generated code through a “get code” option in the interface.

Inference Infrastructure and Latency Considerations

An interesting operational detail is the use of Groq for serving Llama 3.1 models. Dan specifically mentions that when watching the application populate results, “you always have one or two that pop up way before the others”—those are the Groq-served Llama models. This highlights an important LLMOps consideration: when running multiple models in parallel for user-facing applications, inference latency varies significantly across providers.

The choice of Groq was partly practical—at the time of recording, it was free, though Dan expressed hope for a paid account to allow more aggressive usage. This reflects the reality of early-stage projects navigating the evolving pricing and availability landscape of LLM inference providers.

Regarding model quality versus speed trade-offs, Dan noted that the Llama 3.1 70B model (the largest he was using via Groq) and GPT-4o Mini are “really not very good” compared to GPT-4o and Claude Sonnet 3.5. However, the speed advantage of Groq-served Llama makes it valuable in a multi-model setup where users benefit from fast initial results while waiting for higher-quality models to complete.

Dan mentioned that the 400B parameter Llama model (supposedly competitive with GPT-4o) hadn’t been tested yet, suggesting potential for quality improvements with larger open models.

Iterative Design Through Conversation

A key UX and LLMOps pattern demonstrated is iterative refinement through conversation. Users don’t expect perfect results from the first prompt—instead, they select a promising design from the initial batch and then refine it through follow-up prompts. Examples from the demo include:

This mirrors the natural workflow in traditional CAD software where designers iterate, but with natural language as the interface. The key insight is that imperfect initial results are acceptable when iteration is fast and intuitive.

Multimodal Capabilities and Future Directions

The application includes an “upload image” feature, leveraging the multimodal capabilities of modern LLMs. Dan describes a use case from a neighbor who is a hobbyist inventor: the ideal workflow would be to sketch a design on paper and show that image to the model rather than describing it in text. This represents an interesting extension of the text-to-3D paradigm.

Several future directions are mentioned that relate to LLMOps practices:

Honest Assessment of Limitations

One refreshing aspect of this case study is the candid acknowledgment of limitations. Dan repeatedly emphasizes that spatial reasoning is challenging for current LLMs, many generated objects are unusable, and the complexity ceiling is lower than professional CAD software. The application is explicitly positioned as useful for “the home inventor who’s going to make something small” rather than for professional architects or engineers designing buildings.

This honesty about scope is valuable from an LLMOps perspective—setting appropriate user expectations is crucial for adoption and satisfaction.

Practical Results

The demonstration showed tangible results: a personalized cup design was refined from initial prompt to final STL file in a few minutes of conversation, and Dan subsequently 3D printed and sent photos of the physical cup to the podcast host. The estimated time savings compared to traditional CAD software was dramatic—what might take hours even for someone experienced with CAD software was accomplished in minutes.

Early Stage Considerations

This is explicitly described as a “pre-Alpha” hobby side project started less than a month before the recording. There’s no monetization at this stage, and Dan is actively seeking beta testers with 3D printers. The website DesignBench.ai was mentioned as the hosting location.

From an LLMOps maturity perspective, this represents the experimental/prototype phase where the focus is on demonstrating feasibility and gathering user feedback rather than production-scale concerns like reliability, monitoring, or cost optimization at scale. However, the architectural decisions—multi-model orchestration, code generation as an intermediate representation, and iterative refinement—represent patterns that would carry forward into a production system.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Multi-Industry AI Deployment Strategies with Diverse Hardware and Sovereign AI Considerations

AMD / Somite AI / Upstage / Rambler AI 2025

This panel discussion at AWS re:Invent features three companies deploying AI models in production across different industries: Somite AI using machine learning for computational biology and cellular control, Upstage developing sovereign AI with proprietary LLMs and OCR for document extraction in enterprises, and Rambler AI building vision language models for industrial task verification. All three leverage AMD GPU infrastructure (MI300 series) for training and inference, emphasizing the importance of hardware choice, open ecosystems, seamless deployment, and cost-effective scaling. The discussion highlights how smaller, domain-specific models can achieve enterprise ROI where massive frontier models failed, and explores emerging areas like physical AI, world models, and data collection for robotics.

healthcare document_processing classification +40

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61