Trigent Software attempted to develop IRGPT, a fine-tuned LLM for multilingual Ayurvedic medical consultations. The project aimed to combine traditional Ayurvedic medicine with modern AI capabilities, targeting multiple South Indian languages. Despite assembling a substantial dataset and implementing a fine-tuning pipeline using GPT-2 medium, the team faced significant challenges with multilingual data quality and cultural context. While the English-only version showed promise, the full multilingual implementation remains a work in progress.
This case study presents a refreshingly honest account from Trigent Software, a consulting company with approximately 30 years of experience as a Microsoft and Oracle partner, about their attempt to build “IRGPT” (also referred to as “Ayur GPT”) — a fine-tuned large language model designed for multilingual Ayurveda consultations. The presenter, Andy (Anand Pia), explicitly frames this as a learning experience rather than a success story, making it a valuable case study in the challenges of deploying LLMs in specialized medical domains with multilingual requirements.
Ayurveda is an ancient Indian holistic healing system approximately 5,000 years old, based on balancing mind, body, and spirit through concepts like doshas (body imbalances), prakriti/vikriti (constitutional balance), and herbal treatments. The ambitious goal was to modernize access to this traditional medicine through AI-powered consultations.
The team set out with highly ambitious objectives that ultimately proved too complex to achieve in a single iteration:
The presenter candidly admits that attempting all these objectives simultaneously led to project failure, requiring a significant scope reduction.
The data preparation phase consumed significant effort and became a major source of challenges. The team aggregated data from multiple sources:
The data preprocessing pipeline included extensive deduplication, filtering of non-essential or uncommon conversation pieces, and restructuring into conversation pairs with medical queries and treatments. They labeled datasets and organized them to support multilingual setups, applying translation for regional language variants.
However, the team encountered what they described as “The Good, Bad, and Ugly” of data challenges. While they had volume (approximately 1 million rows across multilingual variants), validation proved extremely difficult. Translations were frequently inaccurate, contextual meanings were lost, and the team lacked resources to manually validate such large datasets. Some translations were described as “bizarre” and many unknowns remained unaddressed.
After extensive iteration on data quality — including additional deduplication, question rephrasing, further classification, contextualization, and summarization — they reduced to approximately 113,000 validated data samples as a working dataset.
Facing the reality of multilingual complexity, the team made a pragmatic decision to pivot back to basics by focusing solely on English. This simplification allowed them to make progress and gather learnings that could inform future iterations. The presenter emphasized this as an “agile” approach — accepting current limitations to enable forward movement.
For the fine-tuning approach, they selected GPT-2 Medium as their base model and applied standard fine-tuning techniques:
The training ran for approximately 590 steps/batches until the loss stabilized sufficiently. They used TensorBoard for monitoring, which the presenter praised as working well for their needs.
The case study is notable for its transparency about failures and challenges:
Language and Cultural Nuances: The original vision of multilingual support for Kannada, Tamil, and Malayalam proved far more difficult than anticipated. Standard machine translation approaches failed to capture the nuanced meaning of Ayurvedic terminology, which often has no direct equivalent in modern medical vocabulary. Cultural context embedded in how symptoms and treatments are described varied significantly across regions.
Domain-Specific Data Scarcity: Limited availability of Ayurveda-specific data in regional languages was identified as the single biggest challenge. This specialized domain combined with low-resource languages created a compounding difficulty that could not be solved with existing translation models.
Contextual Accuracy: Even when translations were technically correct, the contextual and cultural relevance of responses was often inappropriate. Medical advice that works culturally in one region may not translate appropriately to another, even within the same country.
Validation at Scale: With datasets approaching 1 million rows in multilingual form, manual validation was impractical. The team had to accept that some amount of noise would remain in training data.
The presentation mentions plans to use approximately 2GB of Ayurveda books through RAG logic to “inject terminologies and have that sequenced in the easiest manner possible.” While the full RAG implementation details weren’t extensively covered, this represents a hybrid approach combining fine-tuning with retrieval to ground responses in authoritative Ayurvedic texts.
The team developed a working demo (mentioned as available via private beta access), though they acknowledge it has noticeable limitations in cultural nuances and contextual accuracy. The presenter describes IRGPT as “promising” as a step toward combining AI with traditional medicine.
Future plans include:
This case study offers several valuable lessons for LLMOps practitioners:
Scope Management: The team’s experience demonstrates the importance of iterative development. Their initial scope was too ambitious, combining fine-tuning, multilingual support, cultural adaptation, and medical domain expertise simultaneously. The pivot to English-only represents a classic “minimum viable product” approach that should have been the starting point.
Data Quality Over Quantity: Despite having access to approximately 1 million rows of data, the effective usable dataset was only about 113,000 rows after quality filtering. This highlights the LLMOps reality that raw data volume is often misleading.
Domain Expertise Requirements: Medical AI applications, especially in traditional medicine systems with specialized terminology, require deep domain expertise for data validation and output evaluation. The gap between technical ML capabilities and domain-specific requirements is often underestimated.
Infrastructure Choices: The use of A100 GPUs and GPT-2 Medium represents a practical middle ground — powerful enough for meaningful fine-tuning but not requiring the most expensive infrastructure. This reflects real-world budget constraints mentioned by the presenter.
Honest Assessment Culture: Perhaps the most valuable aspect of this case study is the culture of honest self-assessment. Rather than presenting inflated claims, the team acknowledges failures openly and frames them as learning opportunities. This approach is essential for productive LLMOps, where understanding what doesn’t work is often as valuable as knowing what does.
The presentation was delivered at an NLP Summit, suggesting the company values contributing to the broader community’s knowledge even when results are mixed. This openness to sharing failures and inviting collaboration represents best practices in the evolving field of LLMOps.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This lecture transcript from Yangqing Jia, VP at NVIDIA and founder of Lepton AI (acquired by NVIDIA), explores the evolution of AI system design from an engineer's perspective. The talk covers the progression from research frameworks (Caffe, TensorFlow, PyTorch) to production AI infrastructure, examining how LLM applications are built and deployed at scale. Jia discusses the emergence of "neocloud" infrastructure designed specifically for AI workloads, the challenges of GPU cluster management, and practical considerations for building consumer and enterprise LLM applications. Key insights include the trade-offs between open-source and closed-source models, the importance of RAG and agentic AI patterns, infrastructure design differences between conventional cloud and AI-specific platforms, and the practical challenges of operating LLMs in production, including supply chain management for GPUs and cost optimization strategies.
Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.