Grammarly developed CoEdIT, a specialized text editing LLM that outperforms larger models while being up to 60 times smaller. Through targeted instruction tuning on a carefully curated dataset of text editing tasks, they created models ranging from 770M to 11B parameters that achieved state-of-the-art performance on multiple editing benchmarks, outperforming models like GPT-3-Edit (175B parameters) and ChatGPT in both automated and human evaluations.
Grammarly, a widely-used AI writing assistant platform, developed CoEdIT (Collaborative Editing with Instruction Tuning), an open-source instruction-tuned large language model specifically designed for text editing tasks. This case study presents an interesting approach to LLMOps where the focus shifts from building ever-larger general-purpose models to creating smaller, task-specific models that can outperform their larger counterparts on targeted use cases. The work was published and accepted as a Findings paper at EMNLP 2023, one of the premier conferences in natural language processing.
The core insight driving this work is that general-purpose LLMs, while capable across a broad range of tasks, may not be optimal for specific use cases like text editing. By narrowing the focus and creating a “specialist” model through instruction tuning on a carefully curated dataset, Grammarly demonstrated that significant performance gains and efficiency improvements can be achieved simultaneously.
The Grammarly team identified several critical gaps in existing approaches to developing text editing models using LLMs:
The team hypothesized that fine-tuning on a “dense task distribution” — tasks that are closely related to each other within the text editing domain — would enable better performance and generalization to adjacent tasks. This is analogous to training a human specialist who becomes expert in a specific domain rather than a generalist who knows a little about everything.
A critical aspect of successful instruction tuning is the quality and design of the training dataset. The Grammarly team built upon their previous work with the IteraTeR+ dataset, which contains various text editing tasks focused on non-meaning-changing edits. The process involved several key steps:
The team translated edit categories (Fluency, Coherence, Clarity, Style) into natural language instructions like “Make this more coherent.” This translation from categorical labels to natural language is essential for instruction tuning as it teaches the model to respond to human-like commands.
For subjective categories like Style, the team introduced specific sub-intentions including Paraphrasing, Formality Style Transfer, and Neutralization. This granularity helps the model understand nuanced differences between editing intents.
To improve robustness to different phrasings, the team created paraphrases of instruction templates and added them to the dataset. For example, ensuring the model could respond appropriately to both “write” and “rewrite” as essentially equivalent instructions. This is an important consideration for production systems where users may phrase their requests in varied ways.
The team fine-tuned pre-trained FLAN-T5 models at three different scales:
The choice of FLAN-T5 as the base model is notable because FLAN-T5 is itself an instruction-tuned model, meaning the team performed additional specialized instruction tuning on top of an already instruction-tuned foundation. This approach leverages the general instruction-following capabilities while adding domain-specific expertise.
The evaluation strategy employed by Grammarly is worth examining closely as it represents a thoughtful approach to assessing LLM quality in production contexts where subjective judgment plays a significant role.
Comparison Groups: The team established four comparison groups to contextualize CoEdIT’s performance:
Quantitative Analysis: The models were evaluated against standard test sets from multiple text editing benchmarks, covering syntactic, semantic, and stylistic edit requirements. This multi-dimensional evaluation is important for understanding model capabilities across different editing scenarios.
Qualitative Analysis (Human Evaluation): Recognizing the inherent subjectivity in judging writing quality, the team conducted human evaluations where expert evaluators compared outputs from CoEdIT-XL (3B parameters) and GPT-3-Edit (175B parameters) across fluency, accuracy, and meaning preservation dimensions.
Adjacent Task Evaluation: To test generalization capabilities, the team evaluated CoEdIT on tasks it wasn’t explicitly trained on, including sentence compression and politeness transfer. This evaluation is particularly important for production systems where users may request variations of trained tasks.
Composite Task Evaluation: Real-world editing often involves multi-step instructions like “make the text simpler, paraphrase it, and make it formal.” The team developed CoEdIT-Composite by enriching the training set with multi-part tasks and evaluated it separately against the base CoEdIT-XL and GPT-3-Edit.
The results demonstrated that task-specific instruction tuning can yield dramatic efficiency gains without sacrificing — and indeed improving — performance:
This case study offers several valuable lessons for LLMOps practitioners:
Model Sizing and Efficiency: The dramatic parameter reduction (up to 60x) while maintaining or improving performance has significant implications for deployment costs, latency, and infrastructure requirements. Smaller models are cheaper to host, faster to run inference on, and can potentially be deployed on edge devices or in resource-constrained environments.
Task-Specific vs. General-Purpose Models: The “specialist vs. generalist” framing provides a useful mental model for deciding when to use general-purpose LLMs versus fine-tuned models. For well-defined application domains, task-specific instruction tuning can yield substantial benefits.
Dataset Quality and Design: The careful attention to dataset construction — including natural language instruction templates, sub-intention categorization, and paraphrase augmentation — highlights the importance of high-quality training data for instruction tuning success.
Multi-Dimensional Evaluation: The combination of quantitative benchmarks, human evaluation, adjacent task testing, and composite task assessment provides a comprehensive evaluation framework that accounts for the subjective nature of text quality while still producing actionable metrics.
Open Source Strategy: By releasing the models and data publicly, Grammarly enables reproducibility and community contribution while positioning itself as a thought leader in the space. This is a strategic choice that balances competitive advantage with the benefits of open research.
The authors acknowledge several areas for future improvement:
While the results are impressive, it’s worth noting some caveats:
Despite these caveats, the work represents a valuable contribution to the LLMOps landscape by demonstrating that thoughtful specialization can achieve better results than brute-force scaling, with significant implications for cost, efficiency, and practical deployment of LLMs in production writing assistance applications.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
OpenAI's applied evaluation team presented best practices for implementing LLMs in production through two case studies: Morgan Stanley's internal document search system for financial advisors and Grab's computer vision system for Southeast Asian mapping. Both companies started with simple evaluation frameworks using just 5 initial test cases, then progressively scaled their evaluation systems while maintaining CI/CD integration. Morgan Stanley improved their RAG system's document recall from 20% to 80% through iterative evaluation and optimization, while Grab developed sophisticated vision fine-tuning capabilities for recognizing road signs and lane counts in Southeast Asian contexts. The key insight was that effective evaluation systems enable rapid iteration cycles and clear communication between teams and external partners like OpenAI for model improvement.
Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.