Linguistic-Informed Approach to LLM Systems at Mastercard
Overview
Chris Brousseau, a lead data scientist at Mastercard, presents a comprehensive framework for implementing LLMs in production environments by focusing on linguistic principles rather than just traditional metrics. The approach emphasizes the importance of understanding and incorporating linguistic features to create more effective and maintainable LLM systems.
Key Linguistic Components in LLM Systems
Syntax
- LLMs have largely solved syntactical challenges through transformational generative grammar
- Models can generate infinite combinations of grammatically correct structures
- Current implementations effectively handle basic grammar rules
Morphology
- Implemented through tokenization and embeddings
- Current solutions are approximately 75-80% effective
- Challenges with statistical tokenization methods affecting model performance
- Issues highlighted with numerical operations in larger models like GPT-4 vs. smaller models like Goat-7B
Semantics and Pragmatics
- Current focus areas for improvement in LLM systems
- Dictionary Problem: Need to handle evolving language definitions
- Importance of regular vocabulary updates in production systems
- Consideration of domain-specific language stability (e.g., financial terminology at Mastercard)
Technical Implementation Challenges
The Dictionary Problem
- Dictionaries represent snapshots of language usage
- Need for dynamic vocabulary updates in production systems
- Weekly/monthly soft updates and yearly hard updates recommended
- Special considerations for domain-specific applications
- Balance between currency and stability in financial applications
The Tokenization Challenge ("Yeet Problem")
- Issues with popular tokenization methods like BPE and SentencePiece
- Impact on arithmetic operations in large models
- Importance of understanding allowable sound and letter combinations
- Need for predictable tokenization patterns
- Benefits of multilingual approaches in improving tokenization
The Morphological Challenge ("Kimono Problem")
- Issues with splitting borrowed words
- Importance of understanding basic units of meaning
- Benefits of multilingual models in handling diverse vocabularies
- Need for context-aware tokenization
Practical Implementation Example
Biology Question-Answering System
- Baseline: Vanilla ChatGPT implementation
- Optimized Implementation:
- Results:
Production Considerations
Model Maintenance
- Regular updates to handle evolving language
- Balance between model currency and stability
- Domain-specific considerations for update frequency
- Monitoring of vocabulary shifts and usage patterns
Performance Optimization
- Local deployment options for faster inference
- Use of guidance framework for improved accuracy
- Integration with other tools like LangChain
- Vector databases for document retrieval
Multimodal Considerations
- Text-to-speech challenges
- Speech-to-speech implementations
- Phonetic information preservation
- Integration of International Phonetic Alphabet
Best Practices and Recommendations
Development Approach
- Focus on linguistic features rather than just metrics
- Consider domain-specific language requirements
- Implement regular update cycles
- Use multilingual approaches when possible
Tool Selection
- Local deployment options for performance
- Integration of linguistic frameworks
- Use of Chain of Thought reasoning
- Implementation of pragmatic instruction
Monitoring and Maintenance
- Regular vocabulary updates
- Performance tracking
- Accuracy measurements
- Response time optimization
Future Directions
- Expansion of pragmatic instruction capabilities
- Integration with document retrieval systems
- Improvement of phonetic handling
- Enhanced multilingual support
- Development of more sophisticated update mechanisms