Microsoft: Implementing LLMOps in Restricted Networks with Long-Running Evaluations

LLMOps Database

Tech

Microsoft

Company

Microsoft

Title

Implementing LLMOps in Restricted Networks with Long-Running Evaluations

Industry

Tech

Link

https://devblogs.microsoft.com/ise/llmops-in-restricted-networks-and-continuous-evaluation-long-run-constraints/

Year

2025

Summary (short)

A case study detailing Microsoft's experience implementing LLMOps in a restricted network environment using Azure Machine Learning. The team faced challenges with long-running evaluations (6+ hours) and network restrictions, developing solutions including opt-out mechanisms for lengthy evaluations, implementing Git Flow for controlled releases, and establishing a comprehensive CI/CE/CD pipeline. Their approach balanced the needs of data scientists, engineers, and platform teams while maintaining security and evaluation quality.

This case study from Microsoft details their experience implementing LLMOps in a highly restricted network environment, highlighting the challenges and solutions for running large language models in production while maintaining security and evaluation quality. The project focused on developing an LLM application using Prompt Flow within Azure Machine Learning (AML), with a particular emphasis on establishing automated evaluation pipelines in a restricted network environment. The key challenge was balancing the need for thorough evaluation with practical time constraints, as their end-to-end evaluation process took 6 hours to complete. ### Network and Infrastructure Setup The implementation took place in a completely private network environment within Azure, structured with: * A Virtual Network with private IPs only * Subnet division based on resource roles * Network Security Groups with strict whitelisting * Private endpoints and VPN gateways for external access * No public IP exposure This restricted network environment created several technical challenges: * Issues with resolving compute/runtime hosts requiring FQDN configuration * Problems with AML managed endpoints in private networks * Complex permission management for Service Principals * Need for explicit no-proxy configurations for private IP resolution ### Development and Deployment Process The team implemented a Git Flow branching strategy with three main branches: * develop: The gatekeeping branch for feature development * feature: Individual development branches following naming conventions * main: Production deployment branch The pipeline was structured into three components: **Continuous Integration (CI)** * Code quality checks * Unit testing * Flow validation with sample data **Continuous Evaluation (CE)** * Environment setup and synthetic data preparation * Node-level evaluation and metrics * End-to-end flow simulation * Azure ML registration **Continuous Deployment (CD)** * PromptFlow versioning * Container packaging * Azure Container Registry deployment * Web App deployment * Comprehensive testing and monitoring ### Innovative Solutions for Long-Running Evaluations A key innovation was the introduction of an opt-out mechanism for the 6-hour evaluation process. This was implemented through PR labels, allowing teams to skip the full evaluation when appropriate (e.g., for minor bug fixes). The solution included: * A "skip-e2e-evaluation" label option for PRs * Automated detection of labels in the CI/CD pipeline * Two-tier evaluation approach: * Fast evaluation (30 minutes) for every deployment * Full evaluation (6 hours) running in parallel, with opt-out option ### Technical Implementation Details The team developed several technical solutions to address the challenges: * Implementation of custom FQDN configurations for compute host resolution * Command-line based AML flow validations to maintain pipeline lightness * Integration of evaluation metrics and job links in pipeline summaries * Configuration-driven evaluation processes * PR template with checklists to ensure proper evaluation decisions ### Continuous Evaluation Considerations The case study provides detailed insights into determining optimal evaluation frequency: * Evaluation types: * Continuous runs for high-frequency updates * On-demand execution for specific needs * Timer-based scheduling for periodic assessment * Key factors in determining frequency: * Repository activity levels * Available infrastructure resources * Testing complexity and requirements * Deployment schedule alignment ### Results and Outcomes The implementation successfully achieved: * Secure LLM operations in a restricted network * Balanced evaluation thoroughness with practical time constraints * Effective collaboration between data science, engineering, and platform teams * Controlled and reliable deployment processes * Maintained high standards for code quality and model performance ### Lessons Learned and Best Practices The project revealed several important considerations for similar implementations: * Clear communication and training requirements for label-based systems * Need for consistent application of evaluation protocols * Importance of feedback loops in the development process * Balance between automation and human oversight * Resource optimization strategies for long-running evaluations The case study demonstrates that successful LLMOps implementation in restricted environments requires careful consideration of security, efficiency, and practical constraints. The team's innovative solutions, particularly around evaluation optimization and process automation, provide valuable insights for similar implementations in highly regulated or secure environments.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source