Plan for network isolation
In this article, you learn how to plan your network isolation for Azure Machine Learning and our recommendations. This article is for IT administrators who want to design network architecture.
Recommended architecture (Managed Network Isolation pattern)
Using a Managed virtual network provides an easier configuration for network isolation. It automatically secures your workspace and managed compute resources in a managed virtual network. You can add private endpoint connections for other Azure services that the workspace relies on, such as Azure Storage Accounts. Depending on your needs, you can allow all outbound traffic to the public network or allow only the outbound traffic you approve. Outbound traffic required by the Azure Machine Learning service is automatically enabled for the managed virtual network. We recommend using workspace managed network isolation for a built-in friction less network isolation method. We have two patterns: allow internet outbound mode or allow only approved outbound mode.
Allow internet outbound mode
Use this option if you want to allow your machine learning engineers access the internet freely. You can create other private endpoint outbound rules to let them access your private resources on Azure.
Allow only approved outbound mode
Use this option if you want to minimize data exfiltration risk and control what your machine learning engineers can access. You can control outbound rules using private endpoint, service tag and FQDN.
Recommended architecture (use your Azure VNet)
If you have a specific requirement or company policy that prevents you from using a managed virtual network, you can use an Azure virtual network for network isolation.
The following diagram is our recommended architecture to make all resources private but allow outbound internet access from your VNet. This diagram describes the following architecture:
- Put all resources in the same region.
- A hub VNet, which contains your firewall.
- A spoke VNet, which contains the following resources:
- A training subnet contains compute instances and clusters used for training ML models. These resources are configured for no public IP.
- A scoring subnet contains an AKS cluster.
- A 'pe' subnet contains private endpoints that connect to the workspace and private resources used by the workspace (storage, key vault, container registry, etc.)
- Managed online endpoints use the private endpoint of the workspace to process incoming requests. A private endpoint is also used to allow managed online endpoint deployments to access private storage.
This architecture balances your network security and your ML engineers' productivity.
You can automate this environments creation using a Bicep template or Terraform template. without managed online endpoint or AKS. Managed online endpoint is the solution if you don't have an existing AKS cluster for your AI model scoring. See how to secure online endpoint documentation for more info. AKS with Azure Machine Learning extension is the solution if you have an existing AKS cluster for your AI model scoring. See how to attach kubernetes documentation for more info.
Removing firewall requirement
If you want to remove the firewall requirement, you can use network security groups and Azure virtual network NAT to allow internet outbound from your private computing resources.
Using public workspace
You can use a public workspace if you're OK with Microsoft Entra authentication and authorization with conditional access. A public workspace has some features to show data in your private storage account and we recommend using private workspace.
Recommended architecture with data exfiltration prevention
This diagram shows the recommended architecture to make all resources private and control outbound destinations to prevent data exfiltration. We recommend this architecture when using Azure Machine Learning with your sensitive data in production. This diagram describes the following architecture:
- Put all resources in the same region.
- A hub VNet, which contains your firewall.
- In addition to service tags, the firewall uses FQDNs to prevent data exfiltration.
- A spoke VNet, which contains the following resources:
- A training subnet contains compute instances and clusters used for training ML models. These resources are configured for no public IP. Additionally, a service endpoint and service endpoint policy are in place to prevent data exfiltration.
- A scoring subnet contains an AKS cluster.
- A 'pe' subnet contains private endpoints that connect to the workspace and private resources used by the workspace (storage, key vault, container registry, etc.)
- Managed online endpoints use the private endpoint of the workspace to process incoming requests. A private endpoint is also used to allow managed online endpoint deployments to access private storage.
The following tables list the required outbound Azure Service Tags and fully qualified domain names (FQDN) with data exfiltration protection setting:
Outbound service tag | Protocol | Port |
---|---|---|
AzureActiveDirectory |
TCP | 80, 443 |
AzureResourceManager |
TCP | 443 |
AzureMachineLearning |
UDP | 5831 |
BatchNodeManagement |
TCP | 443 |
Outbound FQDN | Protocol | Port |
---|---|---|
mcr.microsoft.com |
TCP | 443 |
*.data.mcr.microsoft.com |
TCP | 443 |
studio.ml.azure.cn |
TCP | 443 |
automlresources-prod.azureedge.net |
TCP | 443 |
Using public workspace
You can use the public workspace if you're OK with Microsoft Entra authentication and authorization with conditional access. A public workspace has some features to show data in your private storage account and we recommend using private workspace.
Key considerations to understand details
Azure Machine Learning has both IaaS and PaaS resources
Azure Machine Learning's network isolation involves both Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) components. PaaS services, such as the Azure Machine Learning workspace, storage, key vault, container registry, and monitor, can be isolated using Private Link. IaaS computing services, such as compute instances/clusters for AI model training, and Azure Kubernetes Service (AKS) or managed online endpoints for AI model scoring, can be injected into your virtual network and communicate with PaaS services using Private Link. The following diagram is an example of this architecture.
In this diagram, the compute instances, compute clusters, and AKS Clusters are located within your virtual network. They can access the Azure Machine Learning workspace or storage using a private endpoint. Instead of a private endpoint, you can use a service endpoint for Azure Storage and Azure Key Vault. The other services don't support service endpoint.
Required inbound and outbound configurations
Azure Machine Learning has several required inbound and outbound configurations with your virtual network. If you have a standalone virtual network, the configuration is straightforward using network security group. However, you may have a hub-spoke or mesh network architecture, firewall, network virtual appliance, proxy, and user defined routing. In either case, make sure to allow inbound and outbound with your network security components.
In this diagram, you have a hub and spoke network architecture. The spoke VNet has resources for Azure Machine Learning. The hub VNet has a firewall that control internet outbound from your virtual networks. In this case, your firewall must allow outbound to required resources and your compute resources in spoke VNet must be able to reach your firewall.
Tip
In the diagram, the compute instance and compute cluster are configured for no public IP. If you instead use a compute instance or cluster with public IP, you need to allow inbound from the Azure Machine Learning service tag using a Network Security Group (NSG) and user defined routing to skip your firewall. This inbound traffic would be from a Microsoft service (Azure Machine Learning). However, we recommend using the no public IP option to remove this inbound requirement.
DNS resolution of private link resources and application on compute instance
If you have your own DNS server hosted in Azure or on-premises, you need to create a conditional forwarder in your DNS server. The conditional forwarder sends DNS requests to the Azure DNS for all private link enabled PaaS services. For more information, see the DNS configuration scenarios and Azure Machine Learning specific DNS configuration articles.
Data exfiltration protection
We have two types of outbound; read only and read/write. Read only outbound can't be exploited by malicious actors but read/write outbound can be. Azure Storage and Azure Frontdoor (the frontdoor.frontend
service tag) are read/write outbound in our case.
You can mitigate this data exfiltration risk using our data exfiltration prevention solution. We use a service endpoint policy with an Azure Machine Learning alias to allow outbound to only Azure Machine Learning managed storage accounts. You don't need to open outbound to Storage on your firewall.
In this diagram, the compute instance and cluster need to access Azure Machine Learning managed storage accounts to get set-up scripts. Instead of opening the outbound to storage, you can use service endpoint policy with Azure Machine Learning alias to allow the storage access only to Azure Machine Learning storage accounts.
The following tables list the required outbound Azure Service Tags and fully qualified domain names (FQDN) with data exfiltration protection setting:
Outbound service tag | Protocol | Port |
---|---|---|
AzureActiveDirectory |
TCP | 80, 443 |
AzureResourceManager |
TCP | 443 |
AzureMachineLearning |
UDP | 5831 |
BatchNodeManagement |
TCP | 443 |
Outbound FQDN | Protocol | Port |
---|---|---|
mcr.microsoft.com |
TCP | 443 |
*.data.mcr.microsoft.com |
TCP | 443 |
studio.ml.azure.cn |
TCP | 443 |
automlresources-prod.azureedge.net |
TCP | 443 |
Managed online endpoint
Security for inbound and outbound communication are configured separately for managed online endpoints.
Inbound communication
Azure Machine Learning uses a private endpoint to secure inbound communication to a managed online endpoint. Set the endpoint's public_network_access
flag to disabled
to prevent public access to it. When this flag is disabled, your endpoint can be accessed only via the private endpoint of your Azure Machine Learning workspace, and it can't be reached from public networks.
Outbound communication
To secure outbound communication from a deployment to resources, Azure Machine Learning uses a workspace managed virtual network. The deployment needs to be created in the workspace managed VNet so that it can use the private endpoints of the workspace managed virtual network for outbound communication.
The following architecture diagram shows how communications flow through private endpoints to the managed online endpoint. Incoming scoring requests from a client's virtual network flow through the workspace's private endpoint to the managed online endpoint. Outbound communication from deployments to services is handled through private endpoints from the workspace's managed virtual network to those service instances.
For more information, see Network isolation with managed online endpoints.
Private IP address shortage in your main network
Azure Machine Learning requires private IPs; one IP per compute instance, compute cluster node, and private endpoint. You also need many IPs if you use AKS. Your hub-spoke network connected with your on-premises network might not have a large enough private IP address space. In this scenario, you can use isolated, not-peered VNets for your Azure Machine Learning resources.
In this diagram, your main VNet requires the IPs for private endpoints. You can have hub-spoke VNets for multiple Azure Machine Learning workspaces with large address spaces. A downside of this architecture is to double the number of private endpoints.
Network policy enforcement
You can use built-in policies if you want to control network isolation parameters with self-service workspace and computing resources creation.
Other minor considerations
Image build compute setting for ACR behind VNet
If you put your Azure container registry (ACR) behind your private endpoint, your ACR can't build your docker images. You need to use compute instance or compute cluster to build images. For more information, see the how to set image build compute article.
Enablement of studio UI with private link enabled workspace
If you plan on using the Azure Machine Learning studio, there are extra configuration steps that are needed. These steps are to preventing any data exfiltration scenarios. For more information, see the how to use Azure Machine Learning studio in an Azure virtual network article.
Next steps
For more information on using a managed virtual network, see the following articles:
For more information on using an Azure Virtual Network, see the following articles: