创建和管理实例类型以高效利用计算资源

项目
02/23/2024

实例类型是一种 Azure 机器学习概念，它允许将某些类型的计算节点作为训练和推理工作负载的目标。例如，在 Azure 虚拟机中，实例类型为 STANDARD_D2_V3。本文介绍如何根据计算要求创建和管理实例类型。

在 Kubernetes 群集中，实例类型在随 Azure 机器学习扩展一起安装的自定义资源定义 (CRD) 中表示。 Azure 机器学习扩展中的两个元素可表示实例类型：

使用 nodeSelector 指定 Pod 应在哪个节点上运行。节点必须具有相应的标签。
在资源部分，可为 Pod 设置计算资源（CPU、内存和 NVIDIA GPU）。

如果在部署 Azure 机器学习扩展时指定了 nodeSelector 字段，则 nodeSelector 字段将应用于所有实例类型。这表示：

对于创建的每个实例类型，指定的 nodeSelector 字段应该是扩展指定的 nodeSelector 的字段子集。
如果将实例类型与 nodeSelector 一起使用，工作负载将在与扩展指定的 nodeSelector 字段和实例类型指定的 nodeSelector 字段都匹配的任何节点上运行。
如果使用不带 nodeSelector 字段的实例类型，工作负载将在与扩展指定的 nodeSelector 字段匹配的任何节点上运行。

创建默认实例类型

默认情况下，将 Kuberenetes 群集附加到 Azure 机器学习工作区时，会创建一个称为 defaultinstancetype 的实例类型。定义如下：

resources:
  requests:
    cpu: "100m"
    memory: "2Gi"
  limits:
    cpu: "2"
    memory: "2Gi"
    nvidia.com/gpu: null

如果不应用 nodeSelector 字段，则可以在任何节点上计划 Pod。针对请求，工作负载的 Pod 将获得包含 0.1 个 CPU 核心、2 GB 内存和 0 个 GPU 的默认资源。工作负载的 Pod 使用的资源限制为 2 个 CPU 核心和 8 GB 内存。

默认实例类型有意使用很少的资源。若要确保所有机器学习工作负载都使用适当的资源（例如 GPU 资源）运行，强烈建议创建自定义实例类型。

对于默认实例类型，请注意以下几点：

运行命令 kubectl get instancetype 时，defaultinstancetype 不会作为 InstanceType 自定义资源显示在群集中，而是显示在所有客户端（UI、Azure CLI、SDK）中。
defaultinstancetype 可以被具有相同名称的自定义实例类型的定义覆盖。

创建自定义实例类型

若要创建新的实例类型，请为实例类型 CRD 创建新的自定义资源。例如：

kubectl apply -f my_instance_type.yaml

以下是 my_instance_type.yaml 的内容：

apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
  name: myinstancetypename
spec:
  nodeSelector:
    mylabel: mylabelvalue
  resources:
    limits:
      cpu: "1"
      nvidia.com/gpu: 1
      memory: "2Gi"
    requests:
      cpu: "700m"
      memory: "1500Mi"

前面的代码将创建具有标记行为的实例类型：

系统将仅在具有标签 mylabel: mylabelvalue 的节点上计划 Pod。
系统将为 Pod 分配 700m CPU 和 1500Mi 内存的资源请求。
系统将为 Pod 分配 1 个 CPU、2Gi 内存和 1 个 NVIDIA GPU 的资源限制。

创建自定义实例类型时必须满足以下参数和定义规则，否则将会失败：

参数	必需还是可选	说明
`name`	必须	字符串值，在群集中必须是独一无二的。
`CPU request`	必须	字符串值，不能为 0 或空。可以指定 CPU（以毫核为单位）；例如，`100m`。还可以将其指定为整数。例如，`"1"` 等效于 `1000m`。
`Memory request`	必须	字符串值，不能为 0 或空。可以将内存指定为整数 + 后缀，例如 `1024Mi` 表示 1024 MiB。
`CPU limit`	必须	字符串值，不能为 0 或空。可以指定 CPU（以毫核为单位）；例如，`100m`。还可以将其指定为整数。例如，`"1"` 等效于 `1000m`。
`Memory limit`	必须	字符串值，不能为 0 或空。可以将内存指定为完整数字+后缀；例如，`1024Mi` 表示 1024 MiB。
`GPU`	可选	整数值，只能在 `limits` 部分中指定。有关详细信息，请参阅 Kubernetes 文档。
`nodeSelector`	可选	字符串键和值的映射。

还可以一次性创建多个实例类型：

kubectl apply -f my_instance_type_list.yaml

以下是 my_instance_type_list.yaml 的内容：

apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceTypeList
items:
  - metadata:
      name: cpusmall
    spec:
      resources:
        requests:
          cpu: "100m"
          memory: "100Mi"
        limits:
          cpu: "1"
          nvidia.com/gpu: 0
          memory: "1Gi"

  - metadata:
      name: defaultinstancetype
    spec:
      resources:
        requests:
          cpu: "1"
          memory: "1Gi" 
        limits:
          cpu: "1"
          nvidia.com/gpu: 0
          memory: "1Gi"

上面的示例将创建两种实例类型：cpusmall 和 defaultinstancetype。此 defaultinstancetype 定义将覆盖将 Kubernetes 群集附加到 Azure 机器学习工作区时创建的 defaultinstancetype 定义。

如果提交没有实例类型的训练或推理工作负载，它将使用 defaultinstancetype。若要为 Kubernetes 群集指定默认实例类型，请创建名称为 defaultinstancetype 的实例类型。它会自动识别为默认类型。

若要使用 Azure CLI (v2) 为训练作业选择某个实例类型，请将该类型的名称指定为作业 YAML 中 resources 属性部分的一部分。例如：

command: python -c "print('Hello world!')"
environment:
  image: library/python:latest
compute: azureml:<Kubernetes-compute_target_name>
resources:
  instance_type: <instance type name>

若要使用 SDK (V2) 为训练作业选择某个实例类型，请为 command 类中的 instance_type 属性指定该类型的名称。例如：

from azure.ai.ml import command

# define the command
command_job = command(
    command="python -c "print('Hello world!')"",
    environment="AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu@latest",
    compute="<Kubernetes-compute_target_name>",
    instance_type="<instance type name>"
)

在前面的示例中，请将 <Kubernetes-compute_target_name> 替换为 Kubernetes 计算目标的名称。将 <instance type name> 替换为想要选择的实例类型的名称。如果未指定 instance_type 属性，系统会使用 defaultinstancetype 来提交作业。

选择实例类型以部署模型

Azure CLI
Python SDK

若要使用 Azure CLI (v2) 为模型部署选择某个实例类型，请为部署 YAML 中的 instance_type 属性指定该类型的名称。例如：

name: blue
app_insights_enabled: true
endpoint_name: <endpoint name>
model: 
  path: ./model/sklearn_mnist_model.pkl
code_configuration:
  code: ./script/
  scoring_script: score.py
instance_type: <instance type name>
environment: 
  conda_file: file:./model/conda.yml
  image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest

若要使用 SDK (v2) 为模型部署选择某个实例类型，请为 KubernetesOnlineDeployment 类中的 instance_type 属性指定该类型的名称。例如：

from azure.ai.ml import KubernetesOnlineDeployment,Model,Environment,CodeConfiguration

model = Model(path="./model/sklearn_mnist_model.pkl")
env = Environment(
    conda_file="./model/conda.yml",
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
)

# define the deployment
blue_deployment = KubernetesOnlineDeployment(
    name="blue",
    endpoint_name="<endpoint name>",
    model=model,
    environment=env,
    code_configuration=CodeConfiguration(
        code="./script/", scoring_script="score.py"
    ),
    instance_count=1,
    instance_type="<instance type name>",
)

在上面的示例中，请将 <instance type name> 替换为想要选择的实例类型的名称。如果未指定 instance_type 属性，系统会使用 defaultinstancetype 来部署模型。

重要

对于 MLflow 模型部署，资源请求至少需要 2 个 CPU 核心和 4 GB 内存。否则，部署会失败。

资源部分验证

你可以使用 resources 部分来定义模型部署的资源请求和限制。例如：

Azure CLI
Python SDK

name: blue
app_insights_enabled: true
endpoint_name: <endpoint name>
model: 
  path: ./model/sklearn_mnist_model.pkl
code_configuration:
  code: ./script/
  scoring_script: score.py
environment: 
  conda_file: file:./model/conda.yml
  image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest
resources:
  requests:
    cpu: "0.1"
    memory: "0.2Gi"
  limits:
    cpu: "0.2"
    #nvidia.com/gpu: 0
    memory: "0.5Gi"
instance_type: <instance type name>

from azure.ai.ml import (
    KubernetesOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
    ResourceSettings,
    ResourceRequirementsSettings
)

model = Model(path="./model/sklearn_mnist_model.pkl")
env = Environment(
    conda_file="./model/conda.yml",
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
)

requests = ResourceSettings(cpu="0.1", memory="0.2G")
limits = ResourceSettings(cpu="0.2", memory="0.5G", nvidia_gpu="1")
resources = ResourceRequirementsSettings(requests=requests, limits=limits)

# define the deployment
blue_deployment = KubernetesOnlineDeployment(
    name="blue",
    endpoint_name="<endpoint name>",
    model=model,
    environment=env,
    code_configuration=CodeConfiguration(
        code="./script/", scoring_script="score.py"
    ),
    resources=resources,
    instance_count=1,
    instance_type="<instance type name>",
)

如果使用 resources 不分，则有效的资源定义需要满足以下规则要求。资源定义无效将导致模型部署失败。

参数	必需还是可选	说明
`requests:` `cpu:`	必须	字符串值，不能为 0 或空。可以指定 CPU（以毫核为单位）；例如，`100m`。还可以将其指定为整数。例如，`"1"` 等效于 `1000m`。
`requests:` `memory:`	必须	字符串值，不能为 0 或空。可以将内存指定为完整数字+后缀；例如，`1024Mi` 表示 1024 MiB。内存不能小于 1 MB。
`limits:` `cpu:`	可选（仅当需要 GPU 时为必需参数）	字符串值，不能为 0 或空。可以指定 CPU（以毫核为单位）；例如，`100m`。还可以将其指定为整数。例如，`"1"` 等效于 `1000m`。
`limits:` `memory:`	可选（仅当需要 GPU 时为必需参数）	字符串值，不能为 0 或空。可以将内存指定为整数 + 后缀，例如 `1024Mi` 表示 1024 MiB。
`limits:` `nvidia.com/gpu:`	可选（仅当需要 GPU 时为必需参数）	整数值，不能为空，且只能在 `limits` 部分指定。有关详细信息，请参阅 Kubernetes 文档。如果只需要 CPU，可以省略整个 `limits` 部分。

模型部署要求必需提供实例类型。如果已经定义了resources 部分，并且将根据实例类型对其进行验证，则规则如下：

使用有效的 resource 部分定义时，资源限制必须小于实例类型限制。否则，部署将失败。
如果未定义实例类型，系统将使用 defaultinstancetype 来对 resources 部分进行验证。
如果未定义 resources 部分，系统将使用实例类型创建部署。

通过