在 Azure 机器学习中使用表

项目
2024-08-14

适用范围：Azure CLI ml 扩展 v2（最新版）Python SDK azure-ai-ml v2（最新版）

Azure 机器学习支持表类型 (mltable)。这允许创建蓝图，用于定义如何将数据文件作为 Pandas 或 Spark 数据帧加载到内存中。本文内容：

使用 Azure 机器学习表而不是文件或文件夹的时间
如何安装 mltable SDK
如何使用 mltable 文件定义数据加载蓝图
显示如何在 Azure 机器学习中使用 mltable 的示例
如何在交互开发过程中使用 mltable（例如，在笔记本中）

先决条件

Azure 订阅。如果你没有 Azure 订阅，请在开始之前创建一个试用订阅。尝试试用版订阅。
适用于 Python 的 Azure 机器学习 SDK
Azure 机器学习工作区

重要

确保在 Python 环境中安装了最新的 mltable 包：

pip install -U mltable azureml-dataprep[pandas]

克隆示例存储库

本文中的代码片段基于 Azure 机器学习示例 GitHub 存储库中的示例。若要将存储库克隆到开发环境，请使用此命令：

git clone --depth 1 https://github.com/Azure/azureml-examples

提示

使用 --depth 1 以便仅克隆提交到存储库的最新内容。这可以减少完成操作所需的时间。

可以在克隆存储库的此文件夹中找到与 Azure 机器学习表相关的示例：

cd azureml-examples/sdk/python/using-mltable

简介

使用 Azure 机器学习表 (mltable)，可以定义要将数据文件作为 Pandas 和/或 Spark 数据帧加载到内存中的方式。表有两个主要功能：

MLTable 文件。 定义数据加载蓝图的基于 YAML 的文件。在 MLTable 文件中，可以指定：
- 数据的一个或多个存储位置 - 本地、云中或公共 http 服务器上。
- 云存储的通配模式。这些位置可以指定具有通配符 (*) 的文件名集。
- 读取转换 - 例如，文件格式类型（带分隔符的文本、Parquet、Delta、json）、分隔符、标头等。
- 列类型转换（以强制使用架构）。
- 使用文件夹结构信息创建新列 - 例如，使用路径中的 {year}/{month} 文件夹结构创建年和月列。
- 要加载的数据子集 - 例如，筛选行、保留/删除列、随机采样。
快速高效的引擎，用于根据 MLTable 文件中定义的蓝图将数据加载到 Pandas 或 Spark 数据帧。引擎依赖于 Rust 实现高速和内存效率。

Azure 机器学习表在以下方案中很有用：

需要对存储位置进行 glob 操作。

需要使用来自不同存储位置（例如，不同的 blob 容器）的数据创建表。
路径包含要在数据中捕获的相关信息（例如，日期和时间）。
数据架构经常更改。
你希望数据加载步骤易于重现。
只需要大型数据的一部分。
数据包含要流式传输到 Python 会话的存储位置。例如，你想要在以下 JSON 行结构中流式传输 path：[{"path": "abfss://fs@account.dfs.core.chinacloudapi.cn/my-images/cats/001.jpg", "label":"cat"}]。
你想要使用 Azure 机器学习 AutoML 训练 ML 模型。

提示

对于表格数据，Azure 机器学习不需要将 Azure 机器学习表 (mltable)。可以使用 Azure 机器学习文件（uri_file）和文件夹（uri_folder）类型，并且你自己的分析逻辑会将数据加载到 Pandas 或 Spark 数据帧中。

对于简单的 CSV 文件或 Parquet 文件夹，使用 Azure 机器学习文件/文件夹比使用表更容易。

Azure 机器学习表快速入门

在本快速入门中，你将从 Azure 开放数据集创建一个纽约市绿色出租车数据表 (mltable)。数据采用 parquet 格式，涵盖范围为 2008-2021 年。在可公开访问的 Blob 存储帐户上，数据文件具有此文件夹结构：

/
└── green
    ├── puYear=2008
    │   ├── puMonth=1
    │   │   ├── _committed_2983805876188002631
    │   │   └── part-XXX.snappy.parquet
    │   ├── ...
    │   └── puMonth=12
    │       ├── _committed_2983805876188002631
    │       └── part-XXX.snappy.parquet
    ├── ...
    └── puYear=2021
        ├── puMonth=1
        │   ├── _committed_2983805876188002631
        │   └── part-XXX.snappy.parquet
        ├── ...
        └── puMonth=12
            ├── _committed_2983805876188002631
            └── part-XXX.snappy.parquet

使用此数据，需要加载到 Pandas 数据帧中：

仅 2015-19 年的 parquet 文件
数据的随机示例
仅限 rip 距离大于 0 的行
机器学习的相关列
新列 - 年和月 - 使用路径信息 (puYear=X/puMonth=Y)

Pandas 代码处理此问题。但是，实现可重现性将变得很困难，因为必须：

共享代码，这意味着如果架构更改（例如，列名称可能更改），则所有用户都必须更新其代码
编写具有大量开销的 ETL 管道

Azure 机器学习表提供了轻量级机制，可在 MLTable 文件中序列化（保存）数据加载步骤。然后，你和团队成员可以重现 Pandas 数据帧。如果架构发生更改，请仅更新 MLTable 文件，而不是在涉及 Python 数据加载代码的许多位置进行更新。

克隆快速入门笔记本或创建新的笔记本/脚本

如果使用 IDE，则应创建新的 Python 脚本。

此外，快速入门笔记本在 Azure 机器学习示例 GitHub 存储库中提供。使用此代码克隆和访问笔记本：

git clone --depth 1 https://github.com/Azure/azureml-examples
cd azureml-examples/sdk/python/using-mltable/quickstart

安装 `mltable` Python SDK

若要将纽约市绿色出租车数据加载到 Azure 机器学习表，必须使用以下命令在 Python 环境中安装 mltable Python SDK 和 pandas：

pip install -U mltable azureml-dataprep[pandas]

创作 MLTable 文件

使用 mltable Python SDK 创建 MLTable 文件，以记录数据加载蓝图。为此，请将以下代码复制并粘贴到笔记本/脚本中，然后执行该代码：

import mltable

# glob the parquet file paths for years 2015-19, all months.
paths = [
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.chinacloudapi.cn/green/puYear=2015/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.chinacloudapi.cn/green/puYear=2016/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.chinacloudapi.cn/green/puYear=2017/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.chinacloudapi.cn/green/puYear=2018/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.chinacloudapi.cn/green/puYear=2019/puMonth=*/*.parquet"
    },
]

# create a table from the parquet paths
tbl = mltable.from_parquet_files(paths)

# table a random sample
tbl = tbl.take_random_sample(probability=0.001, seed=735)

# filter trips with a distance > 0
tbl = tbl.filter("col('tripDistance') > 0")

# Drop columns
tbl = tbl.drop_columns(["puLocationId", "doLocationId", "storeAndFwdFlag"])

# Create two new columns - year and month - where the values are taken from the path
tbl = tbl.extract_columns_from_partition_format("/puYear={year}/puMonth={month}")

# print the first 5 records of the table as a check
tbl.show(5)

可以选择将 MLTable 对象加载到 Pandas 中，使用：

# You can load the table into a pandas dataframe
# NOTE: The data is in China East 2 region and the data is large, so this will take several minutes (~7mins)
# to load if you are in a different region.

# df = tbl.to_pandas_dataframe()

保存数据加载步骤

接下来，将所有数据加载步骤保存到 MLTable 文件中。通过将数据加载步骤保存在 MLTable 文件中，可以在以后的时间点重现 Pandas 数据帧，而无需每次重新定义代码。

可以将 MLTable yaml 文件保存到云存储资源，也可以将其保存到本地路径资源。

# save the data loading steps in an MLTable file to a cloud storage resource
# NOTE: the tbl object was defined in the previous snippet.
tbl.save(path="azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/titanic", colocated=True, show_progress=True, overwrite=True)

# save the data loading steps in an MLTable file to a local resource
# NOTE: the tbl object was defined in the previous snippet.
tbl.save("./titanic")

重要

如果并置 == True，则将数据复制到 MLTable yaml 文件所在的同一文件夹中（如果它们当前未并置），我们将在 MLTable yaml 中使用相对路径。
如果并置 == False，则不会移动数据，并且对云数据使用绝对路径，对本地数据使用相对路径。
我们不支持此参数组合：数据存储在本地资源中，并置 == False，path 面向云目录。请将本地数据上传到云，并改用 MLTable 的云数据路径。

重现数据加载步骤

现在，已将数据加载步骤序列化为文件，就可以使用 load() 方法在任意时间点重现这些步骤。这样，无需在代码中重新定义数据加载步骤，即可更轻松地共享文件。

import mltable

# load the previously saved MLTable file
tbl = mltable.load("./nyc_taxi/")
tbl.show(5)

# You can load the table into a pandas dataframe
# NOTE: The data is in China East 2 region and the data is large, so this will take several minutes (~7mins)
# to load if you are in a different region.

# load the table into pandas
# df = tbl.to_pandas_dataframe()

# print the head of the data frame
# df.head()
# print the shape and column types of the data frame
# print(f"Shape: {df.shape}")
# print(f"Columns:\n{df.dtypes}")

你可能已将 MLTable 文件当前保存在磁盘上，因此很难将其与团队成员共享。在 Azure 机器学习中创建数据资产时，MLTable 会上传到云存储并“加入书签”。然后，团队成员即可使用易记名称访问 MLTable。此外，数据资产也会进行版本控制。

CLI
Python

az ml data create --name green-quickstart --version 1 --path ./nyc_taxi --type mltable

注意

路径指向包含 MLTable 文件的文件夹。

设置订阅、资源组和工作区：

subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

可以使用以下 Python 代码在 Azure 机器学习中创建数据资产：

from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# set VERSION variable
VERSION="1"

# connect to the AzureML workspace
# NOTE: the subscription_id, resource_group, workspace variables are set
# in the previous code snippet.
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

my_data = Data(
    path="./nyc_taxi",
    type=AssetTypes.MLTABLE,
    description="A random sample of NYC Green Taxi Data between 2015-19.",
    name="green-quickstart",
    version=VERSION,
)

ml_client.data.create_or_update(my_data)

注意

路径指向包含 MLTable 项目的文件夹。

在交互式会话中读取数据资产

现在，你已将 MLTable 存储在云中，你和团队成员可以在交互式会话（例如，笔记本）中使用易记名称访问它：

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace
# NOTE: the subscription_id, resource_group, workspace variables are set
# in a previous code snippet.
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset
# Note: The version was set in the previous snippet. If you changed the version
# number, update the VERSION variable below.
VERSION="1"
data_asset = ml_client.data.get(name="green-quickstart", version=VERSION)

# create a table
tbl = mltable.load(f"azureml:/{data_asset.id}")
tbl.show(5)

# load into pandas
# NOTE: The data is in East US region and the data is large, so this will take several minutes (~7mins) to load if you are in a different region.
df = tbl.to_pandas_dataframe()

读取作业中的数据资产

如果你或团队成员想要访问作业中的表，Python 训练脚本将包含：

# ./src/train.py
import argparse
import mltable

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--input', help='mltable to read')
args = parser.parse_args()

# load mltable
tbl = mltable.load(args.input)

# load into pandas
df = tbl.to_pandas_dataframe()

作业需要包含 Python 包依赖项的 Conda 文件：

# ./conda_dependencies.yml
dependencies:
  - python=3.10
  - pip=21.2.4
  - pip:
      - mltable
      - azureml-dataprep[pandas]

可以使用以下方法提交作业：

CLI
Python

创建以下作业 YAML 文件：

# mltable-job.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

code: ./src

command: python train.py --input ${{inputs.green}}
inputs:
    green:
      type: mltable
      path: azureml:green-quickstart:1

compute: cpu-cluster

environment:
  image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
  conda_file: conda_dependencies.yml

在 CLI 中，创建作业：

az ml job create -f mltable-job.yml

from azure.ai.ml import MLClient, command, Input
from azure.ai.ml.entities import Environment
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset
# Note: the VERSION was set in a previous cell.
data_asset = ml_client.data.get(name="green-quickstart", version=VERSION)

job = command(
    command="python train.py --input ${{inputs.green}}",
    inputs={"green": Input(type="mltable", path=data_asset.id)},
    compute="cpu-cluster",
    environment=Environment(
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
        conda_file="./job-env/conda_dependencies.yml",
    ),
    code="./src",
)

ml_client.jobs.create_or_update(job)

创作 MLTable 文件

若要直接创建 MLTable 文件，建议使用 mltable Python SDK 创作 MLTable 文件（如 Azure 机器学习表快速入门中所示），而不是使用文本编辑器。在此部分中，我们将概括介绍 mltable Python SDK 中的功能。

支持的文件类型

可以使用一系列不同的文件类型创建 MLTable：

文件类型	`MLTable` Python SDK
带分隔符的文本（例如，CSV 文件）	`from_delimited_files(paths=[path])`
Parquet	`from_parquet_files(paths=[path])`
Delta Lake	`from_delta_lake(delta_table_uri=<uri_pointing_to_delta_table_directory>,timestamp_as_of='2022-08-26T00:00:00Z')`
JSON 行	`from_json_lines_files(paths=[path])`
路径（创建包含要流式传输的路径的列的表）	`from_paths(paths=[path])`

有关详细信息，请阅读 MLTable 参考资源

定义路径

对于带分隔符的文本、parquet、JSON 行和路径，请定义一个 Python 字典列表，用于定义要从中读取的一条或多条路径：

import mltable

# A List of paths to read into the table. The paths are a python dict that define if the path is
# a file, folder, or (glob) pattern.
paths = [
    {
        "file": "<supported_path>"
    }
]

tbl = mltable.from_delimited_files(paths=paths)

# alternatively
# tbl = mltable.from_parquet_files(paths=paths)
# tbl = mltable.from_json_lines_files(paths=paths)
# tbl = mltable.from_paths(paths=paths)

MLTable 支持以下路径类型：

位置	示例
本地计算机上的路径	`./home/username/data/my_data`
公共 http (s) 服务器上的路径	`https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv`
Azure 存储上的路径	`wasbs://<container_name>@<account_name>.blob.core.chinacloudapi.cn/<path>` `abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>`
一个长格式 Azure 机器学习数据存储	`azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<path>`

注意

mltable 处理 Azure 存储和 Azure 机器学习数据存储上的路径的用户凭据直通。如果你对底层存储上的数据没有权限，则无法访问这些数据。

有关定义 Delta Lake 表路径的说明

与其他文件类型相比，定义用于读取 Delta Lake 表的路径有所不同。对于 Delta Lake 表，路径指向包含“_delta_log”文件夹和数据文件的单个文件夹（通常位于 ADLS gen2 上）。支持时间行程。以下代码演示如何定义 Delta Lake 表的路径：

import mltable

# define the cloud path containing the delta table (where the _delta_log file is stored)
delta_table = "abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path_to_delta_table>"

# create an MLTable. Note the timestamp_as_of parameter for time travel.
tbl = mltable.from_delta_lake(
    delta_table_uri=delta_table,
    timestamp_as_of='2022-08-26T00:00:00Z'
)

若要获取最新版本的 Delta Lake 数据，可以将当前时间戳传递给 timestamp_as_of。

import mltable

# define the relative path containing the delta table (where the _delta_log file is stored)
delta_table_path = "./working-directory/delta-sample-data"

# get the current timestamp in the required format
current_timestamp = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
print(current_timestamp)
tbl = mltable.from_delta_lake(delta_table_path, timestamp_as_of=current_timestamp)
df = tbl.to_pandas_dataframe()

重要

限制：mltable 不支持从 Delta Lake 读取数据时进行分区键提取。通过 mltable 读取 Delta Lake 数据时，mltable 转换 extract_columns_from_partition_format 不起作用。

重要

mltable 处理 Azure 存储和 Azure 机器学习数据存储上的路径的用户凭据直通。如果你对底层存储上的数据没有权限，则无法访问这些数据。

文件、文件夹和 glob

Azure 机器学习表支持从以下位置读取：

文件，例如 abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/my-csv.csv
文件夹，例如 abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/my-folder/
glob 模式，例如 abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/my-folder/*.csv

文件、文件夹和 glob 模式的组合

支持的数据加载转换

在 MLTable 参考文档中查找受支持的数据加载转换的最新完整详细信息。

示例

本文中的代码片段基于 Azure 机器学习示例 GitHub 存储库中的示例。使用此命令将存储库克隆到开发环境：

git clone --depth 1 https://github.com/Azure/azureml-examples

提示

使用 --depth 1 以便仅克隆提交到存储库的最新内容。这可以减少完成操作所需的时间。

此克隆存储库文件夹托管与 Azure 机器学习表相关的示例：

cd azureml-examples/sdk/python/using-mltable

带分隔符的文件

首先，使用以下代码从 CSV 文件创建 MLTable：

import mltable
from mltable import MLTableHeaders, MLTableFileEncoding, DataType

# create paths to the data files
paths = [{"file": "wasbs://data@azuremlexampledata.blob.core.chinacloudapi.cn/titanic.csv"}]

# create an MLTable from the data files
tbl = mltable.from_delimited_files(
    paths=paths,
    delimiter=",",
    header=MLTableHeaders.all_files_same_headers,
    infer_column_types=True,
    include_path_column=False,
    encoding=MLTableFileEncoding.utf8,
)

# filter out rows undefined ages
tbl = tbl.filter("col('Age') > 0")

# drop PassengerId
tbl = tbl.drop_columns(["PassengerId"])

# ensure survived column is treated as boolean
data_types = {
    "Survived": DataType.to_bool(
        true_values=["True", "true", "1"], false_values=["False", "false", "0"]
    )
}
tbl = tbl.convert_column_types(data_types)

# show the first 5 records
tbl.show(5)

# You can also load into pandas...
# df = tbl.to_pandas_dataframe()
# df.head(5)

保存数据加载步骤

接下来，将所有数据加载步骤保存到 MLTable 文件中。当通过将数据加载步骤保存在 MLTable 文件中时，可以在以后的时间点重现 Pandas 数据帧，而无需每次重新定义代码。

# save the data loading steps in an MLTable file
# NOTE: the tbl object was defined in the previous snippet.
tbl.save("./titanic")

重现数据加载步骤

现在，该文件具有序列化的数据加载步骤，可以使用 load() 方法在任意时间点重现它们。这样，无需在代码中重新定义数据加载步骤，即可更轻松地共享文件。

import mltable

# load the previously saved MLTable file
tbl = mltable.load("./titanic/")

你可能已将 MLTable 文件当前保存在磁盘上，因此很难将其与团队成员共享。在 Azure 机器学习中创建数据资产时，MLTable 将上传到云存储并“加入书签”。然欧，团队成员即可使用易记名称访问 MLTable。此外，数据资产也会进行版本控制。

import time
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# Update with your details...
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# set the version number of the data asset to the current UTC time
VERSION = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

my_data = Data(
    path="./titanic",
    type=AssetTypes.MLTABLE,
    description="The titanic dataset.",
    name="titanic-cloud-example",
    version=VERSION,
)

ml_client.data.create_or_update(my_data)

现在，你已将 MLTable 存储在云中，你和团队成员可以在交互式会话（例如，笔记本）中使用易记名称访问它：

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace
# NOTE:  subscription_id, resource_group, workspace were set in a previous snippet.
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset
# Note: The version was set in the previous code cell.
data_asset = ml_client.data.get(name="titanic-cloud-example", version=VERSION)

# create a table
tbl = mltable.load(f"azureml:/{data_asset.id}")

# load into pandas
df = tbl.to_pandas_dataframe()
df.head(5)

还可以轻松访问作业中的数据资产。

Parquet 文件

Azure 机器学习表快速入门介绍如何读取 parquet 文件。

路径：创建映像文件表

可以创建一个包含云存储上路径的表。此示例在云存储中具有多个狗和猫图像，位于以下文件夹结构中：

/pet-images
  /cat
    0.jpeg
    1.jpeg
    ...
  /dog
    0.jpeg
    1.jpeg

mltable 可以构造一个表，其中包含这些图像的存储路径及其文件夹名称（标签），可用于流式传输图像。此代码会创建 MLTable：

import mltable

# create paths to the data files
paths = [{"pattern": "wasbs://data@azuremlexampledata.blob.core.chinacloudapi.cn/pet-images/**/*.jpg"}]

# create the mltable
tbl = mltable.from_paths(paths)

# extract useful information from the path
tbl = tbl.extract_columns_from_partition_format("{account}/{container}/{folder}/{label}")

tbl = tbl.drop_columns(["account", "container", "folder"])

df = tbl.to_pandas_dataframe()
print(df.head())

# save the data loading steps in an MLTable file
tbl.save("./pets")

此代码演示如何打开 Pandas 数据帧中的存储位置并绘制图像：

# plot images on a grid. Note this takes ~1min to execute.
import matplotlib.pyplot as plt
from PIL import Image

fig = plt.figure(figsize=(20, 20))
columns = 4
rows = 5
for i in range(1, columns*rows +1):
    with df.Path[i].open() as f:
        img = Image.open(f)
        fig.add_subplot(rows, columns, i)
        plt.imshow(img)
        plt.title(df.label[i])

mltable 文件当前可能保存在磁盘上，因此很难将其与团队成员共享。在 Azure 机器学习中创建数据资产时，mltable 将上传到云存储并“加入书签”。团队成员可以使用易记名称访问 mltable。此外，数据资产也会进行版本控制。

import time
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# set the version number of the data asset to the current UTC time
VERSION = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())

# connect to the AzureML workspace
# NOTE: subscription_id, resource_group, workspace were set in a previous snippet.
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

my_data = Data(
    path="./pets",
    type=AssetTypes.MLTABLE,
    description="A sample of cat and dog images",
    name="pets-mltable-example",
    version=VERSION,
)

ml_client.data.create_or_update(my_data)

现在，mltable 已存储在云中，你和团队成员可以在交互式会话（例如，笔记本）中使用易记名称访问它：

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace
# NOTE: subscription_id, resource_group, workspace were set in a previous snippet.
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset
# Note: the variable VERSION is set in the previous code
data_asset = ml_client.data.get(name="pets-mltable-example", version=VERSION)

# the table from the data asset id
tbl = mltable.load(f"azureml:/{data_asset.id}")

# load into pandas
df = tbl.to_pandas_dataframe()
df.head()

还可以将数据加载到作业中。

通过

在 Azure 机器学习中使用表

先决条件

克隆示例存储库

简介

Azure 机器学习表快速入门

克隆快速入门笔记本或创建新的笔记本/脚本

安装 mltable Python SDK

创作 MLTable 文件

保存数据加载步骤

重现数据加载步骤

创建数据资产以帮助共享和可重现性

在交互式会话中读取数据资产

读取作业中的数据资产

创作 MLTable 文件

支持的文件类型

定义路径

有关定义 Delta Lake 表路径的说明

文件、文件夹和 glob

支持的数据加载转换

示例

带分隔符的文件

保存数据加载步骤

重现数据加载步骤

创建数据资产以帮助共享和可重现性

Parquet 文件

路径：创建映像文件表

创建数据资产以帮助共享和可重现性

后续步骤

其他资源

安装 `mltable` Python SDK