Set up AutoML training for tabular data with the Azure Machine Learning CLI and Python SDK
APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)
In this article, learn how to set up an automated machine learning (AutoML) training job with the Azure Machine Learning Python SDK v2. Automated ML picks an algorithm and hyperparameters for you and generates a model ready for deployment. This article provides details of the various options that you can use to configure automated machine learning experiments.
If you prefer a no-code experience, you can also Set up no-code Automated ML training for tabular data with the studio UI.
Prerequisites
- An Azure subscription. If you don't have an Azure subscription, create a Trial before you begin. Try the Azure Machine Learning.
- An Azure Machine Learning workspace. If you don't have one, see Create resources to get started.
To use the SDK information, install the Azure Machine Learning SDK v2 for Python.
To install the SDK, you can either:
- Create a compute instance, which already has the latest Azure Machine Learning Python SDK and is configured for ML workflows. For more information, see Create an Azure Machine Learning compute instance.
- Install the SDK on your local machine.
Set up your workspace
To connect to a workspace, you need to provide a subscription, resource group, and workspace.
The Workspace details are used in the MLClient
from azure.ai.ml
to get a handle to the required Azure Machine Learning workspace.
The following example uses the default Azure authentication with the default workspace configuration or configuration from a config.json
file in the folders structure. If it finds no config.json
, you need to manually introduce the subscription ID, resource group, and workspace when you create the MLClient
.
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
credential = DefaultAzureCredential()
ml_client = None
try:
ml_client = MLClient.from_config(credential)
except Exception as ex:
print(ex)
# Enter details of your Azure Machine Learning workspace
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AZUREML_WORKSPACE_NAME>"
ml_client = MLClient(credential, subscription_id, resource_group, workspace)
Specify data source and format
In order to provide training data in SDK v2, you need to upload it into the cloud through an MLTable.
Requirements for loading data into an MLTable:
- Data must be in tabular form.
- The value to predict, target column, must be in the data.
Training data must be accessible from the remote compute. Automated ML v2 (Python SDK and CLI/YAML) accepts MLTable data assets (v2). For backwards compatibility, it also supports v1 Tabular Datasets from v1, a registered Tabular Dataset, through the same input dataset properties. We recommend that you use MLTable, available in v2. In this example, the data is stored at the local path, ./train_data/bank_marketing_train_data.csv.
You can create an MLTable using the mltable Python SDK as in the following example:
import mltable
paths = [
{'file': './train_data/bank_marketing_train_data.csv'}
]
train_table = mltable.from_delimited_files(paths)
train_table.save('./train_data')
This code creates a new file, ./train_data/MLTable, which contains the file format and loading instructions.
Now the ./train_data folder has the MLTable definition file plus the data file, bank_marketing_train_data.csv.
For more information on MLTable, see Working with tables in Azure Machine Learning.
Training, validation, and test data
You can specify separate training data and validation data sets. Training data must be provided to the training_data
parameter in the factory function of your automated machine learning job.
If you don't explicitly specify a validation_data
or n_cross_validation
parameter, Automated ML applies default techniques to determine how validation is performed. This determination depends on the number of rows in the dataset assigned to your training_data
parameter.
Training data size | Validation technique |
---|---|
Larger than 20,000 rows | Training and validation data split is applied. The default is to take 10% of the initial training data set as the validation set. In turn, that validation set is used for metrics calculation. |
Smaller than or equal to 20,000 rows | Cross-validation approach is applied. The default number of folds depends on the number of rows. If the dataset is fewer than 1,000 rows, ten folds are used. If the rows are equal to or between 1,000 and 20,000, three folds are used. |
Compute to run experiment
Automated machine learning jobs with the Python SDK v2 (or CLI v2) are currently only supported on Azure Machine Learning remote compute cluster or compute instance. For more information about creating compute with the Python SDKv2 or CLIv2, see Train models with Azure Machine Learning CLI, SDK, and REST API.
Configure your experiment settings
There are several options that you can use to configure your automated machine learning experiment. These configuration parameters are set in your task method. You can also set job training settings and exit criteria with the training
and limits
settings.
The following example shows the required parameters for a classification task that specifies accuracy as the primary metric and five cross-validation folds.
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import automl, Input
# note that this is a code snippet -- you might have to modify the variable values to run it successfully
# make an Input object for the training data
my_training_data_input = Input(
type=AssetTypes.MLTABLE, path="./data/training-mltable-folder"
)
# configure the classification job
classification_job = automl.classification(
compute=my_compute_name,
experiment_name=my_exp_name,
training_data=my_training_data_input,
target_column_name="y",
primary_metric="accuracy",
n_cross_validations=5,
enable_model_explainability=True,
tags={"my_custom_tag": "My custom value"}
)
# Limits are all optional
classification_job.set_limits(
timeout_minutes=600,
trial_timeout_minutes=20,
max_trials=5,
enable_early_termination=True,
)
# Training properties are optional
classification_job.set_training(
blocked_training_algorithms=["logistic_regression"],
enable_onnx_compatible_models=True
)
Select your machine learning task type
Before you can submit your Automated ML job, determine the kind of machine learning problem that you want to solve. This problem determines which function your job uses and what model algorithms it applies.
Automated ML supports different task types:
Tabular data based tasks
- classification
- regression
- forecasting
Computer vision tasks, including
- Image Classification
- Object Detection
Natural language processing tasks, including
- Text classification
- Entity Recognition
For more information, see task types. For more information on setting up forecasting jobs, see Set up AutoML to train a time-series forecasting model.
Supported algorithms
Automated machine learning tries different models and algorithms during the automation and tuning process. As a user, you don't need to specify the algorithm.
The task method determines the list of algorithms or models to apply. To further modify iterations with the available models to include or exclude, use the allowed_training_algorithms
or blocked_training_algorithms
parameters in the training
configuration of the job.
In the following table, explore the supported algorithms per machine learning task.
With other algorithms:
- Image Classification Multi-class Algorithms
- Image Classification Multi-label Algorithms
- Image Object Detection Algorithms
- NLP Text Classification Multi-label Algorithms
- NLP Text Named Entity Recognition (NER) Algorithms
For example notebooks of each task type, see automl-standalone-jobs.
Primary metric
The primary_metric
parameter determines the metric to be used during model training for optimization. The task type that you choose determines the metrics that you can select.
Choosing a primary metric for automated machine learning to optimize depends on many factors. We recommend your primary consideration be to choose a metric that best represents your business needs. Then consider if the metric is suitable for your dataset profile, including data size, range, and class distribution. The following sections summarize the recommended primary metrics based on task type and business scenario.
To learn about the specific definitions of these metrics, see Evaluate automated machine learning experiment results.
Metrics for classification multi-class scenarios
These metrics apply for all classification scenarios, including tabular data, images or computer-vision, and natural language processing text (NLP-Text).
Threshold-dependent metrics, like accuracy
, recall_score_weighted
, norm_macro_recall
, and precision_score_weighted
might not optimize as well for datasets that are small, have large class skew (class imbalance), or when the expected metric value is very close to 0.0 or 1.0. In those cases, AUC_weighted
can be a better choice for the primary metric. After automated machine learning completes, you can choose the winning model based on the metric best suited to your business needs.
Metric | Example use cases |
---|---|
accuracy |
Image classification, Sentiment analysis, Churn prediction |
AUC_weighted |
Fraud detection, Image classification, Anomaly detection/spam detection |
average_precision_score_weighted |
Sentiment analysis |
norm_macro_recall |
Churn prediction |
precision_score_weighted |
Metrics for classification multi-label scenarios
For Text classification multi-label, currently 'Accuracy' is the only primary metric supported.
For Image classification multi-label, the primary metrics supported are defined in the ClassificationMultilabelPrimaryMetrics
enum.
Metrics for NLP Text Named Entity Recognition scenarios
For NLP Text Named Entity Recognition (NER), currently 'Accuracy' is the only primary metric supported.
Metrics for regression scenarios
r2_score
, normalized_mean_absolute_error
, and normalized_root_mean_squared_error
are all trying to minimize prediction errors. r2_score
and normalized_root_mean_squared_error
are both minimizing average squared errors while normalized_mean_absolute_error
is minimizing the average absolute value of errors. Absolute value treats errors at all magnitudes alike and squared errors have a much larger penalty for errors with larger absolute values. Depending on whether larger errors should be punished more or not, you can choose to optimize squared error or absolute error.
The main difference between r2_score
and normalized_root_mean_squared_error
is the way they're normalized and their meanings. normalized_root_mean_squared_error
is root mean squared error normalized by range and can be interpreted as the average error magnitude for prediction. r2_score
is mean squared error normalized by an estimate of variance of data. It's the proportion of variation that the model can capture.
Note
r2_score
and normalized_root_mean_squared_error
also behave similarly as primary metrics. If a fixed validation set is applied, these two metrics are optimizing the same target, mean squared error, and are optimized by the same model. When only a training set is available and cross-validation is applied, they would be slightly different as the normalizer for normalized_root_mean_squared_error
is fixed as the range of training set, but the normalizer for r2_score
would vary for every fold as it's the variance for each fold.
If the rank, instead of the exact value, is of interest, spearman_correlation
can be a better choice. It measures the rank correlation between real values and predictions.
Automated ML doesn't currently support any primary metrics that measure relative difference between predictions and observations. The metrics r2_score
, normalized_mean_absolute_error
, and normalized_root_mean_squared_error
are all measures of absolute difference. For example, if a prediction differs from an observation by 10 units, these metrics compute the same value if the observation is 20 units or 20,000 units. In contrast, a percentage difference, which is a relative measure, gives errors of 50% and 0.05%, respectively. To optimize for relative difference, you can run Automated ML with a supported primary metric and then select the model with the best mean_absolute_percentage_error
or root_mean_squared_log_error
. These metrics are undefined when any observation values are zero, so they might not always be good choices.
Metric | Example use cases |
---|---|
spearman_correlation |
|
normalized_root_mean_squared_error |
Price prediction (house/product/tip), Review score prediction |
r2_score |
Airline delay, Salary estimation, Bug resolution time |
normalized_mean_absolute_error |
Metrics for Time Series Forecasting scenarios
The recommendations are similar to the recommendations for regression scenarios.
Metric | Example use cases |
---|---|
normalized_root_mean_squared_error |
Price prediction (forecasting), Inventory optimization, Demand forecasting |
r2_score |
Price prediction (forecasting), Inventory optimization, Demand forecasting |
normalized_mean_absolute_error |
Metrics for Image Object Detection scenarios
For Image Object Detection, the primary metrics supported are defined in the ObjectDetectionPrimaryMetrics
enum.
Metrics for Image Instance Segmentation scenarios
For Image Instance Segmentation scenarios, the primary metrics supported are defined in the InstanceSegmentationPrimaryMetrics
enum.
Data featurization
In every automated machine learning experiment, your data is automatically transformed to numbers and vectors of numbers. The data is also scaled and normalized to help algorithms that are sensitive to features that are on different scales. These data transformations are called featurization.
Note
Automated machine learning featurization steps, such as feature normalization, handling missing data, and converting text to numeric, become part of the underlying model. When you use the model for predictions, the same featurization steps applied during training are applied to your input data automatically.
When you configure automated machine learning jobs, you can enable or disable the featurization
settings.
The following table shows the accepted settings for featurization.
Featurization Configuration | Description |
---|---|
"mode": 'auto' |
Indicates that, as part of preprocessing, data guardrails and featurization steps are performed automatically. This value is the default setting. |
"mode": 'off' |
Indicates featurization step shouldn't be done automatically. |
"mode": 'custom' |
Indicates customized featurization step should be used. |
The following code shows how custom featurization can be provided in this case for a regression job.
from azure.ai.ml.automl import ColumnTransformer
transformer_params = {
"imputer": [
ColumnTransformer(fields=["CACH"], parameters={"strategy": "most_frequent"}),
ColumnTransformer(fields=["PRP"], parameters={"strategy": "most_frequent"}),
],
}
regression_job.set_featurization(
mode="custom",
transformer_params=transformer_params,
blocked_transformers=["LabelEncoding"],
column_name_and_types={"CHMIN": "Categorical"},
)
Exit criteria
There are a few options you can define in the set_limits()
function to end your experiment before the job completes.
Criteria | description |
---|---|
No criteria | If you don't define any exit parameters, the experiment continues until no further progress is made on your primary metric. |
timeout |
Defines how long, in minutes, your experiment should continue to run. If not specified, the default job's total timeout is six days (8,640 minutes). To specify a timeout less than or equal to 1 hour (60 minutes), make sure your dataset's size isn't greater than 10,000,000 (rows times column) or an error results. This timeout includes setup, featurization, and training runs but doesn't include the ensembling and model explainability runs at the end of the process since those actions need to happen after all the trials (children jobs) are done. |
trial_timeout_minutes |
Maximum time in minutes that each trial (child job) can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used. |
enable_early_termination |
Whether to end the job if the score isn't improving in the short term. |
max_trials |
The maximum number of trials/runs each with a different combination of algorithm and hyper-parameters to try during a job. If not specified, the default is 1,000 trials. If you use enable_early_termination , the number of trials used can be smaller. |
max_concurrent_trials |
Represents the maximum number of trials (children jobs) that would be executed in parallel. It's a good practice to match this number with the number of nodes your cluster. |
Run experiment
Submit the experiment to run and generate a model.
Note
If you run an experiment with the same configuration settings and primary metric multiple times, you might see variation in each experiments final metrics score and generated models. The algorithms that automated machine learning employs have inherent randomness that can cause slight variation in the models output by the experiment and the recommended model's final metrics score, like accuracy. You also might see results with the same model name, but different hyper-parameters used.
Warning
If you have set rules in firewall or Network Security Group over your workspace, verify that required permissions are given to inbound and outbound network traffic as defined in Configure inbound and outbound network traffic.
With the MLClient
created in the prerequisites, you can run the following command in the workspace.
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
classification_job
) # submit the job to the backend
print(f"Created job: {returned_job}")
# Get a URL for the status of the job
returned_job.services["Studio"].endpoint
Multiple child runs on clusters
Automated ML experiment child runs can be performed on a cluster that is already running another experiment. However, the timing depends on how many nodes the cluster has, and if those nodes are available to run a different experiment.
Each node in the cluster acts as an individual virtual machine (VM) that can accomplish a single training run. For Automated ML, this fact means a child run. If all the nodes are busy, a new experiment is queued. If there are free nodes, the new experiment runs child runs in parallel in the available nodes or virtual machines.
To help manage child runs and when they can be performed, we recommend that you create a dedicated cluster per experiment, and match the number of max_concurrent_iterations
of your experiment to the number of nodes in the cluster. This way, you use all the nodes of the cluster at the same time with the number of concurrent child runs and iterations that you want.
Configure max_concurrent_iterations
in the limits
configuration. If it isn't configured, then by default only one concurrent child run/iteration is allowed per experiment. For a compute instance, max_concurrent_trials
can be set to be the same as number of cores on the compute instance virtual machine.
Explore models and metrics
Automated ML offers options for you to monitor and evaluate your training results.
For definitions and examples of the performance charts and metrics provided for each run, see Evaluate automated machine learning experiment results.
To get a featurization summary and understand what features were added to a particular model, see Featurization transparency.
From the Azure Machine Learning UI at the model's page, you can also view the hyper-parameters used when you train a particular model and also view and customize the internal model's training code used.
Register and deploy models
After you test a model and confirm you want to use it in production, you can register it for later use.
Tip
For registered models, you can use one-click deployment by using the Azure Machine Learning studio. See Deploy your model.
Use AutoML in pipelines
To use Automated ML in your machine learning operations workflows, you can add AutoML Job steps to your Azure Machine Learning Pipelines. This approach allows you to automate your entire workflow by hooking up your data preparation scripts to Automated ML. Then register and validate the resulting best model.
This code is a sample pipeline with an Automated ML classification component and a command component that shows the resulting output. The code references the inputs (training and validation data) and the outputs (best model) in different steps.
# Define pipeline
@pipeline(
description="AutoML Classification Pipeline",
)
def automl_classification(
classification_train_data,
classification_validation_data
):
# define the automl classification task with automl function
classification_node = classification(
training_data=classification_train_data,
validation_data=classification_validation_data,
target_column_name="y",
primary_metric="accuracy",
# currently need to specify outputs "mlflow_model" explictly to reference it in following nodes
outputs={"best_model": Output(type="mlflow_model")},
)
# set limits and training
classification_node.set_limits(max_trials=1)
classification_node.set_training(
enable_stack_ensemble=False,
enable_vote_ensemble=False
)
command_func = command(
inputs=dict(
automl_output=Input(type="mlflow_model")
),
command="ls ${{inputs.automl_output}}",
environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:latest"
)
show_output = command_func(automl_output=classification_node.outputs.best_model)
pipeline_job = automl_classification(
classification_train_data=Input(path="./training-mltable-folder/", type="mltable"),
classification_validation_data=Input(path="./validation-mltable-folder/", type="mltable"),
)
# set pipeline level compute
pipeline_job.settings.default_compute = compute_name
# submit the pipeline job
returned_pipeline_job = ml_client.jobs.create_or_update(
pipeline_job,
experiment_name=experiment_name
)
returned_pipeline_job
# ...
# Note that this is a snippet from the bankmarketing example you can find in our examples repo -> https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1h_automl_in_pipeline/automl-classification-bankmarketing-in-pipeline
For more examples on how to include Automated ML in your pipelines, see the examples repository.
Use AutoML at scale: distributed training
For large data scenarios, Automated ML supports distributed training for a limited set of models:
Distributed algorithm | Supported tasks | Data size limit (approximate) |
---|---|---|
LightGBM | Classification, regression | 1 TB |
TCNForecaster | Forecasting | 200 GB |
Distributed training algorithms automatically partition and distribute your data across multiple compute nodes for model training.
Note
Cross-validation, ensemble models, ONNX support, and code generation are not currently supported in the distributed training mode. Also, Automatic ML can make choices such as restricting available featurizers and sub-sampling data used for validation, explainability, and model evaluation.
Distributed training for classification and regression
To use distributed training for classification or regression, set the training_mode
and max_nodes
properties of the job object.
Property | Description |
---|---|
training_mode | Indicates training mode: distributed or non_distributed . Defaults to non_distributed . |
max_nodes | The number of nodes to use for training by each trial. This setting must be greater than or equal to 4. |
The following code sample shows an example of these settings for a classification job:
from azure.ai.ml.constants import TabularTrainingMode
# Set the training mode to distributed
classification_job.set_training(
allowed_training_algorithms=["LightGBM"],
training_mode=TabularTrainingMode.DISTRIBUTED
)
# Distribute training across 4 nodes for each trial
classification_job.set_limits(
max_nodes=4,
# other limit settings
)
Note
Distributed training for classification and regression tasks does not currently support multiple concurrent trials. Model trials execute sequentially with each trial using max_nodes
nodes. The max_concurrent_trials
limit setting is currently ignored.
Distributed training for forecasting
To learn how distributed training works for forecasting tasks, see forecasting at scale. To use distributed training for forecasting, you need to set the training_mode
, enable_dnn_training
, max_nodes
, and optionally the max_concurrent_trials
properties of the job object.
Property | Description |
---|---|
training_mode | Indicates training mode; distributed or non_distributed . Defaults to non_distributed . |
enable_dnn_training | Flag to enable deep neural network models. |
max_concurrent_trials | This value is the maximum number of trial models to train in parallel. Defaults to 1. |
max_nodes | The total number of nodes to use for training. This setting must be greater than or equal to 2. For forecasting tasks, each trial model is trained using $\text{max}\left(2, \text{floor}( \text{max_nodes} / \text{max_concurrent_trials}) \right)$ nodes. |
The following code sample shows an example of these settings for a forecasting job:
from azure.ai.ml.constants import TabularTrainingMode
# Set the training mode to distributed
forecasting_job.set_training(
enable_dnn_training=True,
allowed_training_algorithms=["TCNForecaster"],
training_mode=TabularTrainingMode.DISTRIBUTED
)
# Distribute training across 4 nodes
# Train 2 trial models in parallel => 2 nodes per trial
forecasting_job.set_limits(
max_concurrent_trials=2,
max_nodes=4,
# other limit settings
)
For samples of full configuration code, see previous sections on configuration and job submission.
Related content
- Learn more about how and where to deploy a model.
- Learn more about how to set up AutoML to train a time-series forecasting model.