Configure training, validation, cross-validation, and test data in automated machine learning
APPLIES TO: Python SDK azureml v1
This article describes options for configuring training data and validation data splits along with cross-validation settings for your automated machine learning (automated ML) experiments. In Azure Machine Learning, when you use automated ML to build multiple machine learning models, each child run needs to validate the related model by calculating the quality metrics for that model, such as accuracy or area under the curve (AUC) weighted. These metrics are calculated by comparing the predictions made with each model with real labels from past observations in the validation data. Automated ML experiments perform model validation automatically.
The following sections describe how you can customize validation settings with the Azure Machine Learning Python SDK. To learn more about how metrics are calculated based on validation type, see the Set metric calculation for cross validation section. If you're interesting in a low-code or no-code experience, see Create your automated ML experiments in Azure Machine Learning studio.
Prerequisites
An Azure Machine Learning workspace. To create the workspace, see Create workspace resources.
Familiarity with setting up an automated ML experiment with the Azure Machine Learning SDK. To see the fundamental automated machine learning experiment design patterns, complete the Train an object detection model tutorial or the Set up AutoML training with Python guide.
An understanding of train/validation data splits and cross-validation as machine learning concepts. For a high-level explanation, see the following articles:
Important
The Python commands in this article require the latest azureml-train-automl
package version.
- Install the latest
azureml-train-automl
package to your local environment. - For details on the latest
azureml-train-automl
package, see the release notes.
Set default data splits and cross-validation in machine learning
To set default data splits and cross-validation in machine learning, use the AutoMLConfig Class object to define your experiment and training settings. In the following example, only the required parameters are defined. The n_cross_validations
and validation_data
parameters aren't included.
Note
In forecasting scenarios, default data splits and cross-validation aren't supported.
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = dataset,
label_column_name = 'Class'
)
If you don't explicitly specify a validation_data
or n_cross_validations
parameter, automated ML applies default techniques depending on the number of rows provided in the single dataset training_data
.
Training data size | Validation technique |
---|---|
Larger than 20,000 rows | Train/validation data split is applied. The default is to take 10% of the initial training data set as the validation set. In turn, that validation set is used for metrics calculation. |
Smaller than 20,000 rows | Cross-validation approach is applied. The default number of folds depends on the number of rows. - If the dataset is less than 1,000 rows, 10 folds are used. - If the rows are between 1,000 and 20,000, three folds are used. |
Provide validation dataset
You have two options for providing validation data. You can start with a single data file and split it into training data and validation data sets, or you can provide a separate data file for the validation set. Either way, the validation_data
parameter in your AutoMLConfig
object assigns which data to use as your validation set. This parameter only accepts data sets in the form of an Azure Machine Learning dataset or pandas dataframe.
Here are some other considerations for working with validation parameters:
- You can set only one validation parameter, either the
validation_data
parameter or then_cross_validations
parameter, but not both. - When you use the
validation_data
parameter, you must also specify thetraining_data
andlabel_column_name
parameters.
The following example explicitly defines which portion of the dataset
to use for training (training_data
) and for validation (validation_data
):
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
training_data, validation_data = dataset.random_split(percentage=0.8, seed=1)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = training_data,
validation_data = validation_data,
label_column_name = 'Class'
)
Provide validation dataset size
When you provide the validation set size, you supply only a single dataset for the experiment. The validation_data
parameter isn't specified, and the provided dataset is assigned to the training_data
parameter.
In your AutoMLConfig
object, you can set the validation_size
parameter to hold out a portion of the training data for validation. For this strategy, the automated ML job splits the validation set from the initial training_data
that you supply. The value should be between 0.0 and 1.0 noninclusive (for example, 0.2 means 20% of the data is held out for validation data).
Note
In forecasting scenarios, the validation_size
parameter isn't supported.
The following example supplies a single dataset
for the experiment. The training_data
accesses the full dataset, and 20% of the dataset is allocated for validation (validation_size = 0.2
):
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = dataset,
validation_size = 0.2,
label_column_name = 'Class'
)
Perform k-fold cross-validation
To perform k-fold cross-validation, you include the n_cross_validations
parameter and define the number of folds. This parameter sets how many cross validations to perform, based on the same number of folds.
Note
In classification scenarios that use deep neural networks (DNN), the n_cross_validations
parameter isn't supported.
For forecasting scenarios, see how cross validation is applied in Set up AutoML to train a time-series forecasting model.
The following example defines five folds for cross-validation. The process runs five different trainings, where each training uses 4/5 of the data. Each validation uses 1/5 of the data with a different holdout fold each time. As a result, metrics are calculated with the average of the five validation metrics.
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = dataset,
n_cross_validations = 5
label_column_name = 'Class'
)
Perform Monte Carlo cross-validation
To perform Monte Carlo cross validation, you include both the validation_size
and n_cross_validations
parameters in your AutoMLConfig
object.
For Monte Carlo cross validation, automated ML sets aside the portion of the training data specified by the validation_size
parameter for validation, and then assigns the rest of the data for training. This process is then repeated based on the value specified in the n_cross_validations
parameter, which generates new training and validation splits, at random, each time.
Note
In forecasting scenarios, Monte Carlo cross-validation isn't supported.
The following example defines seven folds for cross-validation and 20% of the training data for validation. The process runs seven different trainings, where each training uses 80% of the data. Each validation uses 20% of the data with a different holdout fold each time.
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = dataset,
n_cross_validations = 7
validation_size = 0.2,
label_column_name = 'Class'
)
Specify custom cross-validation data folds
You can also provide your own cross-validation (CV) data folds. This approach is considered a more advanced scenario because you specify which columns to split and use for validation. You include custom CV split columns in your training data and specify which columns by populating the column names in the cv_split_column_names
parameter. Each column represents one cross-validation split and has an integer value of 1 or 0. A value of 1 indicates the row should be used for training. A value of 0 indicates the row should be used for validation.
Note
In forecasting scenarios, the cv_split_column_names
parameter isn't supported.
The following example contains bank marketing data with two CV split columns cv1
and cv2
:
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/bankmarketing_with_cv.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = dataset,
label_column_name = 'y',
cv_split_column_names = ['cv1', 'cv2']
)
Note
To use cv_split_column_names
with training_data
and label_column_name
, please upgrade your Azure Machine Learning Python SDK version 1.6.0 or later. For previous SDK versions, please refer to using cv_splits_indices
, but note that it is used with X
and y
dataset input only.
Set metric calculation for cross validation
When either k-fold or Monte Carlo cross validation is used, metrics are computed on each validation fold and then aggregated. The aggregation operation is an average for scalar metrics and a sum for charts. Metrics computed during cross validation are based on all folds and therefore all samples from the training set. For more information, see Evaluate automated ML experiment results.
When either a custom validation set or an automatically selected validation set is used, model evaluation metrics are computed from only that validation set, not the training data.
Provide test dataset (preview)
Important
This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Azure Previews.
You can also provide test data to evaluate the recommended model that automated ML generates for you upon completion of the experiment. When you provide test data, the data is considered to be separate from training and validation to prevent any bias effect on the results of the test run of the recommended model. For more information, see Training, validation, and test data.
Warning
The test dataset feature isn't available for the following automated ML scenarios:
Test datasets must be in the form of an Azure Machine Learning TabularDataset. You can specify a test dataset with the test_data
and test_size
parameters in your AutoMLConfig
object. These parameters are mutually exclusive and can't be specified at the same time or with the cv_split_column_names
or cv_splits_indices
parameters.
In your AutoMLConfig
object, use the test_data
parameter to specify an existing dataset:
automl_config = AutoMLConfig(task='forecasting',
...
# Provide an existing test dataset
test_data=test_dataset,
...
forecasting_parameters=forecasting_parameters)
To use a train/test split instead of providing test data directly, use the test_size
parameter when creating the AutoMLConfig
. This parameter must be a floating point value between 0.0 and 1.0 exclusive. It specifies the percentage of the training dataset to use for the test dataset.
automl_config = AutoMLConfig(task = 'regression',
...
# Specify train/test split
training_data=training_data,
test_size=0.2)
Here are some other considerations for working with a test dataset:
- For regression tasks, random sampling is used.
- For classification tasks, stratified sampling is used, but random sampling is used as a fallback when stratified sampling isn't feasible.
Note
In forecasting scenarios, you can't currently specify a test dataset by using a train/test split with the test_size
parameter.
Passing the test_data
or test_size
parameters into the AutoMLConfig
object automatically triggers a remote test run upon completion of your experiment. This test run uses the provided test data to evaluate the best model that automated ML recommends. For more information, see Get test job results.