Set up AutoML to train a natural language processing model

Article
2024-09-13

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article, you learn how to train natural language processing (NLP) models with automated ML in Azure Machine Learning. You can create NLP models with automated ML via the Azure Machine Learning Python SDK v2 or the Azure Machine Learning CLI v2.

Automated ML supports NLP which allows ML professionals and data scientists to bring their own text data and build custom models for NLP tasks. NLP tasks include multi-class text classification, multi-label text classification, and named entity recognition (NER).

You can seamlessly integrate with the Azure Machine Learning data labeling capability to label your text data or bring your existing labeled data. Automated ML provides the option to use distributed training on multi-GPU compute clusters for faster model training. The resulting model can be operationalized at scale using Azure Machine Learning's MLOps capabilities.

APPLIES TO: Azure CLI ml extension v2 (current)

Azure subscription. If you don't have an Azure subscription, sign up to try the trial subscription today.
An Azure Machine Learning workspace with a GPU training compute. To create the workspace, see Create workspace resources. For more information, see GPU optimized virtual machine sizes for more details of GPU instances provided by Azure.

Warning

Support for multilingual models and the use of models with longer max sequence length is necessary for several NLP use cases, such as non-english datasets and longer range documents. As a result, these scenarios may require higher GPU memory for model training to succeed, such as the NC_v3 series or the ND series.
The Azure Machine Learning CLI v2 installed. For guidance to update and install the latest version, see the Install and set up CLI (v2).
This article assumes some familiarity with setting up an automated machine learning experiment. Follow the how-to to see the main automated machine learning experiment design patterns.

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Azure subscription. If you don't have an Azure subscription, sign up to try the trial subscription today.
An Azure Machine Learning workspace with a GPU training compute. To create the workspace, see Create workspace resources. For more information, see GPU optimized virtual machine sizes for more details of GPU instances provided by Azure.

Warning

Support for multilingual models and the use of models with longer max sequence length is necessary for several NLP use cases, such as non-english datasets and longer range documents. As a result, these scenarios may require higher GPU memory for model training to succeed, such as the NC_v3 series or the ND series.
The Azure Machine Learning Python SDK v2 installed.

To install the SDK you can either,
- Create a compute instance, which automatically installs the SDK and is pre-configured for ML workflows. See Create an Azure Machine Learning compute instance for more information.
- Install the automl package yourself, which includes the default installation of the SDK.
Important

The Python commands in this article require the latest azureml-train-automl package version.
- Install the latest azureml-train-automl package to your local environment.
- For details on the latest azureml-train-automl package, see the release notes.
This article assumes some familiarity with setting up an automated machine learning experiment. Follow the how-to to see the main automated machine learning experiment design patterns.

Select your NLP task

Determine what NLP task you want to accomplish. Currently, automated ML supports the follow deep neural network NLP tasks.

Task	AutoML job syntax	Description
Multi-class text classification	CLI v2: `text_classification` SDK v2: `text_classification()`	There are multiple possible classes and each sample can be classified as exactly one class. The task is to predict the correct class for each sample. For example, classifying a movie script as "Comedy," or "Romantic".
Multi-label text classification	CLI v2: `text_classification_multilabel` SDK v2: `text_classification_multilabel()`	There are multiple possible classes and each sample can be assigned any number of classes. The task is to predict all the classes for each sample For example, classifying a movie script as "Comedy," or "Romantic," or "Comedy and Romantic".
Named Entity Recognition (NER)	CLI v2:`text_ner` SDK v2: `text_ner()`	There are multiple possible tags for tokens in sequences. The task is to predict the tags for all the tokens for each sequence. For example, extracting domain-specific entities from unstructured text, such as contracts or financial documents.

Thresholding

Thresholding is the multi-label feature that allows users to pick the threshold which the predicted probabilities will lead to a positive label. Lower values allow for more labels, which is better when users care more about recall, but this option could lead to more false positives. Higher values allow fewer labels and hence better for users who care about precision, but this option could lead to more false negatives.

Preparing data

For NLP experiments in automated ML, you can bring your data in .csv format for multi-class and multi-label classification tasks. For NER tasks, two-column .txt files that use a space as the separator and adhere to the CoNLL format are supported. The following sections provides details for the data format accepted for each task.

Multi-class

For multi-class classification, the dataset can contain several text columns and exactly one label column. The following example has only one text column.

text,labels
"I love watching Shanghai Bulls games.","NBA"
"Tom Brady is a great player.","NFL"
"There is a game between Yankees and Orioles tonight","MLB"
"Stephen Curry made the most number of 3-Pointers","NBA"

Multi-label

For multi-label classification, the dataset columns would be the same as multi-class, however there are special format requirements for data in the label column. The two accepted formats and examples are in the following table.

Label column format options	Multiple labels	One label	No labels
Plain text	`"label1, label2, label3"`	`"label1"`	`""`
Python list with quotes	`"['label1','label2','label3']"`	`"['label1']"`	`"[]"`

Important

Different parsers are used to read labels for these formats. If you are using the plain text format, only use alphabetical, numerical and '_' in your labels. All other characters are recognized as the separator of labels.

For example, if your label is "cs.AI", it's read as "cs" and "AI". Whereas with the Python list format, the label would be "['cs.AI']", which is read as "cs.AI" .

Example data for multi-label in plain text format.

text,labels
"I love watching Shanghai Bulls games.","basketball"
"The four most popular leagues are NFL, MLB, NBA and NHL","football,baseball,basketball,hockey"
"I like drinking beer.",""

Example data for multi-label in Python list with quotes format.

text,labels
"I love watching Shanghai Bulls games.","['basketball']"
"The four most popular leagues are NFL, MLB, NBA and NHL","['football','baseball','basketball','hockey']"
"I like drinking beer.","[]"

Named entity recognition (NER)

Unlike multi-class or multi-label, which takes .csv format datasets, named entity recognition requires CoNLL format. The file must contain exactly two columns and in each row, the token and the label is separated by a single space.

For example,

Hudson B-loc
Square I-loc
is O
a O
famous O
place O
in O
New B-loc
York I-loc
City I-loc

Stephen B-per
Curry I-per
got O
three O
championship O
rings O

Data validation

Before a model trains, automated ML applies data validation checks on the input data to ensure that the data can be preprocessed correctly. If any of these checks fail, the run fails with the relevant error message. The following are the requirements to pass data validation checks for each task.

Note

Some data validation checks are applicable to both the training and the validation set, whereas others are applicable only to the training set. If the test dataset could not pass the data validation, that means that automated ML couldn't capture it and there is a possibility of model inference failure, or a decline in model performance.

Task	Data validation check
All tasks	At least 50 training samples are required
Multi-class and Multi-label	The training data and validation data must have - The same set of columns - The same order of columns from left to right - The same data type for columns with the same name - At least two unique labels - Unique column names within each dataset (For example, the training set can't have multiple columns named Age)
Multi-class only	None
Multi-label only	- The label column format must be in accepted format - At least one sample should have 0 or 2+ labels, otherwise it should be a `multiclass` task - All labels should be in `str` or `int` format, with no overlapping. You shouldn't have both label `1` and label `'1'`
NER only	- The file shouldn't start with an empty line - Each line must be an empty line, or follow format `{token} {label}`, where there's exactly one space between the token and the label and no white space after the label - All labels must start with `I-`, `B-`, or be exactly `O`. Case sensitive - Exactly one empty line between two samples - Exactly one empty line at the end of the file

Configure experiment

Automated ML's NLP capability is triggered through task specific automl type jobs, which is the same workflow for submitting automated ML experiments for classification, regression and forecasting tasks. You would set parameters as you would for those experiments, such as experiment_name, compute_name and data inputs.

However, there are key differences:

You can ignore primary_metric, as it's only for reporting purposes. Currently, automated ML only trains one model per run for NLP and there is no model selection.
The label_column_name parameter is only required for multi-class and multi-label text classification tasks.
If more than 10% of the samples in your dataset contain more than 128 tokens, it's considered long range.
- In order to use the long range text feature, you should use a NC6 or higher/better SKUs for GPU such as: NCv3 series.

Azure CLI
Python SDK

APPLIES TO: Azure CLI ml extension v2 (current)

For CLI v2 automated ml jobs, you configure your experiment in a YAML file like the following.

APPLIES TO: Python SDK azure-ai-ml v2 (current)

For Automated ML jobs via the SDK, you configure the job with the specific NLP task function. The following example demonstrates the configuration for text_classification.

# general job parameters
compute_name = "gpu-cluster"
exp_name = "dpv2-nlp-text-classification-experiment"

# Create the AutoML job with the related factory-function.
text_classification_job = automl.text_classification(
    compute=compute_name,
    # name="dpv2-nlp-text-classification-multiclass-job-01",
    experiment_name=exp_name,
    training_data=my_training_data_input,
    validation_data=my_validation_data_input,
    target_column_name="Sentiment",
    primary_metric="accuracy",
    tags={"my_custom_tag": "My custom value"},
)

text_classification_job.set_limits(timeout=120)

Language settings

As part of the NLP functionality, automated ML supports 104 languages leveraging language specific and multilingual pre-trained text DNN models, such as the BERT family of models. Currently, language selection defaults to English.

The following table summarizes what model is applied based on task type and language. See the full list of supported languages and their codes.

Task type	Syntax for `dataset_language`	Text model algorithm
Multi-label text classification	`"eng"` `"deu"` `"mul"`	English BERT uncased German BERT Multilingual BERT For all other languages, automated ML applies multilingual BERT
Multi-class text classification	`"eng"` `"deu"` `"mul"`	English BERT cased Multilingual BERT For all other languages, automated ML applies multilingual BERT
Named entity recognition (NER)	`"eng"` `"deu"` `"mul"`	English BERT cased German BERT Multilingual BERT For all other languages, automated ML applies multilingual BERT

Azure CLI
Python SDK

APPLIES TO: Azure CLI ml extension v2 (current)

You can specify your dataset language in the featurization section of your configuration YAML file. BERT is also used in the featurization process of automated ML experiment training, learn more about BERT integration and featurization in automated ML (SDK v1).

featurization:
   dataset_language: "eng"

APPLIES TO: Python SDK azure-ai-ml v2 (current)

You can specify your dataset language with the set_featurization() method. BERT is also used in the featurization process of automated ML experiment training, learn more about BERT integration and featurization in automated ML (SDK v1).

text_classification_job.set_featurization(dataset_language='eng')

Distributed training

You can also run your NLP experiments with distributed training on an Azure Machine Learning compute cluster.

Azure CLI
Python SDK

APPLIES TO: Azure CLI ml extension v2 (current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

This is handled automatically by automated ML when the parameters max_concurrent_iterations = number_of_vms and enable_distributed_dnn_training = True are provided in your AutoMLConfig during experiment setup. Doing so, schedules distributed training of the NLP models and automatically scales to every GPU on your virtual machine or cluster of virtual machines. The max number of virtual machines allowed is 32. The training is scheduled with number of virtual machines that is in powers of two.

max_concurrent_iterations = number_of_vms
enable_distributed_dnn_training = True

In AutoML NLP only hold-out validation is supported and it requires a validation dataset.

Submit the AutoML job

Azure CLI
Python SDK

APPLIES TO: Azure CLI ml extension v2 (current)

To submit your AutoML job, you can run the following CLI v2 command with the path to your .yml file, workspace name, resource group and subscription ID.


az ml job create --file ./hello-automl-job-basic.yml --workspace-name [YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]

APPLIES TO: Python SDK azure-ai-ml v2 (current)

With the MLClient created earlier, you can run this CommandJob in the workspace.

returned_job = ml_client.jobs.create_or_update(
    text_classification_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")
ml_client.jobs.stream(returned_job.name)

Code examples

Azure CLI
Python SDK

APPLIES TO: Azure CLI ml extension v2 (current)

See the following sample YAML files for each NLP task.

Model sweeping and hyperparameter tuning (preview)

Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Azure Previews.

AutoML NLP allows you to provide a list of models and combinations of hyperparameters, via the hyperparameter search space in the config. Hyperdrive generates several child runs, each of which is a fine-tuning run for a given NLP model and set of hyperparameter values that were chosen and swept over based on the provided search space.

Supported model algorithms

All the pre-trained text DNN models currently available in AutoML NLP for fine-tuning are listed below:

bert-base-cased
bert-large-uncased
bert-base-multilingual-cased
bert-base-german-cased
bert-large-cased
distilbert-base-cased
distilbert-base-uncased
roberta-base
roberta-large
distilroberta-base
xlm-roberta-base
xlm-roberta-large
xlnet-base-cased
xlnet-large-cased

Note that the large models are larger than their base counterparts. They are typically more performant, but they take up more GPU memory and time for training. As such, their SKU requirements are more stringent: we recommend running on ND-series VMs for the best results.

Supported hyperparameters

The following table describes the hyperparameters that AutoML NLP supports.

Parameter name	Description	Syntax
gradient_accumulation_steps	The number of backward operations whose gradients are to be summed up before performing one step of gradient descent by calling the optimizer's step function. This is to use an effective batch size, which is gradient_accumulation_steps times larger than the maximum size that fits the GPU.	Must be a positive integer.
learning_rate	Initial learning rate.	Must be a float in the range (0, 1).
learning_rate_scheduler	Type of learning rate scheduler.	Must choose from `linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup`.
model_name	Name of one of the supported models.	Must choose from `bert_base_cased, bert_base_uncased, bert_base_multilingual_cased, bert_base_german_cased, bert_large_cased, bert_large_uncased, distilbert_base_cased, distilbert_base_uncased, roberta_base, roberta_large, distilroberta_base, xlm_roberta_base, xlm_roberta_large, xlnet_base_cased, xlnet_large_cased`.
number_of_epochs	Number of training epochs.	Must be a positive integer.
training_batch_size	Training batch size.	Must be a positive integer.
validation_batch_size	Validation batch size.	Must be a positive integer.
warmup_ratio	Ratio of total training steps used for a linear warmup from 0 to learning_rate.	Must be a float in the range [0, 1].
weight_decay	Value of weight decay when optimizer is sgd, adam, or adamw.	Must be a float in the range [0, 1].

All discrete hyperparameters only allow choice distributions, such as the integer-typed training_batch_size and the string-typed model_name hyperparameters. All continuous hyperparameters like learning_rate support all distributions.

Configure your sweep settings

You can configure all the sweep-related parameters. Multiple model subspaces can be constructed with hyperparameters conditional to the respective model, as seen in each hyperparameter tuning example.

The same discrete and continuous distribution options that are available for general HyperDrive jobs are supported here. See all nine options in Hyperparameter tuning a model

Azure CLI
Python SDK

APPLIES TO: Azure CLI ml extension v2 (current)

limits: 
  timeout_minutes: 120  
  max_trials: 4 
  max_concurrent_trials: 2 

sweep: 
  sampling_algorithm: grid 
  early_termination: 
    type: bandit 
    evaluation_interval: 10 
    slack_factor: 0.2 

search_space: 
  - model_name: 
      type: choice 
      values: [bert_base_cased, roberta_base] 
    number_of_epochs: 
      type: choice 
      values: [3, 4] 
  - model_name: 
      type: choice 
      values: [distilbert_base_cased] 
    learning_rate: 
      type: uniform 
      min_value: 0.000005 
      max_value: 0.00005

APPLIES TO: Python SDK azure-ai-ml v2 (current)

You can set the limits for your model sweeping job:

text_ner_job.set_limits( 
                        timeout_minutes=120, 
                        trial_timeout_minutes=20, 
                        max_trials=4, 
                        max_concurrent_trials=2, 
                        max_nodes=4)

You can define a search space with customized settings:

text_ner_job.extend_search_space( 
    [ 
        SearchSpace( 
            model_name=Choice([NlpModels.BERT_BASE_CASED, NlpModels.ROBERTA_BASE]) 
        ), 
        SearchSpace( 
            model_name=Choice([NlpModels.DISTILROBERTA_BASE]), 
            learning_rate_scheduler=Choice([NlpLearningRateScheduler.LINEAR,  
                                            NlpLearningRateScheduler.COSINE]), 
            learning_rate=Uniform(5e-6, 5e-5) 
        ) 
    ] 
)

You can configure the sweep procedure via sampling algorithm early termination:

text_ner_job.set_sweep( 
    sampling_algorithm="Random", 
    early_termination=BanditPolicy( 
        evaluation_interval=2, slack_factor=0.05, delay_evaluation=6 
    ) 
)

Sampling methods for the sweep

When sweeping hyperparameters, you need to specify the sampling method to use for sweeping over the defined parameter space. Currently, the following sampling methods are supported with the sampling_algorithm parameter:

Sampling type	AutoML Job syntax
Random Sampling	`random`
Grid Sampling	`grid`
Bayesian Sampling	`bayesian`

Experiment budget

You can optionally specify the experiment budget for your AutoML NLP training job using the timeout_minutes parameter in the limits - the amount of time in minutes before the experiment terminates. If none specified, the default experiment timeout is seven days (maximum 60 days).

AutoML NLP also supports trial_timeout_minutes, the maximum amount of time in minutes an individual trial can run before being terminated, and max_nodes, the maximum number of nodes from the backing compute cluster to use for the job. These parameters also belong to the limits section.

APPLIES TO: Azure CLI ml extension v2 (current)

limits: 
  timeout_minutes: 60 
  trial_timeout_minutes: 20 
  max_nodes: 2

Early termination policies

You can automatically end poorly performing runs with an early termination policy. Early termination improves computational efficiency, saving compute resources that would have been otherwise spent on less promising configurations. AutoML NLP supports early termination policies using the early_termination parameter. If no termination policy is specified, all configurations are run to completion.

Learn more about how to configure the early termination policy for your hyperparameter sweep.

Resources for the sweep

You can control the resources spent on your hyperparameter sweep by specifying the max_trials and the max_concurrent_trials for the sweep.

Parameter Detail

max_trials Parameter for maximum number of configurations to sweep. Must be an integer between 1 and 1000. When exploring just the default hyperparameters for a given model algorithm, set this parameter to 1. The default value is 1.

max_concurrent_trials Maximum number of runs that can run concurrently. If specified, must be an integer between 1 and 100. The default value is 1.

NOTE:
The number of concurrent runs is gated on the resources available in the specified compute target. Ensure that the compute target has the available resources for the desired concurrency.
max_concurrent_trials is capped at max_trials internally. For example, if user sets max_concurrent_trials=4, max_trials=2, values would be internally updated as max_concurrent_trials=2, max_trials=2.

Parameter	Detail
`max_trials`	Parameter for maximum number of configurations to sweep. Must be an integer between 1 and 1000. When exploring just the default hyperparameters for a given model algorithm, set this parameter to 1. The default value is 1.
`max_concurrent_trials`	Maximum number of runs that can run concurrently. If specified, must be an integer between 1 and 100. The default value is 1. NOTE: The number of concurrent runs is gated on the resources available in the specified compute target. Ensure that the compute target has the available resources for the desired concurrency. `max_concurrent_trials` is capped at `max_trials` internally. For example, if user sets `max_concurrent_trials=4`, `max_trials=2`, values would be internally updated as `max_concurrent_trials=2`, `max_trials=2`.

You can configure all the sweep related parameters as shown in this example.

APPLIES TO: Azure CLI ml extension v2 (current)

sweep:
  limits:
    max_trials: 10
    max_concurrent_trials: 2
  sampling_algorithm: random
  early_termination:
    type: bandit
    evaluation_interval: 2
    slack_factor: 0.2
    delay_evaluation: 6

Known Issues

Dealing with low scores, or higher loss values:

For certain datasets, regardless of the NLP task, the scores produced may be very low, sometimes even zero. This score is accompanied by higher loss values implying that the neural network failed to converge. These scores can happen more frequently on certain GPU SKUs.

While such cases are uncommon, they're possible and the best way to handle it's to leverage hyperparameter tuning and provide a wider range of values, especially for hyperparameters like learning rates. Until our hyperparameter tuning capability is available in production we recommend users experiencing these issues, to use the NC6 or ND6 compute clusters. These clusters typically have training outcomes that are fairly stable.

Set up AutoML to train a natural language processing model

Prerequisites

Select your NLP task

Thresholding

Preparing data

Multi-class

Multi-label

Named entity recognition (NER)

Data validation

Configure experiment

Language settings

Distributed training

Submit the AutoML job

Code examples

Model sweeping and hyperparameter tuning (preview)

Supported model algorithms

Supported hyperparameters

Configure your sweep settings

Sampling methods for the sweep

Experiment budget

Early termination policies

Resources for the sweep

Known Issues

Next steps

Additional resources