Manage inputs and outputs for components and pipelines

Article
11/08/2024

Azure Machine Learning pipelines support inputs and outputs at both the component and pipeline levels. This article describes pipeline and component inputs and outputs and how to manage them.

At the component level, the inputs and outputs define the component interface. You can use the output from one component as an input for another component in the same parent pipeline, allowing for data or models to be passed between components. This interconnectivity represents the data flow within the pipeline.

At the pipeline level, you can use inputs and outputs to submit pipeline jobs with varying data inputs or parameters, such as learning_rate. Inputs and outputs are especially useful when you invoke a pipeline via a REST endpoint. You can assign different values to the pipeline input or access the output of different pipeline jobs. For more information, see Create jobs and input data for batch endpoints.

Input and output types

The following types are supported as both inputs and outputs of components or pipelines:

Data types. For more information, see Data types.
- uri_file
- uri_folder
- mltable
Model types.
- mlflow_model
- custom_model

The following primitive types are also supported for inputs only:

Primitive types
- string
- number
- integer
- boolean

Primitive type output isn't supported.

Example inputs and outputs

These examples are from the NYC Taxi Data Regression pipeline in the Azure Machine Learning examples GitHub repository.

The train component has a number input named test_split_ratio.
The prep component has a uri_folder type output. The component source code reads the CSV files from the input folder, processes the files, and writes the processed CSV files to the output folder.
The train component has a mlflow_model type output. The component source code saves the trained model using the mlflow.sklearn.save_model method.

Output serialization

Using data or model outputs serializes the outputs and saves them as files in a storage location. Later steps can access the files during job execution by mounting this storage location or by downloading or uploading the files to the compute file system.

The component source code must serialize the output object, which is usually stored in memory, into files. For example, you could serialize a pandas dataframe into a CSV file. Azure Machine Learning doesn't define any standardized methods for object serialization. You have the flexibility to choose your preferred methods to serialize objects into files. In the downstream component, you can choose how to deserialize and read these files.

Data type input and output paths

For data asset inputs and outputs, you must specify a path parameter that points to the data location. The following table shows the supported data locations for Azure Machine Learning pipeline inputs and outputs, with path parameter examples.

Location	Input	Output	Example
A path on your local computer	✓		`./home/<username>/data/my_data`
A path on a public http/s server	✓		`https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv`
A path on Azure Storage	*		`wasbs://<container_name>@<account_name>.blob.core.chinacloudapi.cn/<path>` or `abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>`
A path on an Azure Machine Learning datastore	✓	✓	`azureml://datastores/<data_store_name>/paths/<path>`
A path to a data asset	✓	✓	`azureml:my_data:<version>`

* Using Azure Storage directly isn't recommended for input, because it might need extra identity configuration to read the data. It's better to use Azure Machine Learning datastore paths, which are supported across various pipeline job types.

Data type input and output modes

For data type inputs and outputs, you can choose from several download, upload, and mount modes to define how the compute target accesses data. The following table shows the supported modes for different types of inputs and outputs.

Type	`upload`	`download`	`ro_mount`	`rw_mount`	`direct`	`eval_download`	`eval_mount`
`uri_folder` input		✓	✓		✓
`uri_file` input		✓	✓		✓
`mltable` input		✓	✓		✓	✓	✓
`uri_folder` output	✓			✓
`uri_file` output	✓			✓
`mltable` output	✓			✓	✓

The ro_mount or rw_mount modes are recommended for most cases. For more information, see Modes.

Inputs and outputs in pipeline graphs

On the pipeline job page in Azure Machine Learning studio, component inputs and outputs appear as small circles called input/output ports. These ports represent the data flow in the pipeline. Pipeline level output is displayed in purple boxes for easy identification.

The following screenshot from the NYC Taxi Data Regression pipeline graph shows many component and pipeline inputs and outputs.

When you hover over an input/output port, the type is displayed.

Screenshot that highlights the port type when hovering over the port.

The pipeline graph doesn't display primitive type inputs. These inputs appear on the Settings tab of the pipeline Job overview panel for pipeline level inputs, or the component panel for component level inputs. To open the component panel, double-click the component in the graph.

When you edit a pipeline in the studio Designer, pipeline inputs and outputs are in the Pipeline interface panel, and component inputs and outputs are in the component panel.

Screenshot highlighting the pipeline interface in Designer.

Promote component inputs/outputs to pipeline level

Promoting a component's input/output to the pipeline level lets you overwrite the component's input/output when you submit a pipeline job. This ability is especially useful for triggering pipelines by using REST endpoints.

The following examples show how to promote component level inputs/outputs to pipeline level inputs/outputs.

The following pipeline promotes three inputs and three outputs to the pipeline level. For example, pipeline_job_training_max_epocs is pipeline level input because it's declared under the inputs section on the root level.

Under train_job in the jobs section, the input named max_epocs is referenced as ${{parent.inputs.pipeline_job_training_max_epocs}}, meaning that the train_job's max_epocs input references the pipeline level pipeline_job_training_max_epocs input. Pipeline output is promoted by using the same schema.

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 1b_e2e_registered_components
description: E2E dummy train-score-eval pipeline with registered components

inputs:
  pipeline_job_training_max_epocs: 20
  pipeline_job_training_learning_rate: 1.8
  pipeline_job_learning_rate_schedule: 'time-based'

outputs: 
  pipeline_job_trained_model:
    mode: upload
  pipeline_job_scored_data:
    mode: upload
  pipeline_job_evaluation_report:
    mode: upload

settings:
 default_compute: azureml:cpu-cluster

jobs:
  train_job:
    type: command
    component: azureml:my_train@latest
    inputs:
      training_data: 
        type: uri_folder 
        path: ./data      
      max_epocs: ${{parent.inputs.pipeline_job_training_max_epocs}}
      learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate}}
      learning_rate_schedule: ${{parent.inputs.pipeline_job_learning_rate_schedule}}
    outputs:
      model_output: ${{parent.outputs.pipeline_job_trained_model}}
    services:
      my_vscode:
        type: vs_code
      my_jupyter_lab:
        type: jupyter_lab
      my_tensorboard:
        type: tensor_board
        log_dir: "outputs/tblogs"
    #  my_ssh:
    #    type: tensor_board
    #    ssh_public_keys: <paste the entire pub key content>
    #    nodes: all # Use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node.

  score_job:
    type: command
    component: azureml:my_score@latest
    inputs:
      model_input: ${{parent.jobs.train_job.outputs.model_output}}
      test_data: 
        type: uri_folder 
        path: ./data
    outputs:
      score_output: ${{parent.outputs.pipeline_job_scored_data}}

  evaluate_job:
    type: command
    component: azureml:my_eval@latest
    inputs:
      scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
    outputs:
      eval_output: ${{parent.outputs.pipeline_job_evaluation_report}}

You can find the full example at train-score-eval pipeline with registered components in the Azure Machine Learning examples repository.

The following code example defines the nyc_taxi_data_regression pipeline. The pipeline takes one input, pipeline_job_input, and generates six outputs as defined in the return statement. The pipeline outputs are promoted from the child component using the schema <step_name.outputs.output_name>, for example prepare_sample_data.outputs.prep_data.

You can find the end-to-end notebook at NYC taxi data regression in the Azure Machine Learning examples repository.

# import required libraries
from azure.identity import DefaultAzureCredential

from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component

# set subscription, resource group, and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# define the directory that stores the input data 
parent_dir = ""

# load components
prepare_data = load_component(source=parent_dir + "./prep.yml")
transform_data = load_component(source=parent_dir + "./transform.yml")
train_model = load_component(source=parent_dir + "./train.yml")
predict_result = load_component(source=parent_dir + "./predict.yml")
score_data = load_component(source=parent_dir + "./score.yml")

# construct pipeline
@pipeline()
def nyc_taxi_data_regression(pipeline_job_input):
    """NYC taxi data regression example."""
    prepare_sample_data = prepare_data(raw_data=pipeline_job_input)
    transform_sample_data = transform_data(
        clean_data=prepare_sample_data.outputs.prep_data
    )
    train_with_sample_data = train_model(
        training_data=transform_sample_data.outputs.transformed_data
    )
    predict_with_sample_data = predict_result(
        model_input=train_with_sample_data.outputs.model_output,
        test_data=train_with_sample_data.outputs.test_data,
    )
    score_with_sample_data = score_data(
        predictions=predict_with_sample_data.outputs.predictions,
        model=train_with_sample_data.outputs.model_output,
    )
    return {
        "pipeline_job_prepped_data": prepare_sample_data.outputs.prep_data,
        "pipeline_job_transformed_data": transform_sample_data.outputs.transformed_data,
        "pipeline_job_trained_model": train_with_sample_data.outputs.model_output,
        "pipeline_job_test_data": train_with_sample_data.outputs.test_data,
        "pipeline_job_predictions": predict_with_sample_data.outputs.predictions,
        "pipeline_job_score_report": score_with_sample_data.outputs.score_report,
    }
# define pipeline job
pipeline_job = nyc_taxi_data_regression(
    Input(type="uri_folder", path=parent_dir + "./data/")
)
# demo how to change pipeline output settings
pipeline_job.outputs.pipeline_job_prepped_data.mode = "rw_mount"

# set pipeline level compute
pipeline_job.settings.default_compute = "cpu-cluster"
# set pipeline level datastore
pipeline_job.settings.default_datastore = "workspaceblobstore"

Define optional inputs

By default, all inputs are required and must either have a default value or be assigned a value each time you submit a pipeline job. However, you can define an optional input.

Note

Optional outputs aren't supported.

Setting optional inputs can be useful in two scenarios:

If you define an optional data/model type input and don't assign a value to it when you submit the pipeline job, the pipeline component lacks that data dependency. If the component's input port isn't linked to any component or data/model node, the pipeline invokes the component directly instead of waiting for a preceding dependency.
If you set continue_on_step_failure = True for the pipeline but node2 uses required input from node1, node2 doesn't execute if node1 fails. If node1 input is optional, node2 executes even if node1 fails. The following graph demonstrates this scenario.

Azure CLI / Python SDK
Studio UI

The following code example shows how to define optional input. When the input is set as optional = true, you must use $[[]] to embrace the command line inputs, as in the highlighted lines of the example.

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_data_component_cli
display_name: train_data
description: A example train component
tags:
  author: azureml-sdk-team
version: 9
type: command
inputs:
  training_data: 
    type: uri_folder
  max_epocs:
    type: integer
    optional: true
  learning_rate: 
    type: number
    default: 0.01
    optional: true
  learning_rate_schedule: 
    type: string
    default: time-based
    optional: true
outputs:
  model_output:
    type: uri_folder
code: ./train_src
environment: azureml://registries/azureml/environments/sklearn-1.0/labels/latest
command: >-
  python train.py 
  --training_data ${{inputs.training_data}} 
  $[[--max_epocs ${{inputs.max_epocs}}]]
  $[[--learning_rate ${{inputs.learning_rate}}]]
  $[[--learning_rate_schedule ${{inputs.learning_rate_schedule}}]]
  --model_output ${{outputs.model_output}}

Customize output paths

By default, component output is stored in the {default_datastore} you set for the pipeline, azureml://datastores/${{default_datastore}}/paths/${{name}}/${{output_name}}. If not set, the default is the workspace blob storage.

Job {name} is resolved at job execution time, and {output_name} is the name you defined in the component YAML. You can customize where to store the output by defining an output path.

The pipeline.yml file at train-score-eval pipeline with registered components example defines a pipeline that has three pipeline level outputs. You can use the following command to set custom output paths for the pipeline_job_trained_model output.

# define the custom output path using datastore uri
# add relative path to your blob container after "azureml://datastores/<datastore_name>/paths"
output_path="azureml://datastores/{datastore_name}/paths/{relative_path_of_container}"  

# create job and define path using --outputs.<outputname>
az ml job create -f ./pipeline.yml --set outputs.pipeline_job_trained_model.path=$output_path

The following code that demonstrates how to customize output paths is from the Build pipeline with command_component decorated python function notebook.

cluster_name = "cpu-cluster"
custom_path = "azureml://datastores/workspaceblobstore/paths/custom_path/${{name}}/"

# define a pipeline with component
@pipeline(default_compute=cluster_name)
def pipeline_with_python_function_components(input_data, test_data, learning_rate):
    """E2E dummy train-score-eval pipeline with components defined via python function components"""

    # Call component obj as function: apply given inputs & parameters to create a node in pipeline
    train_with_sample_data = train_model(
        training_data=input_data, max_epochs=5, learning_rate=learning_rate
    )
    score_with_sample_data = score_data(
        model_input=train_with_sample_data.outputs.model_output, test_data=test_data
    )
    # example how to change path of output on step level,
    # please note if the output is promoted to pipeline level you need to change path in pipeline job level
    score_with_sample_data.outputs.score_output = Output(
        type="uri_folder", mode="rw_mount", path=custom_path
    )
    eval_with_sample_data = eval_model(
        scoring_result=score_with_sample_data.outputs.score_output
    )

    # Return: pipeline outputs
    return {
        "eval_output": eval_with_sample_data.outputs.eval_output,
        "model_output": train_with_sample_data.outputs.model_output,
    }


pipeline_job = pipeline_with_python_function_components(
    input_data=Input(
        path="wasbs://demo@dprepdata.blob.core.windows.net/Titanic.csv", type="uri_file"
    ),
    test_data=Input(
        path="wasbs://demo@dprepdata.blob.core.windows.net/Titanic.csv", type="uri_file"
    ),
    learning_rate=0.1,
)
# example how to change path of output on pipeline level
pipeline_job.outputs.model_output = Output(
    type="uri_folder", mode="rw_mount", path=custom_path
)

Download outputs

You can download outputs at the pipeline or component level.

Download pipeline level outputs

You can download all the outputs of a job or download a specific output.

# Download all the outputs of the job
az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

# Download a specific output
az ml job download --output-name <OUTPUT_PORT_NAME> -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

First, create and initialize ml_client as a handle to reference your workspace. For more information, see Create a handle to the workspace.

# Set your subscription, resource group and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

Download all the outputs of a job or download a specific output.

# Download all the outputs of the job
output = client.jobs.download(name=job.name, download_path=tmp_path, all=True)

# Download specific output
output = client.jobs.download(name=job.name, download_path=tmp_path, output_name=output_port_name)

Download component outputs

To download the outputs of a child component, first list all child jobs of a pipeline job and then use similar code to download the outputs.

# List all child jobs in the job and print job details in table format
az ml job list --parent-job-name <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID> -o table

# Select the desired child job name to download output
az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>