Run an update on a Delta Live Tables pipeline
This article explains what a Delta Live Tables pipeline update is and how to run one.
After you create a pipeline and are ready to run it, you start an update. A pipeline update does the following:
- Starts a cluster with the correct configuration.
- Discovers all the tables and views defined, and checks for any analysis errors such as invalid column names, missing dependencies, and syntax errors.
- Creates or updates tables and views with the most recent data available.
You can check for problems in a pipeline's source code without waiting for tables to be created or updated using a Validate update. The Validate
feature is useful when developing or testing pipelines by allowing you to quickly find and fix errors in your pipeline, such as incorrect table or column names.
To learn how to create a pipeline, see Tutorial: Run your first Delta Live Tables pipeline.
Start a pipeline update
Azure Databricks provides several options to start pipeline updates, including the following:
- In the Delta Live Tables UI, you have the following options:
- Click the button on the pipeline details page.
- From the pipelines list, click in the Actions column.
- To start an update in a notebook, click Delta Live Tables > Start in the notebook toolbar. See Open or run a Delta Live Tables pipeline from a notebook.
- You can trigger pipelines programmatically using the API or CLI. See Delta Live Tables API guide.
- You can schedule the pipeline as a job using the Delta Live Tables UI or the jobs UI. See Schedule a pipeline.
Note
The default behavior for manually triggered pipeline updates using any of these methods is to refresh all.
How Delta Live Tables updates tables and views
The tables and views updated, and how those tables are views are updated, depends on the update type:
- Refresh all: All tables are updated to reflect the current state of their input data sources. For streaming tables, new rows are appended to the table.
- Full refresh all: All tables are updated to reflect the current state of their input data sources. For streaming tables, Delta Live Tables attempts to clear all data from each table and then load all data from the streaming source.
- Refresh selection: The behavior of
refresh selection
is identical torefresh all
, but allows you to refresh only selected tables. Selected tables are updated to reflect the current state of their input data sources. For streaming tables, new rows are appended to the table. - Full refresh selection: The behavior of
full refresh selection
is identical tofull refresh all
, but allows you to perform a full refresh of only selected tables. Selected tables are updated to reflect the current state of their input data sources. For streaming tables, Delta Live Tables attempts to clear all data from each table and then load all data from the streaming source.
For existing materialized views, an update has the same behavior as a SQL REFRESH
on a materialized view. For new materialized views, the behavior is the same as a SQL CREATE
operation.
Start a pipeline update for selected tables
You may want to reprocess data for only selected tables in your pipeline. For example, during development, you only change a single table and want to reduce testing time, or a pipeline update fails and you want to refresh only the failed tables.
Note
You can use selective refresh with only triggered pipelines.
To start an update that refreshes selected tables only, on the Pipeline details page:
Click Select tables for refresh. The Select tables for refresh dialog appears.
If you do not see the Select tables for refresh button, make sure the Pipeline details page displays the latest update, and the update is complete. If a DAG is not displayed for the latest update, for example, because the update failed, the Select tables for refresh button is not displayed.
To select the tables to refresh, click on each table. The selected tables are highlighted and labeled. To remove a table from the update, click on the table again.
Click Refresh selection.
Note
The Refresh selection button displays the number of selected tables in parentheses.
To reprocess data that has already been ingested for the selected tables, click next to the Refresh selection button and click Full Refresh selection.
Start a pipeline update for failed tables
If a pipeline update fails because of errors in one or more tables in the pipeline graph, you can start an update of only failed tables and any downstream dependencies.
Note
Excluded tables are not refreshed, even if they depend on a failed table.
To update failed tables, on the Pipeline details page, click Refresh failed tables.
To update only selected failed tables:
Click next to the Refresh failed tables button and click Select tables for refresh. The Select tables for refresh dialog appears.
To select the tables to refresh, click on each table. The selected tables are highlighted and labeled. To remove a table from the update, click on the table again.
Click Refresh selection.
Note
The Refresh selection button displays the number of selected tables in parentheses.
To reprocess data that has already been ingested for the selected tables, click next to the Refresh selection button and click Full Refresh selection.
Check a pipeline for errors without waiting for tables to update
Important
The Delta Live Tables Validate
update feature is in Public Preview.
To check whether a pipeline's source code is valid without running a full update, use Validate. A Validate
update resolves the definitions of datasets and flows defined in the pipeline but does not materialize or publish any datasets. Errors found during validation, such as incorrect table or column names, are reported in the UI.
To run a Validate
update, on the pipeline details page click next to Start and click Validate.
After the Validate
update completes, the event log shows events related only to the Validate
update, and no metrics are displayed in the DAG. If errors are found, details are available in the event log.
You can see results for only the most recent Validate
update. If the Validate
update was the most recently run update, you can see the results by selecting it in the update history. If another update is run after the Validate
update, the results are no longer available in the UI.
Continuous vs. triggered pipeline execution
If the pipeline uses the triggered execution mode, the system stops processing after successfully refreshing all tables or selected tables in the pipeline once, ensuring each table that is part of the update is updated based on the data available when the update started.
If the pipeline uses continuous execution, Delta Live Tables processes new data as it arrives in data sources to keep tables throughout the pipeline fresh.
The execution mode is independent of the type of table being computed. Both materialized views and streaming tables can be updated in either execution mode. To avoid unnecessary processing in continuous execution mode, pipelines automatically monitor dependent Delta tables and perform an update only when the contents of those dependent tables have changed.
Table comparing data pipeline execution modes
The following table highlights the differences between these execution modes:
Key questions | Triggered | Continuous |
---|---|---|
When does the update stop? | Automatically once complete. | Runs continuously until manually stopped. |
What data is processed? | Data available when the update is started. | All data as it arrives at configured sources. |
What data freshness requirements is this best for? | Data updates run every 10 minutes, hourly, or daily. | Data updates desired between every 10 seconds and a few minutes. |
Triggered pipelines can reduce resource consumption and expense since the cluster runs only long enough to execute the pipeline. However, new data won't be processed until the pipeline is triggered. Continuous pipelines require an always-running cluster, which is more expensive but reduces processing latency.
You can configure execution mode with the Pipeline mode option in the settings.
How to choose pipeline boundaries
A Delta Live Tables pipeline can process updates to a single table, many tables with dependent relationships, many tables without relationships, or multiple independent flows of tables with dependent relationships. This section contains considerations to help determine how to break up your pipelines.
Larger Delta Live Tables pipelines have a number of benefits. These include the following:
- More efficiently use cluster resources.
- Reduce the number of pipelines in your workspace.
- Reduce the complexity of workflow orchestration.
Some common recommendations on how processing pipelines should be split include the following:
- Split functionality at team boundaries. For example, your data team may maintain pipelines to transform data while your data analysts maintain pipelines that analyze the transformed data.
- Split functionality at application-specific boundaries to reduce coupling and facilitate the re-use of common functionality.
Development and production modes
You can optimize pipeline execution by switching between development and production modes. Use the buttons in the Pipelines UI to switch between these two modes. By default, pipelines run in development mode.
When you run your pipeline in development mode, the Delta Live Tables system does the following:
- Reuses a cluster to avoid the overhead of restarts. By default, clusters run for two hours when development mode is enabled. You can change this with the
pipelines.clusterShutdown.delay
setting in the Configure your compute settings. - Disables pipeline retries so you can immediately detect and fix errors.
In production mode, the Delta Live Tables system does the following:
- Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
- Retries execution in the event of specific errors, for example, a failure to start a cluster.
Note
Switching between development and production modes only controls cluster and pipeline execution behavior. Storage locations and target schemas in the catalog for publishing tables must be configured as part of pipeline settings and are not affected when switching between modes.
Schedule a pipeline
You can start a triggered pipeline manually or run the pipeline on a schedule with an Azure Databricks job. You can create and schedule a job with a single pipeline task directly in the Delta Live Tables UI or add a pipeline task to a multi-task workflow in the jobs UI. See Delta Live Tables pipeline task for jobs.
To create a single-task job and a schedule for the job in the Delta Live Tables UI:
- Click Schedule > Add a schedule. The Schedule button is updated to show the number of existing schedules if the pipeline is included in one or more scheduled jobs, for example, Schedule (5).
- Enter a name for the job in the Job name field.
- Set the Schedule to Scheduled.
- Specify the period, starting time, and time zone.
- Configure one or more email addresses to receive alerts on pipeline start, success, or failure.
- Click Create.