Tips, recommendations, and features for developing and testing Delta Live Tables pipelines
This article describes patterns you can use to develop and test Delta Live Tables pipelines. Through the pipeline settings, Delta Live Tables allows you to specify configurations to isolate pipelines in developing, testing, and production environments. The recommendations in this article are applicable for both SQL and Python code development.
Use development mode to run pipeline updates
Delta Live Tables provides a UI toggle to control whether your pipeline updates run in development or production mode. This mode controls how pipeline updates are processed, including:
- Development mode does not immediately terminate compute resources after an update succeeds or fails. You can reuse the same compute resources to run multiple updates of the pipeline without waiting for a cluster to start.
- Development mode does not automatically retry on task failure, allowing you to immediately detect and fix logical or syntactic errors in your pipeline.
Databricks recommends using development mode during development and testing and always switching to production mode when deploying to a production environment.
See Development and production modes.
Test pipeline source code without waiting for tables to update
To check for problems with your pipeline source code, such as syntax and analysis errors, during development and testing, you can run a Validate update. Because a Validate
update only verifies the correctness of pipeline source code without running an actual update on any tables, you can more quickly identify and fix issues before running an actual pipeline update.
Specify a target schema during all development lifecycle phases
All datasets in a Delta Live Tables pipeline reference the LIVE
virtual schema, which is not accessible outside the pipeline. If a target schema is specified, the LIVE
virtual schema points to the target schema. To review the results written out to each table during an update, you must specify a target schema.
You must specify a target schema that is unique to your environment. Each table in a given schema can only be updated by a single pipeline.
By creating separate pipelines for development, testing, and production with different targets, you can keep these environments isolated. Using the target schema parameter allows you to remove logic that uses string interpolation or other widgets or parameters to control data sources and targets.
See Publish data from Delta Live Tables to the Hive metastore.
Use Databricks Git folders to manage Delta Live Tables pipelines
Databricks recommends using Git folders during Delta Live Tables pipeline development, testing, and deployment to production. Git folders enables the following:
- Keeping track of how code is changing over time.
- Merging changes that are being made by multiple developers.
- Software development practices such as code reviews.
Databricks recommends configuring a single Git repository for all code related to a pipeline.
Each developer should have their own Databricks Git folder configured for development. During development, the user configures their own pipeline from their Databricks Git folder and tests new logic using development datasets and isolated schema and locations. As development work is completed, the user commits and pushes changes back to their branch in the central Git repository and opens a pull request against the testing or QA branch.
The resulting branch should be checked out in a Databricks Git folder and a pipeline configured using test datasets and a development schema. Assuming logic runs as expected, a pull request or release branch should be prepared to push the changes to production.
While Git folders can be used to synchronize code across environments, pipeline settings need to be kept up to date either manually or using tools like Terraform.
This workflow is similar to using Git folders for CI/CD in all Databricks jobs. See CI/CD techniques with Git and Databricks Git folders (Repos).
Segment source code for ingestion and transformation steps
Databricks recommends isolating queries that ingest data from transformation logic that enriches and validates data. If you organize source code used for ingesting data from development or testing data sources in a directory separate from production data ingestion logic, you can configure pipelines for various environments with datasets specific to those environments. For example, you can use smaller datasets for testing, accelerating development. See Create sample datasets for development and testing.
You can also use parameters to control data sources for development, testing, and production. See Control data sources with parameters.
Because Delta Live Tables pipelines use the LIVE
virtual schema for managing all dataset relationships, by configuring development and testing pipelines with ingestion source code that load sample data, you can substitute sample datasets using production table names to test code. The same transformation logic can be used in all environments.
Create sample datasets for development and testing
Databricks recommends creating development and test datasets to test pipeline logic with both expected data and potential malformed or corrupt records. There are multiple ways to create datasets that can be useful for development and testing, including the following:
- Select a subset of data from a production dataset.
- Use anonymized or artificially generated data for sources containing PII.
- Create test data with well-defined outcomes based on downstream transformation logic.
- Anticipate potential data corruption, malformed records, and upstream data changes by creating records that break data schema expectations.
For example, if you have a notebook that defines a dataset using the following code:
CREATE OR REFRESH STREAMING TABLE input_data AS SELECT * FROM read_files("/production/data", "json")
You could create a sample dataset containing specific records using a query like the following:
CREATE OR REFRESH MATERIALIZED VIEW input_data AS
SELECT "2021/09/04" AS date, 22.4 as sensor_reading UNION ALL
SELECT "2021/09/05" AS date, 21.5 as sensor_reading
The following example demonstrates filtering published data to create a subset of the production data for development or testing:
CREATE OR REFRESH MATERIALIZED VIEW input_data AS SELECT * FROM prod.input_data WHERE date > current_date() - INTERVAL 1 DAY
To use these different datasets, create multiple pipelines with the notebooks implementing the transformation logic. Each pipeline can read data from the LIVE.input_data
dataset but is configured to include the notebook that creates the dataset specific to the environment.
Control data sources with parameters
You can reference parameters set during pipeline configuration from within your source code. These parameters are set as key-value pairs in the Compute > Advanced > Configurations portion of the pipeline settings UI. This pattern allows you to specify different data sources in different configurations of the same pipeline.
For example, you can specify different paths in development, testing, and production configurations for a pipeline using the variable data_source_path
and then reference it using the following code:
SQL
CREATE STREAMING TABLE bronze
AS (
SELECT
*,
_metadata.file_path AS source_file_path
FROM read_files( '${data_source_path}', 'csv',
map("header", "true"))
)
Python
import dlt
from pyspark.sql.functions import col
data_source_path = spark.conf.get("data_source_path")
@dlt.table
def bronze():
return (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", True)
.load(data_source_path )
.select("*", col("_metadata.file_path").alias("source_file_name"))
)
This pattern is especially useful if you need to test how ingestion logic might handle changes to schema or malformed data during initial ingestion. You can use the identical code throughout your entire pipeline in all environments while switching out datasets.