Develop a Python wheel file using Databricks Asset Bundles
This article describes how to build, deploy, and run a Python wheel file as part of a Databricks Asset Bundle project. See What are Databricks Asset Bundles?
Requirements
- Databricks CLI version 0.218.0 or above. To check your installed version of the Databricks CLI, run the command
databricks -v
. To install the Databricks CLI, see Install or update the Databricks CLI. - The remote workspace must have workspace files enabled. See What are workspace files?.
Decision: Create the bundle manually or by using a template
Decide whether you want to create a starter bundle by using a template or to create the bundle manually. Creating the bundle by using a template is faster and easier, but the bundle might produce content that is not needed, and the bundle's default settings must be further customized for real applications. Creating the bundle manually gives you full control over the bundle's settings, but you must be familiar with how bundles work, as you are doing all of the work from the beginning. Choose one of the following sets of steps:
Create the bundle by using a template
In these steps, you create the bundle by using the Azure Databricks default bundle template for Python. These steps guide you to create a bundle that consists of files to build into a Python wheel file and the definition of an Azure Databricks job to build this Python wheel file. You then validate, deploy, and build the deployed files into a Python wheel file from the Python wheel job within your Azure Databricks workspace.
The Azure Databricks default bundle template for Python uses setuptools to build the Python wheel file. If you want to use Poetry to build the Python wheel file instead, follow the instructions later in this section to swap out the setuptools
implementation for a Poetry implementation instead.
Step 1: Set up authentication
In this step, you set up authentication between the Databricks CLI on your development machine and your Azure Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Azure Databricks configuration profile named DEFAULT
for authentication.
Note
U2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in Authentication.
Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.
In the following command, replace
<workspace-url>
with your Azure Databricks per-workspace URL, for examplehttps://adb-1234567890123456.7.databricks.azure.cn
.databricks auth login --host <workspace-url>
The Databricks CLI prompts you to save the information that you entered as an Azure Databricks configuration profile. Press
Enter
to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command
databricks auth profiles
. To view a specific profile's existing settings, run the commanddatabricks auth env --profile <profile-name>
.In your web browser, complete the on-screen instructions to log in to your Azure Databricks workspace.
To view a profile's current OAuth token value and the token's upcoming expiration timestamp, run one of the following commands:
databricks auth token --host <workspace-url>
databricks auth token -p <profile-name>
databricks auth token --host <workspace-url> -p <profile-name>
If you have multiple profiles with the same
--host
value, you might need to specify the--host
and-p
options together to help the Databricks CLI find the correct matching OAuth token information.
Step 2: Create the bundle
A bundle contains the artifacts you want to deploy and the settings for the workflows you want to run.
Use your terminal or command prompt to switch to a directory on your local development machine that will contain the template's generated bundle.
Use the Databricks CLI version to run the
bundle init
command:databricks bundle init
For
Template to use
, leave the default value ofdefault-python
by pressingEnter
.For
Unique name for this project
, leave the default value ofmy_project
, or type a different value, and then pressEnter
. This determines the name of the root directory for this bundle. This root directory is created within your current working directory.For
Include a stub (sample) notebook
, selectno
and pressEnter
. This instructs the Databricks CLI to not add a sample notebook to your bundle.For
Include a stub (sample) DLT pipeline
, selectno
and pressEnter
. This instructs the Databricks CLI to not define a sample Delta Live Tables pipeline in your bundle.For
Include a stub (sample) Python package
, leave the default value ofyes
by pressingEnter
. This instructs the Databricks CLI to add sample Python wheel package files and related build instructions to your bundle.
Step 3: Explore the bundle
To view the files that the template generated, switch to the root directory of your newly created bundle and open this directory with your preferred IDE, for example Visual Studio Code. Files of particular interest include the following:
databricks.yml
: This file specifies the bundle's programmatic name, includes a reference to the Python wheel job definition, and specifies settings about the target workspace.resources/<project-name>_job.yml
: This file specifies the Python wheel job's settings.src/<project-name>
: This directory include the files that the Python wheel job uses to build the Python wheel file.
Note
If you want to install the Python wheel file on a target cluster that has Databricks Runtime 12.2 LTS or below installed, you must add the following top-level mapping to the databricks.yml
file:
# Applies to all tasks of type python_wheel_task.
experimental:
python_wheel_wrapper: true
This mapping instructs the Databricks CLI to do the following:
- Deploy a copy of the Python wheel file in the background. This deployment path is typically
${workspace.artifact_path}/.internal/<random-id>/<wheel-filename>.whl
. - Create a notebook in the background that contains instructions to install the preceding deployed Python wheel file on the target cluster. This notebook's path is typically
${workspace.file_path}/.databricks/bundle/<target-name>/.internal/notebook_<job-name>_<task-key>
. - When you run a job that contains a Python wheel task, and that tasks references the preceding Python wheel file, a job is created in the background that runs the preceding notebook.
You do not need to specify this mapping for target clusters with Databricks Runtime 13.1 or above installed, as Python wheel installations from the Azure Databricks workspace file system will install automatically on these target clusters.
Step 4: Update the project's bundle to use Poetry
By default, the bundle template specifies building the Python wheel file using setuptools
along with the files setup.py
and requirements-dev.txt
. If you want to keep these defaults, then skip ahead to Step 5: Validate the project's bundle configuration file.
To update the project's bundle to use Poetry instead of setuptools
, make sure that your local development machine meets the following requirements:
- Poetry version 1.6 or above. To check your installed version of Poetry, run the command
poetry -V
orpoetry --version
. To install or upgrade Poetry, see Installation. - Python version 3.10 or above. To check your version of Python, run the command
python -V
orpython --version
. - Databricks CLI version 0.209.0 or above. To your version of the Databricks CLI, run the command
databricks -v
ordatabricks --version
. See Install or update the Databricks CLI.
Make the following changes to the project's bundle:
From the bundle's root directory, instruct
poetry
to initialize the Python wheel builds for Poetry, by running the following command:poetry init
Poetry displays several prompts for you to complete. For the Python wheel builds, answer these prompts as follows to match the related default settings in the project's bundle:
- For
Package name
, type the name of the child folder under/src
, and then pressEnter
. This should also be the package'sname
value that is defined in the bundle'ssetup.py
file. - For
Version
, type0.0.1
and pressEnter
. This matches the version number that is defined in the bundle'ssrc/<project-name>/__init__.py
file. - For
Description
, typewheel file based on <project-name>/src
(replacing<project-name>
with the project's name), and pressEnter
. This matches thedescription
value that is defined in the template'ssetup.py
file. - For
Author
, pressEnter
. This default value matches the author that is defined in the template'ssetup.py
file. - For
License
, pressEnter
. There is no license defined in the template. - For
Compatible Python versions
, enter the Python version that matches the one on your target Azure Databricks clusters (for example,^3.10
), and pressEnter
. - For
Would you like to define your main dependencies interactively?
Typeno
and pressEnter
. You will define your dependencies later. - For
Would you like to define your development dependencies interactively?
Typeno
and pressEnter
. You will define your dependencies later. - For
Do you confirm generation?
PressEnter
.
- For
After you complete the prompts, Poetry adds a
pyproject.toml
file to the bundle's project. For information about thepyproject.toml
file, see The pyproject.toml file.From the bundle's root directory, instruct
poetry
to read thepyproject.toml
file, resolve the dependencies and install them, create apoetry.lock
file to lock the dependencies, and finally to create a virtual environment. To do this, run the following command:poetry install
Add the following section at the end of the
pyproject.toml
file, replacing<project-name>
with the name of directory that contains thesrc/<project-name>/main.py
file (for example,my_project
):[tool.poetry.scripts] main = "<project-name>.main:main"
The section specifies the Python wheel's entry point for the Python wheel job.
Add the following mapping at the top level of the bundle's
databricks.yml
file:artifacts: default: type: whl build: poetry build path: .
This mapping instructs the Databricks CLI to use Poetry to build a Python wheel file.
Delete the
setup.py
andrequirements-dev.txt
files from the bundle, as Poetry does not need them.
Step 5: Validate the project's bundle configuration file
In this step, you check whether the bundle configuration is valid.
From the root directory, use the Databricks CLI to run the
bundle validate
command, as follows:databricks bundle validate
If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.
If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid.
Step 6: Build the Python wheel file and deploy the local project to the remote workspace
In this step, you build the Python wheel file, deploy the built Python wheel file to your remote Azure Databricks workspace, and create the Azure Databricks job within your workspace.
If you use
setuptools
, install thewheel
andsetuptools
packages if you have not done so already, by running the following command:pip3 install --upgrade wheel setuptools
In the Visual Studio Code terminal, use the Databricks CLI to run the
bundle deploy
command as follows:databricks bundle deploy -t dev
If you want to check whether the locally built Python wheel file was deployed:
- In your Azure Databricks workspace's sidebar, click Workspace.
- Click into the following folder: Workspace > Users >
<your-username>
> .bundle ><project-name>
> dev > artifacts > .internal ><random-guid>
.
The Python wheel file should be in this folder.
If you want to check whether the job was created:
- In your Azure Databricks workspace's sidebar, click Workflows.
- On the Jobs tab, click [dev
<your-username>
]<project-name>
_job. - Click the Tasks tab.
There should be one task: main_task.
If you make any changes to your bundle after this step, you should repeat steps 5-6 to check whether your bundle configuration is still valid and then redeploy the project.
Step 7: Run the deployed project
In this step, you run the Azure Databricks job in your workspace.
From the root directory, use the Databricks CLI to run the
bundle run
command, as follows, replacing<project-name>
with the name of your project from Step 2:databricks bundle run -t dev <project-name>_job
Copy the value of
Run URL
that appears in your terminal and paste this value into your web browser to open your Azure Databricks workspace.In your Azure Databricks workspace, after the task completes successfully and shows a green title bar, click the main_task task to see the results.
If you make any changes to your bundle after this step, you should repeat steps 5-7 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project.
You have reached the end of the steps for creating a bundle by using a template.
Create the bundle manually
In these steps, you create the bundle from the beginning by hand. These steps guide you to create a bundle that consists of files to build into a Python wheel file and the definition of a Databricks job to build this Python wheel file. You then validate, deploy, and build the deployed files into a Python wheel file from the Python wheel job within your Databricks workspace.
These steps include adding content to a YAML file. Optionally, you might want to use an integrated development environment (IDE) that provides automatic schema suggestions and actions when working with YAML files. The following steps use Visual Studio Code with the YAML extension installed from the Visual Studio Code Marketplace.
These steps assume that you already know:
- How to create, build, and work with Python wheel files with Poetry or
setuptools
. For Poetry, see Basic usage. Forsetuptools
, see the Python Packaging User Guide. - How to use Python wheel files as part of an Azure Databricks job. See Use a Python wheel file in an Azure Databricks job.
Follow these instructions to create a sample bundle that builds a Python wheel file with Poetry or setuptools
, deploys the Python wheel file, and then runs the deployed Python wheel file.
If you have already built a Python wheel file and just want to deploy and run it, skip ahead to specifying the Python wheel settings in the bundle configuration file in Step 3: Create the bundle's configuration file.
Step 1: Set up authentication
In this step, you set up authentication between the Databricks CLI on your development machine and your Azure Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Azure Databricks configuration profile named DEFAULT
for authentication.
Note
U2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in Authentication.
Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.
In the following command, replace
<workspace-url>
with your Azure Databricks per-workspace URL, for examplehttps://adb-1234567890123456.7.databricks.azure.cn
.databricks auth login --host <workspace-url>
The Databricks CLI prompts you to save the information that you entered as an Azure Databricks configuration profile. Press
Enter
to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command
databricks auth profiles
. To view a specific profile's existing settings, run the commanddatabricks auth env --profile <profile-name>
.In your web browser, complete the on-screen instructions to log in to your Azure Databricks workspace.
To view a profile's current OAuth token value and the token's upcoming expiration timestamp, run one of the following commands:
databricks auth token --host <workspace-url>
databricks auth token -p <profile-name>
databricks auth token --host <workspace-url> -p <profile-name>
If you have multiple profiles with the same
--host
value, you might need to specify the--host
and-p
options together to help the Databricks CLI find the correct matching OAuth token information.
Step 2: Create the bundle
A bundle contains the artifacts you want to deploy and the settings for the workflows you want to run.
In your bundle's root, create the following folders and files, depending on whether you use Poetry or
setuptools
for building Python wheel files:Poetry
├── src │ └── my_package │ ├── __init__.py │ ├── main.py │ └── my_module.py └── pyproject.toml
Setuptools
├── src │ └── my_package │ ├── __init__.py │ ├── main.py │ └── my_module.py └── setup.py
Leave the
__init__.py
file empty.Add the following code to the
main.py
file and then save the file:from my_package.my_module import * def main(): first = 200 second = 400 print(f"{first} + {second} = {add_two_numbers(first, second)}") print(f"{second} - {first} = {subtract_two_numbers(second, first)}") print(f"{first} * {second} = {multiply_two_numbers(first, second)}") print(f"{second} / {first} = {divide_two_numbers(second, first)}") if __name__ == "__main__": main()
Add the following code to the
my_module.py
file and then save the file:def add_two_numbers(a, b): return a + b def subtract_two_numbers(a, b): return a - b def multiply_two_numbers(a, b): return a * b def divide_two_numbers(a, b): return a / b
Add the following code to the
pyproject.toml
orsetup.py
file and then save the file:Pyproject.toml
[tool.poetry] name = "my_package" version = "0.0.1" description = "<my-package-description>" authors = ["my-author-name <my-author-name>@<my-organization>"] [tool.poetry.dependencies] python = "^3.10" [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api" [tool.poetry.scripts] main = "my_package.main:main"
- Replace
my-author-name
with your organization's primary contact name. - Replace
my-author-name>@<my-organization
with your organization's primary email contact address. - Replace
<my-package-description>
with a display description for your Python wheel file.
Setup.py
from setuptools import setup, find_packages import src setup( name = "my_package", version = "0.0.1", author = "<my-author-name>", url = "https://<my-url>", author_email = "<my-author-name>@<my-organization>", description = "<my-package-description>", packages=find_packages(where='./src'), package_dir={'': 'src'}, entry_points={ "packages": [ "main=my_package.main:main" ] }, install_requires=[ "setuptools" ] )
- Replace
https://<my-url>
with your organization's URL. - Replace
<my-author-name>
with your organization's primary contact name. - Replace
<my-author-name>@<my-organization>
with your organization's primary email contact address. - Replace
<my-package-description>
with a display description for your Python wheel file.
- Replace
Step 3: Create the bundle's configuration file
A bundle configuration file describes the artifacts you want to deploy and the workflows you want to run.
In your bundle's root, add a bundle configuration file named
databricks.yml
. Add the following code to this file:Poetry
Note
If you have already built a Python wheel file and just want to deploy it, then modify the following bundle configuration file by omitting the
artifacts
mapping. The Databricks CLI will then assume that the Python wheel file is already built and will automatically deploy the files that are specified in thelibraries
array'swhl
entries.bundle: name: my-wheel-bundle artifacts: default: type: whl build: poetry build path: . resources: jobs: wheel-job: name: wheel-job tasks: - task_key: wheel-task new_cluster: spark_version: 13.3.x-scala2.12 node_type_id: Standard_DS3_v2 data_security_mode: USER_ISOLATION num_workers: 1 python_wheel_task: entry_point: main package_name: my_package libraries: - whl: ./dist/*.whl targets: dev: workspace: host: <workspace-url>
Setuptools
bundle: name: my-wheel-bundle resources: jobs: wheel-job: name: wheel-job tasks: - task_key: wheel-task new_cluster: spark_version: 13.3.x-scala2.12 node_type_id: Standard_DS3_v2 data_security_mode: USER_ISOLATION num_workers: 1 python_wheel_task: entry_point: main package_name: my_package libraries: - whl: ./dist/*.whl targets: dev: workspace: host: <workspace-url>
Replace
<workspace-url>
with your per-workspace URL, for examplehttps://adb-1234567890123456.7.databricks.azure.cn
.The
artifacts
mapping is required to build Python wheel files with Poetry and is optional to build Python wheel files withsetuptools
. Theartifacts
mapping contains one or more artifact definitions with the following mappings:- The
type
mapping must be present and set towhl
to specify that a Python wheel file is to be built. Forsetuptools
,whl
is the default if no artifact definitions are specified. - The
path
mapping indicates the path to thepyproject.toml
file for Poetry or to thesetup.py
file forsetuptools
. This path is relative to thedatabricks.yml
file. Forsetuptools
, this path is.
(the same directory as thedatabricks.yml
file) by default. - The
build
mapping indicates any custom build commands to run to build the Python wheel file. Forsetuptools
, this command ispython3 setup.py bdist wheel
by default. - The
files
mapping consists of one or moresource
mappings that specify any additional files to include in the Python wheel build. There is no default.
Note
If you want to install the Python wheel file on a target cluster that has Databricks Runtime 12.2 LTS or below installed, you must add the following top-level mapping to the
databricks.yml
file:# Applies to jobs with python_wheel_task and that use # clusters with Databricks Runtime 13.0 or below installed. experimental: python_wheel_wrapper: true
This mapping instructs the Databricks CLI to do the following:
- Deploys a copy of the Python wheel file in the background. This deployment path is typically
${workspace.artifact_path}/.internal/<random-id>/<wheel-filename>.whl
. - Creates a notebook in the background that contains instructions to install the preceding deployed Python wheel file on the target cluster. This notebook's path is typically
${workspace.file_path}/.databricks/bundle/<target-name>/.internal/notebook_<job-name>_<task-key>
. - When you run a job that contains a Python wheel task, and that task references the preceding Python wheel file, a job is created in the background that runs the preceding notebook.
You do not need to specify this mapping for target clusters with Databricks Runtime 13.1 or above installed, as Python wheel installations from the Azure Databricks workspace file system will install automatically on these target clusters.
- The
If you use Poetry, do the following:
- Install Poetry, version 1.6 or above, if it is not already installed. To check your installed version of Poetry, run the command
poetry -V
orpoetry --version
. - Make sure you have Python version 3.10 or above installed. To check your version of Python, run the command
python -V
orpython --version
. - Make sure you have Databricks CLI version 0.209.0 or above. To your version of the Databricks CLI, run the command
databricks -v
ordatabricks --version
. See Install or update the Databricks CLI.
- Install Poetry, version 1.6 or above, if it is not already installed. To check your installed version of Poetry, run the command
If you use
setuptools
, install thewheel
andsetuptools
packages if they are not already installed, by running the following command:pip3 install --upgrade wheel setuptools
If you intend to store this bundle with a Git provider, add a
.gitignore
file in the project's root, and add the following entries to this file:Poetry
.databricks dist
Setuptools
.databricks build dist src/my_package/my_package.egg-info
Step 4: Validate the project's bundle configuration file
In this step, you check whether the bundle configuration is valid.
From the root directory, validate the bundle configuration file:
databricks bundle validate
If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.
If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid.
Step 5: Build the Python wheel file and deploy the local project to the remote workspace
Build the Python wheel file locally, deploy the built Python wheel file to your workspace, deploy the notebook to your workspace, and create the job in your workspace:
databricks bundle deploy -t dev
Step 6: Run the deployed project
Run the deployed job, which uses the deployed notebook to call the deployed Python wheel file:
databricks bundle run -t dev wheel-job
In the output, copy the
Run URL
and paste it into your web browser's address bar.In the job run's Output page, the following results appear:
200 + 400 = 600 400 - 200 = 200 200 * 400 = 80000 400 / 200 = 2.0
If you make any changes to your bundle after this step, you should repeat steps 3-5 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project.
Build and install a Python wheel file for a job
To build a Python wheel file with Poetry or setuptools
, and then use that Python wheel file in a job, you must add one or two mappings to your databricks.yml
file.
If you use Poetry, you must include the following artifacts
mapping in the databricks.yml
file. This mapping runs the poetry build
command and uses the pyproject.toml
file that is in the same directory as the databricks.yml
file:
artifacts:
default:
type: whl
build: poetry build
path: .
Note
The artifacts
mapping is optional for setuptools
. By default, for setuptools
the Databricks CLI runs the command python3 setup.py bdist_wheel
and uses the setup.py
file that is in the same directory as the databricks.yml
file. The Databricks CLI assumes that you have already run a command such as pip3 install --upgrade wheel setuptools
to install the wheel
and setuptools
packages if they are not already installed.
Also, the job task's libraries
mapping must contain a whl
value that specifies the path to the built Python wheel file relative to the configuration file in which it is declared. The following example shows this in a notebook task (the ellipsis indicates omitted content for brevity):
resources:
jobs:
my-notebook-job:
name: my-notebook-job
tasks:
- task_key: my-notebook-job-notebook-task
notebook_task:
notebook_path: ./my_notebook.py
libraries:
- whl: ./dist/*.whl
new_cluster:
# ...
Build and install a Python wheel file for a pipeline
To build a Python wheel file with Poetry or setuptools
and then reference that Python wheel file in a Delta Live Tables pipeline, you must add a mapping to your databricks.yml
file if you use Poetry, and you must add a %pip install
command to your pipeline notebook, as follows.
If you use Poetry, you must include the following artifacts
mapping in the databricks.yml
file. This mapping runs the poetry build
command and uses the pyproject.toml
file that is in the same directory as the databricks.yml
file:
artifacts:
default:
type: whl
build: poetry build
path: .
Note
The artifacts
mapping is optional for setuptools
. By default, for setuptools
the Databricks CLI runs the command python3 setup.py bdist_wheel
and uses the setup.py
file that is in the same directory as the databricks.yml
file. The Databricks CLI assumes that you have already run a command such as pip3 install --upgrade wheel setuptools
to install the wheel
and setuptools
packages if they are not already installed.
Also, the related pipeline notebook must include a %pip install
command to install the Python wheel file that is built. See Python libraries.