Connect to cloud object storage and services using Unity Catalog
This article gives an overview of the cloud storage connections that are required to work with data using Unity Catalog, along with information about how Unity Catalog governs access to cloud storage and external cloud services.
Note
If your workspace was created before November 9, 2023, it might not be enabled for Unity Catalog. An account admin must enable Unity Catalog for your workspace. See Enable a workspace for Unity Catalog.
How does Unity Catalog use cloud storage?
Databricks recommends using Unity Catalog to manage access to all data that you have stored in cloud object storage. Unity Catalog provides a suite of tools to configure secure connections to cloud object storage. These connections provide access to complete the following actions:
- Ingest raw data into a lakehouse.
- Create and read managed tables and managed volumes of unstructured data in Unity Catalog-managed cloud storage.
- Register or create external tables containing tabular data and external volumes containing unstructured data in cloud storage that is managed using your cloud provider.
- Read and write unstructured data (as Unity Catalog volumes).
To be more specific, Unity Catalog uses cloud storage in two primary ways:
- Default (or "managed") storage locations for managed tables and managed volumes (unstructured, non-tabular data) that you create in Databricks. These managed storage locations can be defined at the metastore, catalog, or schema level. You create managed storage locations in your cloud provider, but their lifecycle is fully managed by Unity Catalog.
- Storage locations where external tables and volumes are stored. These are tables and volumes whose access from Azure Databricks is managed by Unity Catalog, but whose data lifecycle and file layout are managed using your cloud provider and other data platforms. Typically you use external tables to register large amounts of your existing data in Azure Databricks, or if you also require write access to the data using tools outside of Azure Databricks.
For more information about managed vs external tables and volumes, see What are tables and views? and What are Unity Catalog volumes?.
Warning
Do not give end users storage-level access to Unity Catalog managed tables or volumes. This compromises data security and governance.
Granting users direct storage-level access to external location storage in Azure Data Lake Storage Gen2 does not honor any permissions granted or audits maintained by Unity Catalog. Direct access will bypass auditing, lineage, and other security and monitoring features of Unity Catalog, including access control and permissions. You are responsible for managing direct storage access through Azure Data Lake Storage Gen2 and ensuring that users have the appropriate permissions granted via Fabric.
Avoid all scenarios that grant direct storage-level write access for buckets that store Databricks managed tables. Modifying, deleting, or evolving any objects directly through storage that were originally managed by Unity Catalog can result in data corruption.
Which cloud storage providers are supported?
Azure Databricks supports both Azure Data Lake Storage Gen2 containers and Cloudflare R2 buckets as cloud storage locations for data and AI assets registered in Unity Catalog. R2 is intended primarily for uses cases in which you want to avoid data egress fees, such as Delta Sharing across clouds and regions. For more information, see Use Cloudflare R2 replicas or migrate storage to R2.
How does Unity Catalog govern access to cloud storage?
To manage access to the underlying cloud storage that holds tables and volumes, Unity Catalog uses a securable object called an external location, which defines a path to a cloud storage location and the credentials required to access that location. Those credentials are, in turn, defined in a Unity Catalog securable object called a storage credential. By granting and revoking access to external location securables in Unity Catalog, you control access to the data in the cloud storage location. By granting and revoking access to storage credential securables in Unity Catalog, you control the ability to create external location objects.
For details, see Manage access to cloud storage using Unity Catalog.
Path-based access to cloud storage
Although Unity Catalog supports path-based access to external tables and external volumes using cloud storage URIs, Databricks recommends that users read and write all Unity Catalog tables using table names and access data in volumes using /Volumes
paths. Volumes are the securable object that most Azure Databricks users should use to interact directly with non-tabular data in cloud object storage. See What are Unity Catalog volumes?.
Best practices for cloud storage with Unity Catalog
Azure Databricks requires using Azure Data Lake Storage Gen2 as the Azure storage service for data that is processed in Azure Databricks using Unity Catalog governance. Azure Data Lake Storage Gen2 enables you to separate storage and compute costs and take advantage of the fine-grained access control provided by Unity Catalog. If data is stored in OneLake (the Azure Fabric data lake) and processed by Databricks (bypassing Unity Catalog), you will incur bundled storage and compute costs. This can lead to costs that are approximately 3x higher for reads and 1.6x higher for writes compared to Azure Data Lake Storage Gen2 for storing, reading, and writing data. Azure Blob Storage is also incompatible with Unity Catalog.
Feature | Azure Blob Storage | Azure Data Lake Storage Gen2 | OneLake |
---|---|---|---|
Supported by Unity Catalog | X | ✓ | X |
Requires additional Fabric capacity purchase | X | X | ✓ |
Supported operations from external engines | - Read - Write |
- Read - Write |
- Read (Reads incur 3x the cost compared to reading data from Azure Data Lake Storage Gen2). - Writes are not supported. |
Deployment | Regional | Regional | Global |
Authentication | Entra ID Shared Access Signature | Entra ID Shared Access Signature | Entra ID |
Storage events | ✓ | ✓ | X |
Soft delete | ✓ | ✓ | ✓ |
Access control | RBAC | RBAC, ABAC, ACL | RBAC (Table/folder only, shortcut ACLs not supported) |
Encryption keys | ✓ | ✓ | X |
Access tiers | Online archive | Hot, cool, cold, archive | Hot only |
How does Unity Catalog govern access to other cloud services?
Unity Catalog governs access to non-storage services using a securable object called a service credential. A service credential encapsulates a long-term cloud credential that provides access to an external service that users need to connect to from Azure Databricks.
Service credentials are not intended for governing access to cloud storage that is used as a Unity Catalog managed storage location or external storage location. For those use cases, use a storage credential, as described in How does Unity Catalog govern access to cloud storage?.
For details, see:
- Manage access to external cloud services using service credentials
- Manage service credentials
- Use Unity Catalog service credentials to connect to external cloud services
Next steps
If you're just getting started with Unity Catalog as an admin, see:
If you're a new user and your workspace is already enabled for Unity Catalog, see:
To learn more about how to manage access to cloud storage, see:
To learn more about how to manage access to cloud services, see: