Asset normalization

When ingesting assets into the Microsoft Purview Data Map, different sources updating the same data asset may send similar, but slightly different qualified names. While these qualified names represent the same asset, slight differences such as an extra character may cause these assets on the surface to appear different and cause duplicate entries in Microsoft Purview. To avoid storing duplicate entries and causing confusion when consuming the Unified Catalog, Microsoft Purview automatically applies normalization during ingestion to ensure all fully qualified names of the same entity type are in the same format.

For example, you scan in an Azure Blob with the qualified name https://myaccount.file.core.chinacloudapi.cn/myshare/folderA/folderB/my-file.parquet. This blob is also consumed by an Azure Data Factory pipeline that will then add lineage information to the asset. The ADF (Azure Data Factory) pipeline may be configured to read the file as https://myAccount.file.core.chinacloudapi.cn//myshare/folderA/folderB/my-file.parquet. While the qualified name is different, this ADF pipeline is consuming the same piece of data. Normalization ensures that all the metadata from both Azure Blob Storage and Azure Data Factory is visible on a single asset, https://myaccount.file.core.chinacloudapi.cn/myshare/folderA/folderB/my-file.parquet.

Important

The rules listed below are the only kinds of potential duplication Microsoft Purview currently recognizes. If you are experiencing accidental asset duplication, compare the assets fully qualified names to check for capitalization differences or extra characters. Update any ingestion points, for example your ADF pipelines, so that the qualified names match.

Normalization rules

These are the normalization rules that Microsoft Purview automatically applies.

Encode curly brackets

Applies to: All Assets

Before: https://myaccount.file.core.chinacloudapi.cn/myshare/{folderA}/folder{B/

After: https://myaccount.file.core.chinacloudapi.cn/myshare/%7BfolderA%7D/folder%7BB/

Trim section spaces

Applies to: Azure Blob, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure Data Factory, Azure SQL Database, Azure SQL Managed Instance, Azure SQL pool, Azure Cosmos DB, Azure Cognitive Search, Azure Data Explorer, Azure Data Share

Before: https://myaccount.file.core.chinacloudapi.cn/myshare/ folder A/folderB /

After: https://myaccount.file.core.chinacloudapi.cn/myshare/folder A/folderB/

Remove hostname spaces

Applies to: Azure Blob, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, Azure SQL Managed Instance, Azure SQL pool, Azure Cosmos DB, Azure Cognitive Search, Azure Data Explorer, Azure Data Share, Amazon S3

Before: https://myaccount .file. core.win dows. net/myshare/folderA/folderB/

After: https://myaccount.file.core.chinacloudapi.cn/myshare/folderA/folderB/

Remove square brackets

Applies to: Azure SQL Database, Azure SQL Managed Instance, Azure SQL pool

Before: mssql://foo.database.chinacloudapi.cn/[bar]/dbo/[foo bar]

After: mssql://foo.database.chinacloudapi.cn/bar/dbo/foo%20bar

Note

Spaces between two square brackets will be encoded

Lowercase scheme

Applies to: Azure Blob, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, Azure SQL Managed Instance, Azure SQL pool, Azure Cosmos DB, Azure Cognitive Search, Azure Data Explorer, Amazon S3

Before: HTTPS://myaccount.file.core.chinacloudapi.cn/myshare/folderA/folderB/

After: https://myaccount.file.core.chinacloudapi.cn/myshare/folderA/folderB/

Lowercase hostname

Applies to: Azure Blob, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, Azure SQL Managed Instance, Azure SQL pool, Azure Cosmos DB, Azure Cognitive Search, Azure Data Explorer, Amazon S3

Before: https://myAccount.file.Core.Windows.net/myshare/folderA/folderB/

After: https://myaccount.file.core.chinacloudapi.cn/myshare/folderA/folderB/

Lowercase file extension

Applies to: Azure Blob, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Amazon S3

Before: https://myAccount.file.core.chinacloudapi.cn/myshare/folderA/data.TXT

After: https://myaccount.file.core.chinacloudapi.cn/myshare/folderA/data.txt

Remove duplicate slash

Applies to: Azure Blob, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure Data Factory, Azure SQL Database, Azure SQL Managed Instance, Azure SQL pool, Azure Cosmos DB, Azure Cognitive Search, Azure Data Explorer, Azure Data Share, Amazon S3

Before: https://myAccount.file.core.chinacloudapi.cn//myshare/folderA////folderB/

After: https://myaccount.file.core.chinacloudapi.cn/myshare/folderA/folderB/

Convert to ADL scheme

Applies to: Azure Data Lake Storage Gen1

Before: https://mystore.azuredatalakestore.net/folderA/folderB/abc.csv

After: adl://mystore.azuredatalakestore.net/folderA/folderB/abc.csv

Remove Trailing Slash

Remove the trailing slash from higher level assets for Azure Blob, ADLS Gen1, and ADLS Gen2.

Applies to: Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2

Asset types: "azure_blob_container", "azure_blob_service", "azure_storage_account", "azure_datalake_gen2_service", "azure_datalake_gen2_filesystem", "azure_datalake_gen1_account".

Before: https://myaccount.core.chinacloudapi.cn/

After: https://myaccount.core.chinacloudapi.cn

Troubleshooting

If your data isn't being normalized, and you're experiencing accidental asset duplication, compare the assets fully qualified names to check for capitalization differences or additional characters.

The rules listed above are the only types of duplication Microsoft Purview currently recognizes. If your data is falling outside of these rules, update any ingestion points, for example your ADF pipelines, so that the qualified names match.

If your assets meet the rules but aren't being normalized, contact support.

Next steps

Scan in an Azure Blob Storage account into the Microsoft Purview data map.