Query data in Azure Data Lake using Azure Data Explorer

Azure Data Lake Storage is a highly scalable and cost-effective data lake solution for big data analytics. It combines the power of a high-performance file system with massive scale and economy to help you reduce your time to insight. Data Lake Storage Gen2 extends Azure Blob Storage capabilities and is optimized for analytics workloads.

Azure Data Explorer integrates with Azure Blob Storage and Azure Data Lake Storage (Gen1 and Gen2), providing fast, cached, and indexed access to data stored in external storage. You can analyze and query data without prior ingestion into Azure Data Explorer. You can also query across ingested and uningested external data simultaneously. For more information, see how to create an external table using the Azure Data Explorer web UI wizard. For a brief overview, see external tables.

Tip

The best query performance necessitates data ingestion into Azure Data Explorer. The capability to query external data without prior ingestion should only be used for historical data or data that are rarely queried. Optimize your external data query performance for best results.

Create an external table

Let's say you have lots of CSV files containing historical info on products stored in a warehouse, and you want to do a quick analysis to find the five most popular products from last year. In this example, the CSV files look like:

Timestamp ProductId ProductDescription
2019-01-01 11:21:00 TO6050 3.5in DS/HD Floppy Disk
2019-01-01 11:30:55 YDX1 Yamaha DX1 Synthesizer
... ... ...

The files are stored in Azure Blob storage mycompanystorage under a container named archivedproducts, partitioned by date:

https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts/2019/01/01/part-00000-7e967c99-cf2b-4dbb-8c53-ce388389470d.csv.gz
https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts/2019/01/01/part-00001-ba356fa4-f85f-430a-8b5a-afd64f128ca4.csv.gz
https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts/2019/01/01/part-00002-acb644dc-2fc6-467c-ab80-d1590b23fc31.csv.gz
https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts/2019/01/01/part-00003-cd5fad16-a45e-4f8c-a2d0-5ea5de2f4e02.csv.gz
https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts/2019/01/02/part-00000-ffc72d50-ff98-423c-913b-75482ba9ec86.csv.gz
...

To run a KQL query on these CSV files directly, use the .create external table command to define an external table in Azure Data Explorer. For more information on external table create command options, see external table commands.

.create external table ArchivedProducts(Timestamp:datetime, ProductId:string, ProductDescription:string)
kind=blob
partition by (Date:datetime = bin(Timestamp, 1d))
dataformat=csv
(
  h@'https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts;StorageSecretKey'
)

The external table is now visible in the left pane of the Azure Data Explorer web UI:

Screenshot that shows external table in Azure Data Explorer web UI.

External table permissions

  • The database user can create an external table. The table creator automatically becomes the table administrator.
  • The cluster, database, or table administrator can edit an existing table.
  • Any database user or reader can query an external table.

Querying an external table

Once an external table is defined, the external_table() function can be used to refer to it. The rest of the query is standard Kusto Query Language.

external_table("ArchivedProducts")
| where Timestamp > ago(365d)
| summarize Count=count() by ProductId,
| top 5 by Count

Querying external and ingested data together

You can query both external tables and ingested data tables within the same query. You can join or union the external table with other data from Azure Data Explorer, SQL servers, or other sources. Use a let( ) statement to assign a shorthand name to an external table reference.

In the example below, Products is an ingested data table and ArchivedProducts is an external table that we've defined previously:

let T1 = external_table("ArchivedProducts") |  where TimeStamp > ago(100d);
let T = Products; //T is an internal table
T1 | join T on ProductId | take 10

Querying hierarchical data formats

Azure Data Explorer allows querying hierarchical formats, such as JSON, Parquet, Avro, and ORC. To map hierarchical data schema to an external table schema (if it's different), use external table mappings commands. For instance, if you want to query JSON log files with the following format:

{
  "timestamp": "2019-01-01 10:00:00.238521",
  "data": {
    "tenant": "e1ef54a6-c6f2-4389-836e-d289b37bcfe0",
    "method": "RefreshTableMetadata"
  }
}
{
  "timestamp": "2019-01-01 10:00:01.845423",
  "data": {
    "tenant": "9b49d0d7-b3e6-4467-bb35-fa420a25d324",
    "method": "GetFileList"
  }
}
...

The external table definition looks like this:

.create external table ApiCalls(Timestamp: datetime, TenantId: guid, MethodName: string)
kind=blob
dataformat=multijson
(
   h@'https://storageaccount.blob.core.chinacloudapi.cn/container1;StorageSecretKey'
)

Define a JSON mapping that maps data fields to external table definition fields:

.create external table ApiCalls json mapping 'MyMapping' '[{"Column":"Timestamp","Properties":{"Path":"$.timestamp"}},{"Column":"TenantId","Properties":{"Path":"$.data.tenant"}},{"Column":"MethodName","Properties":{"Path":"$.data.method"}}]'

When you query the external table, the mapping will be invoked, and relevant data will be mapped to the external table columns:

external_table('ApiCalls') | take 10

For more info on mapping syntax, see data mappings.

Query TaxiRides external table in the help cluster

Use the test cluster called help to try out different Azure Data Explorer capabilities. The help cluster contains an external table definition for a New York City taxi dataset containing billions of taxi rides.

Create external table TaxiRides

This section shows the query used to create the TaxiRides external table in the help cluster. Since this table has already been created, you can skip this section and go directly to query TaxiRides external table data.

.create external table TaxiRides
(
  trip_id: long,
  vendor_id: string,
  pickup_datetime: datetime,
  dropoff_datetime: datetime,
  store_and_fwd_flag: string,
  rate_code_id: int,
  pickup_longitude: real,
  pickup_latitude: real,
  dropoff_longitude: real,
  dropoff_latitude: real,
  passenger_count: int,
  trip_distance: real,
  fare_amount: real,
  extra: real,
  mta_tax: real,
  tip_amount: real,
  tolls_amount: real,
  ehail_fee: real,
  improvement_surcharge: real,
  total_amount: real,
  payment_type: string,
  trip_type: int,
  pickup: string,
  dropoff: string,
  cab_type: string,
  precipitation: int,
  snow_depth: int,
  snowfall: int,
  max_temperature: int,
  min_temperature: int,
  average_wind_speed: int,
  pickup_nyct2010_gid: int,
  pickup_ctlabel: string,
  pickup_borocode: int,
  pickup_boroname: string,
  pickup_ct2010: string,
  pickup_boroct2010: string,
  pickup_cdeligibil: string,
  pickup_ntacode: string,
  pickup_ntaname: string,
  pickup_puma: string,
  dropoff_nyct2010_gid: int,
  dropoff_ctlabel: string,
  dropoff_borocode: int,
  dropoff_boroname: string,
  dropoff_ct2010: string,
  dropoff_boroct2010: string,
  dropoff_cdeligibil: string,
  dropoff_ntacode: string,
  dropoff_ntaname: string,
  dropoff_puma: string
)
kind=blob
partition by (Date:datetime = bin(pickup_datetime, 1d))
dataformat=csv
(
    h@'https://storageaccount.blob.core.chinacloudapi.cn/container1;secretKey'
)

You can find the created TaxiRides table by looking at the left pane of the Azure Data Explorer web UI:

 Screenshot showing the Taxi rides external table.

Query TaxiRides external table data

Sign in to https://dataexplorer.azure.cn/clusters/help.chinaeast2/databases/Samples.

Query TaxiRides external table without partitioning

Run this query on the external table TaxiRides to show rides for each day of the week, across the entire dataset.

external_table("TaxiRides")
| summarize count() by dayofweek(pickup_datetime)
| render columnchart

This query shows the busiest day of the week. Since the data isn't partitioned, the query may take up to several minutes to return results.

Graph representation to render non-partitioned query.

Query TaxiRides external table with partitioning

Run this query on the external table TaxiRides to show taxi cab types (yellow or green) used in January of 2017.

external_table("TaxiRides")
| where pickup_datetime between (datetime(2017-01-01) .. datetime(2017-02-01))
| summarize count() by cab_type
| render piechart

This query uses partitioning, which optimizes query time and performance. The query filters on a partitioned column (pickup_datetime) and returns results in a few seconds.

Diagram to render partitioned query.

You can write other queries to run on the external table TaxiRides and learn more about the data.

Optimize your query performance

Optimize your query performance in the lake by using the following best practices for querying external data.

Data format

  • Use a columnar format for analytical queries, for the following reasons:
    • Only the columns relevant to a query can be read.
    • Column encoding techniques can reduce data size significantly.
  • Azure Data Explorer supports Parquet and ORC columnar formats. Parquet format is suggested because of optimized implementation.

Azure region

Check that external data is in the same Azure region as your Azure Data Explorer cluster. This setup reduces cost and data fetch time.

File size

The optimal file size is hundreds of Mb (up to 1 GB) per file. Avoid many small files that require unneeded overhead, such as slower file enumeration process and limited use of columnar format. The number of files should be greater than the number of CPU cores in your Azure Data Explorer cluster.

Compression

Use compression to reduce the amount of data being fetched from the remote storage. For Parquet format, use the internal Parquet compression mechanism that compresses column groups separately, allowing you to read them separately. To validate use of compression mechanism, check that the files are named as follows: <filename>.gz.parquet or <filename>.snappy.parquet and not <filename>.parquet.gz.

Partitioning

Organize your data using "folder" partitions that enable the query to skip irrelevant paths. When planning partitioning, consider file size and common filters in your queries such as timestamp or tenant ID.

VM size

Select VM SKUs with more cores and higher network throughput (memory is less important). For more information, see Select the correct VM SKU for your Azure Data Explorer cluster.