Query variant data
Important
This feature is in Public Preview.
This article describes the Databricks SQL operators you can use to query and transform semi-structured data stored as VARIANT
. The VARIANT
data type is available in Databricks Runtime 15.3 and above.
Databricks recommends using VARIANT
over JSON strings. For users currently using JSON strings looking to migrate, see How is variant different than JSON strings?.
If you want to see examples for querying semi-structured data stored with JSON strings, see Query JSON strings.
Note
VARIANT
columns cannot be used for clustering keys, partitions, or Z-order keys. The VARIANT
data type cannot be used for comparisons, grouping, ordering, and set operations. For a full list of limitations, see Limitations.
Create a table with a variant column
Run the following query to create a table with highly nested data stored as VARIANT
. The examples in this article all reference this table.
CREATE TABLE store_data AS
SELECT parse_json(
'{
"store":{
"fruit": [
{"weight":8,"type":"apple"},
{"weight":9,"type":"pear"}
],
"basket":[
[1,2,{"b":"y","a":"x"}],
[3,4],
[5,6]
],
"book":[
{
"author":"Nigel Rees",
"title":"Sayings of the Century",
"category":"reference",
"price":8.95
},
{
"author":"Herman Melville",
"title":"Moby Dick",
"category":"fiction",
"price":8.99,
"isbn":"0-553-21311-3"
},
{
"author":"J. R. R. Tolkien",
"title":"The Lord of the Rings",
"category":"fiction",
"reader":[
{"age":25,"name":"bob"},
{"age":26,"name":"jack"}
],
"price":22.99,
"isbn":"0-395-19395-8"
}
],
"bicycle":{
"price":19.95,
"color":"red"
}
},
"owner":"amy",
"zip code":"94025",
"fb:testid":"1234"
}'
) as raw
Query fields in a variant column
The syntax for querying JSON strings and other complex data types on Azure Databricks applies to VARIANT
data, including the following:
- Use
:
to select top level fields. - Use
.
or[<key>]
to select nested fields with named keys. - Use
[<index>]
to select values from arrays.
Note
If a field name contains a period (.
), you must escape it with square brackets ([ ]
). For example, the following query selects a field named zip.code
:
SELECT raw:['zip.code'] FROM store_data
Extract a top-level variant field
To extract a field, specify the name of the JSON field in your extraction path. Field names are always case sensitive.
SELECT raw:owner FROM store_data
+-------+
| owner |
+-------+
| "amy" |
+-------+
-- Use backticks to escape special characters.
SELECT raw:`zip code`, raw:`fb:testid` FROM store_data
+----------+-----------+
| zip code | fb:testid |
+----------+-----------+
| "94025" | "1234" |
+----------+-----------+
If a path cannot be found, the result is NULL
of type VARIANT
.
Extract variant nested fields
You specify nested fields through dot notation or using brackets. Field names are always case sensitive.
-- Use dot notation
SELECT raw:store.bicycle FROM store_data
+------------------+
| bicycle |
+------------------+
| { |
| "color":"red", |
| "price":19.95 |
| } |
+------------------+
-- Use brackets
SELECT raw:store['bicycle'] FROM store_data
+------------------+
| bicycle |
+------------------+
| { |
| "color":"red", |
| "price":19.95 |
| } |
+------------------+
If a path cannot be found, the result is NULL
of type VARIANT
.
Extract values from variant arrays
You index elements in arrays with brackets. Indices are 0-based.
-- Index elements
SELECT raw:store.fruit[0], raw:store.fruit[1] FROM store_data
+-------------------+------------------+
| fruit | fruit |
+-------------------+------------------+
| { | { |
| "type":"apple", | "type":"pear", |
| "weight":8 | "weight":9 |
| } | } |
+-------------------+------------------+
If the path cannot be found, or if the array-index is out of bounds, the result is NULL
.
Flatten variant objects and arrays
The variant_explode
table-valued generator function can be used to flatten VARIANT
arrays and objects.
Because variant_explode
is a generator function, you use it as part of the FROM
clause rather than in the SELECT
list, as in the following examples:
SELECT key, value
FROM store_data,
LATERAL variant_explode(store_data.raw:store);
+-------+--------------------+
| key| value|
+-------+--------------------+
| basket|[[1,2,{"a":"x","b...|
|bicycle|{"color":"red","p...|
| book|[{"author":"Nigel...|
| fruit|[{"type":"apple",...|
+-------+--------------------+
SELECT pos, value
FROM store_data,
LATERAL variant_explode(store_data.raw:store.basket[0]);
+---+-----------------+
|pos| value|
+---+-----------------+
| 0| 1|
| 1| 2|
| 2|{"a":"x","b":"y"}|
+---+-----------------+
Variant type casting rules
You can store arrays and scalars using VARIANT
type. When trying to cast variant types to other types, normal casting rules apply for individual values and fields, with the following additional rules.
Note
variant_get
and try_variant_get
take type arguments and follow these casting rules.
Source type | Behavior |
---|---|
VOID |
The result is a NULL of type VARIANT . |
ARRAY<elementType> |
The elementType must be a type that can be cast to VARIANT . |
When inferring type with schema_of_variant
or schema_of_variant_agg
, functions fall back to VARIANT
type rather than STRING
type when conflicting types are present that can't be resolved.
You can use ::
or cast
to cast values to supported data types.
-- price is returned as a double, not a string
SELECT raw:store.bicycle.price::double FROM store_data
+------------------+
| price |
+------------------+
| 19.95 |
+------------------+
-- cast into more complex types
SELECT cast(raw:store.bicycle AS STRUCT<price DOUBLE, color STRING>) bicycle FROM store_data;
-- `::` also supported
SELECT raw:store.bicycle::STRUCT<price DOUBLE, color STRING> bicycle FROM store_data;
+------------------+
| bicycle |
+------------------+
| { |
| "price":19.95, |
| "color":"red" |
| } |
+------------------+
Variant null rules
Variants can contain two kinds of nulls:
- SQL
NULL
: SQLNULL
s indicate that the value is missing. These are the sameNULL
s as when dealing with structured data. - Variant
NULL
: VariantNULL
s indicate that the variant explicitly contains aNULL
value. These are not the same as SQLNULL
s, because theNULL
value is stored in the data.
Use the is_variant_null
function to determine if the variant value is a variant NULL
.
SELECT
is_variant_null(parse_json(NULL)) AS sql_null,
is_variant_null(parse_json('null')) AS variant_null,
is_variant_null(parse_json('{ "field_a": null }'):field_a) AS variant_null_value,
is_variant_null(parse_json('{ "field_a": null }'):missing) AS missing_sql_value_null
+--------+------------+------------------+----------------------+
|sql_null|variant_null|variant_null_value|missing_sql_value_null|
+--------+------------+------------------+----------------------+
| false| true| true| false|
+--------+------------+------------------+----------------------+