histogram_numeric 聚合函数

适用于:勾选“是” Databricks SQL 勾选“是” Databricks Runtime 10.2 及更高版本

使用 numBins 箱计算 expr 上的直方图。

语法

histogram_numeric ( [ALL | DISTINCT ] expr, numBins ) [ FILTER ( WHERE cond ) ]

参数

  • expr:函数在其上使用和计算直方图的数字、TIMESTAMPDATEINTERVAL 表达式。
  • numBins:必须大于 1 的 INTEGER 文本,用于指定直方图计算的箱数。
  • cond:可选的 BOOLEAN 表达式,用于筛选聚合的行。

返回

返回值是字段为 xySTRUCTS 中的 ARRAY,表示直方图箱的中心。x 的类型与 expr 的类型相同,y 的类型则为 DOUBLE。 增加 numBins 值会优化直方图近似值,从而使其更加精细。 但它可以在离群值周围引入伪影。 通常,20-40 个箱对于直方图是有效的,尽管倾斜或较小的数据集可能需要更多的箱。请注意,此函数会创建具有非统一箱宽度的直方图。 它在直方图的均方误差方面不提供任何保证,但实际上与其他计算包生成的直方图相当。

指定 DISTINCT 会导致函数仅对一组唯一的 expr 值进行操作。

示例

> SELECT histogram_numeric(col, 5)
    FROM VALUES (0), (1), (2), (10) AS tab(col);
 [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (0L), (1L), (2L), (10L) AS tab(col);
 [{"x":0,"y":1.0},{"x":1,"y":1.0},{"x":2,"y":1.0},{"x":10,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (0F), (1F), (2F), (10F) AS tab(col);
 [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (0D), (1D), (2D), (10D) AS tab(col);
 [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (INTERVAL 0 YEAR), (INTERVAL 1 YEAR), (INTERVAL 2 YEAR),
    (INTERVAL 3 YEAR) AS tab(col);
 [{"x":0-0,"y":1.0},{"x":1-0,"y":1.0},{"x":2-0,"y":1.0},{"x":3-0,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (INTERVAL 0 DAY), (INTERVAL 1 DAY), (INTERVAL 2 DAY),
    (INTERVAL 3 DAY) AS tab(col);
  [{"x":0 00:00:00.000000000,"y":1.0},{"x":1 00:00:00.000000000,"y":1.0},{"x":2 00:00:00.000000000,"y":1.0},{"x":3 00:00:00.000000000,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (TIMESTAMP '2020-01-01'), (TIMESTAMP'2020-02-01'),
    (TIMESTAMP'2020-03-01'), (TIMESTAMP'2020-10-01') AS tab(col)
  [{"x":2020-01-01 00:00:00,"y":1.0},{"x":2020-02-01 00:00:00,"y":1.0},{"x":2020-03-01 00:00:00,"y":1.0},{"x":2020-10-01 00:00:00,"y":1.0}]