diffpatterns plugin
Applies to: ✅ Azure Data Explorer
Compares two datasets of the same structure and finds patterns of discrete attributes (dimensions) that characterize differences between the two datasets. The plugin is invoked with the evaluate
operator.
diffpatterns
was developed to help analyze failures (for example, by comparing failures to non-failures in a given time frame), but can potentially find differences between any two datasets of the same structure.
Note
diffpatterns
aims to find significant patterns (that capture portions of the data difference between the sets) and isn't meant for row-by-row differences.
Syntax
T | evaluate diffpatterns(
SplitColumn,
SplitValueA,
SplitValueB [,
WeightColumn,
Threshold,
MaxDimensions,
CustomWildcard,
...])
Learn more about syntax conventions.
Parameters
Name | Type | Required | Description |
---|---|---|---|
SplitColumn | string |
✔️ | The column name that tells the algorithm how to split the query into datasets. According to the specified values for the SplitValueA and SplitValueB arguments, the algorithm splits the query into two datasets, “A” and “B”, and analyzes the differences between them. As such, the split column must have at least two distinct values. |
SplitValueA | string |
✔️ | A string representation of one of the values in the SplitColumn that was specified. All the rows that have this value in their SplitColumn considered as dataset “A”. |
SplitValueB | string |
✔️ | A string representation of one of the values in the SplitColumn that was specified. All the rows that have this value in their SplitColumn considered as dataset “B”. |
WeightColumn | string |
The column used to consider each row in the input according to the specified weight. Must be a name of a numeric column, such as int , long , real . By default each row has a weight of '1'. To use the default value, input the tilde: ~ . A common usage of a weight column is to take into account sampling or bucketing/aggregation of the data that is already embedded into each row.Example: T | extend splitColumn= iff(request_responseCode == 200, "Success" , "Failure") | evaluate diffpatterns(splitColumn, "Success","Failure", sample_Count) |
|
Threshold | real |
A real in the range of 0.015 to 1. This value sets the minimal pattern ratio difference between the two sets. The default is 0.05. To use the default value, input the tilde: ~ .Example: T | extend splitColumn = iff(request-responseCode == 200, "Success" , "Failure") | evaluate diffpatterns(splitColumn, "Success","Failure", "~", 0.04) |
|
MaxDimensions | int |
Sets the maximum number of uncorrelated dimensions per result pattern. By specifying a limit, you decrease the query runtime. The default is unlimited. To use the default value, input the tilde: ~ .Example: T | extend splitColumn = iff(request-responseCode == 200, "Success" , "Failure") | evaluate diffpatterns(splitColumn, "Success","Failure", "~", "~", 3) |
|
CustomWildcard | string |
Sets the wildcard value for a specific type in the result table that will indicate that the current pattern doesn't have a restriction on this column. The default is null, except for string columns for which the default is an empty string. If the default is a viable value in the data, a different wildcard value should be used. For example, * . To use the default value, input the tilde: ~ .Example: T | extend splitColumn = iff(request-responseCode == 200, "Success" , "Failure") | evaluate diffpatterns(splitColumn, "Success","Failure", "~", "~", "~", int(-1), double(-1), long(0), datetime(1900-1-1)) |
Returns
diffpatterns
returns a small set of patterns that capture different portions of the data in the two sets (that is, a pattern capturing a large percentage of the rows in the first dataset and low percentage of the rows in the second set). Each pattern is represented by a row in the results.
The result of diffpatterns
returns the following columns:
SegmentId: the identity assigned to the pattern in the current query (note: IDs aren't guaranteed to be the same in repeating queries).
CountA: the number of rows captured by the pattern in Set A (Set A is the equivalent of
where tostring(splitColumn) == SplitValueA
).CountB: the number of rows captured by the pattern in Set B (Set B is the equivalent of
where tostring(splitColumn) == SplitValueB
).PercentA: the percentage of rows in Set A captured by the pattern (100.0 * CountA / count(SetA)).
PercentB: the percentage of rows in Set B captured by the pattern (100.0 * CountB / count(SetB)).
PercentDiffAB: the absolute percentage point difference between A and B (|PercentA - PercentB|) is the main measure of significance of patterns in describing the difference between the two sets.
Rest of the columns: are the original schema of the input and describe the pattern, each row (pattern) represents the intersection of the non-wildcard values of the columns (equivalent of
where col1==val1 and col2==val2 and ... colN=valN
for each non-wildcard value in the row).
For each pattern, columns that aren't set in the pattern (that is, without restriction on a specific value) will contain a wildcard value, which is null by default. See in the Arguments section below how wildcards can be manually changed.
- Note: the patterns are often not distinct. They may be overlapping, and usually don't cover all the original rows. Some rows may not fall under any pattern.
Tip
Example
StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , 1 , 0)
| project State , EventType , Source , Damage, DamageCrops
| evaluate diffpatterns(Damage, "0", "1" )
Output
SegmentId | CountA | CountB | PercentA | PercentB | PercentDiffAB | State | EventType | Source | DamageCrops |
---|---|---|---|---|---|---|---|---|---|
0 | 2278 | 93 | 49.8 | 7.1 | 42.7 | Hail | 0 | ||
1 | 779 | 512 | 17.03 | 39.08 | 22.05 | Thunderstorm Wind | |||
2 | 1098 | 118 | 24.01 | 9.01 | 15 | Trained Spotter | 0 | ||
3 | 136 | 158 | 2.97 | 12.06 | 9.09 | Newspaper | |||
4 | 359 | 214 | 7.85 | 16.34 | 8.49 | Flash Flood | |||
5 | 50 | 122 | 1.09 | 9.31 | 8.22 | IOWA | |||
6 | 655 | 279 | 14.32 | 21.3 | 6.98 | Law Enforcement | |||
7 | 150 | 117 | 3.28 | 8.93 | 5.65 | Flood | |||
8 | 362 | 176 | 7.91 | 13.44 | 5.52 | Emergency Manager |