差异模式挖掘(表格数据下钻分析)函数

差异模式挖掘函数基于给定的多属性字段样本,在给定的判别条件下,分析出影响该条件划分的差异化模式集合,帮助您快速诊断导致当前判别条件差异的原因。

diff_patterns

用于比较两个具有相同结构的数据集,并找出表格这两个数据集之间差异的离散属性(维度)模式。专指对于表格数据的模板(频繁集)比较。

语法

diff_patterns($TABLE, $HEADER, $SPLIT_COLUMN, $TEST_VAL, $CONTROL_VAL, $METRIC_COLUMN, $WEIGHT_COLUMN, $PARAM)

参数说明

重要

在使用差异模式统计函数时,必须确保 $SPLIT_COLUMN 中包含 $TEST_VAL 和 $CONTROL_VAL 的行,即测试组和对照组均有数据存在。缺少任意一组的数据将导致无法进行比较。

参数

数据类型

是否必选

说明

$TABLE

row<array<T>, array<E>, ..., array<F>>

diff_patterns 待下钻的输入数据的表格,每一列是一个待下钻的维度列。

$HEADER

array<varchar>

列名,和$TABLE对应,header名字数和table列的数量一致。

$SPLIT_COLUMN

varchar

标注列的列名,用于区分不同的分组的标签列名称,标签类目前只支持string类型。

$TEST_VAL

varchar

测试组数据对应的 $SPLIT_COLUMN 中的值。

需要保证 $TEST_VAL 在 $SPLIT_COLUMN 出现过,不然无法区分两组。通常来说,$TEST_VAL对应于异常值。

$CONTROL_VAL

varchar

对照组数据对应的 $SPLIT_COLUMN 中的值。

需要保证 $TEST_VAL 在 $SPLIT_COLUMN 出现过,不然无法区分两组。$CONTROL_VAL对应于正常值。

$METRIC_COLUMN

double/int

指标列,默认为 '' (没有指标列)。

如果我们不仅关注数据的分组,还关注不同维度组合对于某个指标的影响,比如说'qps'这一列记录了我们关注的指标,那么我们通过增加指标列,来进一步让算法能计算出不同维度组合在指标列上面的差异。

$WEIGHT_COLUMN

varchar

权重列,默认为 '' (没有权重列)。

认为各行的权重相同。如果设置了的话,我们的全部计算过程都会和权重有关。比如说某一行的权重为2而其他行的权重为1,那么这一行相当于出现了两次。

$PARAM

varchar

参见param参数说明

param参数说明

参数名

参数解释

参数类型

是否必填

默认值

取值范围

minimum_support_fraction

输出的pattern在测试组的最低的支持度。比如某个pattern在测试组出现的频率是0.1,那么我们说这个pattern的支持度是0.1。这个参数可以通过控制pattern的敏感度控制pattern的数量。

double

0.05

(0, 1)

allow_sample

是否允许采样计算。如果不允许采样计算的话

bool

true

[true, false]

agg_op

一个pattern会筛选出很多行数据,我们会计算这些行在METRIC_COLUMN的聚合值。这个参数是指定我们考虑的聚合operation。如果METRIC_COLUMN没有取值的话,默认测试组取1,对照组取0。

string

"avg"

["sum", "avg", "max", "min", "count", "p95"]

shapley_value_min_fraction

某个条件的最低的shapley value占比。这个参数可以控制一个pattern里面条件的数量

double

0.1

(0, 1)

示例

  • 查询分析:

    关于"set session enable_remote_functions=true ;set session velox_support_row_constructor_enabled=true; ", 目前功能处于公测阶段,需手动添加该 flag。后续版本将移除此要求,实现自动化支持。
    * | set session enable_remote_functions=true ;set session velox_support_row_constructor_enabled=true; 
    with t0 as (select  JSON_EXTRACT_SCALAR(entity, '$.platform') AS platform,  JSON_EXTRACT_SCALAR(entity, '$.region') AS region, cast(value as double) as value, if((value > 100), 'true', 'false') as anomaly_label from log), 
    t1 as ( select array_agg(platform) as platform, array_agg(region) as region, array_agg(anomaly_label) as anomaly_label, array_agg(value) as value from t0),
    t2 as (select row(platform, region, anomaly_label, value) as table_row from t1),
    t3 as (select diff_patterns(table_row, ARRAY['platform', 'region', 'anomaly_label', 'value'],'anomaly_label', 'true', 'false', 'value') as ret from t2)
    select * from t3
  • 输出结果:

    [["\"platform\"='Amazon Prime'","\"platform\"='Console'","\"platform\"='VR'","\"platform\"='Mobile'","\"platform\"='PC'","\"platform\"='Hulu'","\"platform\"='Netflix' AND \"region\"='Asia'","\"platform\"='YouTube'","\"platform\"='Disney+' AND \"region\"='South America'"],[18720,56921,44516,47520,44640,8640,4320,15840,4320],[0,4999,7324,5760,4320,0,0,1440,0],[0.05248902272841978,0.15960083668399478,0.12481844742405635,0.13324136538752713,0.1251661311216164,0.024225702797732206,0.012112851398866103,0.04441378846250904,0.012112851398866103],[0.0,0.0032988991351195829,0.004833194091941553,0.003801092022062172,0.0028508190165466289,0.0,0.0,0.000950273005515543,0.0],[0.05248902272841978,0.1563019375488752,0.1199852533321148,0.12944027336546497,0.12231531210506977,0.024225702797732206,0.012112851398866103,0.0434635154569935,0.012112851398866103],[608.6946996258738,552.6152340898742,501.3011076154948,525.0147121416072,580.8151715760738,409.282718625793,631.6115865793741,603.7699053925991,526.960199868643],[0.0,58.46518308219622,55.396758003518488,49.12926473747681,44.68946999270535,0.0,0.0,20.76262974302821,0.0],null]

返回结果说明

参数

类型

说明

对应示例

$RET.patterns

array<varchar>

表格模板,频繁集。

["\"platform\"='Amazon Prime'","\"platform\"='Console'","\"platform\"='VR'","\"platform\"='Mobile'","\"platform\"='PC'","\"platform\"='Hulu'","\"platform\"='Netflix' AND \"region\"='Asia'","\"platform\"='YouTube'","\"platform\"='Disney+' AND \"region\"='South America'"]

$RET.test_supports

array<bigint>

测试组中对应模板出现的频次。

[18720,56921,44516,47520,44640,8640,4320,15840,4320],[0,4999,7324,5760,4320,0,0,1440,0]

$RET.control_supports

array<bigint>

对照组中对应模板出现的频次。

通过对比,diff_patterns返回结果中靠前的结果在测试组 (异常组) 里面的出现的频次远高于在对照组 (正常组) 中出现的频次。

0,4999,7324,5760,4320,0,0,1440,0]

$RET.test_ratio

array<double>

测试组(label=1)中对应模板占测试集数据的比例。

[0.05248902272841978,0.15960083668399478,0.12481844742405635,0.13324136538752713,0.1251661311216164,0.024225702797732206,0.012112851398866103,0.04441378846250904,0.012112851398866103]

$RET.control_ratio

array<double>

对照组(label=0)中对应模板占验证集数据的比例。

[0.0,0.0032988991351195829,0.004833194091941553,0.003801092022062172,0.0028508190165466289,0.0,0.0,0.000950273005515543,0.0]

$RET.difference

array<double>

测试组数据相较对照组数据的差异。

[0.05248902272841978,0.1563019375488752,0.1199852533321148,0.12944027336546497,0.12231531210506977,0.024225702797732206,0.012112851398866103,0.0434635154569935,0.012112851398866103]

$RET.test_average_metric

array<double>

满足pattern的数据中 METRIC_COLUMN的平均值。如果METRIC_COLUMN为'',这个返回值不具备参考意义。

[608.6946996258738,552.6152340898742,501.3011076154948,525.0147121416072,580.8151715760738,409.282718625793,631.6115865793741,603.7699053925991,526.960199868643]

$RET.control_average_metric

array<double>

不满足pattern的数据中 METRIC_COLUMN的平均值。如果METRIC_COLUMN为'',这个返回值不具备参考意义。

[0.0,58.46518308219622,55.396758003518488,49.12926473747681,44.68946999270535,0.0,0.0,20.76262974302821,0.0]

$RET.error_msg

varchar/null

报错信息。如果没有报错,则返回值为null

null