频繁模式挖掘函数通过分析多维数据,提取显著差异的属性组合并量化其影响,支持多种参数配置以优化挖掘结果。
get_patterns
get_patterns 是频繁集挖掘的算子,除了挖掘频繁项之外,get_patterns还会对挖掘到的频繁项做合并和去重。专指提取表格数据的模板(频繁集)。
语法
get_patterns($TABLE, $HEADER, $PARAM)
参数说明
参数 | 数据类型 | 是否必选 | 说明 |
$TABLE | row<array<T>, array<E>, ..., array<F>> | 是 | 待挖掘频繁项的输入数据的表格,每一列是一个待挖掘的维度列。 |
$HEADER | array<varchar> | 是 | 列名,和$TABLE对应,header名字数和table列的数量一致。 |
$PARAM | varchar | 否 | 参见param参数说明。 |
param参数说明
参数名 | 参数解释 | 参数类型 | 是否必填 | 默认值 | 取值范围 |
minimum_support_fraction | 输出的pattern在测试组的最低的支持度。比如某个pattern在测试组出现的频率是0.1,那么我们说这个pattern的支持度是0.1。这个参数可以通过控制pattern的敏感度控制pattern的数量。 | double | 否 | 0.05 | (0, 1) |
示例
查询分析:
关于"set session enable_remote_functions=true ;set session velox_support_row_constructor_enabled=true; ", 目前功能处于公测阶段,需手动添加该 flag。后续版本将移除此要求,实现自动化支持。
(*)| set session enable_remote_functions=true ;set session velox_support_row_constructor_enabled=true; with t0 as (select JSON_EXTRACT_SCALAR(entity, '$.platform') AS platform, JSON_EXTRACT_SCALAR(entity, '$.region') AS region, cast(value as double) as value, if((value > 100), 'true', 'false') as anomaly_label from log), t1 as ( select array_agg(platform) as platform, array_agg(region) as region, array_agg(anomaly_label) as anomaly_label, array_agg(value) as value from t0), t2 as (select row(platform, region) as table_row from t1), t3 as (select get_patterns(table_row, ARRAY['platform', 'region']) as ret from t2) select * from t3
输出结果:
[["platform=eBay","platform=edX","platform=Amazon","platform=Skillshare","platform=Shopify","platform=Khan Academy","platform=Coursera","platform=Udemy","platform=Alibaba","platform=Taobao","platform=Snapchat","platform=Amazon Prime","platform=YouTube","platform=Hulu","platform=Peloton","platform=Twitter","platform=Fitbit","platform=Nike Training","platform=LinkedIn","platform=Instagram","platform=Disney+","platform=Strava","platform=MyFitnessPal","platform=Facebook","platform=Netflix","platform=Console","platform=Samsung SmartThings","platform=Apple HomeKit","platform=Mobile","platform=PC","platform=Google Home","platform=VR"],[156960,149760,148320,148320,146880,145440,139680,136800,133920,133920,96480,95040,92160,90720,90720,89280,89280,87840,84960,83520,83520,82080,82080,77760,70560,46080,41760,41760,34560,33120,31680,30240],null,null]
返回结果说明
参数 | 类型 | 说明 | 对应示例 |
$RET.patterns | array<varchar> | 表格模板,频繁集。每个varchar是一个用 AND 连接的表达式,类似 |
|
$RET.test_supports | array<bigint> | 每一个模板出现的频次。 |
|
$RET.labels | array<bigint> | 预留的返回值位置,便于后续自动将数据归类。目前始终为 |
|
$RET.error_msg | array<varchar>/null | 报错信息。如果没有报错,则返回值为 |
|