多变量模式识别函数

日志服务通过异常检测功能,自动识别服务系统中的异常状态及其根源。该功能可自动识别指标的变化是否偏离正常模式,并结合指标当前模式与机器学习技术来进行异常检测。本文主要介绍异常检测功能(多变量模式识别函数)。

多变量模式识别函数列表

函数名称

语法

说明

返回值类型

summarize函数

  • summarize(array(array(double)) data_samples)

  • summarize(array(array(double)) data_samples, array(double) weights)

通过对给定样本及样本权重(可选)进行统计学习,识别并返回多变量的统计模式,输出结果为统计模式。统计模式涵盖多种统计量与联合统计量,例如平均值、标准差、协方差矩阵等。

varchar

merge_summary函数

  • merge_summary(varchar summary1, varchar summary2)

  • merge_summary(varchar summary1, double weight1, varchar summary2, double weight2)

将不同阶段分别用summarize函数学习得到的模式进行合并,包括同一数据集在不同时段学习出的模式,或者来自两个独立数据集各自学习出的模式。

varchar

normalize_vector函数

normalize_vector(varchar summary, array(double) x_vector)

使用summarize函数获得的多变量模式summary,对新样本向量x_vector进行归一化处理,确保其每个分量都被映射至[0, 1]的区间。

array(double)

standardize_vector函数

standardize_vector(varchar summary, array(double) x_vector)

使用summarize函数获得的多变量模式summary,对新样本向量x_vector进行标准化处理,将向量分量标准化为均值0和标准差1的某个值。

array(double)

mah_distance函数

mah_distance(varchar summary, array(double) x_vector)

使用summarize函数获得的多变量模式summary,对新样本x_vector算其马氏距离。马氏距离能够有效处理不同变量间的尺度差异问题,通过标准化给定样本向量x_vector到向量重心的距离来进行衡量。当该距离值为1时,表示样本向量与重心之间的距离等于所有向量到重心的平均距离。

double

standard_distance函数

standard_distance(varchar summary, double metric_value, int element_index)

使用summarize函数获得的多变量模式summary,计算新样本metric_value的标准化距离。标准化距离和马氏距离类似,马氏距离是计算由多个指标构成的一个向量到重心的标准化距离,而标准化距离是单个指标到重心的距离,其中element_index是该指标在向量中的索引,metric_value是要标准化的指标值。

double

anomaly_level函数

  • anomaly_level(varchar summary, array(double) x_vector)

  • anomaly_level(varchar summary, array(double) x_vector, int element_index)

使用summarize函数获得的多变量模式summary,对新样本x_vector计算其马氏距离,接着,通过对结果进行向下取整处理,得到1、2、3、4等,来分别表示不同级别的异常概率,具体为:0.1(一级异常)、0.01(二级异常)、0.001(三级异常)、0.0001(四级异常)等。异常等级增加表明概率减小,样本点的可疑性增加。用户通常会对异常检测结果设置阈值过滤,例如仅保留四级异常及以上的异常。

若指定了element_index,则仅计算向量指定索引分量的异常概率;若未指定,则计算所有分量的异常概率。

array(double)

summarize函数

通过对给定样本及样本权重(可选)进行统计学习,识别并返回多变量的统计模式,输出结果为统计模式。统计模式涵盖多种统计量与联合统计量,例如平均值、标准差、协方差矩阵等。

varchar summarize(array(array(double)) data_samples)

varchar summarize(array(array(double)) data_samples, array(double) weights)

参数

说明

data_samples

一个二维数组,可以看成一个二维表格。每列表示一个变量,每行表示一次观测样本。

weights

可选参数。表示样本的权重,表示为一个与data_samples第一维度长度相同的一维数组。若未指定,则默认所有样本权重相等。

使用示例

  • 查询和分析语句

    * | with data_table as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    )
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table
        group by entity_group
  • 返回结果

    entity_group

    statistical_summary

    A

    {
      "sampleCount": 8,
      "vectorSize": 4,
      "means": [
        11.5,
        12.5,
        9.25,
        0.0
      ],
      "stdDevs": [
        6.87386354243376,
        6.87386354243376,
        7.361215932167728,
        0.0
      ],
      "variances": [
        47.25,
        47.25,
        54.1875,
        0.0
      ],
      "mins": [
        1.0,
        2.0,
        1.0,
        0.0
      ],
      "maxs": [
        22.0,
        23.0,
        21.0,
        0.0
      ],
      "covariance": [
        [
          47.25,
          47.25,
          19.125,
          0.0
        ],
        [
          47.25,
          47.25,
          19.125,
          0.0
        ],
        [
          19.125,
          19.125,
          54.1875,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          0.0
        ]
      ],
      "correlations": [
        [
          1.0,
          1.0,
          0.37796447300922725,
          0.0
        ],
        [
          1.0,
          1.0,
          0.37796447300922725,
          0.0
        ],
        [
          0.37796447300922725,
          0.37796447300922725,
          1.0,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          1.0
        ]
      ],
      "sums": [
        92.0,
        100.0,
        74.0,
        0.0
      ],
      "weightSum": 8.0,
      "sumProducts": [
        [
          1436.0,
          1528.0,
          1004.0,
          0.0
        ],
        [
          1528.0,
          1628.0,
          1078.0,
          0.0
        ],
        [
          1004.0,
          1078.0,
          1118.0,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          0.0
        ]
      ],
      "isSummarized": true
    }
    

    返回参数说明:

    参数

    说明

    sampleCount

    样本数量。

    vectorSize

    向量长度。

    means

    所有向量各分量的平均值。

    stdDevs

    所有向量各分量的标准差。

    variances

    所有向量各分量的方差。

    mins

    所有向量各分量的最小值。

    maxs

    所有向量各分量的最大值。

    covariance

    所有向量各分量间的协方差矩阵。

    correlations

    所有向量各分量间的相关系数矩阵。

    sums

    所有向量各分量的和。

    weightSum

    所有样本权重的和。

    sumProducts

    合并统计模式时候用到的中间结果。

    isSummarized

    模式统计计算是否正常返回。

    • true:正常返回。

    • false:非正常返回。

merge_summary函数

将不同阶段分别用summarize函数学习得到的模式进行合并,包括同一数据集在不同时段学习出的模式,或者来自两个独立数据集各自学习出的模式。

varchar merge_summary(varchar summary1, varchar summary2)

varchar merge_summary(varchar summary1, double weight1, varchar summary2, double weight2)

参数

说明

summary1

summarize函数学习过程得到的模式。

weight1

summary1模式对应的整体权重。

summary2

summarize函数学习过程得到的模式。

weight2

summary2模式对应的整体权重。

使用示例

  • 查询和分析语句

    * | with data_table_01 as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features
    ),
    
    summaries_01 as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table_01
        group by entity_group
    ),
    
    data_table_02 as
    (
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    ),
    
    summaries_02 as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table_02
        group by entity_group
    )
    select s1.entity_group,
        merge_summary(s1.statistical_summary, s2.statistical_summary) as statistical_summary
    from summaries_01 as s1
        join summaries_02 as s2
        on s1.entity_group = s2.entity_group
  • 返回结果

    statistical_summary为整合后的模式。

    entity_group

    statistical_summary

    2

    {
      "sampleCount": 8,
      "vectorSize": 4,
      "means": [
        11.5,
        12.5,
        9.25,
        0.0
      ],
      "stdDevs": [
        6.87386354243376,
        6.87386354243376,
        7.361215932167728,
        0.0
      ],
      "variances": [
        47.25,
        47.25,
        54.1875,
        0.0
      ],
      "mins": [
        1.0,
        2.0,
        1.0,
        0.0
      ],
      "maxs": [
        22.0,
        23.0,
        21.0,
        0.0
      ],
      "covariance": [
        [
          47.25,
          47.25,
          19.125,
          0.0
        ],
        [
          47.25,
          47.25,
          19.125,
          0.0
        ],
        [
          19.125,
          19.125,
          54.1875,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          0.0
        ]
      ],
      "correlations": [
        [
          1.0,
          1.0,
          0.37796447300922725,
          0.0
        ],
        [
          1.0,
          1.0,
          0.37796447300922725,
          0.0
        ],
        [
          0.37796447300922725,
          0.37796447300922725,
          1.0,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          1.0
        ]
      ],
      "sums": [
        92.0,
        100.0,
        74.0,
        0.0
      ],
      "weightSum": 8.0,
      "sumProducts": [
        [
          1436.0,
          1528.0,
          1004.0,
          0.0
        ],
        [
          1528.0,
          1628.0,
          1078.0,
          0.0
        ],
        [
          1004.0,
          1078.0,
          1118.0,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          0.0
        ]
      ],
      "isSummarized": true
    }

    返回参数说明:

    参数

    说明

    sampleCount

    样本数量。

    vectorSize

    向量长度。

    means

    所有向量各分量的平均值。

    stdDevs

    所有向量各分量的标准差。

    variances

    所有向量各分量的方差。

    mins

    所有向量各分量的最小值。

    maxs

    所有向量各分量的最大值。

    covariance

    所有向量各分量间的协方差矩阵。

    correlations

    所有向量各分量间的相关系数矩阵。

    sums

    所有向量各分量的和。

    weightSum

    所有样本权重的和。

    sumProducts

    合并统计模式时候用到的中间结果。

    isSummarized

    模式统计计算是否正常返回。

    • true:正常返回。

    • false:非正常返回。

normalize_vector函数

使用summarize函数获得的多变量模式summary,对新样本向量x_vector进行归一化处理,确保其每个分量都被映射至[0, 1]的区间。

array(double) normalize_vector(varchar summary, array(double) x_vector)

参数

说明

summary

summarize函数学习过程得到的模式。

x_vector

一组新样本数据。

使用示例

  • 查询和分析语句

    * | with data_table as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    ),
    
    summaries as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table
        group by entity_group
    )
    
    select t1.entity_id,
        t1.entity_group,
        normalize_vector(t2.statistical_summary, t1.features) as normalized_features
    from data_table as t1
        join summaries as t2
        on t1.entity_group = t2.entity_group
  • 返回结果

    normalized_features为输入样本向量x_vector归一化处理后的结果。

    entity_id

    entity_group

    normalized_features

    2

    A

    [0.14285714285714286,0.14285714285714286,0.25,0.5]

    4

    A

    [0.42857142857142857,0.42857142857142857,0.0,0.5]

    3

    A

    [0.2857142857142857,0.2857142857142857,0.4,0.5]

    ...

    ...

    ...

standardize_vector函数

使用summarize函数获得的多变量模式summary,对新样本向量x_vector进行标准化处理,将向量分量标准化为均值0和标准差1的某个值。

array(double) standardize_vector(varchar summary, array(double) x_vector)

参数

说明

summary

summarize函数学习过程得到的模式。

x_vector

一组新样本数据。

使用示例

  • 查询和分析语句

    * | with data_table as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    ),
    
    summaries as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table
        group by entity_group
    )
    
    select t1.entity_id,
        t1.entity_group,
        standardize_vector(t2.statistical_summary, t1.features) as standardized_features
    from data_table as t1
        join summaries as t2
        on t1.entity_group = t2.entity_group
  • 返回结果

    standardized_features为输入样本向量x_vector标准化处理后的结果。

    entity_id

    entity_group

    standardized_features

    2

    A

    [-1.0910894511799619,-1.0910894511799619,-0.4415031470273609,0.0]

    4

    A

    [-0.21821789023599237,-0.21821789023599237,-1.1207387578386854,0.0]

    3

    A

    [-0.6546536707079771,-0.6546536707079771,-0.03396178054056622,0.0]

    ...

    ...

    ...

mah_distance函数

使用summarize函数获得的多变量模式summary,对新样本x_vector算其马氏距离。马氏距离能够有效处理不同变量间的尺度差异问题,通过标准化样本向量x_vector到向量重心的距离来进行衡量。当该距离值为1时,表示样本向量与重心之间的距离等于所有向量到重心平均距离。

double mah_distance(varchar summary, array(double) x_vector)

参数

说明

summary

summarize函数学习过程得到的模式。

x_vector

一组新样本数据。

使用示例

  • 查询和分析语句

    * | with data_table as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    ),
    
    summaries as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table
        group by entity_group
    )
    
    select t1.entity_id,
        t1.entity_group,
        mah_distance(t2.statistical_summary, t1.features) as std_distance
    from data_table as t1
        join summaries as t2
        on t1.entity_group = t2.entity_group
  • 返回结果

    std_distance为输入样本向量x_vector的马氏距离。

    entity_id

    entity_group

    std_distance

    8

    A

    2.386927730244857

    7

    A

    1.6809080087793125

    1

    A

    1.5554594371997328

    ...

    ...

    ...

standard_distance函数

使用summarize函数获得的多变量模式summary,计算新样本metric_value的标准化距离。使用summarize函数获得的多变量模式summary,计算新样本metric_value的标准化距离。标准化距离和马氏距离类似,马氏距离是计算由多个指标构成的一个向量到重心的标准化距离,而标准化距离是单个指标到重心的距离,其中element_index是该指标在向量中的索引(分量的索引是从0开始的,而非从1开始),metric_value是要标准化的指标值。

double standard_distance(varchar summary, double metric_value, int element_index)

参数

说明

summary

summarize函数学习过程得到的模式。

metric_value

新样本数据。

element_index

summary数组指定分量的索引。分量的索引是从0开始的,而非从1开始。

使用示例

  • 查询和分析语句

    * | with data_table as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    ),
    
    summaries as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table
        group by entity_group
    )
    
    select t1.entity_id,
        t1.entity_group,
        standard_distance(t2.statistical_summary, 30, 1) as std_distance
    from data_table as t1
        join summaries as t2
        on t1.entity_group = t2.entity_group
    
  • 返回结果

    std_distance为输入样本metric_value指定索引的标准化距离。

    entity_id

    entity_group

    std_distance

    8

    A

    2.386927730244857

    7

    A

    1.6809080087793125

    1

    A

    1.5554594371997328

    ...

    ...

    ...

anomaly_level函数

使用summarize函数获得的多变量模式summary,对新样本x_vector计算其马氏距离,接着,通过对结果进行向下取整处理,得到1、2、3、4等结果,来分别表示不同级别的异常概率,异常概率的级别具体为:0.1(一级异常)、0.01(二级异常)、0.001(三级异常)、0.0001(四级异常)等。异常等级增加表明概率减小,样本点的可疑性增加。用户通常会对异常检测结果设置阈值过滤,例如仅保留四级异常及以上的异常。

若指定了element_index,则仅计算向量指定索引分量的异常概率;若未指定,则计算所有分量的异常概率。

double anomaly_level(varchar summary, array(double) x_vector)

double anomaly_level(varchar summary, array(double) x_vector, int element_index)

参数

说明

summary

summarize函数学习过程得到的模式。

x_vector

一组新样本数据。

element_index

可选参数。x_vector数组指定索引的元素。

使用示例

  • 查询和分析语句

    * |
    with dummy as
    (
        select sequence(1, 1000) as seq_data,
            count(*) as record_count from log
    ),
    
    sample_data as
    (
        select 'G1' as group_id,
            s.seq_num,
             -- 产生1000个二维的随机向量数据点,向量中心在(100, 5000)这个位置,两个分量的标准差分别为20500
            inverse_normal_cdf(100, 20, random()) as x1,
            inverse_normal_cdf(5000, 500, rand()) as x2
        from dummy,
            unnest(seq_data) as s(seq_num)
    ),
    
    data_summary as
    (
        select group_id,
            summarize(array_agg(array[x1, x2])) as metric_summary
        from sample_data
        group by group_id
    ),
    
    new_data as
    (
        select 'G1' as group_id, 1001 as object_id, 100.0 as x1, 5000.0 as x2
        union all select 'G1' as group_id, 1002 as object_id, 118.0 as x1, 5450.0 as x2
        union all select 'G1' as group_id, 1003 as object_id, 138.0 as x1, 5950.0 as x2
        union all select 'G1' as group_id, 1004 as object_id, 158.0 as x1, 6450.0 as x2
        union all select 'G1' as group_id, 1005 as object_id, 178.0 as x1, 6950.0 as x2
        union all select 'G1' as group_id, 1006 as object_id, 198.0 as x1, 7450.0 as x2
        union all select 'G1' as group_id, 1007 as object_id, 318.0 as x1, 10000.0 as x2
    )
    
    select n.group_id,
        json_extract(s.metric_summary, '$.means') as metric_vector_mean,
        json_extract(s.metric_summary, '$.covariance') as metric_covariance,
        n.object_id,
        n.x1,
        n.x2,
        anomaly_level(s.metric_summary, array[x1, x2]) as anomaly_level
    from data_summary as s
        join new_data as n
            on s.group_id = n.group_id
    order by n.group_id, n.object_id
    limit 100000
    
  • 返回结果

    anomaly_level为输入样本向量x_vector的异常概率。

    group_id

    object_id

    anomaly_level

    G1

    1007

    13.0

    G1

    1006

    5.0

    G1

    1005

    4.0

    ...

    ...

    ...