Approximate functions

更新时间:
复制 MD 格式

Approximate functions estimate distinct counts, percentiles, and histograms with low computational cost.

SLS supports the following approximate functions.

Important If you want to use strings in analytic statements, you must enclose strings in single quotation marks (''). Strings that are not enclosed or enclosed in double quotation marks ("") indicate field names or column names. For example, 'status' indicates the status string, and status or "status" indicates the status log field.

Function

Syntax

Description

SQL support

SPL support

approx_distinct

approx_distinct(x)

Estimates the distinct value count in x. Default standard error: 2.3%.

approx_distinct(x, e)

Estimates the distinct value count in x with a custom standard error.

approx_percentile

approx_percentile(x, percentage)

Returns the approximate percentage percentile of x (sorted ascending).

approx_percentile(x, array[percentage01, percentage02...])

Returns the approximate values of x at the percentage01 and percentage02 percentiles (sorted ascending).

approx_percentile(x, weight, percentage)

Returns the approximate percentage percentile of x weighted by x × weight (sorted ascending).

approx_percentile(x, weight, array[percentage01, percentage02...])

Returns the approximate values of x at the percentage01 and percentage02 percentiles, weighted by x × weight.

approx_percentile(x, weight, percentage, accuracy)

Returns the approximate percentage percentile of x weighted by x × weight, with configurable accuracy.

numeric_histogram

numeric_histogram(bucket, x)

Computes an approximate histogram for x with the specified bucket count. Returns a JSON object.

numeric_histogram(bucket, x, weight)

Computes a weighted approximate histogram for x with the specified bucket count. Returns a JSON object.

numeric_histogram_u

numeric_histogram_u(bucket, x)

Computes an approximate histogram for x with the specified bucket count. Returns a multi-row, multi-column table.

approx_most_frequent

approx_most_frequent(k, x)

Estimates the top k most frequent values in column x and their approximate counts. Returns a map.

Approx_distinct function

The approx_distinct function estimates the distinct value count in x.

Syntax

  • Estimates the distinct value count in x. Default standard error: 2.3%.

    approx_distinct(x)
  • Estimates the distinct value count in x with a custom standard error.

    approx_distinct(x, e)

Parameters

Parameter

Description

x

Column name. Supports any data type.

e

Custom standard error. Valid range: [0.0115, 0.26].

Return value type

BIGINT

Examples

  • Example 1: Calculate PV with count and estimate UV with approx_distinct on the client_ip field. Standard error: 2.3%.

    • Query statement

      * | SELECT count(*) AS PV, approx_distinct(client_ip) AS UV
    • The query returns a PV of 941,787 and a UV of 723,040.

  • Example 2: Calculate PV and estimate UV on the client_ip field with a custom standard error of 10%.

    • Query statement

      * | SELECT count(*) AS PV, approx_distinct(client_ip,0.1) AS UV
    • The query returns a PV of 9,095 and a UV of 7,946.

Approx_percentile function

The approx_percentile function returns the approximate percentile of x at a given percentage. Results are approximate, without stability guarantees.

Syntax

  • Returns the approximate percentage percentile of x (sorted ascending). Return type: double.

    approx_percentile(x, percentage)
  • Returns the approximate values of x at the percentage01 and percentage02 percentiles (sorted ascending). Return type: array(double,double).

    approx_percentile(x, array[percentage01, percentage02...])
  • Returns the approximate percentage percentile of x weighted by x × weight. Return type: double.

    approx_percentile(x, weight, percentage)
  • Returns the approximate values of x at the percentage01 and percentage02 percentiles, weighted by x × weight. Return type: array(double,double).

    approx_percentile(x, weight, array[percentage01, percentage02...])
  • Returns the approximate percentage percentile of x weighted by x × weight, with configurable accuracy. Return type: double.

    approx_percentile(x, weight, percentage, accuracy)

Parameters

Parameter

Description

x

Column name. Must be DOUBLE type.

percentage

Percentile value. Valid range: [0, 1].

accuracy

Accuracy value. Valid range: (0, 1).

weight

Weight. Must be an integer greater than 1.

Values are sorted by x × weight.

Return value type

DOUBLE or ARRAY(DOUBLE)

Examples

  • Example 1: Return the approximate 50th percentile (median) of request_time.

    • Query statement

      * | SELECT approx_percentile(request_time, 0.5)
    • The query returns 45.0, which is the approximate median (50th percentile) of request_time.

  • Example 2: Return the approximate 10th, 20th, and 70th percentiles of request_time.

    • Query statement

      * | SELECT approx_percentile(request_time, array[0.1, 0.2, 0.7])
    • The query returns [17.0, 24.0, 59.0], which are the approximate values of request_time at the 10th, 20th, and 70th percentiles, respectively.

  • Example 3: Weighted 50th percentile of request_time. Weight: 100 if request_time < 20, else 10.

    • Query statement

      * |
      SELECT
        approx_percentile(
          request_time,
          CASE
            WHEN request_time < 20 THEN 100
            ELSE 10
          END,
          0.5
        )
    • The query returns 18.0.

  • Example 4: Weighted 80th and 90th percentiles of request_time. Weight: 100 if request_time < 20, else 10.

    • Query statement

      * |
      SELECT
        approx_percentile(
          request_time,
          CASE
            WHEN request_time < 20 THEN 100
            ELSE 10
          END,
          array [0.8, 0.9]
        )
    • The query returns [48.0, 64.0].

  • Example 5: Weighted 50th percentile of request_time, accuracy 0.2. Weight: 100 if request_time < 20, else 10.

    • Query statement

      * |
      SELECT
        approx_percentile(
          request_time,
          CASE
            WHEN request_time < 20 THEN 100
            ELSE 10
          END,
          0.5,
          0.2
        )

Numeric_histogram function

The numeric_histogram function computes an approximate histogram of x with a given bucket count. Returns a JSON object.

Syntax

  • Computes an approximate histogram for x with the specified bucket count.

    numeric_histogram(bucket, x)
  • Computes a weighted approximate histogram for x with the specified bucket count.

    numeric_histogram(bucket, x, weight)

Parameters

Parameter

Description

bucket

Number of histogram buckets. Must be BIGINT.

x

Column name. Must be DOUBLE type.

weight

Weight. Must be an integer greater than 0.

Values are grouped by x × weight.

Return value type

JSON

Examples

  • Example 1: Compute an approximate histogram of request durations for POST requests.

    • Query statement

      request_method:POST | SELECT numeric_histogram(10, request_time)
    • Returns a JSON object mapping bucket upper bounds to counts, e.g. "45.03638445951907":11351.0, "31.617058096415327":16180.0, and "51.0979254315973":13786.0.

  • Example 2: Compute a weighted approximate histogram of request durations for POST requests.

    • Query statement

      request_method:POST | SELECT numeric_histogram(10, request_time, CASE WHEN request_time<20 THEN 100 ELSE 10 END)
    • Returns a JSON column _col0 with 10 key-value pairs mapping ranges to frequencies, e.g. "60.63632633267428":268070.0, "76.41340599455032":275250.0, and "15.028066666666671":1500000.0.

Numeric_histogram_u function

The numeric_histogram_u function computes an approximate histogram of x with a given bucket count. Returns a multi-row, multi-column table.

Syntax

numeric_histogram_u(bucket, x)

Parameters

Parameter

Description

bucket

Number of histogram buckets. Must be BIGINT.

x

Column name. Must be DOUBLE type.

Return value type

A table with bucket_avg and count columns.

Examples

Compute an approximate histogram of POST request durations.

  • Query statement

    request_method:POST | SELECT numeric_histogram_u(10, request_time)
  • Returns 10 rows with bucket_avg and count columns. For example: bucket_avg 14.20, count 18806.0; bucket_avg 70.78, count 13442.0.

Approx_most_frequent function

Estimates the top k most frequent values in column x and their approximate counts.

Syntax

approx_most_frequent(k, x)

Parameters

Parameter

Description

k

Number of top frequent values to return. For example, 5 returns the top 5 values and their approximate counts.

x

Column name. Must be VARCHAR type.

Return value

map(VARCHAR, BIGINT)

Example

Get the three most frequent values from the content field.

  • Sample data

    content: 
    'A'
    'B'
    'A'
    'C'
    'A'
    'B'
    'C'
    'D'
    'E'
  • Query statement

    SELECT approx_most_frequent(3, content)
  • Output

    The query returns {"A":3, "B":2, "C":2}.