SLS approximate functions: distinct count, percentile, histogram-Simple Log Service(SLS)-阿里云帮助中心

Approximate functions estimate distinct counts, percentiles, and histograms with low computational cost.

SLS supports the following approximate functions.

Important If you want to use strings in analytic statements, you must enclose strings in single quotation marks (''). Strings that are not enclosed or enclosed in double quotation marks ("") indicate field names or column names. For example, 'status' indicates the status string, and status or "status" indicates the status log field.

Function	Syntax	Description	SQL support	SPL support
approx_distinct	approx_distinct(x)	Estimates the distinct value count in x. Default standard error: 2.3%.	✔	❌
approx_distinct	approx_distinct(x, e)	Estimates the distinct value count in x with a custom standard error.	✔	❌
approx_percentile	approx_percentile(x, percentage)	Returns the approximate percentage percentile of x (sorted ascending).	✔	❌
	approx_percentile(x, array[percentage01, percentage02...])	Returns the approximate values of x at the percentage01 and percentage02 percentiles (sorted ascending).	✔	❌
	approx_percentile(x, weight, percentage)	Returns the approximate percentage percentile of x weighted by x × weight (sorted ascending).	✔	❌
	approx_percentile(x, weight, array[percentage01, percentage02...])	Returns the approximate values of x at the percentage01 and percentage02 percentiles, weighted by x × weight.	✔	❌
	approx_percentile(x, weight, percentage, accuracy)	Returns the approximate percentage percentile of x weighted by x × weight, with configurable accuracy.	✔	❌
numeric_histogram	numeric_histogram(bucket, x)	Computes an approximate histogram for x with the specified bucket count. Returns a JSON object.	✔	❌
numeric_histogram	numeric_histogram(bucket, x, weight)	Computes a weighted approximate histogram for x with the specified bucket count. Returns a JSON object.	✔	❌
numeric_histogram_u	numeric_histogram_u(bucket, x)	Computes an approximate histogram for x with the specified bucket count. Returns a multi-row, multi-column table.	✔	❌
approx_most_frequent	`approx_most_frequent(k, x)`	Estimates the top `k` most frequent values in column `x` and their approximate counts. Returns a map.	✔	❌

Approx_distinct function

The approx_distinct function estimates the distinct value count in x.

Syntax

Estimates the distinct value count in x. Default standard error: 2.3%.
```
approx_distinct(x)
```
Estimates the distinct value count in x with a custom standard error.
```
approx_distinct(x, e)
```

Parameters

Parameter	Description
x	Column name. Supports any data type.
e	Custom standard error. Valid range: [0.0115, 0.26].

Return value type

BIGINT

Examples

Example 1: Calculate PV with count and estimate UV with approx_distinct on the client_ip field. Standard error: 2.3%.
- Query statement
```
* | SELECT count(*) AS PV, approx_distinct(client_ip) AS UV
```
- The query returns a PV of 941,787 and a UV of 723,040.
Example 2: Calculate PV and estimate UV on the client_ip field with a custom standard error of 10%.
- Query statement
```
* | SELECT count(*) AS PV, approx_distinct(client_ip,0.1) AS UV
```
- The query returns a PV of 9,095 and a UV of 7,946.

Approx_percentile function

The approx_percentile function returns the approximate percentile of x at a given percentage. Results are approximate, without stability guarantees.

Syntax

Returns the approximate percentage percentile of x (sorted ascending). Return type: double.
```
approx_percentile(x, percentage)
```
Returns the approximate values of x at the percentage01 and percentage02 percentiles (sorted ascending). Return type: array(double,double).
```
approx_percentile(x, array[percentage01, percentage02...])
```
Returns the approximate percentage percentile of x weighted by x × weight. Return type: double.
```
approx_percentile(x, weight, percentage)
```
Returns the approximate values of x at the percentage01 and percentage02 percentiles, weighted by x × weight. Return type: array(double,double).
```
approx_percentile(x, weight, array[percentage01, percentage02...])
```
Returns the approximate percentage percentile of x weighted by x × weight, with configurable accuracy. Return type: double.
```
approx_percentile(x, weight, percentage, accuracy)
```

Parameters

Parameter	Description
x	Column name. Must be DOUBLE type.
percentage	Percentile value. Valid range: [0, 1].
accuracy	Accuracy value. Valid range: (0, 1).
weight	Weight. Must be an integer greater than 1. Values are sorted by x × weight.

Return value type

DOUBLE or ARRAY(DOUBLE)

Examples

Example 1: Return the approximate 50th percentile (median) of request_time.
- Query statement
```
* | SELECT approx_percentile(request_time, 0.5)
```
- The query returns 45.0, which is the approximate median (50th percentile) of request_time.
Example 2: Return the approximate 10th, 20th, and 70th percentiles of request_time.
- Query statement
```
* | SELECT approx_percentile(request_time, array[0.1, 0.2, 0.7])
```
- The query returns [17.0, 24.0, 59.0], which are the approximate values of request_time at the 10th, 20th, and 70th percentiles, respectively.

Example 3: Weighted 50th percentile of request_time. Weight: 100 if request_time < 20, else 10.

Query statement

* |
SELECT
  approx_percentile(
    request_time,
    CASE
      WHEN request_time < 20 THEN 100
      ELSE 10
    END,
    0.5
  )

The query returns 18.0.

Example 4: Weighted 80th and 90th percentiles of request_time. Weight: 100 if request_time < 20, else 10.
- Query statement
```
* |
SELECT
  approx_percentile(
    request_time,
    CASE
      WHEN request_time < 20 THEN 100
      ELSE 10
    END,
    array [0.8, 0.9]
  )
```
- The query returns [48.0, 64.0].

Example 5: Weighted 50th percentile of request_time, accuracy 0.2. Weight: 100 if request_time < 20, else 10.

Query statement

* |
SELECT
  approx_percentile(
    request_time,
    CASE
      WHEN request_time < 20 THEN 100
      ELSE 10
    END,
    0.5,
    0.2
  )

Numeric_histogram function

The numeric_histogram function computes an approximate histogram of x with a given bucket count. Returns a JSON object.

Syntax

Computes an approximate histogram for x with the specified bucket count.
```
numeric_histogram(bucket, x)
```
Computes a weighted approximate histogram for x with the specified bucket count.
```
numeric_histogram(bucket, x, weight)
```

Parameters

Parameter	Description
bucket	Number of histogram buckets. Must be BIGINT.
x	Column name. Must be DOUBLE type.
weight	Weight. Must be an integer greater than 0. Values are grouped by x × weight.

Return value type

JSON

Examples

Example 1: Compute an approximate histogram of request durations for POST requests.
- Query statement
```
request_method:POST | SELECT numeric_histogram(10, request_time)
```
- Returns a JSON object mapping bucket upper bounds to counts, e.g. "45.03638445951907":11351.0, "31.617058096415327":16180.0, and "51.0979254315973":13786.0.
Example 2: Compute a weighted approximate histogram of request durations for POST requests.
- Query statement
```
request_method:POST | SELECT numeric_histogram(10, request_time, CASE WHEN request_time<20 THEN 100 ELSE 10 END)
```
- Returns a JSON column _col0 with 10 key-value pairs mapping ranges to frequencies, e.g. "60.63632633267428":268070.0, "76.41340599455032":275250.0, and "15.028066666666671":1500000.0.

Numeric_histogram_u function

The numeric_histogram_u function computes an approximate histogram of x with a given bucket count. Returns a multi-row, multi-column table.

Syntax

numeric_histogram_u(bucket, x)

Parameters

Parameter	Description
bucket	Number of histogram buckets. Must be BIGINT.
x	Column name. Must be DOUBLE type.

Return value type

A table with bucket_avg and count columns.

Examples

Compute an approximate histogram of POST request durations.

Query statement

request_method:POST | SELECT numeric_histogram_u(10, request_time)

Returns 10 rows with bucket_avg and count columns. For example: bucket_avg 14.20, count 18806.0; bucket_avg 70.78, count 13442.0.

Approx_most_frequent function

Estimates the top k most frequent values in column x and their approximate counts.

Syntax

approx_most_frequent(k, x)

Parameters

Parameter	Description
k	Number of top frequent values to return. For example, 5 returns the top 5 values and their approximate counts.
x	Column name. Must be VARCHAR type.

Return value

map(VARCHAR, BIGINT)

Example

Get the three most frequent values from the content field.

Sample data

content: 
'A'
'B'
'A'
'C'
'A'
'B'
'C'
'D'
'E'

Query statement
```
SELECT approx_most_frequent(3, content)
```
Output

The query returns {"A":3, "B":2, "C":2}.