Approximate functions estimate distinct counts, percentiles, and histograms with low computational cost.
SLS supports the following approximate functions.
|
Function |
Syntax |
Description |
SQL support |
SPL support |
|
approx_distinct(x) |
Estimates the distinct value count in x. Default standard error: 2.3%. |
✔ |
❌ |
|
|
approx_distinct(x, e) |
Estimates the distinct value count in x with a custom standard error. |
✔ |
❌ |
|
|
approx_percentile(x, percentage) |
Returns the approximate percentage percentile of x (sorted ascending). |
✔ |
❌ |
|
|
approx_percentile(x, array[percentage01, percentage02...]) |
Returns the approximate values of x at the percentage01 and percentage02 percentiles (sorted ascending). |
✔ |
❌ |
|
|
approx_percentile(x, weight, percentage) |
Returns the approximate percentage percentile of x weighted by x × weight (sorted ascending). |
✔ |
❌ |
|
|
approx_percentile(x, weight, array[percentage01, percentage02...]) |
Returns the approximate values of x at the percentage01 and percentage02 percentiles, weighted by x × weight. |
✔ |
❌ |
|
|
approx_percentile(x, weight, percentage, accuracy) |
Returns the approximate percentage percentile of x weighted by x × weight, with configurable accuracy. |
✔ |
❌ |
|
|
numeric_histogram(bucket, x) |
Computes an approximate histogram for x with the specified bucket count. Returns a JSON object. |
✔ |
❌ |
|
|
numeric_histogram(bucket, x, weight) |
Computes a weighted approximate histogram for x with the specified bucket count. Returns a JSON object. |
✔ |
❌ |
|
|
numeric_histogram_u(bucket, x) |
Computes an approximate histogram for x with the specified bucket count. Returns a multi-row, multi-column table. |
✔ |
❌ |
|
|
|
Estimates the top |
✔ |
❌ |
Approx_distinct function
The approx_distinct function estimates the distinct value count in x.
Syntax
-
Estimates the distinct value count in x. Default standard error: 2.3%.
approx_distinct(x) -
Estimates the distinct value count in x with a custom standard error.
approx_distinct(x, e)
Parameters
|
Parameter |
Description |
|
x |
Column name. Supports any data type. |
|
e |
Custom standard error. Valid range: [0.0115, 0.26]. |
Return value type
BIGINT
Examples
-
Example 1: Calculate PV with
countand estimate UV withapprox_distincton theclient_ipfield. Standard error: 2.3%.-
Query statement
* | SELECT count(*) AS PV, approx_distinct(client_ip) AS UV -
The query returns a PV of 941,787 and a UV of 723,040.
-
-
Example 2: Calculate PV and estimate UV on the
client_ipfield with a custom standard error of 10%.-
Query statement
* | SELECT count(*) AS PV, approx_distinct(client_ip,0.1) AS UV -
The query returns a PV of 9,095 and a UV of 7,946.
-
Approx_percentile function
The approx_percentile function returns the approximate percentile of x at a given percentage. Results are approximate, without stability guarantees.
Syntax
-
Returns the approximate percentage percentile of x (sorted ascending). Return type: double.
approx_percentile(x, percentage) -
Returns the approximate values of x at the percentage01 and percentage02 percentiles (sorted ascending). Return type: array(double,double).
approx_percentile(x, array[percentage01, percentage02...]) -
Returns the approximate percentage percentile of x weighted by x × weight. Return type:
double.approx_percentile(x, weight, percentage) -
Returns the approximate values of x at the percentage01 and percentage02 percentiles, weighted by x × weight. Return type: array(double,double).
approx_percentile(x, weight, array[percentage01, percentage02...]) -
Returns the approximate percentage percentile of x weighted by x × weight, with configurable accuracy. Return type: double.
approx_percentile(x, weight, percentage, accuracy)
Parameters
|
Parameter |
Description |
|
x |
Column name. Must be DOUBLE type. |
|
percentage |
Percentile value. Valid range: [0, 1]. |
|
accuracy |
Accuracy value. Valid range: (0, 1). |
|
weight |
Weight. Must be an integer greater than 1. Values are sorted by x × weight. |
Return value type
DOUBLE or ARRAY(DOUBLE)
Examples
-
Example 1: Return the approximate 50th percentile (median) of request_time.
-
Query statement
* | SELECT approx_percentile(request_time, 0.5) -
The query returns
45.0, which is the approximate median (50th percentile) ofrequest_time.
-
-
Example 2: Return the approximate 10th, 20th, and 70th percentiles of request_time.
-
Query statement
* | SELECT approx_percentile(request_time, array[0.1, 0.2, 0.7]) -
The query returns
[17.0, 24.0, 59.0], which are the approximate values ofrequest_timeat the 10th, 20th, and 70th percentiles, respectively.
-
-
Example 3: Weighted 50th percentile of request_time. Weight: 100 if request_time < 20, else 10.
-
Query statement
* | SELECT approx_percentile( request_time, CASE WHEN request_time < 20 THEN 100 ELSE 10 END, 0.5 ) -
The query returns
18.0.
-
-
Example 4: Weighted 80th and 90th percentiles of request_time. Weight: 100 if request_time < 20, else 10.
-
Query statement
* | SELECT approx_percentile( request_time, CASE WHEN request_time < 20 THEN 100 ELSE 10 END, array [0.8, 0.9] ) -
The query returns
[48.0, 64.0].
-
-
Example 5: Weighted 50th percentile of request_time, accuracy 0.2. Weight: 100 if request_time < 20, else 10.
-
Query statement
* | SELECT approx_percentile( request_time, CASE WHEN request_time < 20 THEN 100 ELSE 10 END, 0.5, 0.2 )
-
Numeric_histogram function
The numeric_histogram function computes an approximate histogram of x with a given bucket count. Returns a JSON object.
Syntax
-
Computes an approximate histogram for x with the specified bucket count.
numeric_histogram(bucket, x) -
Computes a weighted approximate histogram for x with the specified bucket count.
numeric_histogram(bucket, x, weight)
Parameters
|
Parameter |
Description |
|
bucket |
Number of histogram buckets. Must be BIGINT. |
|
x |
Column name. Must be DOUBLE type. |
|
weight |
Weight. Must be an integer greater than 0. Values are grouped by x × weight. |
Return value type
JSON
Examples
-
Example 1: Compute an approximate histogram of request durations for POST requests.
-
Query statement
request_method:POST | SELECT numeric_histogram(10, request_time) -
Returns a JSON object mapping bucket upper bounds to counts, e.g.
"45.03638445951907":11351.0,"31.617058096415327":16180.0, and"51.0979254315973":13786.0.
-
-
Example 2: Compute a weighted approximate histogram of request durations for POST requests.
-
Query statement
request_method:POST | SELECT numeric_histogram(10, request_time, CASE WHEN request_time<20 THEN 100 ELSE 10 END) -
Returns a JSON column
_col0with 10 key-value pairs mapping ranges to frequencies, e.g."60.63632633267428":268070.0,"76.41340599455032":275250.0, and"15.028066666666671":1500000.0.
-
Numeric_histogram_u function
The numeric_histogram_u function computes an approximate histogram of x with a given bucket count. Returns a multi-row, multi-column table.
Syntax
numeric_histogram_u(bucket, x)
Parameters
|
Parameter |
Description |
|
bucket |
Number of histogram buckets. Must be BIGINT. |
|
x |
Column name. Must be DOUBLE type. |
Return value type
A table with bucket_avg and count columns.
Examples
Compute an approximate histogram of POST request durations.
-
Query statement
request_method:POST | SELECT numeric_histogram_u(10, request_time) -
Returns 10 rows with
bucket_avgandcountcolumns. For example: bucket_avg 14.20, count 18806.0; bucket_avg 70.78, count 13442.0.
Approx_most_frequent function
Estimates the top k most frequent values in column x and their approximate counts.
Syntax
approx_most_frequent(k, x)
Parameters
|
Parameter |
Description |
|
k |
Number of top frequent values to return. For example, 5 returns the top 5 values and their approximate counts. |
|
x |
Column name. Must be VARCHAR type. |
Return value
map(VARCHAR, BIGINT)
Example
Get the three most frequent values from the content field.
-
Sample data
content: 'A' 'B' 'A' 'C' 'A' 'B' 'C' 'D' 'E' -
Query statement
SELECT approx_most_frequent(3, content) -
Output
The query returns
{"A":3, "B":2, "C":2}.