This topic describes how to develop and use Java user-defined functions (UDFs).
Background
As of v2.2.0, StarRocks supports user-defined functions (UDFs) written in Java.
As of v3.0, StarRocks supports global UDFs. You only need to add the GLOBAL keyword to the relevant SQL statements (CREATE, SHOW, and DROP) to apply them globally. This saves you from executing the statements for each database individually. You can develop custom functions based on your business scenarios to extend StarRocks' capabilities.
StarRocks currently supports the following types of UDFs:
-
scalar user-defined function (scalar UDF)
-
user-defined aggregate function (UDAF)
-
user-defined window function (UDWF)
-
user-defined table-valued function (UDTF)
Prerequisites
Before you use Java UDFs in StarRocks, ensure you meet the following requirements:
-
You must have Apache Maven installed to create and develop Java projects.
-
You must have JDK 1.8 installed on your server.
-
Enable the UDF feature by setting the FE configuration parameter
enable_udftoTRUEon the Instance Configuration page. Then, restart the instance to apply the change.
Data type mapping
|
SQL type |
Java type |
|
BOOLEAN |
java.lang.Boolean |
|
TINYINT |
java.lang.Byte |
|
SMALLINT |
java.lang.Short |
|
INT |
java.lang.Integer |
|
BIGINT |
java.lang.Long |
|
FLOAT |
java.lang.Float |
|
DOUBLE |
java.lang.Double |
|
STRING/VARCHAR |
java.lang.String |
Develop and use UDFs
You will create a Maven project and write the function in Java.
Step 1: Create a Maven project
Create a Maven project with the following basic directory structure.
project
|--pom.xml
|--src
| |--main
| | |--java
| | |--resources
| |--test
|--target
Step 2: Add dependencies
Add the following dependencies to the pom.xml file.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>udf</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.76</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>2.10</version>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</build>
</project>
Step 3: Develop the UDF
Develop a scalar UDF
A scalar UDF processes one row at a time, returning a single value for each row. Typical scalar functions include UPPER, LOWER, ROUND, and ABS.
The following example shows how to extract data from a JSON object. In some business scenarios, the value of a field in JSON data might be a JSON string instead of a JSON object. To extract the nested JSON string, you must use a nested call to the GET_JSON_STRING function, such as GET_JSON_STRING(GET_JSON_STRING('{"key":"{\\"k0\\":\\"v0\\"}"}', "$.key"), "$.k0").
To simplify the SQL statement, you can develop a UDF to extract the JSON string directly, for example, MY_UDF_JSON_GET('{"key":"{\\"k0\\":\\"v0\\"}"}', "$.key.k0").
package com.starrocks.udf.sample;
import com.alibaba.fastjson.JSONPath;
public class UDFJsonGet {
public final String evaluate(String jsonObj, String key) {
if (jsonObj == null || key == null) return null;
try {
// The JSONPath library can fully expand the path, even if a field value is a JSON-formatted string.
return JSONPath.read(jsonObj, key).toString();
} catch (Exception e) {
return null;
}
}
}
Your custom class must implement the following method.
The data types of the input arguments and the return value in the method must match those declared in the CREATE FUNCTION statement in Step 6. The data type mapping must conform to the rules in Data type mapping.
|
Method |
Description |
|
TYPE1 evaluate(TYPE2, ...) |
The |
Develop a UDAF
A UDAF operates on a group of rows and returns a single value. Typical aggregate functions include SUM, COUNT, MAX, and MIN. These functions aggregate data from multiple rows within each GROUP BY group and output a single result for each group.
The following example uses the MY_SUM_INT function. Unlike the built-in SUM function, which returns a BIGINT value, MY_SUM_INT takes an INT argument and returns an INT.
package com.starrocks.udf.sample;
public class SumInt {
public static class State {
int counter = 0;
public int serializeLength() { return 4; }
}
public State create() {
return new State();
}
public void destroy(State state) {
}
public final void update(State state, Integer val) {
if (val != null) {
state.counter+= val;
}
}
public void serialize(State state, java.nio.ByteBuffer buff) {
buff.putInt(state.counter);
}
public void merge(State state, java.nio.ByteBuffer buffer) {
int val = buffer.getInt();
state.counter += val;
}
public Integer finalize(State state) {
return state.counter;
}
}
Your custom class must implement the following methods.
The data types of the input arguments and the return value in the methods must match those declared in the CREATE FUNCTION statement in Step 6. The data type mapping must conform to the rules in Data type mapping.
|
Method |
Description |
|
State create() |
Creates a state object. |
|
void destroy(State) |
Destroys the state object. |
|
void update(State, ...) |
Updates the state. The first argument is the state object, and the subsequent arguments are the input arguments declared for the function. One or more input arguments are supported. |
|
void serialize(State, ByteBuffer) |
Serializes the state. |
|
void merge(State, ByteBuffer) |
Merges a serialized state into the current state. |
|
TYPE finalize(State) |
Computes and returns the final result from the state. |
When developing a UDAF, you must use the java.nio.ByteBuffer buffer class to store the intermediate result and the serializeLength method to specify the length of the serialized intermediate result.
|
Class and method |
Description |
|
java.nio.ByteBuffer() |
This buffer class stores the intermediate result. Because the result is serialized for transfer between nodes, you must use |
|
serializeLength() |
The length in bytes of the serialized intermediate result. The |
Note the following requirements for java.nio.ByteBuffer serialization:
-
Do not rely on the
remaining()method ofByteBufferto deserialize the state. -
Do not call the
clear()method on theByteBuffer. -
The value returned by
serializeLength()must match the actual length of the data written to the buffer. A mismatch causes serialization and deserialization errors.
Develop a UDWF
A UDWF is a special type of aggregate function. Unlike a regular aggregate function, a window function calculates a value over a group of rows (a window) and returns a separate result for each row. Typically, a window function includes an OVER clause that divides rows into partitions. The function then computes a result for each row based on the window of rows it belongs to.
The following example uses the MY_WINDOW_SUM_INT function. Unlike the built-in SUM function, which returns a BIGINT value, MY_WINDOW_SUM_INT takes an INT argument and returns an INT.
package com.starrocks.udf.sample;
public class WindowSumInt {
public static class State {
int counter = 0;
public int serializeLength() { return 4; }
@Override
public String toString() {
return "State{" +
"counter=" + counter +
'}';
}
}
public State create() {
return new State();
}
public void destroy(State state) {
}
public void update(State state, Integer val) {
if (val != null) {
state.counter+=val;
}
}
public void serialize(State state, java.nio.ByteBuffer buff) {
buff.putInt(state.counter);
}
public void merge(State state, java.nio.ByteBuffer buffer) {
int val = buffer.getInt();
state.counter += val;
}
public Integer finalize(State state) {
return state.counter;
}
public void reset(State state) {
state.counter = 0;
}
public void windowUpdate(State state,
int peer_group_start, int peer_group_end,
int frame_start, int frame_end,
Integer[] inputs) {
for (int i = (int)frame_start; i < (int)frame_end; ++i) {
state.counter += inputs[i];
}
}
}
Because a window function is a special type of aggregate function, your custom class must implement all the required UDAF methods, plus the windowUpdate() method.
The data types of the input arguments and the return value in the methods must match those declared in the CREATE FUNCTION statement in Step 6. The data type mapping must conform to the rules in Data type mapping.
Additional required method
|
Method |
Description |
|
|
Updates the window data. For detailed information about window functions, see Window functions. For each input row, the corresponding window information is retrieved to update the intermediate result.
|
Develop a UDTF
A UDTF takes a single row as input and produces a table of output rows. UDTFs are often used for operations like splitting a column into multiple rows.
Currently, UDTFs only support returning multiple rows with a single column.
The following example uses the MY_UDF_SPLIT function. It splits a string by using a space as the delimiter. Both the input argument and return value are of the STRING type.
package com.starrocks.udf.sample;
public class UDFSplit{
public String[] process(String in) {
if (in == null) return null;
return in.split(" ");
}
}
Your custom class must implement the following method.
The data types of the input argument and the return value in the method must match those declared in the CREATE FUNCTION statement in Step 6. The data type mapping must conform to the rules in Data type mapping.
|
Method |
Description |
|
TYPE[] process() |
The |
Step 4: Package the Java project
Run the following command to package the Java project.
mvn package
This command generates two JAR files in the target directory: udf-1.0-SNAPSHOT.jar and udf-1.0-SNAPSHOT-jar-with-dependencies.jar.
Step 5: Upload the project
Upload the udf-1.0-SNAPSHOT-jar-with-dependencies.jar file to an OSS bucket and grant public read permission on the JAR file. For more information, see Simple upload and Bucket ACL.
In Step 6, the FE validates the UDF's JAR file and calculates a checksum. The BE then downloads and executes the JAR file.
Step 6: Create the UDF in StarRocks
StarRocks provides two namespaces for UDFs: database-level and global-level.
-
If you do not require visibility isolation for your UDFs, you can create a global UDF. When referencing a global UDF, you can call it directly by its function name without any catalog or database prefix, which simplifies access.
-
If you require visibility isolation or need to create UDFs with the same name in different databases, you can create a database-level UDF. If your current session is within that database, you can call the function directly by its name. If your session is in a different catalog or database, you must use a fully qualified name, such as
catalog.database.function.
Creating a global UDF requires the system-level CREATE GLOBAL FUNCTION permission. Creating a database-level UDF requires the CREATE FUNCTION permission on the database. Using any UDF requires the USAGE permission on it. For information about how to grant permissions, see GRANT.
After you upload the JAR file, create the corresponding UDF in StarRocks. To create a global UDF, simply add the GLOBAL keyword to the SQL statement.
Syntax
CREATE [GLOBAL][AGGREGATE | TABLE] FUNCTION function_name(arg_type [, ...])
RETURNS return_type
[PROPERTIES ("key" = "value" [, ...]) ]
Parameters
|
Parameter |
Required |
Description |
|
GLOBAL |
No |
Specifies that the UDF is a global UDF. Supported since StarRocks v3.0. |
|
AGGREGATE |
No |
Specifies that the function to be created is a UDAF or UDWF. |
|
TABLE |
No |
Specifies that the function to be created is a UDTF. |
|
function_name |
Yes |
The name of the function. It can include a database name, such as |
|
arg_type |
Yes |
The data type of a function argument. For supported data types, see Data type mapping. |
|
return_type |
Yes |
The data type of the function's return value. For supported data types, see Data type mapping. |
|
properties |
Yes |
Properties related to the function. Different properties are required for different types of UDFs. See the following examples for details. |
Create a scalar UDF
Run the following command to create the scalar UDF from the previous example in StarRocks.
CREATE [GLOBAL] FUNCTION MY_UDF_JSON_GET(string, string)
RETURNS string
PROPERTIES (
"symbol" = "com.starrocks.udf.sample.UDFJsonGet",
"type" = "StarrocksJar",
"file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);
|
Parameter |
Description |
|
symbol |
The fully qualified class name of the UDF, in the format |
|
type |
The type of UDF. For Java UDFs, set this to |
|
file |
The HTTP path to the UDF's JAR file. This should be the HTTP URL of the file in OSS, preferably using an internal endpoint. The format is |
Create a UDAF
Run the following command to create the UDAF from the previous example in StarRocks.
CREATE [GLOBAL] AGGREGATE FUNCTION MY_SUM_INT(INT)
RETURNS INT
PROPERTIES
(
"symbol" = "com.starrocks.udf.sample.SumInt",
"type" = "StarrocksJar",
"file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);
The parameters in the PROPERTIES clause are the same as those described in Create a scalar UDF.
Create a UDWF
Run the following command to create the UDWF from the previous example in StarRocks.
CREATE [GLOBAL] AGGREGATE FUNCTION MY_WINDOW_SUM_INT(Int)
RETURNS Int
PROPERTIES
(
"analytic" = "true",
"symbol" = "com.starrocks.udf.sample.WindowSumInt",
"type" = "StarrocksJar",
"file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);
The analytic property identifies the function as a window function. For a UDWF, set this to true. The other parameters are the same as those described in Create a scalar UDF.
Create a UDTF
Run the following command to create the UDTF from the previous example in StarRocks.
CREATE [GLOBAL] TABLE FUNCTION MY_UDF_SPLIT(string)
RETURNS string
PROPERTIES
(
"symbol" = "com.starrocks.udf.sample.UDFSplit",
"type" = "StarrocksJar",
"file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);
The parameters in the PROPERTIES clause are the same as those described in Create a scalar UDF.
Step 7: Use the UDF
After creating the UDF, you can test and use it.
Use the scalar UDF
Run the following command to use the scalar UDF created in Step 6.
SELECT MY_UDF_JSON_GET('{"key":"{\\"in\\":2}"}', '$.key.in');
Use the UDAF
Run the following command to use the UDAF created in Step 6.
SELECT MY_SUM_INT(col1);
Use the UDWF
Run the following command to use the UDWF created in Step 6.
SELECT MY_WINDOW_SUM_INT(intcol)
OVER (PARTITION BY intcol2
ORDER BY intcol3
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test_basic;
Use the UDTF
Run the following command to use the UDTF from the previous example.
-- Assume that table t1 exists with columns a, b, and c1.
SELECT t1.a,t1.b,t1.c1 FROM t1;
> output:
1,2.1,"hello world"
2,2.2,"hello UDTF."
-- Use the MY_UDF_SPLIT() function.
SELECT t1.a,t1.b, MY_UDF_SPLIT FROM t1, MY_UDF_SPLIT(t1.c1);
> output:
1,2.1,"hello"
1,2.1,"world"
2,2.2,"hello"
2,2.2,"UDTF."
-
In the SELECT list, the first
MY_UDF_SPLITis the default column alias for the output of theMY_UDF_SPLITfunction. -
Currently, you cannot specify a table alias or column aliases for the UDTF's output, such as with
AS t2(f1).
View UDFs
Run the following command to view information about UDFs.
SHOW [GLOBAL] FUNCTIONS;
Delete UDFs
Run the following command to delete a specified UDF.
DROP [GLOBAL] FUNCTION <function_name>(arg_type [, ...]);
FAQ
Q: Can I use static variables when developing a UDF? Will static variables from different UDFs interfere with each other?
A: Yes, you can use static variables. Static variables are isolated between different UDFs and do not interfere with each other, even if the UDFs share the same class name.