Java UDF

更新时间:
复制 MD 格式

This topic describes how to develop and use Java user-defined functions (UDFs).

Background

As of v2.2.0, StarRocks supports user-defined functions (UDFs) written in Java.

As of v3.0, StarRocks supports global UDFs. You only need to add the GLOBAL keyword to the relevant SQL statements (CREATE, SHOW, and DROP) to apply them globally. This saves you from executing the statements for each database individually. You can develop custom functions based on your business scenarios to extend StarRocks' capabilities.

StarRocks currently supports the following types of UDFs:

  • scalar user-defined function (scalar UDF)

  • user-defined aggregate function (UDAF)

  • user-defined window function (UDWF)

  • user-defined table-valued function (UDTF)

Prerequisites

Before you use Java UDFs in StarRocks, ensure you meet the following requirements:

  • You must have Apache Maven installed to create and develop Java projects.

  • You must have JDK 1.8 installed on your server.

  • Enable the UDF feature by setting the FE configuration parameter enable_udf to TRUE on the Instance Configuration page. Then, restart the instance to apply the change.

Data type mapping

SQL type

Java type

BOOLEAN

java.lang.Boolean

TINYINT

java.lang.Byte

SMALLINT

java.lang.Short

INT

java.lang.Integer

BIGINT

java.lang.Long

FLOAT

java.lang.Float

DOUBLE

java.lang.Double

STRING/VARCHAR

java.lang.String

Develop and use UDFs

You will create a Maven project and write the function in Java.

Step 1: Create a Maven project

Create a Maven project with the following basic directory structure.

project
|--pom.xml
|--src
|  |--main
|  |  |--java
|  |  |--resources
|  |--test
|--target

Step 2: Add dependencies

Add the following dependencies to the pom.xml file.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>udf</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.76</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-dependency-plugin</artifactId>
                <version>2.10</version>
                <executions>
                    <execution>
                        <id>copy-dependencies</id>
                        <phase>package</phase>
                        <goals>
                            <goal>copy-dependencies</goal>
                        </goals>
                        <configuration>
                            <outputDirectory>${project.build.directory}/lib</outputDirectory>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.3.0</version>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

Step 3: Develop the UDF

Develop a scalar UDF

A scalar UDF processes one row at a time, returning a single value for each row. Typical scalar functions include UPPER, LOWER, ROUND, and ABS.

The following example shows how to extract data from a JSON object. In some business scenarios, the value of a field in JSON data might be a JSON string instead of a JSON object. To extract the nested JSON string, you must use a nested call to the GET_JSON_STRING function, such as GET_JSON_STRING(GET_JSON_STRING('{"key":"{\\"k0\\":\\"v0\\"}"}', "$.key"), "$.k0").

To simplify the SQL statement, you can develop a UDF to extract the JSON string directly, for example, MY_UDF_JSON_GET('{"key":"{\\"k0\\":\\"v0\\"}"}', "$.key.k0").

package com.starrocks.udf.sample;
import com.alibaba.fastjson.JSONPath;

public class UDFJsonGet {
    public final String evaluate(String jsonObj, String key) {
        if (jsonObj == null || key == null) return null;
        try {
            // The JSONPath library can fully expand the path, even if a field value is a JSON-formatted string.
            return JSONPath.read(jsonObj, key).toString();
        } catch (Exception e) {
            return null;
        }
    }
}

Your custom class must implement the following method.

Note

The data types of the input arguments and the return value in the method must match those declared in the CREATE FUNCTION statement in Step 6. The data type mapping must conform to the rules in Data type mapping.

Method

Description

TYPE1 evaluate(TYPE2, ...)

The evaluate method is the entry point for the UDF and must be a public member method.

Develop a UDAF

A UDAF operates on a group of rows and returns a single value. Typical aggregate functions include SUM, COUNT, MAX, and MIN. These functions aggregate data from multiple rows within each GROUP BY group and output a single result for each group.

The following example uses the MY_SUM_INT function. Unlike the built-in SUM function, which returns a BIGINT value, MY_SUM_INT takes an INT argument and returns an INT.

package com.starrocks.udf.sample;

public class SumInt {
    public static class State {
        int counter = 0;
        public int serializeLength() { return 4; }
    }

    public State create() {
        return new State();
    }

    public void destroy(State state) {
    }

    public final void update(State state, Integer val) {
        if (val != null) {
            state.counter+= val;
        }
    }

    public void serialize(State state, java.nio.ByteBuffer buff) {
        buff.putInt(state.counter);
    }

    public void merge(State state, java.nio.ByteBuffer buffer) {
        int val = buffer.getInt();
        state.counter += val;
    }

    public Integer finalize(State state) {
        return state.counter;
    }
}

Your custom class must implement the following methods.

Note

The data types of the input arguments and the return value in the methods must match those declared in the CREATE FUNCTION statement in Step 6. The data type mapping must conform to the rules in Data type mapping.

Method

Description

State create()

Creates a state object.

void destroy(State)

Destroys the state object.

void update(State, ...)

Updates the state. The first argument is the state object, and the subsequent arguments are the input arguments declared for the function. One or more input arguments are supported.

void serialize(State, ByteBuffer)

Serializes the state.

void merge(State, ByteBuffer)

Merges a serialized state into the current state.

TYPE finalize(State)

Computes and returns the final result from the state.

When developing a UDAF, you must use the java.nio.ByteBuffer buffer class to store the intermediate result and the serializeLength method to specify the length of the serialized intermediate result.

Class and method

Description

java.nio.ByteBuffer()

This buffer class stores the intermediate result. Because the result is serialized for transfer between nodes, you must use serializeLength() to specify its serialized size.

serializeLength()

The length in bytes of the serialized intermediate result. The serializeLength() method must return an INT. For example, State { int counter = 0; public int serializeLength() { return 4; }} indicates that the intermediate result is an INT and its serialized length is 4 bytes. You can customize this based on your needs. For example, if the intermediate result is a LONG with a length of 8 bytes, you would use State { long counter = 0; public int serializeLength() { return 8; }}.

Note

Note the following requirements for java.nio.ByteBuffer serialization:

  • Do not rely on the remaining() method of ByteBuffer to deserialize the state.

  • Do not call the clear() method on the ByteBuffer.

  • The value returned by serializeLength() must match the actual length of the data written to the buffer. A mismatch causes serialization and deserialization errors.

Develop a UDWF

A UDWF is a special type of aggregate function. Unlike a regular aggregate function, a window function calculates a value over a group of rows (a window) and returns a separate result for each row. Typically, a window function includes an OVER clause that divides rows into partitions. The function then computes a result for each row based on the window of rows it belongs to.

The following example uses the MY_WINDOW_SUM_INT function. Unlike the built-in SUM function, which returns a BIGINT value, MY_WINDOW_SUM_INT takes an INT argument and returns an INT.

package com.starrocks.udf.sample;

public class WindowSumInt {    
    public static class State {
        int counter = 0;
        public int serializeLength() { return 4; }
        @Override
        public String toString() {
            return "State{" +
                    "counter=" + counter +
                    '}';
        }
    }

    public State create() {
        return new State();
    }

    public void destroy(State state) {

    }

    public void update(State state, Integer val) {
        if (val != null) {
            state.counter+=val;
        }
    }

    public void serialize(State state, java.nio.ByteBuffer buff) {
        buff.putInt(state.counter);
    }

    public void merge(State state, java.nio.ByteBuffer buffer) {
        int val = buffer.getInt();
        state.counter += val;
    }

    public Integer finalize(State state) {
        return state.counter;
    }

    public void reset(State state) {
        state.counter = 0;
    }

    public void windowUpdate(State state,
                            int peer_group_start, int peer_group_end,
                            int frame_start, int frame_end,
                            Integer[] inputs) {
        for (int i = (int)frame_start; i < (int)frame_end; ++i) {
            state.counter += inputs[i];
        }
    }
}

Because a window function is a special type of aggregate function, your custom class must implement all the required UDAF methods, plus the windowUpdate() method.

Note

The data types of the input arguments and the return value in the methods must match those declared in the CREATE FUNCTION statement in Step 6. The data type mapping must conform to the rules in Data type mapping.

Additional required method

Method

Description

void windowUpdate(State state, int, int, int , int, ...)

Updates the window data. For detailed information about window functions, see Window functions. For each input row, the corresponding window information is retrieved to update the intermediate result.

  • peer_group_start: The starting position of the current partition.

    A partition is a group of rows that share the same value for the column specified in the PARTITION BY clause.

  • peer_group_end: The ending position of the current partition.

  • frame_start: The starting position of the current window frame.

    A window frame is a subset of rows within a partition that defines the scope of computation for the current row. For example, ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING defines a frame that includes the current row, the row before it, and the row after it.

  • frame_end: The ending position of the current window frame.

  • inputs: The input data for a window, provided as a wrapper class array. The wrapper class must correspond to the input data type. In this example, the input data type is INT, so the wrapper class array is Integer[].

Develop a UDTF

A UDTF takes a single row as input and produces a table of output rows. UDTFs are often used for operations like splitting a column into multiple rows.

Note

Currently, UDTFs only support returning multiple rows with a single column.

The following example uses the MY_UDF_SPLIT function. It splits a string by using a space as the delimiter. Both the input argument and return value are of the STRING type.

package com.starrocks.udf.sample;

public class UDFSplit{
    public String[] process(String in) {
        if (in == null) return null;
        return in.split(" ");
    }
}

Your custom class must implement the following method.

Note

The data types of the input argument and the return value in the method must match those declared in the CREATE FUNCTION statement in Step 6. The data type mapping must conform to the rules in Data type mapping.

Method

Description

TYPE[] process()

The process() method is the entry point for the UDTF and must return an array.

Step 4: Package the Java project

Run the following command to package the Java project.

mvn package

This command generates two JAR files in the target directory: udf-1.0-SNAPSHOT.jar and udf-1.0-SNAPSHOT-jar-with-dependencies.jar.

Step 5: Upload the project

Upload the udf-1.0-SNAPSHOT-jar-with-dependencies.jar file to an OSS bucket and grant public read permission on the JAR file. For more information, see Simple upload and Bucket ACL.

Note

In Step 6, the FE validates the UDF's JAR file and calculates a checksum. The BE then downloads and executes the JAR file.

Step 6: Create the UDF in StarRocks

StarRocks provides two namespaces for UDFs: database-level and global-level.

  • If you do not require visibility isolation for your UDFs, you can create a global UDF. When referencing a global UDF, you can call it directly by its function name without any catalog or database prefix, which simplifies access.

  • If you require visibility isolation or need to create UDFs with the same name in different databases, you can create a database-level UDF. If your current session is within that database, you can call the function directly by its name. If your session is in a different catalog or database, you must use a fully qualified name, such as catalog.database.function.

Note

Creating a global UDF requires the system-level CREATE GLOBAL FUNCTION permission. Creating a database-level UDF requires the CREATE FUNCTION permission on the database. Using any UDF requires the USAGE permission on it. For information about how to grant permissions, see GRANT.

After you upload the JAR file, create the corresponding UDF in StarRocks. To create a global UDF, simply add the GLOBAL keyword to the SQL statement.

Syntax

CREATE [GLOBAL][AGGREGATE | TABLE] FUNCTION function_name(arg_type [, ...])
RETURNS return_type
[PROPERTIES ("key" = "value" [, ...]) ]

Parameters

Parameter

Required

Description

GLOBAL

No

Specifies that the UDF is a global UDF. Supported since StarRocks v3.0.

AGGREGATE

No

Specifies that the function to be created is a UDAF or UDWF.

TABLE

No

Specifies that the function to be created is a UDTF.

function_name

Yes

The name of the function. It can include a database name, such as db1.my_func. If a function_name includes a database name, the UDF is created in that database. Otherwise, it is created in the current database. The combination of function name and arguments must be unique within the database. However, you can overload a function by creating another with the same name but different argument types.

arg_type

Yes

The data type of a function argument. For supported data types, see Data type mapping.

return_type

Yes

The data type of the function's return value. For supported data types, see Data type mapping.

properties

Yes

Properties related to the function. Different properties are required for different types of UDFs. See the following examples for details.

Create a scalar UDF

Run the following command to create the scalar UDF from the previous example in StarRocks.

CREATE [GLOBAL] FUNCTION MY_UDF_JSON_GET(string, string) 
RETURNS string
PROPERTIES (
    "symbol" = "com.starrocks.udf.sample.UDFJsonGet", 
    "type" = "StarrocksJar",
    "file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);

Parameter

Description

symbol

The fully qualified class name of the UDF, in the format <package_name>.<class_name>.

type

The type of UDF. For Java UDFs, set this to StarrocksJar.

file

The HTTP path to the UDF's JAR file. This should be the HTTP URL of the file in OSS, preferably using an internal endpoint. The format is http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/<jar_package_name>.

Create a UDAF

Run the following command to create the UDAF from the previous example in StarRocks.

CREATE [GLOBAL] AGGREGATE FUNCTION MY_SUM_INT(INT) 
RETURNS INT
PROPERTIES 
( 
    "symbol" = "com.starrocks.udf.sample.SumInt", 
    "type" = "StarrocksJar",
    "file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);

The parameters in the PROPERTIES clause are the same as those described in Create a scalar UDF.

Create a UDWF

Run the following command to create the UDWF from the previous example in StarRocks.

CREATE [GLOBAL] AGGREGATE FUNCTION MY_WINDOW_SUM_INT(Int)
RETURNS Int
PROPERTIES 
(
    "analytic" = "true",
    "symbol" = "com.starrocks.udf.sample.WindowSumInt", 
    "type" = "StarrocksJar", 
    "file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);

The analytic property identifies the function as a window function. For a UDWF, set this to true. The other parameters are the same as those described in Create a scalar UDF.

Create a UDTF

Run the following command to create the UDTF from the previous example in StarRocks.

CREATE [GLOBAL] TABLE FUNCTION MY_UDF_SPLIT(string)
RETURNS string
PROPERTIES 
(
    "symbol" = "com.starrocks.udf.sample.UDFSplit", 
    "type" = "StarrocksJar", 
    "file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);

The parameters in the PROPERTIES clause are the same as those described in Create a scalar UDF.

Step 7: Use the UDF

After creating the UDF, you can test and use it.

Use the scalar UDF

Run the following command to use the scalar UDF created in Step 6.

SELECT MY_UDF_JSON_GET('{"key":"{\\"in\\":2}"}', '$.key.in');

Use the UDAF

Run the following command to use the UDAF created in Step 6.

SELECT MY_SUM_INT(col1);

Use the UDWF

Run the following command to use the UDWF created in Step 6.

SELECT MY_WINDOW_SUM_INT(intcol) 
            OVER (PARTITION BY intcol2
                  ORDER BY intcol3
                  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test_basic;

Use the UDTF

Run the following command to use the UDTF from the previous example.

-- Assume that table t1 exists with columns a, b, and c1.
SELECT t1.a,t1.b,t1.c1 FROM t1;
> output:
1,2.1,"hello world"
2,2.2,"hello UDTF."

-- Use the MY_UDF_SPLIT() function.
SELECT t1.a,t1.b, MY_UDF_SPLIT FROM t1, MY_UDF_SPLIT(t1.c1); 
> output:
1,2.1,"hello"
1,2.1,"world"
2,2.2,"hello"
2,2.2,"UDTF."
Note
  • In the SELECT list, the first MY_UDF_SPLIT is the default column alias for the output of the MY_UDF_SPLIT function.

  • Currently, you cannot specify a table alias or column aliases for the UDTF's output, such as with AS t2(f1).

View UDFs

Run the following command to view information about UDFs.

SHOW [GLOBAL] FUNCTIONS;

Delete UDFs

Run the following command to delete a specified UDF.

DROP [GLOBAL] FUNCTION <function_name>(arg_type [, ...]);

FAQ

Q: Can I use static variables when developing a UDF? Will static variables from different UDFs interfere with each other?

A: Yes, you can use static variables. Static variables are isolated between different UDFs and do not interfere with each other, even if the UDFs share the same class name.