This topic answers frequently asked questions about EMR Serverless StarRocks.
How do I access OSS across accounts?
By default, EMR Serverless StarRocks provides password-free access to OSS buckets in the same account. To access OSS resources in another account, you must disable this default, configure the target account's AccessKey pair, and apply the new configuration.
Disable password-free access: On the Parameter Configuration tab, clear the values of the following configuration items in the specified files.
core-site.xml
fs.oss.credentials.provider =jindosdk.cfg
fs.oss.provider.format = fs.oss.provider.endpoint =
Add the AccessKey pair for the target account: On the Parameter Configuration tab, click Add Configuration Item and add the following configurations to the specified files.
core-site.xml
fs.oss.accessKeyId = AccessKey ID of the target account fs.oss.accessKeySecret = AccessKey Secret of the target accountjindosdk.cfg
fs.oss.accessKeyId = AccessKey ID of the target account fs.oss.accessKeySecret = AccessKey Secret of the target account
Apply the configuration: On the Parameter Configuration tab, click Submit Parameters.
Use UDF and JDBC connector drivers
Before you use UDF and JDBC drivers, you must obtain the required JAR files from an external source.
Upload the JAR files to OSS. For more information, see Upload files.
When you upload the files, set the object ACL to Public Read/Write to grant the JAR files public read and write permissions.
Obtain the URL for each JAR file.
In the OSS console, find the link for each successfully uploaded JAR file. Use the HTTP URL of the internal endpoint, which must be in one of the following formats:
For a JDBC driver:
http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/mysql-connector-java-*.jar.For a UDF:
http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/<jar_package_name>.
Use the JAR files. For more information, see Java UDF and JDBC Catalog.
How do I reset the instance password?
Resetting the instance password interrupts client-server connections. To minimize production impact, perform this operation during off-peak hours.
Only users with the AliyunEMRStarRocksFullAccess permission can reset the password.
Go to the instance details page.
Log in to the E-MapReduce console.
In the left-side navigation pane, choose EMR Serverless > StarRocks.
Click the name of the target instance.
On the Instance Details page, in the Basic Information section, click Reset Password.
In the dialog box that appears, enter and confirm the new password, and then click OK.
Error writing data to Paimon tables
Symptom: When you use StarRocks to write data to a Paimon table, you may receive the following error message:
(5025, 'Backend node not found. Check if any backend node is down.')Cause: A permission check in Paimon tables can prevent StarRocks from correctly identifying BE nodes during write operations.
Solution:
Upgrade the version (Recommended): If your instance version is earlier than one of the following, perform a minor version update to apply the fix.
StarRocks 3.2: 3.2.11-1.89 or later
StarRocks 3.3: 3.3.8-1.88 or later
Workaround: On the Parameter Configuration tab of the StarRocks instance, add the following configuration item to the
core-site.xmlfile.dlf.permission.clientCheck=false
When creating a foreign table in StarRocks, if you receive the not a RAM user error, what should you do?
Symptom: When creating a foreign table in StarRocks, you may receive the following error message:
current user is not a RAM userCause: This error is caused by insufficient permissions or an outdated instance version.
Solution:
Check the RAM user permissions: Ensure that the Resource Access Management (RAM) user has the required permissions for StarRocks. For more information, see Grant permissions to a RAM user.
If the permissions are correct, check and upgrade the kernel version on the StarRocks Instance Details page.
If your instance version is earlier than one of the following, perform a minor version update to apply the fix.
StarRocks 3.2: 3.2.11-1.89 or later
StarRocks 3.3: 3.3.8-1.88 or later
Error with semicolons in the SQL Editor
Symptom: When you run an SQL statement containing a semicolon (
;) in the SQL Editor, you receive an error. The error message includesthe most similar input is {a legal identifier}.The error code is 1064, and the details also include
Unexpected input '<EOF>', indicating a syntax error at line 3, column 11.This error occurs because the SQL Editor uses the semicolon (
;) as a statement terminator by default. If your SQL statement contains a semicolon (;), a syntax parsing error occurs.Solution:
Set a custom delimiter.
Before you run an SQL statement that contains a semicolon, set a custom delimiter to prevent syntax parsing errors. For example, you can change the delimiter to
$$.delimiter $$Run the SQL statement that contains a semicolon. An example is shown below:
INSERT INTO sr_test VALUES (1, 'asdsd,asdsads'), (2, 'sadsad;asdsads');Restore the default delimiter.
After the SQL statement is executed, restore the default delimiter (
;) so that subsequent SQL operations can run as expected.delimiter ;Verify the result.
Run a query to verify that the data was inserted correctly.
delimiter ; SELECT * FROM sr_test;Output test_id test_desc 0 1 asdsd,asdsads 1 2 sadsad;asdsads
Failure to import data or access foreign tables
Symptom: When you use EMR Serverless StarRocks to import data or access a foreign table, the import or connection may fail if the destination is a public IP address.
Cause: An EMR Serverless StarRocks instance runs in a Virtual Private Cloud (VPC) environment by default, which may not have direct access to the internet. Therefore, requests to public resources, such as for data imports or foreign table queries, fail unless internet access is configured.
Solution: You can deploy an Internet NAT gateway in the VPC and enable the SNAT feature. This allows the EMR Serverless StarRocks instance to access public resources through the gateway. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.
Prevent connection closure by SLB/CLB idle timeout
Symptom: When using SLB with a StarRocks instance, the SLB forcibly closes the client connection if an SQL query runs for more than 900 seconds, preventing the query from returning a result. For more information about enabling SLB, see Manage gateways.
Cause: SLB closes any TCP connection that is idle for more than 900 seconds. This can happen during a long-running SQL query, interrupting the connection before StarRocks returns a result.
Solution: Configure client-side TCP Keepalive parameters to prevent the SLB from closing idle connections.
Global kernel parameter settings (system-level)
Modify the operating system's kernel parameters to enable and configure appropriate TCP Keepalive settings for all TCP connections. This helps monitor the status of network connections. The following table describes the parameters to be configured.
Parameter
Description
Recommended value
Linux:
net.ipv4.tcp_keepalive_timeFreeBSD/macOS:
net.inet.tcp.keepidle
The period of inactivity in seconds after which the first Keepalive probe is sent.
600 seconds
Linux:
net.ipv4.tcp_keepalive_intvlFreeBSD/macOS:
net.inet.tcp.keepintvl
The interval in seconds between Keepalive probe retransmissions.
60 seconds
Linux:
net.ipv4.tcp_keepalive_probesFreeBSD/macOS:
net.inet.tcp.keepcnt
The number of consecutive failed probes after which the connection is dropped.
5
Linux
Apply settings temporarily
# Set global Keepalive parameters (root permissions required) sudo sysctl -w net.ipv4.tcp_keepalive_time=600 # Corresponds to keepidle (600 seconds) sudo sysctl -w net.ipv4.tcp_keepalive_intvl=60 # Corresponds to keepintvl (60 seconds) sudo sysctl -w net.ipv4.tcp_keepalive_probes=5 # Corresponds to keepcount (5)Apply settings permanently
Add the following content to
/etc/sysctl.confand runsysctl -pto apply the settings.echo "net.ipv4.tcp_keepalive_time = 600" >> /etc/sysctl.conf echo "net.ipv4.tcp_keepalive_intvl = 60" >> /etc/sysctl.conf echo "net.ipv4.tcp_keepalive_probes = 5" >> /etc/sysctl.conf
FreeBSD/macOS
Apply settings temporarily
# Set global Keepalive parameters (root permissions required) sudo sysctl -w net.inet.tcp.keepidle=600 sudo sysctl -w net.inet.tcp.keepintvl=60 sudo sysctl -w net.inet.tcp.keepcnt=5Apply settings permanently
Add the following content to
/etc/sysctl.conf.echo "net.inet.tcp.keepidle = 600" >> /etc/sysctl.conf echo "net.inet.tcp.keepintvl = 60" >> /etc/sysctl.conf echo "net.inet.tcp.keepcnt = 5" >> /etc/sysctl.conf
Application-level settings
You can use language-specific APIs to set TCP Keepalive parameters for a single connection.
Java
The Java standard library has limited support for TCP Keepalive. However, you can implement it by using reflection or low-level socket options.
NoteThe following code requires system support for options such as
tcp_keepidleon Linux or FreeBSD. Additionally, some methods, such as reflection, may not work due to differences in JVM versions. We recommend that you test for compatibility before use in a production environment.import java.io.IOException; import java.net.InetSocketAddress; import java.net.Socket; import java.net.SocketOption; import java.nio.channels.SocketChannel; public class TcpKeepaliveExample { public static void main(String[ ] args) { try (Socket socket = new Socket()) { // 1. Enable Keepalive socket.setKeepAlive(true); // 2. Set Keepalive parameters (requires low-level socket options) SocketChannel channel = socket.getChannel(); if (channel != null) { // Set Keepidle (idle time) channel.setOption(StandardSocketOptions.SO_KEEPALIVE, true); // Keepalive must be enabled first setSocketOptionInt(socket, "tcp_keepidle", 600); // Requires system support // Set Keepintvl (retransmission interval) setSocketOptionInt(socket, "tcp_keepintvl", 60); // Set Keepcount (number of failures) setSocketOptionInt(socket, "tcp_keepcnt", 5); // Note: The parameter name may vary by system } // Connect to the server socket.connect(new InetSocketAddress("example.com", 80)); // ... Other operations ... } catch (IOException e) { e.printStackTrace(); } } // Use reflection to set system-specific socket options (such as on Linux/FreeBSD) private static void setSocketOptionInt(Socket socket, String optionName, int value) { try { Class<?> clazz = Class.forName("java.net.Socket$SocketOptions"); Object options = clazz.getDeclaredMethod("options").invoke(socket); Class<?> optionsClass = options.getClass(); optionsClass.getDeclaredMethod("setOption", String.class, int.class) .invoke(options, optionName, value); } catch (Exception e) { throw new RuntimeException("Failed to set socket option " + optionName, e); } } }Python
The Python
socketmodule supports direct configuration of TCP Keepalive parameters.NoteDifferent operating systems may use different parameter names. For example, macOS may require
TCP_KEEPALIVEinstead ofTCP_KEEPIDLE. Some parameters may require root permissions to set.import socket def create_keepalive_socket(): sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # 1. Enable Keepalive sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1) # 2. Set Keepalive parameters (Linux/FreeBSD) # Keepidle: 600 seconds sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 600) # Keepintvl: 60 seconds sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 60) # Keepcount: 5 sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 5) return sock # Example sock = create_keepalive_socket() sock.connect(("example.com", 80)) # ... Other operations ... sock.close()Golang
The Golang
netpackage provides basic Keepalive configuration. However, you must use the low-levelsyscallpackage to set detailed parameters.NoteDifferent operating systems may use different parameter names. Some parameters may require root permissions to set.
package main import ( "fmt" "net" "syscall" ) func main() { // Create a TCP connection conn, err := net.Dial("tcp", "example.com:80") if err != nil { panic(err) } defer conn.Close() // Get the underlying file descriptor file, err := conn.(*net.TCPConn).File() if err != nil { panic(err) } defer file.Close() fd := int(file.Fd()) // Enable Keepalive err = syscall.SetsockoptInt(fd, syscall.SOL_SOCKET, syscall.SO_KEEPALIVE, 1) if err != nil { panic(fmt.Errorf("set SO_KEEPALIVE: %v", err)) } // Set Keepidle (idle time) err = syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, syscall.TCP_KEEPIDLE, 600) if err != nil { panic(fmt.Errorf("set TCP_KEEPIDLE: %v", err)) } // Set Keepintvl (retransmission interval) err = syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, syscall.TCP_KEEPINTVL, 60) if err != nil { panic(fmt.Errorf("set TCP_KEEPINTVL: %v", err)) } // Set Keepcount (number of failures) err = syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, syscall.TCP_KEEPCNT, 5) if err != nil { panic(fmt.Errorf("set TCP_KEEPCNT: %v", err)) } // ... Other operations ... }
"Not a OLAP table" error during data backup
Symptom: When you back up data by creating a snapshot in StarRocks, you receive the following error message:
Unexpected exception: Table '<table_name>' is not a OLAP tableCauses:
The instance is in shared-data mode.
StarRocks shared-data instances do not support data backup and recovery. This feature is only available for shared-nothing instances.
The table engine type is incompatible.
The StarRocks backup feature only supports tables that use the OLAP engine. This error occurs if the table engine is not OLAP.
Solution:
Check the instance type.
In the StarRocks Instance List, check the Instance Type. If the instance is of the Shared-data type, it does not support data backup and recovery. We recommend using a shared-nothing instance to enable backup and recovery. For more information, see Backup and recovery.
Check the table engine type.
Examine the DDL definition of the target table to confirm if
ENGINE=OLAPis set.SHOW CREATE TABLE <table_name>;If the table engine is not OLAP, recreate the table according to your business needs and ensure that you specify
ENGINE=OLAP.
"Primary-key index exceeds the limit" error on import
Symptom
When you write data to a Primary Key model table, the following error occurs:
msg: Cancelled, msg: Primary-key index exceeds the limit. tablet_id: 2506733, consumption: 33176971421, limit: 32116807950. Memory stats of top five tablets: 3656508(4465M) 3656496(4464M) 3656544(4464M) 3656520(4462M) 3656532(4461M): be: backend-0.backend.xxx.svc.cluster.local.xxxTroubleshooting approach
Analyze the current memory usage of the BE node to identify any resource bottlenecks.
Determine the cluster type (shared-nothing or shared-data) and check the corresponding configuration parameters.
Detailed troubleshooting steps
Analyze memory usage
This error occurs because the primary key index's memory consumption has exceeded the BE node's memory limit.
Check the
mem_limitconfiguration of the BE node to assess its available memory capacity. You can obtain this information by runningSHOW FRONTENDSorSHOW BACKENDS.
Solutions
Solution 1: Enable the persistent index (Recommended)
Shared-nothing: Set
enable_persistent_indextotrue.Shared-data: Set
persistent_index_typetocloud_native.
Solution 2: Implement an effective partitioning strategy
Partition the primary key table effectively (for example, by time or region) to avoid writing to the entire table at once.
After partitioning, each write operation affects only a subset of partitions. The primary key index only needs to load data for the corresponding partitions, thereby reducing the memory pressure of a single write operation.
"Reached timeout" error during import
Symptoms
A Flink job fails to import data into StarRocks and reports the following error:
Message: [E1008] Reached timeout=7500ms @x.x.x.x:8060.An
INSERT INTOjob fails and reports the following error:java.sql.SQLException: [E1008] Reached timeout=7500ms @10.106.7.182:8060.
Troubleshooting approach
If the timeout value in the error message is not 30000 ms (the default value for
rpc_connect_timeout_ms), check if therpc_connect_timeout_msparameter on the BE node has been manually adjusted.For
INSERT INTOimport jobs, you need to check whether thequery_timeoutparameter is set (for example,query_timeout = 15). Currently, the internal logic of StarRocks sets the RPC timeout threshold to half of thequery_timeoutvalue, in milliseconds (ms). Therefore, ifquery_timeout=15, the corresponding timeout is 7,500 ms.
Detailed troubleshooting steps
If you have modified the
rpc_connect_timeout_msparameter on the BE node, we recommend restoring it to the default value (30000 ms) to avoid false timeouts.The error
Reached timeout=7500msusually indicates that the brpc thread load on the BE node is high, which causes delays in processing RPC requests and in turn triggers a timeout.Analyze the data distribution of the target table by running
SHOW TABLET FROM target_database.target_table ORDER BY RowCount DESC;to check if the table's tablet distribution is reasonable. For example, if the data volume of a single tablet far exceeds the recommended range of 1 GB to 10 GB, some BE nodes may experience high load, affecting write performance.Solutions
Solution 1 (Recommended)
Optimize the table's bucketing strategy by selecting an appropriate high-cardinality column as the bucketing key forDISTRIBUTED BY HASH(...)to ensure even data distribution across tablets.Solution 2 (Temporary mitigation)
If the CPU and I/O load on the BE nodes are not at their limits, you can tune the following parameters:Increase
brpc_num_threads: The default value is the number of BE CPU cores. You can try adjusting it to 2 or 4 times the original value. Do not exceed 4 times to avoid increased thread contention.Increase
flush_thread_num_per_store: The default value is 2. You can adjust it to 4 to improve data flushing concurrency.
"NULL value in non-nullable column" error
Symptom: When importing data into a StarRocks table, the following error occurs:
Error: NULL value in non-nullable column 'xxx'.Cause: Writing a
NULLvalue to aNOT NULLcolumn violates the table schema, causing the import job to fail.Solutions
Solution 1: Fix the source data
Before writing data to StarRocks, filter or replace
NULLvalues to ensure the data conforms to the target table's constraints.Solution 2: Modify the table schema
If your business logic allows the field to be null, modify the table schema to remove the
NOT NULLconstraint.
"Too many versions" error with Flink connector
Symptom
While continuously writing data to a StarRocks table using the Flink Connector, the following error occurs:
because of too many versions, current/limit: 1009/1000.Cause: In StarRocks' Primary Key model or Unique Key model (with merge-on-write), each data import generates a new version. To prevent metadata bloat and ensure query performance, the system limits each partition to a maximum of 1000 versions by default.
Solution
Check the partition compaction score.
Run the following SQL statement to check the compaction pressure on the partitions of the target table:
SELECT TABLE_NAME, PARTITION_NAME, AvgCS AS avg_compaction_score, MaxCS AS max_compaction_score FROM information_schema.partitions_meta WHERE TABLE_NAME = 'your_table_name';If the
MaxCS(maximum compaction score) of the affected partition is much higher than 100, it indicates a large number of small versions are pending merge, and compaction has not completed in time.Manually trigger compaction.
Run the following command:
ALTER TABLE your_db.your_table COMPACT PARTITION your_partition_name;Optimize Flink sink parameters.
Increase the Flink sink parameters, including
sink.buffer-flush.max-bytes,sink.buffer-flush.max-rows, andsink.buffer-flush.interval-ms, to reduce the import frequency and avoid generating too many small versions.
"Failed to get status for file" error
Symptom
When querying a table in an external data lake (such as Paimon or Iceberg), the query fails with the following error:
(1064, 'Failed to get status for file: oss://data-lakehouse-oss-normal/dataware.db/dwd_annotation2_user/metadata/00097-10647858-814a-499e-b300-51c570ee7ee0.metadata.json'). The OSS API returns the following error message:<Error> <Code>AccessDenied</Code> <Message>You have no right to access this object because of bucket acl.</Message> <RequestId>68EC744EB6CD8C3539FAB32A</RequestId> <HostId>data-lakehouse-oss-normal.oss-cn-shenzhen-internal.aliyuncs.com</HostId> <EC>0003-00000001</EC> <RecommendDoc>https://api.aliyun.com/troubleshoot?q=0003-00000001</RecommendDoc> </Error>Cause: When querying an external data lake (such as Hive, Iceberg, or Hudi) with the External Catalog feature, StarRocks requires access to object storage like Alibaba Cloud OSS. Improper permissions or configurations cause failures when reading metadata or data files.
Solution
Verify the AccessKey pair validity: Confirm that the configured
accessKeyIdandaccessKeySecretcan access the target OSS bucket.If the
accessKeyIdandaccessKeySecretare configured correctly, check if you are performing cross-account access to an OSS bucket. For cross-account access, you need to modify the configurations. For details, see How do I access OSS across accounts?.
Check disk usage of persistent indexes
After you enable the persistent index (enable_persistent_index = true or persistent_index_type = 'cloud_native'), the primary key index is stored on disk. You can check its disk usage by querying the information_schema.be_tablets table.
-- Query the index size of tables, ordered by index size in descending order
SELECT
tables_config.TABLE_NAME,
t1.TABLE_ID,
t1.index_sum_mb
FROM (
-- Calculate the total index size (MB) for each table
SELECT
TABLE_ID,
SUM(INDEX_DISK)/1024/1024 AS index_sum_mb
FROM information_schema.be_tablets
GROUP BY TABLE_ID
) t1
JOIN tables_config ON tables_config.TABLE_ID = t1.TABLE_ID
ORDER BY index_sum_mb DESC
-- Optional: Add a LIMIT clause to restrict the number of results to avoid large outputs
-- LIMIT 100
;View in-progress write transactions and tablets
Track ongoing or recently completed import jobs to identify which tablets are being written to.
SELECT
txn_table.*,
tc.table_name
FROM (
SELECT
bt.TABLET_ID,
bt.COMMIT_TIME,
bt.PUBLISH_TIME,
bt.TABLE_ID
FROM information_schema.be_txns bt
JOIN information_schema.be_tablets btt
ON bt.TABLET_ID = btt.TABLET_ID
) AS txn_table
JOIN information_schema.tables_config tc
ON txn_table.TABLE_ID = tc.TABLE_ID;Analyze CPU or memory load spikes
Use the audit log to identify queries with high resource consumption.
SELECT
queryId,
timestamp,
ROUND(memCostBytes / 1024 / 1024 / 1024, 2) AS memCostGB,
cpuCostNs
FROM _starrocks_audit_db_.starrocks_audit_tbl
WHERE timestamp BETWEEN '2025-xx-xx hh:mm:ss' AND '2025-xx-xx hh:mm:ss'
ORDER BY cpuCostNs DESC, memCostGB DESC
LIMIT 20;Analyze I/O load spikes
I/O spikes are typically caused by large-scale scans, such as full table scans or queries that miss partitions or indexes.
SELECT
queryId,
timestamp,
ROUND(scanBytes / 1024 / 1024 / 1024, 2) AS scanTotalGB
FROM _starrocks_audit_db_.starrocks_audit_tbl
WHERE timestamp BETWEEN '2025-xx-xx hh:mm:ss' AND '2025-xx-xx hh:mm:ss'
ORDER BY scanTotalGB DESC
LIMIT 20;"Insufficient storage" error during BE node scale-in
Symptom
When you scale in BE nodes from the console of a fully managed StarRocks cluster, the following error is reported:
invalid status: [insufficient storage].Cause: The scale-in validation fails. The cluster must meet the following storage condition after the scale-in, or the operation will be rejected:
Used Storage < Total Capacity after Scale-in × 0.7
The terms are defined as follows:
Total Capacity after Scale-in = The sum of the total capacity of the remaining nodes (Total Nodes - Scaled-in Nodes).
Used Storage = The sum of (Total Capacity - Available Capacity) for all nodes.
Obtain the
TotalCapacityandAvailCapacityvalues by runningSHOW BACKENDS.
Solution
Check the current cluster capacity.
Run the following SQL command to get the total and available capacity of each BE node:
SHOW BACKENDS\GFocus on the
TotalCapacityandAvailCapacityfields to calculate whether the used storage exceeds 70% of the total capacity after the scale-in.Expand the disk and retry.
If the disk capacity is insufficient, expand the disk size of the BE node in the EMR console. After you ensure that
Used storage < Total capacity after scale-in × 0.7, retry the scale-in operation.
"RAM.Permission.NotAllow" error in StarRocks Manager
Symptom
When a RAM user logs in to the StarRocks Manager page for a fully managed cluster, the error
You are not authorized to perform the operation(code:RAM.Permission.NotAllow) appears.Cause: The RAM user lacks the necessary permissions for EMR Serverless StarRocks, preventing access to the StarRocks Manager page. For more information, see Grant permissions to a RAM user.
Solution
Solution 1: Attach a system policy.
Log in to the RAM console and attach the
AliyunEMRStarRocksFullAccesssystem policy to the target RAM user to grant full operational permissions for EMR Serverless StarRocks.Solution 2: Grant specific permissions.
To avoid granting full access, identify the specific missing permission from the RequestId in the error message, and then grant only that permission to the RAM user in the RAM console. For example, if the
emr-serverless-starrocks:ListInstancespermission is missing, you can create a custom policy to grant it individually.
To view all permissions included in this system policy, search for
AliyunEMRStarRocksFullAccesson the Policies page in the RAM console.