This topic answers common questions about StarRocks data import.
-
General questions
-
How do I resolve the "close index channel failed" or "too many tablet versions" error?
-
How do I resolve the "ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel" error?
-
How do I troubleshoot a remote procedure call (RPC) timeout that occurs during data loading?
-
How do I resolve the "Value count does not match column count" error?
-
How do I resolve the "Too many open files" error found in the BE service logs during data loading?
-
How do I resolve the "increase config load_process_max_memory_limit_percent" error?
-
How do I resolve errors that occur when reading data from an OSS external table using JindoSDK?
-
How do I resolve the "transmit chunk rpc failed" error during data loading?
-
What should I do if the memory used by the primary key index exceeds the limit during data loading?
-
Why do I get a cancellation error when reading data using a Spark connector?
-
How do I troubleshoot slow transaction processing when running a load job with the Spark connector?
-
Stream load
-
Routine load
-
Broker load
-
How do I fix garbled characters that appear during a broker load?
-
Why does a broker load job report no errors, but I cannot query any data?
-
How do I resolve the "failed to send batch" or "TabletWriter add batch with unknown id" error?
-
How to troubleshoot issues such as a load job that takes too long to finish?
-
How do I configure access to an Apache HDFS cluster in high availability (HA) mode?
-
INSERT INTO
-
Real-time sync from MySQL to StarRocks
-
Flink connector
-
How do I resolve a load failure when using the exactly-once transactional API?
-
Why are the BE service's memory and CPU fully utilized even when no queries are running?
-
Why doesn't the BE service release its allocated memory back to the operating system?
-
Does the sink.buffer-flush.interval-ms parameter take effect in the Flink connector for StarRocks?
-
-
DataX writer
-
Spark load
"Close index channel failed" or "too many tablet versions" error
-
Cause
This issue is caused by frequent data imports. When new data versions are created faster than the system can merge them through compaction, the number of uncompacted versions exceeds the system limit.
-
Solution
The default limit for uncompacted versions is 1,000. You can resolve this issue with the following methods:
-
Increase the batch size for each data import and reduce the import frequency.
-
Modify the following parameters in the BE configuration file be.conf to accelerate compaction.
cumulative_compaction_num_threads_per_disk = 4 base_compaction_num_threads_per_disk = 2 cumulative_compaction_check_interval_seconds = 2
-
The "Label Already Exists" error
-
Problem
This error indicates that a load job with the same label is already running or has successfully completed in the same database on your StarRocks cluster.
-
Cause
Stream Load submits load jobs over HTTP, and most HTTP clients have a built-in retry mechanism. When the StarRocks cluster receives the initial request, it starts the load job. However, if the cluster fails to respond before the client's request times out, the client may automatically resend the request. The cluster is already processing the original request, so it rejects the duplicate and returns the
Label Already Existserror. -
Solution
Check for label conflicts between different load methods or for duplicate load job submissions. Troubleshoot the issue as follows:
-
Search the primary FE logs for the job label. If the same label appears twice, this confirms the client resubmitted the request.
NoteIn a StarRocks cluster, load job labels are unique within a database, regardless of the load method. Therefore, a conflict can occur if different load jobs use the same label.
-
Run the
SHOW LOAD WHERE LABEL = "xxx"command to check whether a load job with the same label exists in the FINISHED state. Replacexxxwith the label you are checking.
As a best practice, estimate the load time based on the data volume of your request. Then, set the client's request timeout to be longer than that estimate. This prevents the client from prematurely resubmitting the request.
-
The "ETL_QUALITY_UNSATISFIED" error
Run theSHOW LOAD command. In the command output, find the URL to inspect the error data. Common error types include:
-
convert csv string to INT failed.This error indicates that a string in a column of the data file could not be converted to its corresponding data type. For example, the string
abccannot be converted to a numeric type. -
the length of input is too long than schema.The length of a field in the data file exceeds the length defined in the table schema. For example, a fixed-length string is longer than its defined length, or an INT-type field exceeds 4 bytes.
-
actual column number is less than schema column number.When split by the specified delimiter, a row in the data file results in fewer columns than defined in the schema. This may indicate that the delimiter is incorrect.
-
actual column number is more than schema column number.When split by the specified delimiter, a row in the data file results in more columns than defined in the schema.
-
the frac part length longer than schema scale.The fractional part of a DECIMAL-type column in the data file exceeds the scale defined in the schema.
-
the int part length longer than schema precision.The integer part of a DECIMAL-type column in the data file contains more digits than the precision defined in the schema.
-
there is no corresponding partition for this key.The value of the partition column for a row in the data file does not fall within any defined partition range.
Error 1064: Failed to find enough host
When creating a table, add "replication_num" = "1" to the table properties.
Resolve 'Too many open files' error
Follow these steps:
-
Increase the operating system's file handle limit.
-
Decrease the values of the base_compaction_num_threads_per_disk and cumulative_compaction_num_threads_per_disk parameters (the default value for both is 1). For instructions on how to modify configuration items, see Modify configuration items.
-
If the issue persists, scale out your cluster or reduce the import frequency.
The "increase config load_process_max_memory_limit_percent" error
To resolve the "increase config load_process_max_memory_limit_percent" error during data import, go to the Parameter Settings tab for your StarRocks instance and increase the values of the load_process_max_memory_limit_bytes and load_process_max_memory_limit_percent parameters.
Troubleshoot RPC timeouts
Check the write_buffer_size setting in the BE configuration file. This parameter controls the size threshold for memory blocks on a BE, and its default value is 100 MB. If the threshold is too large, it may cause a remote procedure call (RPC) timeout. In this case, adjust write_buffer_size along with the tablet_writer_rpc_timeout_sec parameter. For more on BE parameters, see parameter configuration.
"Value count does not match column count" error
-
Problem
An import job returns the "Value count does not match column count" error when the number of columns parsed from the source data does not match the number of columns in the destination table.
-
Cause
This error occurs when the column delimiter specified in the import command or import statement does not match the one in the source data. For example, if a three-column file in CSV format uses a comma (,) as its delimiter, but the import command or import statement specifies a tab character (\t), the file is incorrectly parsed as a single column.
-
Solution
Change the column delimiter in the import command or import statement to match the one in your source data, and then retry the import.
Choose an import method
For guidance on choosing an import method, see data import.
Factors affecting import performance
Several factors can affect import performance:
-
Memory
A high number of tablets uses more memory. Estimate the size of each tablet as described in How to configure bucketing?.
-
Disk I/O and network bandwidth
A network bandwidth of 50 Mbit/s to 100 Mbit/s is typically sufficient.
-
Import batch size and frequency
-
The recommended batch size for Stream Load is 10 MB to 100 MB.
-
Broker Load is optimized for large batch sizes.
-
Avoid a high import frequency. For SATA disks, do not exceed one task per second.
-
Handle header rows during Stream Load
Stream Load does not natively support identifying or skipping header rows in a text file; it treats all rows as data. If your source file includes a header row, you can use one of the following methods to handle it:
-
Modify the settings in your export tool to re-export the text file without the header row.
-
Use the
sed -i '1d' filenamecommand to delete the first row of the text file. -
In your Stream Load statement, use the
-H "where: <column_name> != 'header_string'"option to filter out the header row.StarRocks first attempts to convert the data before applying the filter. If a string in the header row fails to convert to the target data type, it becomes
null. Therefore, this method requires that the target columns in your StarRocks table are nullable (not defined asNOT NULL). -
In your Stream Load statement, add the
-H "max_filter_ratio:0.01"option. This sets a tolerance ratio for the import job, allowing it to ignore the header row error. You can set the ratio to 1% or less, but ensure the value is high enough to tolerate one error row relative to your total data volume. Even with this setting, theErrorURLin the response will still report the error, but the overall import job will succeed. Avoid setting the tolerance ratio too high, as this might mask genuine data quality issues.
Transform non-standard partition key data
Yes. StarRocks lets you transform data during the load process.
For example, suppose the data file TEST is in CSV format and contains four columns: NO, DATE, VERSION, and PRICE. However, the DATE column has a non-standard format, such as 202106.00. To use DATE as the partition column in your StarRocks table, you must first create a table in StarRocks and define the data type of the DATE column as DATE, DATETIME, or INT. Then, specify the following setting in your Stream Load statement to transform the data on the fly:
-H "columns: NO,DATE_1, VERSION, PRICE, DATE=LEFT(DATE_1,6)"
In this statement, DATE_1 is a temporary variable that holds the raw data. The LEFT() function transforms this data, and the result is assigned to the DATE column in the StarRocks table. Note that you must first list temporary names for all columns in the CSV file and then define the transformation expressions. Supported functions for column transformation include scalar functions, including non-aggregate functions and window functions.
"body exceed max size: 10737418240, limit: 10737418240" error
-
Cause
The source data file exceeds the 10 GB size limit for a Stream Load job.
-
Solution
-
Split the data file into smaller files using
seq -w 0 n. -
Run
curl -XPOST http://be_host:http_port/api/update_config?streaming_load_max_mb=<file_size>to adjust the streaming_load_max_mb BE configuration item. For more information about BE configuration items, see parameter configuration.
-
Improve import performance
Method 1: Increase task concurrency
This method may consume more CPU resources and result in too many import versions.
Split an import job into as many import tasks as possible to run them in parallel. The following formula determines the actual task concurrency, which is capped by the number of BE nodes or consumption partitions.
min(alive_be_number, partition_number, desired_concurrent_number, max_routine_load_task_concurrent_num)
The parameters are as follows:
-
alive_be_number: The number of live BEs.
-
partition_number: The number of consumption partitions.
-
desired_concurrent_number: The desired task concurrency for a single Routine Load import job. The default value is 3.
-
If you have not created the import job, set this parameter using
CREATE ROUTINE LOAD. -
If the import job already exists, modify this parameter using
ALTER ROUTINE LOAD.
-
-
max_routine_load_task_concurrent_num: The maximum task concurrency for a Routine Load import job. The default value is 5. This is an FE dynamic parameter. For more information, including how to set it, see parameter configuration.
Therefore, if the number of consumption partitions and BE nodes is greater than the values of the desired_concurrent_number and max_routine_load_task_concurrent_num parameters, you can increase the actual task concurrency by increasing the values of desired_concurrent_number and max_routine_load_task_concurrent_num.
For example, assume you have 7 consumption partitions, 5 live BEs, and max_routine_load_task_concurrent_num is 5 (its default value). To maximize the actual task concurrency, set desired_concurrent_number to 5 (up from the default of 3). The resulting actual task concurrency is min(5,7,5,5), or 5.
Method 2: Increase data volume per task
This method can increase data import latency.
Either the max_routine_load_batch_size or the routine_load_task_consume_second parameter determines the consumption limit for a single Routine Load import task. An import task completes when it meets the condition of either parameter. Both are FE configuration items. For more information, including how to set them, see parameter configuration.
You can analyze the logs in be/log/be.INFO to determine which of these parameters limits the data consumption of a single import task. You can then increase the value of that parameter to allow each task to consume more data.
I0325 20:27:50.410579 15259 data_consumer_group.cpp:131] consumer group done: 41448fb1a0ca59ad-30e34dabfa7e47a0. consume time(ms)=3261, received rows=179190, received bytes=9855450, eos: 1, left_time: -261, left_bytes: 514432550, blocking get time(us): 3065086, blocking put time(us): 24855
Normally, left_bytes >=0 in the log means the task did not read enough data to reach the max_routine_load_batch_size limit within the routine_load_task_consume_second time frame. This means the scheduled batch of import tasks can consume all available data from Kafka with no consumption latency. In this situation, you can increase the value of routine_load_task_consume_second to allow each import task to consume more data.
If left_bytes < 0, it means the task read enough data to reach the max_routine_load_batch_size limit before the routine_load_task_consume_second time limit was reached. This means each batch of scheduled import tasks fills with data from Kafka, likely creating a consumption backlog. In this scenario, increase the value of max_routine_load_batch_size.
Troubleshoot PAUSED or CANCELLED Routine Load jobs
Use the returned error message to troubleshoot and resolve the issue:
-
Symptom: The import job status changes to PAUSED, and
ReasonOfStateChangedreturns the errorBroker: Offset out of range.-
Cause: The consumer offset for the import job does not exist in the Kafka partition.
-
Solution: Run the
SHOW ROUTINE LOADcommand and check theProgressparameter to find the job's latest consumer offset. Then, verify that a message with this offset exists in the Kafka partition. If not, there are two possible causes:-
The consumer offset specified when the import job was created points to a future time.
-
Kafka removed the message at the specified offset before the import job could consume it. Configure a Kafka log cleanup policy that matches the job's import speed. For example, you can adjust parameters such as log.retention.hours and log.retention.bytes.
-
-
-
Symptom: The import job status changes to PAUSED.
-
Cause: An import task generated more error rows than the
max_error_numberthreshold allows. -
Solution: Inspect the
ReasonOfStateChangedandErrorLogUrlsfields to identify the cause.-
If the data source contains data in an invalid format, correct the format. After you correct the data, run the
RESUME ROUTINE LOADcommand to resume the job. -
If StarRocks cannot parse the data format, increase the
max_error_numbererror row threshold.-
Run the
SHOW ROUTINE LOADcommand to view the currentmax_error_number. -
Run the
ALTER ROUTINE LOADcommand to increase themax_error_numberthreshold. -
Run the
RESUME ROUTINE LOADcommand to resume the job.
-
-
-
-
Symptom: The import job status changes to CANCELLED.
-
Cause: An exception may have occurred during the import task. For example, the destination table was deleted.
-
Solution: Inspect the
ReasonOfStateChangedorErrorLogUrlsfields to identify and resolve the issue. However, a CANCELLED job cannot be resumed, even after you resolve the underlying issue.
-
Routine Load and exactly-once semantics
Routine Load guarantees exactly-once semantics.
Each import job is a single transaction. If an error occurs during execution, the transaction aborts, and the FE does not update the offsets for the relevant partitions. When the FE reschedules the job, it resumes consumption from the last saved offsets, ensuring exactly-once semantics.
Error: Broker: Offset out of range
Run the SHOW ROUTINE LOAD command to find the latest offset, and then use a Kafka client to check if data exists at that offset. This error can be caused by the following:
-
The specified offset is in the future.
-
Kafka clears the data at a specific offset before it can be imported. You need to set reasonable log cleanup parameters, such as log.retention.hours and log.retention.bytes, based on the StarRocks import speed.
Can a finished Broker Load job be rerun?
No, you cannot rerun a Broker Load job that is in the FINISHED state. To ensure exactly-once semantics, you cannot reuse the label of a completed load job. As a workaround, use SHOW LOAD to find the original job, copy its configuration, assign a new label, and submit a new load job.
Garbled content in Broker Load
Use the error URL to view the garbled content. For example:
Reason: column count mismatch, expect=6 real=1. src line: [$交通]; zcI~跟团+v]; count mismatch, expect=6 real=2. src line: [租erD食休闲娱乐
This issue is caused by an incorrect FORMAT AS parameter in the import job request. To fix it, set the FORMAT AS parameter to match your source file's format and retry the import job.
Date fields are 8 hours ahead with Broker Load
-
Cause
The cause is a timezone mismatch. The StarRocks table and the Broker Load job are configured with China Standard Time, but the server uses UTC. This mismatch shifts the date field forward by 8 hours during the load.
-
Solution
Remove the timezone parameter during table creation.
Broker Load error: Cannot cast '<slot 6>' from VARCHAR to ARRAY<VARCHAR(30)>
-
Cause
This error occurs because the column names in the source data file do not match those in the StarRocks table. When the
SETclause is executed, StarRocks performs type inference, but the data type conversion fails when the cast function is called. -
Solution
Ensure that the column names in the source file and the target table are identical. This eliminates the need for a
SETclause. As a result, StarRocks does not call the cast function for data type conversion, which allows the import to succeed.
Missing data after successful Broker Load
Broker Load is an asynchronous load method. Although a statement to create a load job may complete without errors, the job itself may not have succeeded. Run the SHOW LOAD statement to check the result status and errmsg of the load job. You can then adjust the parameters and retry the job.
The "failed to send batch" or "TabletWriter add batch with unknown id" error
This error is caused by a data write timeout. To resolve it, modify the system variable query_timeout and the BE configuration item streaming_load_rpc_max_alive_time_sec. For more information about BE configuration items, see parameter configuration.
Resolve the "LOAD-RUN-FAIL; msg:OrcScannerAdapter::init_include_columns. col name = xxx not found" error
When importing data in Parquet or ORC format, ensure the column names in the file header match those in the StarRocks table.
For example, the following code maps the columns tmp_c1 and tmp_c2 in a Parquet or ORC file to the name and id columns in the StarRocks table. Without a SET clause, the mapping is based on the columns specified in the column_list parameter.
(tmp_c1,tmp_c2)
SET
(
id=tmp_c2,
name=tmp_c1
)
When loading an ORC file generated directly by Apache Hive with a table header of (_col0, _col1, _col2, ...), the load job may fail with an "Invalid Column Name" error. In this case, use the SET clause to define column conversion rules.
Troubleshoot a slow load job
In the FE log file fe.log, find the load job ID by searching for its label. Then, in the BE log file be.INFO, use this ID to find the root cause from the log context.
Configure access to an HDFS HA cluster
Configuring NameNode high availability (HA) allows the client to automatically identify the new active NameNode during a failover. Configure the following parameters to access an HDFS cluster in HA mode.
Parameter | Description |
dfs.nameservices | A custom name for the HDFS service. For example, set dfs.nameservices to my-ha. |
dfs.ha.namenodes.xxx | Specifies the logical IDs for the NameNodes. Separate multiple IDs with a comma (,). Replace For example, set dfs.ha.namenodes.my-ha to my_namenode1,my_namenode2. |
dfs.namenode.rpc-address.xxx.nn | The RPC address of the NameNode. In this parameter, nn represents the NameNode ID you configured in dfs.ha.namenodes.xxx. For example, set a parameter like dfs.namenode.rpc-address.my-ha.my-namenode1 in the host:port format. |
dfs.client.failover.proxy.provider.xxx | The provider class that the client uses to connect to the NameNodes. The default value is org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider. |
You can use HA mode with either simple authentication or Kerberos authentication to access your cluster. The following example uses simple authentication to access an HA HDFS cluster:
(
"username"="user",
"password"="passwd",
"dfs.nameservices" = "my-ha",
"dfs.ha.namenodes.my-ha" = "my_namenode1,my_namenode2",
"dfs.namenode.rpc-address.my-ha.my-namenode1" = "nn1-host:rpc_port",
"dfs.namenode.rpc-address.my-ha.my-namenode2" = "nn2-host:rpc_port",
"dfs.client.failover.proxy.provider.my-ha" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
)Add the HDFS cluster configurations to the hdfs-site.xml file. When a broker process reads information from the HDFS cluster, it requires only the cluster's file path and authentication information.
Configure Hadoop ViewFS Federation
Copy the core-site.xml and hdfs-site.xml configuration files to the broker/conf directory.
If you use a custom file system, also copy the corresponding .jar files to the broker/lib directory.
"Can't get Kerberos realm" error
-
Verify that the /etc/krb5.conf file is configured on every machine that hosts a broker.
-
If an error still occurs after configuration, you need to add
-Djava.security.krb5.conf:/etc/krb5.confto the end of theJAVA_OPTSvariable in the broker's startup script.
Slow performance of single-row INSERT INTO
The INSERT INTO statement is designed for batch writes. This means a single write incurs roughly the same overhead as a batch write. Therefore, for OLAP workloads, avoid using INSERT INTO statements for single-row insertions.
Resolve the "index channel has intolerable failure" error
This error occurs because the streaming load RPC times out. To resolve this issue, adjust the relevant RPC timeout parameters in the configuration file.
Modify the following two system parameters in the BE configuration file be.conf, and then restart the cluster to apply the changes.
-
streaming_load_rpc_max_alive_time_sec: The timeout for the streaming load RPC, defaulting to 1200 seconds.
-
tablet_writer_rpc_timeout_sec: The timeout for the TabletWriter, defaulting to 600 seconds.
Resolve "execute timeout" error with INSERT INTO SELECT
This error indicates a query timeout. To resolve this, adjust the session variable query_timeout. This parameter defaults to 600 seconds.
Example:
set query_timeout = xx;
Flink job error: "One or more required options are missing"
-
Cause
In the StarRocks-migrate-tools (SMT) configuration file config_prod.conf, multiple rule groups such as
[table-rule.1]and[table-rule.2]are configured, but necessary configuration is missing. -
Solution
Check whether the database, table, and Flink connector information is configured for each rule group, such as
[table-rule.1]and[table-rule.2].
Automatic restart of failed tasks
Flink uses checkpointing and a restart strategy to automatically restart failed tasks.
For example, to enable checkpointing and use the default fixed-delay restart strategy, configure the following parameters in the flink-conf.yaml configuration file.
# unit: ms
execution.checkpointing.interval: 300000
state.backend: filesystem
state.checkpoints.dir: file:///tmp/flink-checkpoints-directory
The parameters are as follows:
-
execution.checkpointing.interval: The interval between checkpoints, in milliseconds (ms). A value greater than 0 enables checkpointing. -
state.backend: When checkpointing is enabled, Flink persists the application state with each checkpoint to prevent data loss and ensure consistency during recovery. The storage format, persistence method, and storage location for the state depend on the selected state backend. For more information, see state backends. -
state.checkpoints.dir: The directory where Flink stores checkpoint data.
Manually stop and restore a Flink job
When you stop a Flink job, you can manually trigger a Savepoint. A Savepoint is a consistent snapshot of the execution state of a streaming job, created through the Checkpointing mechanism. You can later restore the Flink job from a specified Savepoint.
Stop a job and create a Savepoint. This automatically triggers a Savepoint for the specified Flink job before stopping it. You can also specify a target directory to store the Savepoint.
Example command:
bin/flink stop --type [native/canonical] --savepointPath [:targetDirectory] :jobId
Parameters:
-
jobId: You can find the Flink job ID in the Flink Web UI or by running theflink list -runningcommand. -
targetDirectory: You can also configure a default directory for Savepoints in the Flink configuration file flink-conf.yml by setting thestate.savepoints.dirparameter. If this parameter is set, Savepoints are stored in this directory, making a directory specification optional when you trigger a Savepoint.state.savepoints.dir: [file:// or hdfs://]/home/user/savepoints_dir
To restore the Flink job, specify the Savepoint when you resubmit the job.
./flink run -c com.starrocks.connector.flink.tools.ExecuteSQL -s savepoints_dir/savepoints-xxxxxxxx flink-connector-starrocks-xxxx.jar -f flink-create.all.sql
Exactly-once transaction load failure
-
Problem: You receive the following error message:
com.starrocks.data.load.stream.exception.StreamLoadFailException: { "TxnId": 3382****, "Label": "502c2770-cd48-423d-b6b7-9d8f9a59****", "Status": "Fail", "Message": "timeout by txn manager", "NumberTotalRows": 1637, "NumberLoadedRows": 1637, "NumberFilteredRows": 0, "NumberUnselectedRows": 0, "LoadBytes": 4284214, "LoadTimeMs": 120294, "BeginTxnTimeMs": 0, "StreamLoadPlanTimeMs": 7, "ReadDataTimeMs": 9, "WriteDataTimeMs": 120278, "CommitAndPublishTimeMs": 0 } -
Cause: The
sink.properties.timeoutis shorter than the Flink checkpoint interval, leading to a transaction timeout. -
Solution: Increase the value of this parameter to exceed the Flink checkpoint interval.
8-hour time lag with flink-connector-jdbc_2.11
-
Problem: The localTimestamp function in Flink generates correct timestamps, but they lag by 8 hours when written to a StarRocks sink table. This issue occurs even though both the Flink and StarRocks servers are set to the Asia/Shanghai time zone. The environment uses Flink 1.12 and the flink-connector-jdbc_2.11 driver.
-
Solution: To resolve this, explicitly configure the time zone by setting the
'server-time-zone' = 'Asia/Shanghai'parameter in your Flink sink table definition and appending&serverTimezone=Asia/Shanghaito theurlparameter. The following example demonstrates the correct configuration.CREATE TABLE sk ( sid int, local_dtm TIMESTAMP, curr_dtm TIMESTAMP ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://192.168.**.**:9030/sys_device?characterEncoding=utf-8&serverTimezone=Asia/Shanghai', 'table-name' = 'sink', 'driver' = 'com.mysql.jdbc.Driver', 'username' = 'sr', 'password' = 'sr123', 'server-time-zone' = 'Asia/Shanghai' );
High BE memory and CPU usage at idle
The high CPU usage is temporary and occurs when the BE periodically collects statistics. High memory usage is expected. To optimize performance, a BE retains up to 10 GB of memory by default, managing it internally instead of releasing it back to the operating system. You can adjust this limit using the tc_use_memory_min parameter.
The tc_use_memory_min parameter specifies the minimum amount of memory retained by TCmalloc. The default value is 10737418240 bytes (10 GB). StarRocks releases idle memory back to the operating system only when its memory usage exceeds this value. You can configure this parameter in the be.conf file on the Configure tab of the StarRocks service in the EMR console. For more information about BE configuration, see Parameter Configuration.
BE memory is not released to the OS
The database requests large blocks of memory from the operating system. To avoid the high overhead of large-scale memory allocation, it pre-allocates extra memory and delays its release for reuse. To observe this behavior, monitor memory usage in a test environment over an extended period to verify that the memory is eventually released.
Unresolved Flink connector dependencies
-
Cause: The dependencies cannot be resolved because the mirror section in /etc/maven/settings.xml is not configured with the Alibaba Cloud mirror address.
-
Solution: Set the Alibaba Cloud public repository address to
https://maven.aliyun.com/repository/public.
Behavior of sink.buffer-flush.interval-ms
-
Problem: If
sink.buffer-flush.interval-msis 15s and thecheckpoint intervalis 5 minutes, doessink.buffer-flush.interval-msstill take effect?+----------------------+--------------------------------------------------------------+ | Option | Required | Default | Type | Description | +-------------------------------------------------------------------------------------+ | sink.buffer-flush. | NO | 300000 | String | the flushing time interval, | | interval-ms | | | | range: [1000ms, 3600000ms] | +----------------------+--------------------------------------------------------------+ -
Solution: A flush is triggered when any of the following parameters reaches its threshold. This behavior is independent of the
checkpoint intervalsetting. Thecheckpoint intervalapplies only toexactly oncesemantics; at least once semantics rely onsink.buffer-flush.interval-ms.sink.buffer-flush.max-rows sink.buffer-flush.max-bytes sink.buffer-flush.interval-ms
Data updates with DataX
DataX supports data updates with the primary key model. To enable this, add the _op field to the reader section of your JSON configuration file.
Handle keywords in names for DataX
Enclose the corresponding field in backticks ().
Error: HADOOP-CONF-DIR or YARN-CONF-DIR not set
This error occurs when using Spark Load because the HADOOP-CONF-DIR environment variable is not set in the spark-env.sh file on the Spark client.
Error: Cannot run program "xxx/bin/spark-submit": error=2, No such file or directory
This error indicates that when using Spark Load, the spark_home_default_dir parameter is either unspecified or points to an incorrect Spark client root directory.
"File xxx/jars/spark-2x.zip does not exist" error
This Spark Load error indicates that the spark-resource-path configuration option does not point to the packaged ZIP file. Verify that the file path and file name are correct.
Resolve "YARN client does not exist" error
This error occurs when using Spark Load because the yarn-client-path configuration option does not correctly point to the YARN executable file.
High disk usage triggers safe mode
-
Problem: When the disk usage of a BE node in the cluster exceeds the threshold, the system automatically enters safe mode, blocking new data write operations. This issue often results from the accumulation of large amounts of garbage data in the trash directory after large-scale delete or update operations.
-
Cause: Insufficient disk space.
When disk usage exceeds the threshold, the system triggers safe mode to block further data write operations and prevent the disk from becoming full.
Safe mode is triggered when the available space on a disk drops below a calculated threshold. This threshold is the minimum of the following two values:
-
10% of the total capacity of a single disk (i.e.,
10% * diskInfo.getTotalCapacityB()). -
The value of the
safe_mode_check_disk_spaceconfiguration parameter, which defaults to 100 GB (107,374,182,400 bytes).
The formula is as follows:
double safeModeCheckDiskCapacity = Math.min( 0.10 * diskInfo.getTotalCapacityB(), Config.safe_mode_check_disk_space); -
-
Solutions:
-
Option 1 (Recommended): Expand disk capacity.
If you have insufficient disk space, we recommend expanding the disk capacity. This resolves the issue at its root and prevents safe mode from being triggered frequently. For more information, see Disk Expansion.
-
Option 2: Adjust the safe mode threshold.
If you cannot immediately expand the disk, you can adjust the safe mode threshold as a temporary solution.
On the Parameter Settings page, add the FE configuration parameter
safe_mode_check_disk_space.NoteYou can lower this value based on your business requirements, but we do not recommend setting it below 50 GB (53,687,091,200 bytes). Lowering this threshold increases the risk of the disk becoming full, which can make the instance unavailable.
After the adjustment, the minimum required available space is
total capacity * 10%for a disk with less than 1000 GB of total capacity, and the adjusted value ofsafe_mode_check_disk_spacefor a disk with 1000 GB or more.
-
Queries canceled when reading with Spark Connector
-
Problem: When you use the Spark Connector to read data from StarRocks, a query might be unexpectedly canceled. This error typically occurs with long-running queries when an automatic cleanup process deletes the expired data versions they need.
-
Cause: If a query's execution time exceeds the configured retention period for data versions, the versions that the query requires may be garbage-collected before it completes. As a result, the task is canceled because it cannot find the necessary data.
-
Solution: On the Parameter Settings page, increase the
lake_autovacuum_grace_period_minutesparameter. The default value is 30 minutes. You can set a higher value based on your actual query requirements, for example, 60 minutes or more, to prevent the premature deletion of data required by long-running queries.
Flink job fails with "Could not resolve host"
To resolve this issue, set the sink.version=V1 parameter in your Flink job. By default, the connector uses version V2, which has stability and compatibility issues. Version V1 is more stable. For more information, see the open source StarRocks documentation for Apache Flink.
Handle excessive memory consumption of primary key index
Consider the following optimization solutions:
-
Create a partitioned table: For tables that use the primary key model, creating partitions prevents the primary key index for the entire table from being loaded into memory at once during data writes. This reduces memory pressure.
-
Enable the persistent primary key index: This feature persists the primary key index on disk instead of storing it in memory, which reduces memory consumption. Run the following SQL statement to enable it.
ALTER TABLE xxx SET ("enable_persistent_index"="true");
Jindosdk read errors from OSS external tables
-
Problem: An error occurs when you use JindoSDK to read StarRocks external table data from OSS.
-
Cause: This issue typically occurs because the
preadimplementation of JindoSDK does not function as expected, which results in errors when you read data. -
To resolve this issue, on the Parameter Settings page, adjust the configuration in the
jindosdk.cfgfile to optimize the memory buffer and reduce the occurrence of errors.-
fs.oss.memory.buffer.size.watermark: The watermark for the memory buffer. Set this value to0.6(the default value is0.3). -
fs.oss.memory.buffer.size.max.mb: Increases the maximum memory buffer size. We recommend that you set this value to6144(6 GB). The default value is 6144 MB.
-
Disk space spike during data import
-
Problem: When you import data into StarRocks, you might observe an unexpected increase in disk space usage, particularly in the partition marked as "other". This can affect system stability and performance.
-
Possible cause: This issue is typically caused by lingering temporary or obsolete files from the data import process. The most common reason is an excessive buildup of files in the trash.
-
Solution: Adjust the
trash_file_expire_time_secparameter to accelerate the cleanup of files from the trash. This parameter defines how long, in seconds, files remain in the trash before being permanently deleted. Reducing this value releases disk space more quickly.
Slow transactions with the Spark connector
-
Problem: You may experience slow transaction processing when running a data import job with the Spark connector on a fully managed StarRocks cluster in Alibaba Cloud E-MapReduce. This issue typically occurs when the number of pending transactions reaches the system's default limit, which significantly slows subsequent data imports.
-
Solution: To optimize performance, set the
lake_enable_batch_publish_versionparameter to true.
Resolve the "transmit chunk rpc failed" error
-
Problem: You encounter the following error message when you import data.
transmit chunk rpc failed [dest_instance_id=5a2489c6-f0d8-11ee-abf6-061aec******] [dest=172.17.**.**:8060] detail: brpc failed, error=Host is down, error_text=[E112]Not connected to 172.17.**.**:8060 yet, server_id=904 [R1][E112]Not connected to 172.17.**.**:8060 yet, server_id=904 [R2][E112]Not connected to 172.17.**.**:8060 yet, server_id=904 [R3][E112]Not connected to 172.17.**.**:8060 yet, server_id=904 -
Possible cause: This error typically indicates congestion during RPC communication.
-
Solution: You can resolve this issue using one of the following two methods.
-
Adjust the unwritten data threshold. The current state may be OVER CROWDED, which indicates that the RPC source has a large amount of unwritten data that exceeds the default threshold. The default value of
brpc_socket_max_unwritten_bytesis 1 GB. If the amount of unwritten data exceeds this value, this error may occur. We recommend that you increase this value appropriately to avoid theOVER CROWDEDerror. -
Increase the RPC message body size. Another issue occurs when the RPC body size exceeds the
brpc_max_body_sizelimit, which has a default value of 2 GB. This can happen if a query contains oversized strings or bitmap type data. You can resolve this issue by increasing the BE parameterbrpc_max_body_size.
-
Resolve the "Cancelled because of runtime state is cancelled" error
-
Problem: The error "Cancelled because of runtime state is cancelled" occurs when you run a Stream Load job without setting the
sink.buffer-flush.interval-msparameter. -
Possible cause: This issue typically occurs during data loading into StarRocks. The
sink.buffer-flush.interval-msparameter defines the flush interval for the data buffer, determining how often accumulated data is written to the target storage. If you do not explicitly set this parameter, the system uses a default value. This default value may be too large, causing data to remain in memory for an extended period without being flushed. This can lead to the runtime state being canceled due to a timeout or other resource constraints, which triggers the error. -
Solution: Explicitly set the
sink.buffer-flush.interval-msparameter in your code to override the default value. This lets you adjust the buffer flush frequency for your specific use case, ensuring a more efficient and stable data transfer.