An OOM (Out of Memory) error occurs when a query exceeds available memory in Hologres. This topic explains how to monitor memory usage, identify OOM errors, and resolve them.
Analyze memory consumption
-
View memory consumption
-
Total consumption: The Hologres console shows aggregated memory consumption across all nodes. For more information, see Metrics.
-
Per-query consumption: The
memory_bytesfield approximates per-query memory consumption. This value may be inaccurate. For more information, see Get and analyze slow query logs.
-
-
Handle high memory usage
Monitor overall memory usage in the Hologres console (see Metrics). Sustained usage above 80% is high. Hologres pre-allocates memory for metadata and cache, so idle usage of 30-50% is normal. Usage near 100% degrades stability and performance.
-
Causes
-
High memory consumption from metadata
Metadata memory grows with table volume and can cause high usage even when no tasks run. Keep each Table Group under 10,000 tables (including partitions, excluding foreign tables). Too many shards in a Table Group increases fragmentation and metadata overhead.
-
High memory consumption from computation
High query memory typically results from scanning large data volumes or complex operations such as multiple
COUNT DISTINCTfunctions, complexJOINoperations,GROUP BYon multiple columns, or window functions.
-
-
Key impacts
-
Stability
Excessive memory consumption, especially from metadata, reduces memory available for queries and can cause sporadic errors such as
SERVER_INTERNAL_ERROR,ERPC_ERROR_CONNECTION_CLOSED, orTotal memory used by all existing queries exceeded memory limitation. -
Performance
High memory usage from excessive metadata depletes cache space, lowering cache hit rates and increasing query latency.
-
-
Solutions
-
If high memory usage is caused by excessive metadata: Use the
hg_table_infotable to manage your tables. For more information, see Get and analyze table statistics. Delete unused data or tables and reduce unnecessary partitions to free memory. -
If high memory usage is caused by computation: Optimize SQL for write and query use cases separately. For more information, see Resolve OOM errors during queries and Resolve OOM errors during data import and export.
-
General solution: Scale up compute and storage resources. Instance list.
-
-
Identify OOM errors
An OOM error occurs when computation memory exceeds its allocated limit (e.g., 20 GB or more). A typical error message:
Total memory used by all existing queries exceeded memory limitation.
memory usage for existing queries=(2031xxxx,184yy)(2021yyyy,85yy)(1021121xxxx,6yy)(2021xxx,18yy)(202xxxx,14yy); Used/Limit: xy1/xy2 quota/sum_quota: zz/100
Interpret the error message as follows:
-
queries=(query_id, memory_used_by_query)Each entry, such as
queries=(2031xxxx,184yy), shows per-query memory consumption. For example,queries=(2031xxxx,18441803528)means queryquery_id=2031xxxxconsumed ~18 GB on a single node. The top 5 memory-intensive queries are listed. For more information, see Get and analyze slow query logs. -
Used/Limit: xy1/xy2Shows
compute_memory_used_on_node/compute_memory_limit_on_nodein bytes.Usedis the total compute memory consumed by all running queries on that node. For example,Used/Limit: 33288093696/33114697728means queries used 33.2 GB, exceeding the 33.1 GB limit and triggering OOM. -
quota/sum_quota: zz/100zzis the percentage of total instance resources allocated to a resource group. For example,quota/sum_quota: 50/100means the resource group uses 50% of total instance resources.
Basic causes of OOM errors
Hologres prioritizes in-memory computation for optimal query efficiency. Unlike systems that spill to disk when memory is insufficient, Hologres raises an OOM error directly when a query exceeds available memory.
Memory allocation and limits
A Hologres instance operates as a distributed system, comprising multiple nodes whose quantity varies with instance specifications. For more details, see Instance management.
Each node typically has 16 vCPUs and 64 GB of memory. An OOM error occurs if any single node exhausts its memory. The 64 GB is partitioned for query computation, backend processes, cache, and metadata. Before V1.1.24, compute memory was capped at 20 GB. V1.1.24 and later dynamically allocate available memory to queries when metadata consumption is low.
Resolve OOM errors during queries
-
Causes.
-
Incorrect execution plans: This can be due to inaccurate statistics, improper join order, or other optimization issues.
-
High query concurrency: Many queries simultaneously consuming substantial memory.
-
Complex queries: Inherently complex queries or those scanning large data volumes.
-
UNION ALLoperations: Queries containingUNION ALLcan increase executor parallelism, leading to higher memory usage. -
Insufficient resource group allocation: A resource group is configured but allocated inadequate resources.
-
Data skew or shard pruning: These can cause unbalanced load and high memory pressure on specific nodes.
-
-
Analysis and solutions:
-
Cause: Insufficient resource group allocation
Solution: Use the Serverless Computing feature to supplement your instance’s dedicated resources with additional compute capacity. For an overview and usage instructions, refer to Serverless computing and Work with serverless computing.
In Hologres V3.0 and later, Query Queues automatically rerun OOM queries on serverless computing resources. Control large queries.
-
Cause: Incorrect execution plan
-
Type 1: Inaccurate statistics
Run
EXPLAIN <SQL>to view the execution plan. As shown in the figure below,rows=1000indicates missing or inaccurate statistics, leading to an inefficient execution plan that consumes excessive resources and triggers an OOM error.
Solutions include the following:
-
Run the
ANALYZE <tablename>command to update table statistics. -
Enable auto analyze to automatically update statistics. For more information, see ANALYZE and AUTO ANALYZE.
-
-
Type 2: Incorrect join order
In a Hash Join, the smaller table should be the build side. Use
EXPLAIN <SQL>to check the execution plan. If the larger table builds the hash table, the join order is inefficient and can cause OOM. Common reasons:-
Outdated table statistics. For example, in the figure below, the upper table’s statistics were not updated, resulting in
rows=1000.
-
The optimizer failed to generate an optimal execution plan.
Solutions:
-
Run
ANALYZE <tablename>on all tables involved in the join to ensure up-to-date statistics. This helps the optimizer determine the correct join order. -
If the join order remains incorrect after running
ANALYZE <tablename>, adjust a GUC parameter. Setoptimizer_join_order = queryto force the optimizer to follow the join sequence specified in the SQL statement. This approach is particularly suitable for complex queries.SET optimizer_join_order = query; SELECT * FROM a JOIN b ON a.id = b.id; -- Table b is used as the build side of the hash table.You can also adjust the join order policy as needed.
Parameter
Description
set optimizer_join_order = <value>
This parameter controls the optimizer's Join Order algorithm. Valid values:
-
query: Does not perform Join Order transformation. Joins are executed strictly in the order specified in the SQL query. This setting incurs the lowest optimizer overhead. -
greedy: Employs a greedy algorithm to explore possible Join Orders. This option results in moderate optimizer overhead. -
exhaustive(default): Uses a dynamic planning algorithm for Join Order transformation. It aims to generate the optimal execution plan but comes with the highest optimizer overhead.
-
-
-
Type 3: Incorrect hash table estimation
In a hash join, the smaller input should build the hash table. However, query complexity or inaccurate statistics can cause the system to select a larger relation as the build input, creating an oversized hash table that triggers OOM.
As shown in the figure below,
Hash (cost=727353.45..627353.35 , rows=970902134 width=94)represents the build input, androws=970902134indicates the estimated data volume for building the hash table. If the actual table contains less data, the estimation is inaccurate.
Solutions:
-
Verify statistics: Check if the subquery’s table statistics are current and accurate. If not, run
ANALYZE <tablename>to update them. -
Disable hash table estimation: Turn off the execution engine's hash table estimation using the following parameter:
NoteThis parameter defaults to
off. However, it may have been enabled in certain tuning scenarios. If it is currently enabled, ensure you set it back tooff.SET hg_experimental_enable_estimate_hash_table_size =off;
-
-
Type 4: Broadcasting a large table
Broadcasting copies data to all shards and is efficient only for small tables with few shards. During joins, the build input is broadcast to every shard. A large dataset or excessive shard count can consume substantial memory, causing OOM errors.
For example, an 80-million-row table might show only 1 estimated row in the execution plan. The actual broadcast of all 80 million rows consumes excessive memory, triggering OOM.

Solutions:
-
Check whether the estimated row count in the execution plan matches reality. If not, run
ANALYZE tablenameto update statistics. -
Disable broadcasting and rewrite it as a redistribution operator using the following GUC parameter.
SET optimizer_enable_motion_broadcast = off;
-
-
-
Cause: High query concurrency
If QPS spikes significantly, or the OOM error shows
HGERR_detl memory usage for existing queries=(2031xxxx,184yy)(2021yyyy,85yy)(1021121xxxx,6yy)(2021xxx,18yy)(202xxxx,14yy);with each query using minimal memory, high concurrency is the likely cause. Solutions:-
Reduce write concurrency: If write operations are contributing, reduce their concurrency. For more information, see Resolve OOM errors during data import and export.
-
Implement read/write splitting: Deploy a read/write splitting architecture with primary and secondary instances (shared storage).
-
Increase the compute specifications of your instance.
-
-
Cause: Complex query
If a single query triggers OOM due to its complexity or large data volume, consider these approaches:
-
Pre-compute data: Write pre-computed data into Hologres to avoid large-scale ETL operations within Hologres.
-
Add filter conditions.
-
Optimize SQL: Use techniques like Fixed Plan or Count Distinct optimization. For more information, see Optimize internal table query performance.
-
-
Cause: UNION ALL
As shown below, when an SQL statement contains many
UNION ALLsubqueries, the executor processes them concurrently. This can overload memory and cause an OOM error.subquery1 UNION ALL subquery2 UNION ALL subquery3 ...Solution: Force serial execution using the following parameters to mitigate OOM errors. Be aware that this will result in slower query performance.
SET hg_experimental_hqe_union_all_type=1; SET hg_experimental_enable_fragment_instance_delay_open=on; -
Cause: Inadequate resource group configuration
An OOM error reports:
memory usage for existing queries=(3019xxx,37yy)(3022xxx,37yy)(3023xxx,35yy)(4015xxx,30yy)(2004xxx,2yy); Used/Limit: xy1/xy2 quota/sum_quota: zz/100. If zz is small—for example, 10 (only 10% of resources allocated)—queries in that group have limited memory, increasing OOM likelihood.
Solution: Reset the resource group quota. Allocate at least
30%of the instance’s total resources to each resource group. -
Cause: Data skew or shard pruning
If overall memory usage is low but OOM still occurs, data skew or shard pruning may be concentrating memory pressure on specific nodes.
NoteShard pruning is a query optimization technique that scans only a subset of shards, rather than all of them.
-
Check for data skew: Use the following SQL query. The
hg_shard_idis a built-in hidden field in every table that indicates the shard where each row resides.SELECT hg_shard_id, count(1) FROM t1 GROUP BY hg_shard_id; -
Inspect shard pruning: Inspect the execution plan for indications of Shard Pruning. For example, if the shard selector shows
l0[1], it means only one specific shard's data was selected for the query.-- The distribution key is x. Based on the filter condition x=1, you can quickly locate the shard. SELECT count(1) FROM bbb WHERE x=1 GROUP BY y;
Solutions:
-
Design an appropriate distribution key to prevent data skew.
-
If business logic inherently causes data skew, modify the application logic accordingly.
-
-
Cause: High-cardinality multi-stage GROUP BY
In Hologres V3.0 and later, multi-stage aggregations on high-cardinality data can cause OOM when
GROUP BYcolumns do not align with the distribution key (the distribution key is not a subset of theGROUP BYkey). Each concurrent instance maintains a large hash table, creating high memory pressure. To mitigate this, set the following parameter:-- Use a GUC parameter to set the maximum number of rows in the aggregation hash table. The following SQL statement indicates that the partial_agg_hash_table can have a maximum of 8192 rows. The default value is 0, which indicates no limit. SET hg_experimental_partial_agg_hash_table_size = 8192;
-
Resolve OOM errors during data import and export
OOM errors can occur during data transfers in Hologres, including between internal tables, interactions with foreign tables, and imports from MaxCompute.
-
Solution 1: Use Serverless Computing for imports and exports
Use Serverless Computing to supplement your instance’s resources for import and export tasks, avoiding resource contention. For an overview, see Serverless computing. For usage instructions, see Work with serverless computing.
-
Solution 2: Control scan concurrency for wide tables or columns
In MaxCompute imports, OOM errors can arise from wide tables or columns combined with high scan concurrency. Use the following parameters to control concurrency.
-
Control scan concurrency for wide tables (common scenario)
NoteApply the following parameters along with your SQL statement. Prioritize the first two parameters. If an OOM error persists, reduce their values further.
-- Set the maximum concurrency for accessing foreign tables. The default value equals the instance's vCPU count. The maximum value is 128. Do not set a large value to prevent queries on foreign tables, especially in data import scenarios, from affecting other queries and causing system busy errors. This parameter is effective in Hologres V1.1 and later. SET hg_foreign_table_executor_max_dop = 32; -- Adjust the batch size for each read from a MaxCompute table. The default value is 8192. SET hg_experimental_query_batch_size = 4096; -- Set the maximum concurrency for executing DML statements when accessing foreign tables. The default value is 32. This parameter is optimized for data import and export scenarios to prevent import operations from consuming excessive system resources. This parameter is effective in Hologres V1.1 and later. SET hg_foreign_table_executor_dml_max_dop = 16; -- Set the split size for accessing MaxCompute tables. This parameter can adjust concurrency. The default value is 64 MB. If the table is large, increase this value to prevent too many splits from affecting performance. This parameter is effective in Hologres V1.1 and later. SET hg_foreign_table_split_size = 128; -
Control scan concurrency for wide columns
If you’ve already tuned parameters for wide tables but still encounter OOM errors, check whether your data includes wide columns. If so, adjust the following parameters to resolve the issue.
-- Adjust the shuffle parallelism for wide columns to reduce data accumulation. SET hg_experimental_max_num_record_batches_in_buffer = 32; -- Adjust the batch size for each read from a MaxCompute table. The default value is 8192. SET hg_experimental_query_batch_size=128;
-
-
Cause: Excessive duplicate data in a foreign table
When a foreign table contains substantial duplicate data, import performance degrades and can cause OOM errors. For example, a table of 100 million rows with 80 million duplicates is highly duplicated. Assess duplication based on your business context.
Solution: Deduplicate data before import, or import in smaller batches.
