Metrics provided by the advanced monitoring and alert service-Elasticsearch(ES)-阿里云帮助中心

The advanced monitoring and alert service provides a wide range of metrics for Elasticsearch (ES). Basic metrics include resource usage information, such as cluster status, node count, and index count. They also include concurrent performance metrics, such as the read and write queries per second (QPS) of a cluster or node, and network monitoring metrics. These metrics help you better understand your ES cluster usage. You can use the advanced monitoring and alert service to view the basic metrics dashboard for your cluster, create custom alert rules to monitor cluster performance in real time, and send alert notifications. This topic describes the metrics on the default basic metrics dashboard.

The supported advanced monitoring metrics vary based on the version of your Alibaba Cloud ES instance.

Only kernel-enhanced instances support advanced monitoring metrics related to index write QPS and query QPS. For more information about these metrics, see DPI engine metrics.
ES V6.7 instances with shared elastic storage enabled do not support disk usage metrics. For the specific supported metrics, see the console.

Table 1. Basic metrics and their descriptions

Note

The metrics for the cluster, index, Node JVM, and Thread_pool dimensions are provided by the ES module. For more information, see Elasticsearch Fields.
When you monitor cluster-level QPS metrics, you may observe instability due to cluster jitter. If this happens, see the relevant metrics in Kibana monitoring. Both advanced monitoring and Kibana monitoring are affected by cluster stability. However, cluster jitter can cause monitoring spikes, negative values, or a lack of data for advanced monitoring QPS metrics. In Kibana, cluster jitter is more likely to cause a lack of monitoring data.

Category	Metric	Description
cluster	aliyunes.elasticsearch.index.summary.total.indexing.index_total_qps	The total write QPS of the cluster. This metric indicates the number of documents written to the cluster per second. Details are as follows: If a client sends one write request that contains a single document to the cluster within one second, the write QPS is 1. If multiple write requests are sent within one second, the values are aggregated. If multiple documents are written in a batch in a single write request using the _bulk API within one second, the write QPS is the total number of documents pushed in the request. If multiple batch write requests are sent using the _bulk API within one second, the values are aggregated.
	aliyunes.elasticsearch.index.summary.total.search.query_total_qps	The total query QPS of the cluster. This metric indicates the number of query QPS executed by the cluster per second. The query QPS is related to the number of primary shards in the index to be queried. For example, if the index to be queried has five primary shards, one query request corresponds to a QPS of 5.
	aliyunes.elasticsearch.cluster.stats.status	The cluster status. The following states are supported: 0: green 1: yellow 2: red
	aliyunes.elasticsearch.cluster.stats.indices.shards.count	The number of shards.
	aliyunes.elasticsearch.cluster.stats.indices.total	The number of indexes.
	aliyunes.elasticsearch.cluster.stats.nodes.count	The number of nodes.
	aliyunes.elasticsearch.aliyun_auto_snapshot.latest_duration.ms	The duration of the latest snapshot. Unit: ms.
	aliyunes.elasticsearch.cluster.stats.indices.fielddata.memory.bytes	The memory usage of fielddata. Unit: bytes.
	aliyunes.elasticsearch.cluster.stats.indices.shards.primaries	The number of primary shards.
index	aliyunes.elasticsearch.index.segments.memory.bytes	The memory usage of index segments. Unit: bytes.
	aliyunes.elasticsearch.index.store.size.bytes	The storage size of the index. Unit: bytes.
	aliyunes.elasticsearch.index.segments.stored_fields_memory.bytes	The memory size of stored fields in segments. Unit: bytes.
	aliyunes.elasticsearch.index.segments.count	The number of index segments.
Node Resource	aliyunes.ecs.node_stats_process_cpu_percent_raw	The average CPU utilization of the node.
	aliyunes.ecs.node_stats_os_cpu_load_average_1m_raw	The one-minute load average of the node.
	aliyunes.ecs.node_stats_os_per_cpu_load_average_1m_raw	The one-minute load average per CPU on the node.
	aliyunes.elasticsearch.node.stats.jvm.mem.heap_used_percent	The JVM heap memory usage.
	aliyunes.ecs.node_stats_system_disk_space_usage	The system disk usage.
	aliyunes.ecs.node_stats_fs_data_disk_total_usage	The disk usage of the node.
Node Network	aliyunes.ecs.node_stats_networkin_packages	The number of incoming network packets for the node.
	aliyunes.ecs.node_stats_networkout_packages	The number of outgoing network packets for the node.
	aliyunes.ecs.node_stats_networkin_rate	The inbound network traffic rate for the node.
	aliyunes.ecs.node_stats_networkout_rate	The outbound network traffic rate for the node.
	aliyunes.ecs.node_stats_tcp_established	The number of TCP connections for the node.
Node Disk	aliyunes.ecs.node_stats_data_disk_r	The number of read requests completed per second.
	aliyunes.ecs.node_stats_data_disk_rm	The amount of data read per second. Unit: MB.
	aliyunes.ecs.node_stats_data_disk_w	The number of write requests completed per second.
	aliyunes.ecs.node_stats_data_disk_wm	The amount of data written per second. Unit: MB.
	aliyunes.ecs.node_stats_data_disk_r_await	The average wait time for each read request. Unit: ms.
	aliyunes.ecs.node_stats_data_disk_w_await	The average wait time for each write request. Unit: ms.
	aliyunes.ecs.node_stats_data_disk_svctm	The average service time for each request. Unit: ms.
	aliyunes.ecs.node_stats_data_disk_util	The utilization of the device.
	aliyunes.ecs.node_stats_data_disk_avgqu_sz	The average length of the request queue.
Node JVM	aliyunes.elasticsearch.node.stats.jvm.mem.heap_used_percent	The heap usage.
	aliyunes.elasticsearch.node.stats.jvm.mem.pools.old.used.bytes	The usage of the old generation space. Unit: bytes.
	aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.ms	The time consumed by old generation garbage collection (GC). Unit: ms.
	aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.ms	The time consumed by young generation GC. Unit: ms.
	aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.count	The frequency of old generation GC.
	aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.count	The frequency of young generation GC.
	aliyunes.elasticsearch.node.stats.jvm.mem.pools.survivor.used.bytes	The amount of memory currently used by the survivor space. Unit: bytes.
	aliyunes.elasticsearch.node.stats.jvm.mem.pools.survivor.max.bytes	The maximum amount of memory that can be used by the survivor space. Unit: bytes.
	aliyunes.elasticsearch.node.stats.jvm.mem.pools.old.peak.bytes	The peak memory usage of the JVM old generation space. Unit: bytes.
	aliyunes.elasticsearch.node.jvm.memory.nonheap.init.bytes	The initial non-heap memory of the JVM. Unit: bytes.
	aliyunes.elasticsearch.node.jvm.memory.nonheap.max.bytes	The maximum non-heap memory usage. Unit: bytes.
Thread_pool	aliyunes.elasticsearch.node.stats.thread_pool.search.threads	The total number of threads in the thread pool.
	aliyunes.elasticsearch.node.stats.thread_pool.search.rejected	The number of rejected requests in the search thread pool.
	aliyunes.elasticsearch.node.stats.thread_pool.search.queue	The number of queued requests in the search thread pool.
	aliyunes.elasticsearch.node.stats.thread_pool.generic.queue	The number of queued requests in the generic thread pool.
	aliyunes.elasticsearch.node.stats.thread_pool.generic.threads	The total number of threads in the generic pool.
	aliyunes.elasticsearch.node.stats.thread_pool.generic.rejected	The number of rejected requests in the generic thread pool.

Some metrics are calculated as a rate, which represents the growth over a period of time. This monitoring data is not completely precise and may have a margin of error. The data is mainly used to identify trends. If the data changes slowly, the changes are likely to be averaged out.

For example, consider the old GC count metric `aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.count`. The rate is calculated from the change in the count value between two data points. If monitoring displays one data point per minute, and the cumulative GC count is 1,000 at the beginning of a minute and 1,001 at the end of that minute, the rate is calculated as (1,001 - 1,000) / 60.

The rate capability is enabled for the following metrics:

"metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.ms"
"metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.count"
"metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.count"
"metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.ms"
"metric": "aliyunes.elasticsearch.node.stats.thread_pool.search.rejected"
"metric": "aliyunes.elasticsearch.node.stats.thread_pool.write.rejected"
"metric": "aliyunes.elasticsearch.node.stats.thread_pool.generic.rejected"