The advanced monitoring and alert service provides a wide range of metrics for Elasticsearch (ES). Basic metrics include resource usage information, such as cluster status, node count, and index count. They also include concurrent performance metrics, such as the read and write queries per second (QPS) of a cluster or node, and network monitoring metrics. These metrics help you better understand your ES cluster usage. You can use the advanced monitoring and alert service to view the basic metrics dashboard for your cluster, create custom alert rules to monitor cluster performance in real time, and send alert notifications. This topic describes the metrics on the default basic metrics dashboard.
The supported advanced monitoring metrics vary based on the version of your Alibaba Cloud ES instance.
Only kernel-enhanced instances support advanced monitoring metrics related to index write QPS and query QPS. For more information about these metrics, see DPI engine metrics.
ES V6.7 instances with shared elastic storage enabled do not support disk usage metrics. For the specific supported metrics, see the console.
Table 1. Basic metrics and their descriptions
The metrics for the cluster, index, Node JVM, and Thread_pool dimensions are provided by the ES module. For more information, see Elasticsearch Fields.
When you monitor cluster-level QPS metrics, you may observe instability due to cluster jitter. If this happens, see the relevant metrics in Kibana monitoring. Both advanced monitoring and Kibana monitoring are affected by cluster stability. However, cluster jitter can cause monitoring spikes, negative values, or a lack of data for advanced monitoring QPS metrics. In Kibana, cluster jitter is more likely to cause a lack of monitoring data.
Category | Metric | Description |
cluster | aliyunes.elasticsearch.index.summary.total.indexing.index_total_qps | The total write QPS of the cluster. This metric indicates the number of documents written to the cluster per second. Details are as follows:
|
aliyunes.elasticsearch.index.summary.total.search.query_total_qps | The total query QPS of the cluster. This metric indicates the number of query QPS executed by the cluster per second. The query QPS is related to the number of primary shards in the index to be queried. For example, if the index to be queried has five primary shards, one query request corresponds to a QPS of 5. | |
aliyunes.elasticsearch.cluster.stats.status | The cluster status. The following states are supported:
| |
aliyunes.elasticsearch.cluster.stats.indices.shards.count | The number of shards. | |
aliyunes.elasticsearch.cluster.stats.indices.total | The number of indexes. | |
aliyunes.elasticsearch.cluster.stats.nodes.count | The number of nodes. | |
aliyunes.elasticsearch.aliyun_auto_snapshot.latest_duration.ms | The duration of the latest snapshot. Unit: ms. | |
aliyunes.elasticsearch.cluster.stats.indices.fielddata.memory.bytes | The memory usage of fielddata. Unit: bytes. | |
aliyunes.elasticsearch.cluster.stats.indices.shards.primaries | The number of primary shards. | |
index | aliyunes.elasticsearch.index.segments.memory.bytes | The memory usage of index segments. Unit: bytes. |
aliyunes.elasticsearch.index.store.size.bytes | The storage size of the index. Unit: bytes. | |
aliyunes.elasticsearch.index.segments.stored_fields_memory.bytes | The memory size of stored fields in segments. Unit: bytes. | |
aliyunes.elasticsearch.index.segments.count | The number of index segments. | |
Node Resource | aliyunes.ecs.node_stats_process_cpu_percent_raw | The average CPU utilization of the node. |
aliyunes.ecs.node_stats_os_cpu_load_average_1m_raw | The one-minute load average of the node. | |
aliyunes.ecs.node_stats_os_per_cpu_load_average_1m_raw | The one-minute load average per CPU on the node. | |
aliyunes.elasticsearch.node.stats.jvm.mem.heap_used_percent | The JVM heap memory usage. | |
aliyunes.ecs.node_stats_system_disk_space_usage | The system disk usage. | |
aliyunes.ecs.node_stats_fs_data_disk_total_usage | The disk usage of the node. | |
Node Network | aliyunes.ecs.node_stats_networkin_packages | The number of incoming network packets for the node. |
aliyunes.ecs.node_stats_networkout_packages | The number of outgoing network packets for the node. | |
aliyunes.ecs.node_stats_networkin_rate | The inbound network traffic rate for the node. | |
aliyunes.ecs.node_stats_networkout_rate | The outbound network traffic rate for the node. | |
aliyunes.ecs.node_stats_tcp_established | The number of TCP connections for the node. | |
Node Disk | aliyunes.ecs.node_stats_data_disk_r | The number of read requests completed per second. |
aliyunes.ecs.node_stats_data_disk_rm | The amount of data read per second. Unit: MB. | |
aliyunes.ecs.node_stats_data_disk_w | The number of write requests completed per second. | |
aliyunes.ecs.node_stats_data_disk_wm | The amount of data written per second. Unit: MB. | |
aliyunes.ecs.node_stats_data_disk_r_await | The average wait time for each read request. Unit: ms. | |
aliyunes.ecs.node_stats_data_disk_w_await | The average wait time for each write request. Unit: ms. | |
aliyunes.ecs.node_stats_data_disk_svctm | The average service time for each request. Unit: ms. | |
aliyunes.ecs.node_stats_data_disk_util | The utilization of the device. | |
aliyunes.ecs.node_stats_data_disk_avgqu_sz | The average length of the request queue. | |
Node JVM | aliyunes.elasticsearch.node.stats.jvm.mem.heap_used_percent | The heap usage. |
aliyunes.elasticsearch.node.stats.jvm.mem.pools.old.used.bytes | The usage of the old generation space. Unit: bytes. | |
aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.ms | The time consumed by old generation garbage collection (GC). Unit: ms. | |
aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.ms | The time consumed by young generation GC. Unit: ms. | |
aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.count | The frequency of old generation GC. | |
aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.count | The frequency of young generation GC. | |
aliyunes.elasticsearch.node.stats.jvm.mem.pools.survivor.used.bytes | The amount of memory currently used by the survivor space. Unit: bytes. | |
aliyunes.elasticsearch.node.stats.jvm.mem.pools.survivor.max.bytes | The maximum amount of memory that can be used by the survivor space. Unit: bytes. | |
aliyunes.elasticsearch.node.stats.jvm.mem.pools.old.peak.bytes | The peak memory usage of the JVM old generation space. Unit: bytes. | |
aliyunes.elasticsearch.node.jvm.memory.nonheap.init.bytes | The initial non-heap memory of the JVM. Unit: bytes. | |
aliyunes.elasticsearch.node.jvm.memory.nonheap.max.bytes | The maximum non-heap memory usage. Unit: bytes. | |
Thread_pool | aliyunes.elasticsearch.node.stats.thread_pool.search.threads | The total number of threads in the thread pool. |
aliyunes.elasticsearch.node.stats.thread_pool.search.rejected | The number of rejected requests in the search thread pool. | |
aliyunes.elasticsearch.node.stats.thread_pool.search.queue | The number of queued requests in the search thread pool. | |
aliyunes.elasticsearch.node.stats.thread_pool.generic.queue | The number of queued requests in the generic thread pool. | |
aliyunes.elasticsearch.node.stats.thread_pool.generic.threads | The total number of threads in the generic pool. | |
aliyunes.elasticsearch.node.stats.thread_pool.generic.rejected | The number of rejected requests in the generic thread pool. |
Some metrics are calculated as a rate, which represents the growth over a period of time. This monitoring data is not completely precise and may have a margin of error. The data is mainly used to identify trends. If the data changes slowly, the changes are likely to be averaged out.
For example, consider the old GC count metric `aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.count`. The rate is calculated from the change in the count value between two data points. If monitoring displays one data point per minute, and the cumulative GC count is 1,000 at the beginning of a minute and 1,001 at the end of that minute, the rate is calculated as (1,001 - 1,000) / 60.
The rate capability is enabled for the following metrics:
"metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.ms"
"metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.count"
"metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.count"
"metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.ms"
"metric": "aliyunes.elasticsearch.node.stats.thread_pool.search.rejected"
"metric": "aliyunes.elasticsearch.node.stats.thread_pool.write.rejected"
"metric": "aliyunes.elasticsearch.node.stats.thread_pool.generic.rejected"