Basic metrics

更新时间:
复制 MD 格式

The advanced monitoring and alert service provides a wide range of metrics for Elasticsearch (ES). Basic metrics include resource usage information, such as cluster status, node count, and index count. They also include concurrent performance metrics, such as the read and write queries per second (QPS) of a cluster or node, and network monitoring metrics. These metrics help you better understand your ES cluster usage. You can use the advanced monitoring and alert service to view the basic metrics dashboard for your cluster, create custom alert rules to monitor cluster performance in real time, and send alert notifications. This topic describes the metrics on the default basic metrics dashboard.

The supported advanced monitoring metrics vary based on the version of your Alibaba Cloud ES instance.

  • Only kernel-enhanced instances support advanced monitoring metrics related to index write QPS and query QPS. For more information about these metrics, see DPI engine metrics.

  • ES V6.7 instances with shared elastic storage enabled do not support disk usage metrics. For the specific supported metrics, see the console.

Table 1. Basic metrics and their descriptions

Note
  • The metrics for the cluster, index, Node JVM, and Thread_pool dimensions are provided by the ES module. For more information, see Elasticsearch Fields.

  • When you monitor cluster-level QPS metrics, you may observe instability due to cluster jitter. If this happens, see the relevant metrics in Kibana monitoring. Both advanced monitoring and Kibana monitoring are affected by cluster stability. However, cluster jitter can cause monitoring spikes, negative values, or a lack of data for advanced monitoring QPS metrics. In Kibana, cluster jitter is more likely to cause a lack of monitoring data.

Category

Metric

Description

cluster

aliyunes.elasticsearch.index.summary.total.indexing.index_total_qps

The total write QPS of the cluster. This metric indicates the number of documents written to the cluster per second. Details are as follows:

  • If a client sends one write request that contains a single document to the cluster within one second, the write QPS is 1. If multiple write requests are sent within one second, the values are aggregated.

  • If multiple documents are written in a batch in a single write request using the _bulk API within one second, the write QPS is the total number of documents pushed in the request. If multiple batch write requests are sent using the _bulk API within one second, the values are aggregated.

aliyunes.elasticsearch.index.summary.total.search.query_total_qps

The total query QPS of the cluster. This metric indicates the number of query QPS executed by the cluster per second. The query QPS is related to the number of primary shards in the index to be queried. For example, if the index to be queried has five primary shards, one query request corresponds to a QPS of 5.

aliyunes.elasticsearch.cluster.stats.status

The cluster status. The following states are supported:

  • 0: green

  • 1: yellow

  • 2: red

aliyunes.elasticsearch.cluster.stats.indices.shards.count

The number of shards.

aliyunes.elasticsearch.cluster.stats.indices.total

The number of indexes.

aliyunes.elasticsearch.cluster.stats.nodes.count

The number of nodes.

aliyunes.elasticsearch.aliyun_auto_snapshot.latest_duration.ms

The duration of the latest snapshot. Unit: ms.

aliyunes.elasticsearch.cluster.stats.indices.fielddata.memory.bytes

The memory usage of fielddata. Unit: bytes.

aliyunes.elasticsearch.cluster.stats.indices.shards.primaries

The number of primary shards.

index

aliyunes.elasticsearch.index.segments.memory.bytes

The memory usage of index segments. Unit: bytes.

aliyunes.elasticsearch.index.store.size.bytes

The storage size of the index. Unit: bytes.

aliyunes.elasticsearch.index.segments.stored_fields_memory.bytes

The memory size of stored fields in segments. Unit: bytes.

aliyunes.elasticsearch.index.segments.count

The number of index segments.

Node Resource

aliyunes.ecs.node_stats_process_cpu_percent_raw

The average CPU utilization of the node.

aliyunes.ecs.node_stats_os_cpu_load_average_1m_raw

The one-minute load average of the node.

aliyunes.ecs.node_stats_os_per_cpu_load_average_1m_raw

The one-minute load average per CPU on the node.

aliyunes.elasticsearch.node.stats.jvm.mem.heap_used_percent

The JVM heap memory usage.

aliyunes.ecs.node_stats_system_disk_space_usage

The system disk usage.

aliyunes.ecs.node_stats_fs_data_disk_total_usage

The disk usage of the node.

Node Network

aliyunes.ecs.node_stats_networkin_packages

The number of incoming network packets for the node.

aliyunes.ecs.node_stats_networkout_packages

The number of outgoing network packets for the node.

aliyunes.ecs.node_stats_networkin_rate

The inbound network traffic rate for the node.

aliyunes.ecs.node_stats_networkout_rate

The outbound network traffic rate for the node.

aliyunes.ecs.node_stats_tcp_established

The number of TCP connections for the node.

Node Disk

aliyunes.ecs.node_stats_data_disk_r

The number of read requests completed per second.

aliyunes.ecs.node_stats_data_disk_rm

The amount of data read per second. Unit: MB.

aliyunes.ecs.node_stats_data_disk_w

The number of write requests completed per second.

aliyunes.ecs.node_stats_data_disk_wm

The amount of data written per second. Unit: MB.

aliyunes.ecs.node_stats_data_disk_r_await

The average wait time for each read request. Unit: ms.

aliyunes.ecs.node_stats_data_disk_w_await

The average wait time for each write request. Unit: ms.

aliyunes.ecs.node_stats_data_disk_svctm

The average service time for each request. Unit: ms.

aliyunes.ecs.node_stats_data_disk_util

The utilization of the device.

aliyunes.ecs.node_stats_data_disk_avgqu_sz

The average length of the request queue.

Node JVM

aliyunes.elasticsearch.node.stats.jvm.mem.heap_used_percent

The heap usage.

aliyunes.elasticsearch.node.stats.jvm.mem.pools.old.used.bytes

The usage of the old generation space. Unit: bytes.

aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.ms

The time consumed by old generation garbage collection (GC). Unit: ms.

aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.ms

The time consumed by young generation GC. Unit: ms.

aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.count

The frequency of old generation GC.

aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.count

The frequency of young generation GC.

aliyunes.elasticsearch.node.stats.jvm.mem.pools.survivor.used.bytes

The amount of memory currently used by the survivor space. Unit: bytes.

aliyunes.elasticsearch.node.stats.jvm.mem.pools.survivor.max.bytes

The maximum amount of memory that can be used by the survivor space. Unit: bytes.

aliyunes.elasticsearch.node.stats.jvm.mem.pools.old.peak.bytes

The peak memory usage of the JVM old generation space. Unit: bytes.

aliyunes.elasticsearch.node.jvm.memory.nonheap.init.bytes

The initial non-heap memory of the JVM. Unit: bytes.

aliyunes.elasticsearch.node.jvm.memory.nonheap.max.bytes

The maximum non-heap memory usage. Unit: bytes.

Thread_pool

aliyunes.elasticsearch.node.stats.thread_pool.search.threads

The total number of threads in the thread pool.

aliyunes.elasticsearch.node.stats.thread_pool.search.rejected

The number of rejected requests in the search thread pool.

aliyunes.elasticsearch.node.stats.thread_pool.search.queue

The number of queued requests in the search thread pool.

aliyunes.elasticsearch.node.stats.thread_pool.generic.queue

The number of queued requests in the generic thread pool.

aliyunes.elasticsearch.node.stats.thread_pool.generic.threads

The total number of threads in the generic pool.

aliyunes.elasticsearch.node.stats.thread_pool.generic.rejected

The number of rejected requests in the generic thread pool.

Some metrics are calculated as a rate, which represents the growth over a period of time. This monitoring data is not completely precise and may have a margin of error. The data is mainly used to identify trends. If the data changes slowly, the changes are likely to be averaged out.

For example, consider the old GC count metric `aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.count`. The rate is calculated from the change in the count value between two data points. If monitoring displays one data point per minute, and the cumulative GC count is 1,000 at the beginning of a minute and 1,001 at the end of that minute, the rate is calculated as (1,001 - 1,000) / 60.

The rate capability is enabled for the following metrics:

  • "metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.ms"

  • "metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.old.collection.count"

  • "metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.count"

  • "metric": "aliyunes.elasticsearch.node.stats.jvm.gc.collectors.young.collection.ms"

  • "metric": "aliyunes.elasticsearch.node.stats.thread_pool.search.rejected"

  • "metric": "aliyunes.elasticsearch.node.stats.thread_pool.write.rejected"

  • "metric": "aliyunes.elasticsearch.node.stats.thread_pool.generic.rejected"