本文介绍如何基于用户自建的Prometheus,采集ACK Pro集群的控制平面组件监控API Server、etcd、Scheduler、KCM、CCM指标配置,并介绍推荐的报警配置。
前提条件
- 自建的Prometheus能够访问ACK Pro集群的APIServer,并拥有
/metrics
的读权限。 - 自建的Prometheus可以在ACK Pro集群内,也可以在ACK Pro集群外。
背景信息
ACK Pro提供控制平面核心组件监控对外透出的功能,并基于ARMS预置了相关的组件监控大盘,具体包括APIServer、Cloud Controller Manager、etcd、Kube Controller Manager和Scheduler,如果您选用了ARMS监控能力,监控数据会被ARMS代理自动采集并在监控大盘上实时展示。如果您希望通过自建Prometheus采集ACK Pro集群的控制平面核心组件指标并配置相应告警,实现与自建监控系统的集成,可以基于本文进行配置。
Prometheus采集配置
使用自建的Prometheus采集ACK Pro集群控制平面核心组件指标,首先需要在Prometheus的配置文件prometheus.yaml中配置指标采集Job,配置文件格式如下:
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: ack-api-server
......
- job_name: ack-etcd
......
- job_name: ack-scheduler
......
其中,每个核心组件对应一个Job配置,具体配置可参见对应核心组件的指标清单。社区Prometheus配置Prometheus.yaml方法请参见Configuration。
社区Prometheus Operator方案以及ACK应用市场ack-prometheus-operator组件的相关信息,请参见开源Prometheus监控。关于自定义采集配置,请参见Prometheus Operator社区官方文档Prometheus Operator进行数据采集配置。
Prometheus报警规则配置
社区Prometheus报警配置具体操作,请参见Alerting_rules。
ACK Pro集群内部监控
内部监控是将Prometheus部署在待监控的ACK Pro集群内的监控形式。
kube-apiserver
- Prometheus采集配置
- job_name: ack-api-server scrape_interval: 30s scrape_timeout: 30s metrics_path: /metrics scheme: https # scheme: https honor_labels: true honor_timestamps: true params: hosting: ["true"] job: ["apiserver"] kubernetes_sd_configs: - role: endpoints namespaces: names: [default] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes, insecure_skip_verify: false} relabel_configs: - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: apiserver replacement: $1 action: keep - source_labels: [__meta_kubernetes_service_label_provider] separator: ; regex: kubernetes replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: https replacement: $1 action: keep - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Node;(.*) target_label: node replacement: ${1} action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Pod;(.*) target_label: pod replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: (.+) target_label: job replacement: ${1} action: replace - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
- Prometheus告警规则
- alert: AckApiServerWarning annotations: message: APIServer is not available in last 5 minutes. Please check the prometheus job and target status. expr: | (absent(up{job="ack-api-server",pod!=""}) or (count(up{job="ack-api-server",pod!=""}) <= 1)) == 1 for: 5m labels: severity: critical
- 监控采集指标清单
kube-apiserver监控采集指标清单,请参见kube-apiserver指标清单。
etcd
- Prometheus采集配置
- job_name: ack-etcd scrape_interval: 30s scrape_timeout: 30s metrics_path: /metrics scheme: https # scheme: https honor_labels: true honor_timestamps: true params: hosting: ["true"] job: ["etcd"] kubernetes_sd_configs: - role: endpoints namespaces: names: [default] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes, insecure_skip_verify: false} relabel_configs: - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: apiserver replacement: $1 action: keep - source_labels: [__meta_kubernetes_service_label_provider] separator: ; regex: kubernetes replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: https replacement: $1 action: keep - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Node;(.*) target_label: node replacement: ${1} action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Pod;(.*) target_label: pod replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: (.+) target_label: job replacement: ${1} action: replace - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
- Prometheus告警规则
- alert: AckETCDWarning annotations: message: Etcd cluster has no leader in last 5 minutes, please check whether the cluster is overloaded and contact ACK team. expr: | sum_over_time(etcd_server_has_leader[5m]) == 0 for: 5m labels: severity: critical - alert: AckETCDWarning annotations: message: Etcd is not available in last 5 minutes. Please check the prometheus job and target status. expr: | (absent(up{job="ack-etcd",pod!=""}) or (count(up{job="ack-etcd",pod!=""}) <= 2)) == 1 for: 5m labels: severity: critical
- 监控采集指标清单
etcd监控采集指标清单,请参见etcd指标清单。
kube-scheduler
- Prometheus采集配置
- job_name: ack-scheduler scrape_interval: 30s scrape_timeout: 30s metrics_path: /metrics scheme: https # scheme: https honor_labels: true honor_timestamps: true params: hosting: ["true"] job: ["ack-scheduler"] kubernetes_sd_configs: - role: endpoints namespaces: names: [default] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes, insecure_skip_verify: false} relabel_configs: - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: apiserver replacement: $1 action: keep - source_labels: [__meta_kubernetes_service_label_provider] separator: ; regex: kubernetes replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: https replacement: $1 action: keep - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Node;(.*) target_label: node replacement: ${1} action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Pod;(.*) target_label: pod replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: (.+) target_label: job replacement: ${1} action: replace - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
- Prometheus告警规则
- alert: AckSchedulerWarning annotations: message: Scheduler is not available in last 3 minutes. Please check the prometheus job and target status. expr: | (absent(up{job="ack-scheduler",pod!=""}) or (count(up{job="ack-scheduler",pod!=""}) = 0)) == 1 for: 3m labels: severity: critical
- 监控采集指标清单
kube-scheduler监控采集指标清单,请参见kube-scheduler指标清单。
kube-controller-manager
- Prometheus采集配置
- job_name: ack-kcm scrape_interval: 30s scrape_timeout: 30s metrics_path: /metrics scheme: https # scheme: https honor_labels: true honor_timestamps: true params: hosting: ["true"] job: ["kube-controller-manager"] kubernetes_sd_configs: - role: endpoints namespaces: names: [default] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes, insecure_skip_verify: false} relabel_configs: - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: apiserver replacement: $1 action: keep - source_labels: [__meta_kubernetes_service_label_provider] separator: ; regex: kubernetes replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: https replacement: $1 action: keep - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Node;(.*) target_label: node replacement: ${1} action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Pod;(.*) target_label: pod replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: (.+) target_label: job replacement: ${1} action: replace - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
- Prometheus告警规则
- alert: AckKCMWarning annotations: message: KCM is not available in last 3 minutes. Please check the prometheus job and target status. expr: | (absent(up{job="ack-kube-controller-manager",pod!=""}) or (count(up{job="ack-kube-controller-manager",pod!=""}) = 0)) == 1 for: 3m labels: severity: critical
- 监控采集指标清单
kube-controller-manager监控采集指标清单,请参见kube-controller-manager指标清单。
cloud-controller-manager
- Prometheus采集配置
- job_name: ack-cloud-controller-manager scrape_interval: 30s scrape_timeout: 30s metrics_path: /metrics scheme: https # scheme: https honor_labels: true honor_timestamps: true params: hosting: ["true"] job: [""ack-cloud-controller-manager"] kubernetes_sd_configs: - role: endpoints namespaces: names: [default] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: {ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, server_name: kubernetes, insecure_skip_verify: false} relabel_configs: - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: apiserver replacement: $1 action: keep - source_labels: [__meta_kubernetes_service_label_provider] separator: ; regex: kubernetes replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: https replacement: $1 action: keep - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Node;(.*) target_label: node replacement: ${1} action: replace - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name] separator: ; regex: Pod;(.*) target_label: pod replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - source_labels: [__meta_kubernetes_service_label_component] separator: ; regex: (.+) target_label: job replacement: ${1} action: replace - {separator: ;, regex: (.*), target_label: endpoint, replacement: https, action: replace}
- Prometheus告警规则
- alert: AckCCMWarning annotations: message: CCM is not available in last 3 minutes. Please check the prometheus job and target status. expr: | (absent(up{job="ack-cloud-controller-manager",pod!=""}) or (count(up{job="ack-cloud-controller-manager",pod!=""}) <= 0)) == 1 for: 3m labels: severity: critical
- 监控采集指标清单
cloud-controller-manager监控采集指标清单,请参见cloud-controller-manager指标清单。
ACK Pro集群外部监控
如果需要使用ACK Pro集群外的Prometheus来监控Kubernetes集群,具体操作,请参见Configuration和Monitoring kubernetes with prometheus from outside of k8s cluster。主要配置如下:
- job_name: 'out-of-k8s-scrape-job'
scheme: https
tls_config:
ca_file: /etc/prometheus/kubernetes-ca.crt
bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
kubernetes_sd_configs:
- api_server: 'https://<KUBERNETES URL>'
role: node
tls_config:
ca_file: /etc/prometheus/kubernetes-ca.crt
bearer_token: '<SERVICE ACCOUNT BEARER TOKEN>'
验证效果
- 登录自建的Prometheus控制台,切换到Graph页面。
- 输入up,查看是否全部控制平面组件都可以显示。
up
预期输出:
重要up{instance="x.x.x.x:6443", job="ack-api-server"}
是作为代理的Endpoint状态。其中,x.x.x.x
是K8s集群default命名空间下Kubernetes Service的IP,不同用户集群该IP不同。up{instance="controlplane-xyz", job="ack-api-server", pod="controlplane-xyz"}
是具体控制面Pod的状态。可以使用该up
指标为控制面Pod做探活检测。
- 输入以下指标,查看是否可以正常显示。
apiserver_request_total{job="ack-api-server"}
预期输出:
如果界面能正常显示查询的指标和数据,说明自建Prometheus可以正常采集控制平面核心组件指标。