全部产品
云市场

Prometheus Remote Write配置

更新时间:2019-04-16 17:56:35

阿里云提供的不同规格的TSDB实例,设置了不同的最大写入TPS,避免过大TPS导致TSDB示例不可用,保护TSDB实例正常运行。当写入TPS超过TSDB实例允许的最大TPS时,将触发TSDB实例限流保护规则,会造成写入失败异常。因此需要根据TSDB实例规格来调整Prometheus的remote_write配置,从而实现平稳可靠的将Prometheus采集到的指标写入TSDB中。

Prometheus的remote_write的所有配置项可以从Prometheus官网得到,本文这里只介绍Prometheus对接阿里云TSDB时的写入配置最佳实践。为了提高写入效率,Prometheus在将采集到的samples写入远程存储之前,会先缓存在内存队列中,然后打包发送给远端存储。而这个内存队列的配置参数,对于Prometheus写入远程存储的效率影响较大,其包括的配置项主要如下所示。

  1. # Configures the queue used to write to remote storage.
  2. queue_config:
  3. # Number of samples to buffer per shard before we start dropping them.
  4. [ capacity: <int> | default = 10000 ]
  5. # Maximum number of shards, i.e. amount of concurrency.
  6. [ max_shards: <int> | default = 1000 ]
  7. # Minimum number of shards, i.e. amount of concurrency.
  8. [ min_shards: <int> | default = 1 ]
  9. # Maximum number of samples per send.
  10. [ max_samples_per_send: <int> | default = 100]
  11. # Maximum time a sample will wait in buffer.
  12. [ batch_send_deadline: <duration> | default = 5s ]
  13. # Maximum number of times to retry a batch on recoverable errors.
  14. [ max_retries: <int> | default = 3 ]
  15. # Initial retry delay. Gets doubled for every retry.
  16. [ min_backoff: <duration> | default = 30ms ]
  17. # Maximum retry delay.
  18. [ max_backoff: <duration> | default = 100ms ]

上面配置中对于min_shards这个配置项,仅Prometheus V2.6.0及其之后的版本才支持,V2.6.0以前的版本默认是1,因此若无特殊需要,可以不用设置该参数。

上面的参数中的max_shards和max_samples_per_send决定了Prometheus写入远程存储的最大TPS。假设发送100个sample需要100ms, 那么按照上面的默认配置,Prometheus写入远程存储的最大TPS为 1000 * 100 / 0.1s = 100W/s。 若购买的TSDB实例的最大写入TPS小于100W/s,则很容易触发TSDB实例限流保护规则,会造成写入失败异常。下面给出了对于TSDB不同规格,Prometheus对接TSDB时remote_write参考的配置,在不同的使用场景下,可以适当调整。

TSDB规格ID 写入数据点/秒 参考配置
mlarge 5000 capacity:10000
max_samples_per_send:500
max_shards:1
large 10000 capacity:10000
max_samples_per_send:500
max_shards:2
3xlarge 30000 capacity:10000
max_samples_per_send:500
max_shards:6
4xlarge 40000 capacity:10000
max_samples_per_send:500
max_shards:8
6xlarge 60000 capacity:10000
max_samples_per_send:500
max_shards:12
12xlarge 120000 capacity:10000
max_samples_per_send:500
max_shards:24
24xlarge 240000 capacity:10000
max_samples_per_send:500
max_shards:48
48xlarge 480000 capacity:10000
max_samples_per_send:500
max_shards:96
96xlarge 960000 capacity:10000
max_samples_per_send:500
max_shards:192

以TSDB实例为mlarge规格的为例,则Prometheus的参考配置的完整示例如下:

  1. # my global config
  2. global:
  3. scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  4. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  5. # scrape_timeout is set to the global default (10s).
  6. # Alertmanager configuration
  7. alerting:
  8. alertmanagers:
  9. - static_configs:
  10. - targets:
  11. # - alertmanager:9093
  12. # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
  13. rule_files:
  14. # - "first_rules.yml"
  15. # - "second_rules.yml"
  16. # A scrape configuration containing exactly one endpoint to scrape:
  17. # Here it's Prometheus itself.
  18. scrape_configs:
  19. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  20. - job_name: 'prometheus'
  21. # metrics_path defaults to '/metrics'
  22. # scheme defaults to 'http'.
  23. static_configs:
  24. - targets: ['localhost:9090']
  25. # Remote write configuration (TSDB).
  26. remote_write:
  27. - url: "http://ts-xxxxxxxxxxxx.hitsdb.rds.aliyuncs.com:3242/api/prom_write"
  28. # Configures the queue used to write to remote storage.
  29. queue_config:
  30. # Number of samples to buffer per shard before we start dropping them.
  31. capacity: 10000
  32. # Maximum number of shards, i.e. amount of concurrency.
  33. max_shards: 1
  34. # Maximum number of samples per send.
  35. max_samples_per_send: 500
  36. # Remote read configuration (TSDB).
  37. remote_read:
  38. - url: "http://ts-xxxxxxxxxxxx.hitsdb.rds.aliyuncs.com:3242/api/prom_read"
  39. read_recent: true