ZooKeeper 監視器指南
新指標系統
新指標系統
功能自 3.6.0 起提供豐富的指標,協助使用者在 znode、網路、磁碟、法定人數、領導者選舉、客戶端、安全性、故障、監控/工作階段、請求處理器等主題上監控 ZooKeeper。
指標
所有指標都包含在 ServerMetrics.java
中。
Prometheus
- 執行 Prometheus 監控服務是擷取和記錄 ZooKeeper 指標最簡單的方法。
- 先決條件
- 在 zoo.cfg 中設定
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
以啟用Prometheus MetricsProvider
。 - 也可以透過設定
metricsProvider.httpPort
(預設值:7000)來設定埠。 - 安裝 Prometheus:前往官方網站下載 頁面,下載最新版本。
-
將 Prometheus 的刮取器設定為目標 ZooKeeper 群集端點
cat > /tmp/test-zk.yaml <<EOF global: scrape_interval: 10s scrape_configs: - job_name: test-zk static_configs: - targets: ['192.168.10.32:7000','192.168.10.33:7000','192.168.10.34:7000'] EOF cat /tmp/test-zk.yaml
-
設定 Prometheus 處理常式
nohup /tmp/prometheus \ --config.file /tmp/test-zk.yaml \ --web.listen-address ":9090" \ --storage.tsdb.path "/tmp/test-zk.data" >> /tmp/test-zk.log 2>&1 &
-
現在 Prometheus 將每 10 秒刮取一次 zk 指標。
使用 Prometheus 發出警示
-
我們建議您閱讀 Prometheus 官方警示頁面,以了解一些警示原則
-
我們建議您使用 Prometheus Alertmanager,它可以協助使用者更方便地接收警示電子郵件或即時訊息(透過 webhook)
-
我們提供一個警示範例,其中這些指標應特別注意。注意:這僅供您參考,您需要根據實際情況和資源環境進行調整
use ./promtool check rules rules/zk.yml to check the correctness of the config file cat rules/zk.yml groups: - name: zk-alert-example rules: - alert: ZooKeeper server is down expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} ZooKeeper server is down" description: "{{ $labels.instance }} of job {{$labels.job}} ZooKeeper server is down: [{{ $value }}]." - alert: create too many znodes expr: znode_count > 1000000 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} create too many znodes" description: "{{ $labels.instance }} of job {{$labels.job}} create too many znodes: [{{ $value }}]." - alert: create too many connections expr: num_alive_connections > 50 # suppose we use the default maxClientCnxns: 60 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} create too many connections" description: "{{ $labels.instance }} of job {{$labels.job}} create too many connections: [{{ $value }}]." - alert: znode total occupied memory is too big expr: approximate_data_size /1024 /1024 > 1 * 1024 # more than 1024 MB(1 GB) for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} znode total occupied memory is too big" description: "{{ $labels.instance }} of job {{$labels.job}} znode total occupied memory is too big: [{{ $value }}] MB." - alert: set too many watch expr: watch_count > 10000 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} set too many watch" description: "{{ $labels.instance }} of job {{$labels.job}} set too many watch: [{{ $value }}]." - alert: a leader election happens expr: increase(election_time_count[5m]) > 0 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} a leader election happens" description: "{{ $labels.instance }} of job {{$labels.job}} a leader election happens: [{{ $value }}]." - alert: open too many files expr: open_file_descriptor_count > 300 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} open too many files" description: "{{ $labels.instance }} of job {{$labels.job}} open too many files: [{{ $value }}]." - alert: fsync time is too long expr: rate(fsynctime_sum[1m]) > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} fsync time is too long" description: "{{ $labels.instance }} of job {{$labels.job}} fsync time is too long: [{{ $value }}]." - alert: take snapshot time is too long expr: rate(snapshottime_sum[5m]) > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} take snapshot time is too long" description: "{{ $labels.instance }} of job {{$labels.job}} take snapshot time is too long: [{{ $value }}]." - alert: avg latency is too high expr: avg_latency > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} avg latency is too high" description: "{{ $labels.instance }} of job {{$labels.job}} avg latency is too high: [{{ $value }}]." - alert: JvmMemoryFillingUp expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8 for: 5m labels: severity: warning annotations: summary: "JVM memory filling up (instance {{ $labels.instance }})" description: "JVM memory is filling up (> 80%)\n labels: {{ $labels }} value = {{ $value }}\n"
Grafana
- Grafana 內建 Prometheus 支援;只要新增 Prometheus 資料來源
Name: test-zk Type: Prometheus Url: https://127.0.0.1:9090 Access: proxy
- 然後下載並匯入預設 ZooKeeper 儀表板 範本 並自訂。
- 使用者可以透過寫信至 dev@zookeeper.apache.org 來詢問 Grafana 儀表板帳戶,以獲得任何良好的改進。
InfluxDB
InfluxDB 是一個開放原始碼時間序列資料,通常用於儲存來自 Zookeeper 的指標。您可以 下載 開放原始碼版本或在 InfluxDB Cloud 上建立一個 免費 帳戶。在任何情況下,設定 Apache Zookeeper Telegraf 外掛程式 以開始收集和儲存指標,從您的 Zookeeper 群集到您的 InfluxDB 執行個體。還有一個 Apache Zookeeper InfluxDB 範本,其中包含 Telegraf 設定和儀表板,讓您可以立即設定。
JMX
更多詳細資訊請參閱 此處
四個字母的字
更多詳細資訊請參閱 此處