nginx 的监控规则可以参考:Awesome Prometheus alerts | Collection of alerting rules
实际上里面的规则可能还需要略微修改才可以使用:
groups:
- name: Nginx
rules:
- alert: NginxHighHttp4xxErrorRate
expr: sum(rate(nginx_server_requests{code="4xx" ,host=~"[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(.[a-zA-Z0-9][-a-zA-Z0-9]{0,62})+.?"}[1m])) by (host, instance) / sum(rate(nginx_server_requests[1m])) by (host, instance) * 100 > 30
for: 2m
labels:
severity: critical
annotations:
summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 4xx (> 30%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NginxHighHttp5xxErrorRate
expr: sum(rate(nginx_server_requests{code="5xx" ,host=~"[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(.[a-zA-Z0-9][-a-zA-Z0-9]{0,62})+.?"}[1m])) by (host, instance) / sum(rate(nginx_server_requests[1m])) by (host, instance) * 100 > 30
for: 2m
labels:
severity: critical
annotations:
summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 5xx (> 30%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NginxLatencyHigh
expr: histogram_quantile(0.99, sum(rate(nginx_server_requestMsec[2m])) by (host, node)) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Nginx latency high (instance {{ $labels.instance }})
description: "Nginx p99 latency is higher than 3 seconds\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
以上的规则修改了4xx和5xx的匹配规则,nginx_server_requests
是通过安装nginx_ vts_exporter工具获取的,不是文章中的nginx-lua-prometheus。另外host中通过[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(.[a-zA-Z0-9][-a-zA-Z0-9]{0,62})+.?
这个正则匹配了域名出来,避免匹配到类似*
或者_
等不需要的信息出来。