可观测性体系:Prometheus + Grafana + Loki
深入理解云原生可观测性三大支柱,学习使用 Prometheus、Grafana、Loki 构建完整的监控告警体系。
概述
可观测性(Observability)是云原生应用运维的核心能力,包括指标(Metrics)、日志(Logs)和追踪(Traces)三大支柱。本文将深入探讨:
学习目标:
- 理解可观测性三大支柱与 SRE 实践
- 掌握 Prometheus 指标收集与查询
- 学会 Grafana 可视化面板配置
- 了解 Loki 日志收集与查询
- 掌握 AlertManager 告警配置
可观测性概述
三大支柱
┌─────────────────────────────────────────────────────────────────┐
│ 可观测性三大支柱 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │
│ ├──────────────┤ ├──────────────┤ ├──────────────┤ │
│ │ 指标 │ │ 日志 │ │ 追踪 │ │
│ │ │ │ │ │ │ │
│ │ Prometheus │ │ Loki/ELK │ │ Jaeger/Zipkin│ │
│ │ Graphite │ │ Fluentd │ │ OpenTelemetry│ │
│ │ InfluxDB │ │ Filebeat │ │ │ │
│ │ │ │ │ │ │ │
│ │ "发生了什么" │ │ "为什么发生" │ │ "如何发生" │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 可观测性平台 │ │
│ │ │ │
│ │ Prometheus + Grafana + Loki + Jaeger + AlertManager │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
SRE 可观测性实践
┌─────────────────────────────────────────────────────────────────┐
│ SRE 可观测性实践 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 黄金信号(Golden Signals): │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ 延迟 │ │ 流量 │ │ 错误 │ │ 饱和度 │ │
│ │ Latency│ │ Traffic │ │ Errors │ │Saturation│ │
│ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ │
│ │ P50/P99│ │ QPS │ │ 错误率 │ │ CPU/内存 │ │
│ │ 延迟分布│ │ RPS │ │ 5xx │ │ 队列长度 │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ USE 方法(资源): │
│ - Utilization:利用率 │
│ - Saturation:饱和度 │
│ - Errors:错误 │
│ │
│ RED 方法(服务): │
│ - Rate:请求率 │
│ - Errors:错误率 │
│ - Duration:延迟 │
│ │
└─────────────────────────────────────────────────────────────────┘
Prometheus
Prometheus 架构
┌─────────────────────────────────────────────────────────────────┐
│ Prometheus 架构 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Prometheus Server │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Scrape │ │ TSDB │ │ HTTP │ │ │
│ │ │ Manager │ │ (存储) │ │ Server │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │
│ │ │ │ │ │ │
│ └─────────┼────────────────┼────────────────┼─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Targets │ │
│ │ │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │ K8s Pod │ │Node │ │App │ │Service │ │ │
│ │ │(cAdvisor)│ │Exporter│ │Metrics │ │Endpoint│ │ │
│ │ └────────┘ └────────┘ └────────┘ └────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ 组件: │
│ - Scrape:从目标拉取指标 │
│ - TSDB:时序数据库存储 │
│ - PromQL:查询语言 │
│ - AlertManager:告警 │
│ │
└─────────────────────────────────────────────────────────────────┘
安装 Prometheus
# 方式1:Helm 安装 kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
# 方式2:自定义安装
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.73.0/bundle.yaml
# 访问 UI
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
ServiceMonitor
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp
labels:
release: prometheus # 匹配 Prometheus 配置
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 15s # 抓取间隔
path: /metrics
scheme: http
tlsConfig:
insecureSkipVerify: true
namespaceSelector:
matchNames:
- production
targetLabels:
- app
- environment
PodMonitor
# podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: myapp-pod
spec:
selector:
matchLabels:
app: myapp
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 15s
namespaceSelector:
matchNames:
- production
Prometheus 配置
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
# Kubernetes API Server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
action: keep
regex: default;kubernetes;https
# Kubernetes Nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Kubernetes Pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
PrometheusQL 查询
基础语法
# 基础查询
# 获取所有指标
up
# 获取特定指标
container_cpu_usage_seconds_total
# 带标签过滤
container_cpu_usage_seconds_total{pod="myapp-12345", namespace="production"}
# 聚合运算
sum(rate(container_cpu_usage_seconds_total[5m]))
# 百分比
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
常用查询模板
# 黄金信号查询
# 延迟 - P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# 延迟 - P50
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# QPS
sum(rate(http_requests_total[5m])) by (service)
# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# CPU 使用率
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) /
sum(container_spec_cpu_quota / container_spec_cpu_period) by (pod) * 100
# 内存使用
container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100
# Pod 数量
count(kube_pod_info) by (namespace, pod)
告警规则
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alerts
labels:
app: myapp
release: prometheus
spec:
groups:
- name: myapp.rules
rules:
# 基础告警
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
# 服务延迟
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.service }}"
description: "P99 latency is {{ $value }}s"
# Pod 重启
- alert: PodRestartingTooMuch
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 0m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting too much"
# 内存使用
- alert: HighMemoryUsage
expr: |
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Memory usage is high"
description: "Memory usage is {{ $value | humanizePercentage }}"
Grafana
安装与配置
# 访问 Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# 默认账号:admin/prom-operator(Helm 安装)
# 查看密码
kubectl get secret prometheus-grafana -n monitoring -o jsonpath='{.data.admin-password}' | base64 -d
数据源配置
# grafana-datasource.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus-operated:9090
isDefault: true
uid: prometheus
editable: false
- name: Loki
type: loki
access: proxy
url: http://loki-stack:3100
editable: false
Dashboard 配置
# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: myapp-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
myapp-dashboard.json: |
{
"dashboard": {
"title": "MyApp Overview",
"uid": "myapp-overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
}
]
},
{
"title": "Latency P99",
"type": "gauge",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
}
]
}
]
}
}
常用 Dashboard 导入
# 导入社区 Dashboard
# Kubernetes cluster monitoring
# ID: 10856
# Kubernetes pod monitoring
# ID: 10518
# Node Exporter Full
# ID: 1860
# Prometheus Overview
# ID: 3662
Loki 日志收集
Loki 架构
┌─────────────────────────────────────────────────────────────────┐
│ Loki 架构 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Grafana │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Loki Server │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Distributor│ │ Ingester │ │ Querier │ │ │
│ │ │ (接收日志) │ │ (存储) │ │ (查询) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │ │ │
│ │ └──────────────┬───────────────────────┘ │ │
│ │ ▼ │ │
│ │ ┌────────────────┐ │ │
│ │ │ Object │ │ │
│ │ │ Storage │ │ │
│ │ │ (S3/MinIO) │ │ │
│ │ └────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Promtail / Fluentd │ │
│ │ (收集日志,发送到 Loki) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
安装 Loki
# Helm 安装 Loki Stack
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace monitoring \
--create-namespace \
--set grafana.enabled=true \
--set prometheus.enabled=true \
--set loki.persistence.enabled=true \
--set loki.persistence.size=50Gi
# 或单独安装 Loki
helm install loki grafana/loki \
--namespace monitoring \
--create-namespace
Promtail 配置
# promtail-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: monitoring
data:
promtail-config.yaml: |
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
client:
url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Kubernetes 日志
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- action: replace
target_label: job
replacement: kubernetes-pods
# 应用日志
- job_name: app-logs
static_configs:
- targets:
- localhost
labels:
job: app-logs
__path__: /var/log/pods/*/*/*.log
LogQL 查询
# 基础查询
{service="myapp"}
# 带标签过滤
{service="myapp", namespace="production"}
# 日志内容搜索
{service="myapp"} |= "ERROR"
# 排除关键词
{service="myapp"} != "DEBUG"
# 正则匹配
{service="myapp"} |~ "user_id=\d+"
# 聚合
sum(count_over_time({service="myapp"}[5m])) by (namespace)
# 统计错误日志
rate({service="myapp"} |= "ERROR"[5m])
Dashboard 示例
# logs-dashboard.json
{
"dashboard": {
"title": "Application Logs",
"panels": [
{
"title": "Error Logs",
"type": "logs",
"targets": [
{
"expr": "{service=\"myapp\"} |= \"ERROR\"",
"legendFormat": "{{pod}}"
}
],
"options": {
"showLabels": true,
"showCommonLabels": true,
"showTime": true
}
},
{
"title": "Log Volume by Level",
"type": "piechart",
"targets": [
{
"expr": "sum by (level) (count_over_time({service=\"myapp\"}[5m]))"
}
]
}
]
}
}
AlertManager 告警
AlertManager 配置
# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-receiver'
continue: true
- match:
severity: warning
receiver: 'warning-receiver'
receivers:
- name: 'default-receiver'
webhook_configs:
- url: 'http://alert-webhook:5000/alerts'
- name: 'critical-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#critical-alerts'
send_resolved: true
title: 'Critical Alert: {{ .GroupLabels.alertname }}'
text: |
*Alert:* {{ .GroupLabels.alertname }}
*Severity:* {{ .Labels.severity }}
*Description:* {{ .Annotations.description }}
- name: 'warning-receiver'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
告警路由
┌─────────────────────────────────────────────────────────────────┐
│ AlertManager 路由 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ root route │
│ │ │
│ ├── severity=critical ──┐ │
│ │ │ │
│ │ ▼ │
│ │ critical-receiver (Slack) │
│ │ │ │
│ │ │ (continue: true) │
│ │ │ │
│ ▼ ▼ │
│ default-receiver ◀─────────┘ │
│ │ │
│ ├── severity=warning ─┐ │
│ │ │ │
│ ▼ ▼ │
│ warning-receiver (Email) │ │
│ │ │
│ └── (no match) │ │
│ ▼ │
│ default-receiver │
│ │
└─────────────────────────────────────────────────────────────────┘
告警状态
# 查看告警状态
amtool alert query --alertmanager.url=http://localhost:9093
# 查看告警组
amtool alertgroups --alertmanager.url=http://localhost:9093
# 静默告警
amtool silence add alertname=HighErrorRate
# 查看静默列表
amtool silence query --alertmanager.url=http://localhost:9093
# 检查配置
amtool config verify --alertmanager.url=http://localhost:9093
告警最佳实践
告警分级
# 告警分级策略
# P1 - 紧急(立即响应)
# - 服务不可用
# - 数据丢失风险
# - 安全事件
alert:
- name: ServiceDown
severity: critical
annotations:
runbook: https://wiki.example.com/runbooks/service-down
# P2 - 高(30分钟内响应)
# - 性能严重下降
# - 错误率升高
alert:
- name: HighErrorRate
severity: high
# P3 - 中(2小时内响应)
# - 资源使用率高
# - 非关键功能异常
alert:
- name: HighMemoryUsage
severity: medium
# P4 - 低(计划处理)
# - 警告阈值
# - 容量规划提醒
alert:
- name: DiskSpaceWarning
severity: low
告警疲劳管理
# 分组和抑制
route:
group_by: ['alertname', 'cluster']
group_wait: 30s # 等待同组告警一起发送
group_interval: 5m # 同组告警间隔
repeat_interval: 12h # 重复告警间隔
# 抑制规则
inhibit_rules:
# 服务挂掉时,抑制该服务所有关联告警
- source_match:
alertname: 'InstanceDown'
target_match_re:
alertname: '.*Instance.*'
equal: ['cluster', 'service']
常见问题与避坑指南
Q1:Prometheus 数据丢失?
# 排查步骤
# 1. 检查 TSDB 存储
kubectl exec -it prometheus-0 -n monitoring -- ls -la /data
# 2. 检查 Prometheus 日志
kubectl logs prometheus-0 -n monitoring | grep "TSDB"
# 3. 增加存储
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi
Q2:告警不触发?
# 检查规则是否加载
kubectl get prometheusrule -n monitoring
# 查看 Prometheus 日志
kubectl logs -n monitoring -l app=prometheus | grep "rule"
# 测试告警规则
kubectl exec -it prometheus-0 -n monitoring -- promtool check rules /etc/prometheus/rules/*.yml
# 检查 AlertManager 连接
kubectl exec -it alertmanager-0 -n monitoring -- amtool check config
Q3:Grafana Dashboard 不显示数据?
# 1. 检查数据源
kubectl get configmap -n monitoring grafana-datasources
# 2. 测试 Prometheus 查询
# 直接访问 Prometheus UI: http://prometheus:9090
# 3. 检查 ServiceMonitor 标签
kubectl get servicemonitor -o yaml | grep release
# 4. 重新加载配置
kubectl delete configmap prometheus-prometheus-node-exporter -n monitoring
Q4:如何减少告警噪音?
# 1. 设置 for 参数(持续时间)
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) > 0.05
for: 5m # 持续 5 分钟才触发
# 2. 设置重复间隔
route:
repeat_interval: 12h # 告警解决前不重复发送
# 3. 聚合告警
group_by: ['alertname', 'cluster', 'service']
总结
┌─────────────────────────────────────────────────────────────────┐
│ 核心要点回顾 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 可观测性三大支柱 │
│ ├── Metrics:发生了什么(Prometheus) │
│ ├── Logs:为什么发生(Loki/ELK) │
│ └── Traces:如何发生(Jaeger) │
│ │
│ Prometheus │
│ ├── ServiceMonitor/PodMonitor 自动发现 │
│ ├── PromQL 强大的查询语言 │
│ └── 告警规则定义 │
│ │
│ Grafana │
│ ├── 多数据源支持 │
│ ├── 丰富的可视化类型 │
│ └── 面板模板复用 │
│ │
│ Loki │
│ ├── LogQL 查询语言 │
│ ├── 与 Grafana 集成 │
│ └── 资源高效 │
│ │
│ AlertManager │
│ ├── 分级告警路由 │
│ ├── 抑制和静默 │
│ └── 多渠道通知 │
│ │
└─────────────────────────────────────────────────────────────────┘
思考题
- 如何设计告警策略,避免告警疲劳?
- 可观测性三大支柱各自的优势和局限性是什么?
- 如何在大规模集群中优化 Prometheus 的性能?
引用与参考
- Prometheus Documentation
- Grafana Documentation
- Loki Documentation
- SRE Book - Monitoring Distributed Systems
下篇预告
下一篇文章我们将探讨 安全最佳实践,包括:
- RBAC 权限控制
- Pod Security Policy / PSA
- NetworkPolicy 网络隔离
- 镜像安全与扫描
敬请期待!