可观测性体系：Prometheus + Grafana + Loki

概述

可观测性（Observability）是云原生应用运维的核心能力，包括指标（Metrics）、日志（Logs）和追踪（Traces）三大支柱。本文将深入探讨：

学习目标：

理解可观测性三大支柱与 SRE 实践
掌握 Prometheus 指标收集与查询
学会 Grafana 可视化面板配置
了解 Loki 日志收集与查询
掌握 AlertManager 告警配置

可观测性概述

三大支柱

┌─────────────────────────────────────────────────────────────────┐
│                    可观测性三大支柱                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│   │   Metrics    │  │    Logs      │  │   Traces     │        │
│   ├──────────────┤  ├──────────────┤  ├──────────────┤        │
│   │   指标        │  │   日志        │  │   追踪        │        │
│   │              │  │              │  │              │        │
│   │ Prometheus   │  │ Loki/ELK    │  │ Jaeger/Zipkin│       │
│   │ Graphite     │  │ Fluentd    │  │ OpenTelemetry│       │
│   │ InfluxDB     │  │ Filebeat   │  │              │        │
│   │              │  │              │  │              │        │
│   │ "发生了什么" │  │ "为什么发生" │  │ "如何发生" │        │
│   └──────────────┘  └──────────────┘  └──────────────┘        │
│                                                                 │
│   ┌──────────────────────────────────────────────────────────┐ │
│   │                    可观测性平台                           │ │
│   │                                                          │ │
│   │   Prometheus + Grafana + Loki + Jaeger + AlertManager    │ │
│   │                                                          │ │
│   └──────────────────────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

SRE 可观测性实践

┌─────────────────────────────────────────────────────────────────┐
│                    SRE 可观测性实践                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   黄金信号（Golden Signals）：                                   │
│                                                                 │
│   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐           │
│   │  延迟   │  │  流量   │  │  错误   │  │  饱和度 │           │
│   │ Latency│  │ Traffic │  │ Errors  │  │Saturation│           │
│   ├─────────┤  ├─────────┤  ├─────────┤  ├─────────┤           │
│   │ P50/P99│  │ QPS     │  │ 错误率  │  │ CPU/内存 │           │
│   │ 延迟分布│  │ RPS     │  │ 5xx     │  │ 队列长度 │           │
│   └─────────┘  └─────────┘  └─────────┘  └─────────┘           │
│                                                                 │
│   USE 方法（资源）：                                             │
│   - Utilization：利用率                                         │
│   - Saturation：饱和度                                         │
│   - Errors：错误                                               │
│                                                                 │
│   RED 方法（服务）：                                            │
│   - Rate：请求率                                               │
│   - Errors：错误率                                             │
│   - Duration：延迟                                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Prometheus

Prometheus 架构

┌─────────────────────────────────────────────────────────────────┐
│                    Prometheus 架构                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                   Prometheus Server                    │   │
│   │                                                          │   │
│   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │   │
│   │  │  Scrape     │  │   TSDB     │  │   HTTP     │    │   │
│   │  │  Manager    │  │  (存储)    │  │  Server   │    │   │
│   │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘    │   │
│   │         │                │                │             │   │
│   └─────────┼────────────────┼────────────────┼─────────────┘   │
│             │                │                │                 │
│             ▼                ▼                ▼                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    Targets                              │   │
│   │                                                          │   │
│   │  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐       │   │
│   │  │ K8s Pod │  │Node    │  │App     │  │Service │       │   │
│   │  │(cAdvisor)│  │Exporter│  │Metrics │  │Endpoint│       │   │
│   │  └────────┘  └────────┘  └────────┘  └────────┘       │   │
│   │                                                          │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   组件：                                                        │
│   - Scrape：从目标拉取指标                                       │
│   - TSDB：时序数据库存储                                        │
│   - PromQL：查询语言                                           │
│   - AlertManager：告警                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

安装 Prometheus

# 方式1：Helm 安装 kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

# 方式2：自定义安装
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.73.0/bundle.yaml

# 访问 UI
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

ServiceMonitor

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
  labels:
    release: prometheus       # 匹配 Prometheus 配置
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 15s            # 抓取间隔
    path: /metrics
    scheme: http
    tlsConfig:
      insecureSkipVerify: true
  namespaceSelector:
    matchNames:
    - production
  targetLabels:
  - app
  - environment

PodMonitor

# podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: myapp-pod
spec:
  selector:
    matchLabels:
      app: myapp
  podMetricsEndpoints:
  - port: metrics
    path: /metrics
    interval: 15s
  namespaceSelector:
    matchNames:
    - production

Prometheus 配置

# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - alertmanager:9093

    rule_files:
    - /etc/prometheus/rules/*.yml

    scrape_configs:
    # Kubernetes API Server
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        action: keep
        regex: default;kubernetes;https

    # Kubernetes Nodes
    - job_name: 'kubernetes-nodes'
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

    # Kubernetes Pods
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

PrometheusQL 查询

基础语法

# 基础查询
# 获取所有指标
up

# 获取特定指标
container_cpu_usage_seconds_total

# 带标签过滤
container_cpu_usage_seconds_total{pod="myapp-12345", namespace="production"}

# 聚合运算
sum(rate(container_cpu_usage_seconds_total[5m]))

# 百分比
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

常用查询模板

# 黄金信号查询

# 延迟 - P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# 延迟 - P50
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# QPS
sum(rate(http_requests_total[5m])) by (service)

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)

# CPU 使用率
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) /
sum(container_spec_cpu_quota / container_spec_cpu_period) by (pod) * 100

# 内存使用
container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100

# Pod 数量
count(kube_pod_info) by (namespace, pod)

告警规则

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp-alerts
  labels:
    app: myapp
    release: prometheus
spec:
  groups:
  - name: myapp.rules
    rules:
    # 基础告警
    - alert: HighErrorRate
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
        sum(rate(http_requests_total[5m])) by (service) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate on {{ $labels.service }}"
        description: "Error rate is {{ $value | humanizePercentage }}"

    # 服务延迟
    - alert: HighLatency
      expr: |
        histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High latency on {{ $labels.service }}"
        description: "P99 latency is {{ $value }}s"

    # Pod 重启
    - alert: PodRestartingTooMuch
      expr: |
        increase(kube_pod_container_status_restarts_total[1h]) > 5
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting too much"

    # 内存使用
    - alert: HighMemoryUsage
      expr: |
        (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Memory usage is high"
        description: "Memory usage is {{ $value | humanizePercentage }}"

Grafana

安装与配置

# 访问 Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# 默认账号：admin/prom-operator（Helm 安装）
# 查看密码
kubectl get secret prometheus-grafana -n monitoring -o jsonpath='{.data.admin-password}' | base64 -d

数据源配置

# grafana-datasource.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      access: proxy
      url: http://prometheus-operated:9090
      isDefault: true
      uid: prometheus
      editable: false
    - name: Loki
      type: loki
      access: proxy
      url: http://loki-stack:3100
      editable: false

Dashboard 配置

# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: myapp-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  myapp-dashboard.json: |
    {
      "dashboard": {
        "title": "MyApp Overview",
        "uid": "myapp-overview",
        "panels": [
          {
            "title": "Request Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total[5m])) by (service)",
                "legendFormat": "{{service}}"
              }
            ]
          },
          {
            "title": "Error Rate",
            "type": "stat",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
              }
            ]
          },
          {
            "title": "Latency P99",
            "type": "gauge",
            "targets": [
              {
                "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
              }
            ]
          }
        ]
      }
    }

常用 Dashboard 导入

# 导入社区 Dashboard
# Kubernetes cluster monitoring
# ID: 10856

# Kubernetes pod monitoring
# ID: 10518

# Node Exporter Full
# ID: 1860

# Prometheus Overview
# ID: 3662

Loki 日志收集

Loki 架构

┌─────────────────────────────────────────────────────────────────┐
│                    Loki 架构                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │                      Grafana                            │  │
│   └─────────────────────────────────────────────────────────┘  │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │                    Loki Server                          │  │
│   │                                                          │  │
│   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │  │
│   │  │  Distributor│  │  Ingester   │  │  Querier   │    │  │
│   │  │  (接收日志)  │  │  (存储)    │  │  (查询)   │    │  │
│   │  └─────────────┘  └─────────────┘  └─────────────┘    │  │
│   │        │                                      │          │  │
│   │        └──────────────┬───────────────────────┘         │  │
│   │                       ▼                                    │  │
│   │              ┌────────────────┐                          │  │
│   │              │    Object     │                          │  │
│   │              │    Storage    │                          │  │
│   │              │ (S3/MinIO)    │                          │  │
│   │              └────────────────┘                          │  │
│   └─────────────────────────────────────────────────────────┘  │
│                              ▲                                  │
│                              │                                  │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │                    Promtail / Fluentd                   │  │
│   │  (收集日志，发送到 Loki)                                  │  │
│   └─────────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

安装 Loki

# Helm 安装 Loki Stack
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.enabled=true \
  --set prometheus.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=50Gi

# 或单独安装 Loki
helm install loki grafana/loki \
  --namespace monitoring \
  --create-namespace

Promtail 配置

# promtail-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
  namespace: monitoring
data:
  promtail-config.yaml: |
    server:
      http_listen_port: 9080
      grpc_listen_port: 0

    positions:
      filename: /tmp/positions.yaml

    client:
      url: http://loki:3100/loki/api/v1/push

    scrape_configs:
    # Kubernetes 日志
    - job_name: kubernetes
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - action: replace
        target_label: job
        replacement: kubernetes-pods

    # 应用日志
    - job_name: app-logs
      static_configs:
      - targets:
        - localhost
        labels:
          job: app-logs
          __path__: /var/log/pods/*/*/*.log

LogQL 查询

# 基础查询
{service="myapp"}

# 带标签过滤
{service="myapp", namespace="production"}

# 日志内容搜索
{service="myapp"} |= "ERROR"

# 排除关键词
{service="myapp"} != "DEBUG"

# 正则匹配
{service="myapp"} |~ "user_id=\d+"

# 聚合
sum(count_over_time({service="myapp"}[5m])) by (namespace)

# 统计错误日志
rate({service="myapp"} |= "ERROR"[5m])

Dashboard 示例

# logs-dashboard.json
{
  "dashboard": {
    "title": "Application Logs",
    "panels": [
      {
        "title": "Error Logs",
        "type": "logs",
        "targets": [
          {
            "expr": "{service=\"myapp\"} |= \"ERROR\"",
            "legendFormat": "{{pod}}"
          }
        ],
        "options": {
          "showLabels": true,
          "showCommonLabels": true,
          "showTime": true
        }
      },
      {
        "title": "Log Volume by Level",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum by (level) (count_over_time({service=\"myapp\"}[5m]))"
          }
        ]
      }
    ]
  }
}

AlertManager 告警

AlertManager 配置

# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m

    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'default-receiver'
      routes:
      - match:
          severity: critical
        receiver: 'critical-receiver'
        continue: true
      - match:
          severity: warning
        receiver: 'warning-receiver'

    receivers:
    - name: 'default-receiver'
      webhook_configs:
      - url: 'http://alert-webhook:5000/alerts'

    - name: 'critical-receiver'
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#critical-alerts'
        send_resolved: true
        title: 'Critical Alert: {{ .GroupLabels.alertname }}'
        text: |
          *Alert:* {{ .GroupLabels.alertname }}
          *Severity:* {{ .Labels.severity }}
          *Description:* {{ .Annotations.description }}

    - name: 'warning-receiver'
      email_configs:
      - to: 'oncall@example.com'
        send_resolved: true

    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'cluster', 'service']

告警路由

┌─────────────────────────────────────────────────────────────────┐
│                    AlertManager 路由                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   root route                                                    │
│       │                                                         │
│       ├── severity=critical ──┐                                  │
│       │                      │                                  │
│       │                      ▼                                  │
│       │               critical-receiver (Slack)                │
│       │                      │                                  │
│       │                      │ (continue: true)                │
│       │                      │                                  │
│       ▼                      ▼                                  │
│   default-receiver ◀─────────┘                                  │
│       │                                                         │
│       ├── severity=warning ─┐                                   │
│       │                      │                                  │
│       ▼                      ▼                                  │
│   warning-receiver (Email)  │                                  │
│                              │                                  │
│       └── (no match)        │                                  │
│                              ▼                                  │
│                    default-receiver                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

告警状态

# 查看告警状态
amtool alert query --alertmanager.url=http://localhost:9093

# 查看告警组
amtool alertgroups --alertmanager.url=http://localhost:9093

# 静默告警
amtool silence add alertname=HighErrorRate

# 查看静默列表
amtool silence query --alertmanager.url=http://localhost:9093

# 检查配置
amtool config verify --alertmanager.url=http://localhost:9093

告警最佳实践

告警分级

# 告警分级策略

# P1 - 紧急（立即响应）
# - 服务不可用
# - 数据丢失风险
# - 安全事件
alert:
  - name: ServiceDown
    severity: critical
    annotations:
      runbook: https://wiki.example.com/runbooks/service-down

# P2 - 高（30分钟内响应）
# - 性能严重下降
# - 错误率升高
alert:
  - name: HighErrorRate
    severity: high

# P3 - 中（2小时内响应）
# - 资源使用率高
# - 非关键功能异常
alert:
  - name: HighMemoryUsage
    severity: medium

# P4 - 低（计划处理）
# - 警告阈值
# - 容量规划提醒
alert:
  - name: DiskSpaceWarning
    severity: low

告警疲劳管理

# 分组和抑制
route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s           # 等待同组告警一起发送
  group_interval: 5m         # 同组告警间隔
  repeat_interval: 12h      # 重复告警间隔

# 抑制规则
inhibit_rules:
  # 服务挂掉时，抑制该服务所有关联告警
  - source_match:
      alertname: 'InstanceDown'
    target_match_re:
      alertname: '.*Instance.*'
    equal: ['cluster', 'service']

常见问题与避坑指南

Q1：Prometheus 数据丢失？

# 排查步骤
# 1. 检查 TSDB 存储
kubectl exec -it prometheus-0 -n monitoring -- ls -la /data

# 2. 检查 Prometheus 日志
kubectl logs prometheus-0 -n monitoring | grep "TSDB"

# 3. 增加存储
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi

Q2：告警不触发？

# 检查规则是否加载
kubectl get prometheusrule -n monitoring

# 查看 Prometheus 日志
kubectl logs -n monitoring -l app=prometheus | grep "rule"

# 测试告警规则
kubectl exec -it prometheus-0 -n monitoring -- promtool check rules /etc/prometheus/rules/*.yml

# 检查 AlertManager 连接
kubectl exec -it alertmanager-0 -n monitoring -- amtool check config

Q3：Grafana Dashboard 不显示数据？

# 1. 检查数据源
kubectl get configmap -n monitoring grafana-datasources

# 2. 测试 Prometheus 查询
# 直接访问 Prometheus UI: http://prometheus:9090

# 3. 检查 ServiceMonitor 标签
kubectl get servicemonitor -o yaml | grep release

# 4. 重新加载配置
kubectl delete configmap prometheus-prometheus-node-exporter -n monitoring

Q4：如何减少告警噪音？

# 1. 设置 for 参数（持续时间）
- alert: HighErrorRate
  expr: rate(http_errors_total[5m]) > 0.05
  for: 5m              # 持续 5 分钟才触发

# 2. 设置重复间隔
route:
  repeat_interval: 12h  # 告警解决前不重复发送

# 3. 聚合告警
group_by: ['alertname', 'cluster', 'service']

总结

┌─────────────────────────────────────────────────────────────────┐
│                    核心要点回顾                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  可观测性三大支柱                                               │
│  ├── Metrics：发生了什么（Prometheus）                          │
│  ├── Logs：为什么发生（Loki/ELK）                              │
│  └── Traces：如何发生（Jaeger）                                 │
│                                                                 │
│  Prometheus                                                    │
│  ├── ServiceMonitor/PodMonitor 自动发现                        │
│  ├── PromQL 强大的查询语言                                      │
│  └── 告警规则定义                                              │
│                                                                 │
│  Grafana                                                       │
│  ├── 多数据源支持                                              │
│  ├── 丰富的可视化类型                                          │
│  └── 面板模板复用                                              │
│                                                                 │
│  Loki                                                           │
│  ├── LogQL 查询语言                                            │
│  ├── 与 Grafana 集成                                          │
│  └── 资源高效                                                  │
│                                                                 │
│  AlertManager                                                  │
│  ├── 分级告警路由                                              │
│  ├── 抑制和静默                                                │
│  └── 多渠道通知                                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

思考题

如何设计告警策略，避免告警疲劳？
可观测性三大支柱各自的优势和局限性是什么？
如何在大规模集群中优化 Prometheus 的性能？

引用与参考

下篇预告

下一篇文章我们将探讨 安全最佳实践，包括：

RBAC 权限控制
Pod Security Policy / PSA
NetworkPolicy 网络隔离
镜像安全与扫描

敬请期待！