性能调优与故障排查：SRE 实践指南

概述

故障排查和性能调优是 SRE 的核心技能。本篇文章将深入探讨：

学习目标：

掌握 Kubernetes 故障排查方法论
学会使用诊断工具分析问题
理解常见故障场景与解决方案
掌握性能优化技巧

故障排查方法论

黄金信号排查

┌─────────────────────────────────────────────────────────────────┐
│                    故障排查流程                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   用户报告问题                                                   │
│       │                                                         │
│       ▼                                                         │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                 症状分析（Symptoms）                     │   │
│   │                                                          │   │
│   │  1. 延迟高？   ────▶  测量延迟分布                       │   │
│   │  2. 错误多？   ────▶  检查错误率和类型                   │   │
│   │  3. 不可用？   ────▶  验证健康状态                       │   │
│   │  4. 慢响应？   ────▶  分析吞吐量瓶颈                     │   │
│   │                                                          │   │
│   └─────────────────────────────────────────────────────────┘   │
│       │                                                         │
│       ▼                                                         │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                 根因分析（Root Cause）                    │   │
│   │                                                          │   │
│   │  检查层级：                                              │   │
│   │  1. 应用层（代码、日志、配置）                           │   │
│   │  2. 网络层（Service、Ingress、DNS）                      │   │
│   │  3. 存储层（PV、PVC、存储驱动）                          │   │
│   │  4. 节点层（资源、容量、调度）                           │   │
│   │  5. 集群层（API Server、ETCD）                          │   │
│   │                                                          │   │
│   └─────────────────────────────────────────────────────────┘   │
│       │                                                         │
│       ▼                                                         │
│   修复与验证                                                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

排查命令速查

# ==================== 基础状态检查 ====================
# 节点状态
kubectl get nodes

# Pod 状态（所有命名空间）
kubectl get pods --all-namespaces

# Pod 详细状态
kubectl describe pod <pod-name> -n <namespace>

# 查看事件
kubectl get events --sort-by='.lastTimestamp'

# ==================== 日志分析 ====================
# 应用日志
kubectl logs <pod-name> -n <namespace>

# 上一个容器的日志
kubectl logs <pod-name> -n <namespace> --previous

# 实时日志
kubectl logs -f <pod-name> -n <namespace>

# ==================== 网络检查 ====================
# Service 详情
kubectl get svc -n <namespace>
kubectl describe svc <svc-name> -n <namespace>

# Endpoint 检查
kubectl get endpoints <svc-name> -n <namespace>

# ==================== 资源检查 ====================
# 节点资源
kubectl top nodes

# Pod 资源
kubectl top pods -n <namespace>

# 详细资源使用
kubectl describe node <node-name>

常见问题诊断

Pod Pending

# 排查步骤
kubectl describe pod <pod-name> -n <namespace> | grep -A 20 "Events:"

# 常见原因1：资源不足
kubectl describe node
# 检查 CPU/Memory allocatable

# 常见原因2：节点选择器不匹配
kubectl get node -l region=us-west
kubectl get pod -o jsonpath='{.spec.nodeSelector}'

# 常见原因3：污点不容忍
kubectl describe node | grep Taints
kubectl get pod -o jsonpath='{.spec.tolerations}'

# 常见原因4：存储无法挂载
kubectl describe pod | grep -A 5 "Volumes:"
kubectl get pvc -n <namespace>

Pod CrashLoopBackOff

# 查看退出原因
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Last State:"

# 查看容器日志
kubectl logs <pod-name> -n <namespace> --previous

# 常见原因1：OOMKilled
kubectl describe pod | grep "Last State" -A 10
# 检查 ExitCode: 137 或 Memory 限制

# 常见原因2：配置错误
# 应用无法找到配置文件、环境变量

# 常见原因3：依赖服务不可达
# 数据库连接失败、Redis 超时

# 常见原因4：健康检查失败
kubectl describe pod | grep -A 5 "Liveness" -A 5 "Readiness"

Service 不可用

# 排查步骤
# 1. 检查 Endpoint
kubectl get endpoints <svc-name> -n <namespace>

# 2. 检查 Pod 选择器
kubectl get svc <svc-name> -n <namespace> -o yaml | grep -A 5 selector

# 3. 检查 Pod 是否运行
kubectl get pods -n <namespace> -l key=value

# 4. 测试连接
kubectl run test --rm -it --image=busybox -- sh
# wget -qO- http://<svc-name>:<port>

# 常见原因：selector 不匹配
# 期望：app=backend，实际：app=backend-1

网络问题

# 1. 检查 DNS 解析
kubectl exec -it <pod-name> -- nslookup <service-name>
kubectl exec -it <pod-name> -- cat /etc/resolv.conf

# 2. 检查网络连通性
kubectl exec -it <pod-name> -- ping <target-ip>

# 3. 检查端口
kubectl exec -it <pod-name> -- nc -zv <target-ip> <port>

# 4. 查看 iptables 规则（节点上）
iptables -L -t nat | grep <service-name>

# 5. 检查 kube-proxy 状态
kubectl get pods -n kube-system -l k8s-app=kube-proxy
kubectl logs -n kube-system kube-proxy-<pod> --tail 50

# 6. 使用网络诊断工具
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- sh

存储问题

# 1. 检查 PVC 状态
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

# 2. 检查 PV 状态
kubectl get pv
kubectl describe pv <pv-name>

# 3. 检查存储类
kubectl get storageclass

# 4. 查看 Pod 挂载错误
kubectl describe pod <pod-name> | grep -A 10 "Volumes:"

# 5. 检查 CSI 驱动
kubectl get pods -n kube-system | grep csi

# 常见原因：存储类不存在
# 解决方案：指定正确的 storageClassName

性能分析方法

资源瓶颈分析

# 1. 节点资源使用
kubectl top nodes

# 输出：
# NAME         CPU(c)   CPU%   MEMORY(bytes)   MEMORY%
# node-1       2000m    50%    4Gi             60%
# node-2       3000m    75%    6Gi             90%

# 2. Pod 资源使用
kubectl top pods -n <namespace> --sort-by=memory

# 3. 详细节点信息
kubectl describe node <node-name> | grep -A 10 "Allocated resources"

# 4. 使用 Prometheus 分析
# 节点 CPU 使用率
node_cpu_usage / node_cpu_capacity * 100

# 节点内存使用率
node_memory_usage / node_memory_capacity * 100

# Pod 内存使用
container_memory_working_set_bytes / container_spec_memory_limit_bytes

应用性能分析

# 1. 检查应用延迟
# 使用 curl 或 hey 测试
kubectl run curl --rm -it --image=curlimages/curl -- \
  curl -w "@curl-format.txt" http://<service>

# 2. 查看应用指标（Prometheus）
# HTTP 请求延迟
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# 数据库连接池
db_connections_active / db_connections_max

# 3. 使用 kubectl top 分析
kubectl top pod -n <namespace> --containers

# 4. Profile 分析
# Go: pprof
kubectl exec -it <pod-name> -- curl localhost:6060/debug/pprof/heap

# Java: jstat/jstack
kubectl exec -it <pod-name> -- jstack <pid>

网络性能分析

# 1. 带宽测试
kubectl run iperf --rm -it --image=networkstatic/iperf3 -- \
  iperf3 -s

kubectl run iperf-client --rm -it --image=networkstatic/iperf3 -- \
  iperf3 -c iperf -t 60

# 2. DNS 延迟
kubectl run dnstest --rm -it --image=busybox -- sh
# for i in $(seq 1 10); do nslookup kubernetes.default; done

# 3. 连接追踪
kubectl exec -it <pod-name> -- ss -s

# 4. 网络策略验证
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <policy-name> -n <namespace>

日志与事件分析

Pod 事件分析

# 1. 查看最近事件
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50

# 2. 按资源过滤事件
kubectl get events --field-selector involvedObject.name=<pod-name>

# 3. 查看特定类型事件
kubectl get events --field-selector type=Warning

# 4. 分析常见 Warning 事件
# - FailedScheduling: 无法调度
# - FailedMount: 挂载失败
# - FailedCreate: 创建失败
# - BackOff: 重启回退

应用日志分析

# 1. 实时日志
kubectl logs -f <pod-name> -n <namespace> --tail=100

# 2. 多容器日志
kubectl logs <pod-name> -n <namespace> -c <container-name>

# 3. 过滤日志
kubectl logs <pod-name> -n <namespace> | grep "ERROR"

# 4. 日志聚合
# 使用 Loki
logql='{app="myapp"} |= "ERROR"'

# 使用 ELK
# logstash 配置过滤规则

# 5. 结构化日志分析
kubectl logs <pod-name> -n <namespace> --format=json | jq .

集群事件分析

# 1. API Server 日志
kubectl logs -n kube-system kube-apiserver-<node> --tail=100

# 2. Controller Manager 日志
kubectl logs -n kube-system kube-controller-manager-<node> --tail=100

# 3. Scheduler 日志
kubectl logs -n kube-system kube-scheduler-<node> --tail=100

# 4. ETCD 日志
kubectl logs -n kube-system etcd-<node> --tail=100

# 5. 分析组件健康
kubectl get componentstatuses  # 或 kubectl get cs

高级诊断工具

kubectl-debug

# 安装 kubectl-debug
brew install kubectl-debug

# 启动调试容器
kubectl debug pod/myapp -it --image=nicolaka/netshoot

# 复制 Pod 到调试环境
kubectl debug <pod-name> --image=nicolaka/netshoot \
  --share-processes --copy-to=debug-pod

# 节点调试
kubectl debug node/<node-name> --image=nicolaka/netshoot \
  -it --share-processes

stern 日志工具

# 安装 stern
brew install stern

# 实时查看日志
stern <pod-prefix> -n <namespace>

# 多命名空间
stern . -n production,staging

# 过滤日志
stern myapp --since=5m
stern myapp --grep ERROR
stern myapp --exclude DEBUG

# 高亮关键词
stern myapp --highlight=ERROR

kubectl-tmux 面板

# 创建诊断面板
tmux new-session -s k8s-diag

# 面板1：节点状态
tmux new-window -n nodes
watch kubectl get nodes

# 面板2：Pod 状态
tmux split-window -h
watch kubectl get pods --all-namespaces

# 面板3：事件
tmux split-window -v
watch kubectl get events --sort-by=lastTimestamp

# 面板4：资源
tmux new-window -n resources
watch kubectl top nodes && kubectl top pods -A

常见故障案例

案例1：OOMKilled 导致服务中断

┌─────────────────────────────────────────────────────────────────┐
│                    故障：Pod 被 OOMKilled                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   症状：应用突然无响应，重新启动后恢复                           │
│                                                                 │
│   排查：                                                        │
│   $ kubectl describe pod myapp-xxx | grep -A 5 "Last State"    │
│                                                                 │
│   结果：                                                        │
│   Last State:                                                   │
│     Terminated:                                                 │
│       Exit Code: 137                                            │
│       Reason: OOMKilled                                         │
│       Message:                                                  │
│         Task in container "app" was terminated because the    │
│         memory limit was exceeded.                              │
│                                                                 │
│   根因：内存限制设置过低，实际需求超过限制                       │
│                                                                 │
│   解决方案：                                                    │
│   1. 调整内存限制                                               │
│   kubectl patch deployment myapp -p '{"spec":{"template":    │
│     {"spec":{"containers":[{"name":"app","resources":{"limits":│
│     {"memory":"1Gi"}}}}]}}}}                                   │
│                                                                 │
│   2. 分析内存泄漏                                               │
│   kubectl exec -it myapp-xxx -- curl localhost:6060/debug/    │
│   pprof/heap                                                    │
│                                                                 │
│   3. 长期优化                                                   │
│   - 使用 VPA 自动调整资源                                       │
│   - 设置合理的 initial resources                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

案例2：网络隔离导致服务不可用

┌─────────────────────────────────────────────────────────────────┐
│                    故障：服务无法访问                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   症状：前端无法访问后端 API                                    │
│                                                                 │
│   排查：                                                        │
│   $ kubectl get pods -l app=backend                            │
│   NAME           READY   STATUS                                │
│   backend-xxx    1/1     Running                               │
│                                                                 │
│   $ kubectl get svc backend                                    │
│   NAME      TYPE      CLUSTER-IP   PORT(S)                     │
│   backend   ClusterIP  10.96.0.100  8080/TCP                   │
│                                                                 │
│   $ kubectl get endpoints backend                              │
│   NAME      ENDPOINTS                                          │
│   backend   <none>                                            │
│                                                                 │
│   根因：Endpoint 为空，说明 selector 不匹配                     │
│                                                                 │
│   检查：                                                        │
│   $ kubectl get pod -l app=backend --show-labels               │
│   LABELS: app=backend-new                                      │
│                                                                 │
│   Service selector: app=backend                                │
│   Pod labels: app=backend-new                                  │
│   → 标签不匹配！                                                │
│                                                                 │
│   解决方案：                                                    │
│   1. 更新 Pod 标签或更新 Service selector                       │
│   kubectl label pods -l app=backend-new app=backend --overwrite │
│                                                                 │
│   2. 验证修复                                                   │
│   $ kubectl get endpoints backend                              │
│   ENDPOINTS: 10.0.0.1:8080,10.0.0.2:8080                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

案例3：存储挂载失败导致 Pod Pending

┌─────────────────────────────────────────────────────────────────┐
│                    故障：PVC 无法挂载                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   症状：Pod 一直处于 Pending 状态                               │
│                                                                 │
│   排查：                                                        │
│   $ kubectl describe pod myapp-xxx | grep -A 20 "Events:"       │
│                                                                 │
│   Events:                                                       │
│     Type     Reason                   Age                       │
│     Warning  FailedScheduling          5m    default-scheduler  │
│     Warning  FailedAttachVolume        5m    attachdetach...   │
│                                                                 │
│   $ kubectl describe pvc data-pvc                              │
│                                                                 │
│   Status:  Pending                                              │
│   Reason:  Waiting for first consumer to create the volume     │
│                                                                 │
│   根因：StorageClass 配置问题，延迟绑定未正常工作               │
│                                                                 │
│   解决方案：                                                    │
│   1. 检查 StorageClass                                          │
│   $ kubectl get storageclass                                   │
│   $ kubectl describe storageclass fast-storage                 │
│                                                                 │
│   2. 检查 CSI 驱动状态                                          │
│   $ kubectl get pods -n kube-system | grep csi                 │
│                                                                 │
│   3. 修复或更换存储类                                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

性能优化实践

调度优化

# 1. 亲和性调度
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cache
spec:
  template:
    spec:
      affinity:
        # 亲和同类型 Pod（缓存优先）
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: cache
              topologyKey: kubernetes.io/hostname

        # 反亲和延迟敏感型应用
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: latency-sensitive
              topologyKey: kubernetes.io/hostname

# 2. 节点亲和
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: node-type
            operator: In
            values: ["compute-optimized"]

# 3. 污点容忍
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"

资源优化

# 1. 使用 VPA 自动调整
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: myapp-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  updatePolicy:
    updateMode: "Auto"

---
# 2. 使用 HPA 水平扩缩容
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

---
# 3. Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: myapp-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: myapp

网络优化

# 1. 使用高性能 CNI
# Cilium/eBPF 提供更好的网络性能

# 2. 优化 kube-proxy 模式
# 查看当前模式
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# 切换到 IPVS 模式
kubectl get configmap kube-proxy -n kube-system -o yaml | sed 's/mode: ""/mode: "ipvs"/' | kubectl apply -f -

# 3. 优化 Service 连接
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  sessionAffinity: ClientIP    # 基于客户端 IP 的会话保持
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800    # 3小时超时

监控与告警

关键监控指标

# 黄金信号监控

# 1. 延迟
- alert: HighLatency
  expr: |
    histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High P99 latency"

# 2. 错误率
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
    sum(rate(http_requests_total[5m])) by (service) > 0.01
  for: 2m
  labels:
    severity: critical

# 3. 饱和度
- alert: HighCPUUsage
  expr: |
    sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod) /
    sum(container_spec_cpu_quota / container_spec_cpu_period) by (namespace, pod) > 0.9
  for: 10m

- alert: HighMemoryUsage
  expr: |
    container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9
  for: 10m

# 4. 可用性
- alert: PodNotReady
  expr: |
    kube_pod_status_ready{condition="true"} == 0
  for: 5m

健康检查配置

# 应用健康检查
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
      - name: app
        # 启动探针（给予足够启动时间）
        startupProbe:
          httpGet:
            path: /ready
            port: 8080
          failureThreshold: 30
          periodSeconds: 10

        # 存活探针（确保应用运行）
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3

        # 就绪探针（确保可以接收流量）
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 3

总结

┌─────────────────────────────────────────────────────────────────┐
│                    核心要点回顾                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  排查方法论                                                     │
│  ├── 症状 → 定位 → 分析 → 修复 → 验证                          │
│  ├── 黄金信号：延迟、错误、流量、饱和度                       │
│  └── 分层排查：应用→网络→存储→节点→集群                      │
│                                                                 │
│  常见问题                                                       │
│  ├── Pod Pending：资源不足、调度约束                           │
│  ├── CrashLoopBackOff：OOM、配置错误、依赖问题                  │
│  ├── Service 不可用：Endpoint 为空、selector 不匹配            │
│  └── 存储问题：PVC Pending、挂载失败                           │
│                                                                 │
│  性能优化                                                       │
│  ├── 调度优化：亲和性、污点容忍                                 │
│  ├── 资源优化：VPA、HPA、PDB                                   │
│  └── 网络优化：CNI、IPVS、Service 配置                        │
│                                                                 │
│  工具                                                           │
│  ├── kubectl：基础诊断                                          │
│  ├── kubectl-debug：高级调试                                   │
│  ├── stern：日志聚合                                           │
│  └── Prometheus/Grafana：监控可视化                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

思考题

如何设计一个自动化的故障自愈系统？
在性能优化中，如何平衡成本和可靠性？
面对复杂的分布式系统故障，如何快速定位根因？

引用与参考

下篇预告

最后一篇文章我们将探讨 深入源码：Kube-Scheduler，包括：

调度框架与插件
调度算法与优先级
调度流程源码解析
自定义调度器实践

敬请期待！