跳转到内容

DevOps 与 SRE 实践

Claude Code 是 DevOps 和 SRE 工作流的强力助手——从诊断基础设施问题到生成事后报告、管理基础设施即代码(IaC),无所不能。本指南介绍 FIRE 框架,一种系统化的基础设施诊断方法,以及事件响应、Kubernetes 故障排查和基础设施即代码的实用模式。


使用 Claude Code 进行基础设施诊断遵循四个阶段:

F - First Response(首次响应) → 告诉 Claude 症状 + 上下文
I - Investigate(调查) → Claude 分析日志、指标、配置
R - Remediate(修复) → Claude 提出修复方案(人工审批)
E - Evaluate(评估) → 事后分析、文档记录、预防措施

首次响应

描述故障现象,提供环境上下文。Claude 进行分诊并确定检查优先级。

调查

执行命令,粘贴输出。Claude 分析、关联并形成假设。

修复

Claude 提出修复方案并说明影响。由你决定批准或拒绝。

评估

Claude 生成事后报告、文档和预防建议。

场景跳转到
正在处理线上事故Kubernetes 故障排查
首次使用 Claude 做 DevOps首次基础设施诊断
想要自动化运维手册事件响应模式
需要现成的提示词速查提示词

以 Pod CrashLoopBackOff 场景展示 FIRE 框架的实际运用:

  1. 首次响应 —— 提供上下文

    Terminal window
    claude "I have a pod in CrashLoopBackOff in the payment-service namespace.
    Environment: production
    Cluster: gke-prod-us-east1
    Started: 10 minutes ago
    What are the first 3 things I should check?"
  2. 调查 —— 执行建议的命令并分享输出

    Terminal window
    kubectl describe pod payment-api-7d4b8c6f5-x2j9k -n payment-service
    # 复制输出并粘贴给 Claude
  3. 修复 —— Claude 提议,你审批

    Terminal window
    claude "Based on this describe output, what's the fix?
    CONSTRAINT: Show me the exact command but don't execute anything."
  4. 评估 —— 记录事件

    Terminal window
    claude "Create a brief incident note for our wiki:
    - What happened
    - Root cause
    - Fix applied
    - Prevention recommendation"
## CrashLoopBackOff 分诊(按优先级)
1. **检查退出码和最近的日志**
kubectl logs payment-api-7d4b8c6f5-x2j9k -n payment-service --previous
最常见原因:OOM(退出码 137)、配置错误(退出码 1)、依赖故障
2. **检查 Pod 事件和资源限制**
kubectl describe pod payment-api-7d4b8c6f5-x2j9k -n payment-service
关注:OOMKilled、ImagePullBackOff、存活探针失败
3. **检查近期部署变更**
kubectl rollout history deployment/payment-api -n payment-service
如有近期变更:考虑在排查期间先回滚

安装 K8s MCP 服务器以获得持续的集群上下文:

// ~/.claude.json(或 .mcp.json)
{
"mcpServers": {
"kubernetes": {
"command": "npx",
"args": ["-y", "@anthropic/mcp-kubernetes"]
}
}
}

<bracketed> 中的值替换为你自己的实际值。

Terminal window
kubectl describe pod <pod> -n <ns> | claude "Analyze this CrashLoopBackOff:
1. What's the exit code and what does it mean?
2. Check the last 5 restarts pattern (timing, consistent or escalating?)
3. Suggest 3 most likely root causes based on the events
4. Give me the exact commands to investigate each hypothesis"

Claude 通常会识别的常见原因

退出码含义
137OOMKilled(内存限制超出)
1应用错误(配置错误、缺少依赖)
143SIGTERM(优雅关闭超时)
Terminal window
kubectl top pods -n <ns> && kubectl describe pod <pod> -n <ns> | claude "This pod was OOMKilled:
1. Compare requests vs limits vs actual usage
2. Is this a memory leak or under-provisioning?
3. If leak: what patterns in the container suggest investigation paths?
4. If under-provisioned: suggest optimal resource settings based on this data"

内存泄漏追踪

Terminal window
claude "The pod has been restarting every 2 hours with OOMKilled.
Memory grows linearly from 200Mi to 512Mi limit before crash.
Language: Node.js 18
What are the top 3 things to check for memory leaks in this stack?"
Terminal window
kubectl describe pod <pod> -n <ns> | claude "ImagePullBackOff diagnosis:
1. Is this an auth issue, network issue, or wrong image name?
2. What's the exact error message telling us?
3. Give me commands to verify the image exists and credentials work"
Terminal window
kubectl describe pod <pod> -n <ns> && kubectl describe nodes | claude "Pod stuck in Pending:
1. Is this resource constraints, node selectors, or affinity rules?
2. Which nodes were considered and why rejected?
3. What's the quickest fix vs proper solution?"
Terminal window
kubectl get svc,endpoints -n <ns> && kubectl describe svc <svc> -n <ns> | claude "Service not reachable:
1. Are there healthy endpoints?
2. Is the selector matching pods correctly?
3. Is it a network policy blocking traffic?
Give me diagnostic commands for each possibility"

背景:电商平台,凌晨 3 点告警,结账服务返回 503。

  1. 首次响应

    Terminal window
    claude "INCIDENT: checkout-service returning 503s, started 10 min ago.
    Impact: 100% of checkout attempts failing.
    Environment: AWS EKS production, us-east-1.
    Recent changes: deployment 2 hours ago (new feature flag logic).
    What's the fastest diagnostic path?"
  2. 调查(Claude 建议先检查 Pod)

    Terminal window
    kubectl get pods -n checkout -l app=checkout-service
    # 输出:5 个 Pod 中有 3 个处于 CrashLoopBackOff 状态
    kubectl logs checkout-service-xxx --previous | tail -50 | claude "Analyze crash logs"
    # Claude 识别出:feature flag 代码中的空指针异常
  3. 修复

    Terminal window
    claude "Root cause identified: nil pointer in feature flag logic from recent deploy.
    Options:
    A) Rollback to previous version
    B) Hotfix the nil check
    Which is faster and safer at 3 AM?"
    # Claude 建议:回滚(更快、经过验证的状态、明天再修复)
    kubectl rollout undo deployment/checkout-service -n checkout
    # 服务在 2 分钟内恢复
  4. 评估(第二天)

    Terminal window
    claude "Create postmortem from this incident:
    Timeline: 3:02 AM alert, 3:15 AM root cause found, 3:17 AM rollback, 3:19 AM resolved
    Root cause: Feature flag nil pointer from commit abc123
    Impact: 15 minutes checkout downtime
    Format: Blameless, focused on prevention"

结果:15 分钟 MTTR,清晰的事后报告,明确的预防措施。


Terminal window
# 收集相关服务的日志
kubectl logs -l app=api-gateway -n ingress --since=10m > gateway.log
kubectl logs -l app=auth-service -n auth --since=10m > auth.log
kubectl logs -l app=payment-service -n payment --since=10m > payment.log
# 分析关联
cat gateway.log auth.log payment.log | claude "Correlate these logs:
1. Find the request flow for failed transactions
2. Identify where the failure originates
3. Are there patterns in timing or specific endpoints?
4. Create a timeline of events"
Terminal window
grep -E "ERROR|WARN|Exception" app.log | claude "Analyze error patterns:
1. Cluster similar errors (group by type, not timestamp)
2. What's the most frequent vs most severe?
3. Which errors are correlated (same root cause)?
4. Prioritize investigation order"
Terminal window
claude "I need a PromQL query to:
- Show p99 latency for the payment-service
- Group by endpoint
- Alert if > 500ms for 5 minutes
Include the alert rule YAML too"

  1. 首次响应(30 秒)

    Terminal window
    claude "INCIDENT: [具体症状]
    Context: [服务名], [环境], [开始时间]
    Recent changes: [部署、基础设施变更、流量波动]
    Current impact: [受影响用户百分比, 收入影响]
    What are the 3 most critical things to check first?"
  2. 调查(2-5 分钟)

    Terminal window
    # Claude 建议先检查 Pod 健康状态
    kubectl get pods -n checkout | claude "Quick assessment of this pod list"
    # 然后检查最近的日志
    kubectl logs -l app=checkout --since=5m | head -100 | claude "Analyze for error patterns"
  3. 修复(需要审批)

    Terminal window
    claude "Based on investigation:
    - Root cause: [你的判断]
    - Evidence: [关键发现]
    Propose remediation options:
    1. Quick mitigation (restore service)
    2. Proper fix (address root cause)
    CONSTRAINT: I need to approve before any action. Show exact commands."
  4. 评估(事后,非事中)

    Terminal window
    claude "Create incident postmortem:
    Timeline: [带时间戳的事件]
    Format: Blameless, focus on systems not people
    Include: Action items with owners"
Terminal window
claude "Generate incident update for stakeholders:
Incident: Checkout service degradation
Current status: Mitigated, monitoring
Impact: 15 minutes of 30% checkout failures
ETA to full resolution: 2 hours (proper fix in next deploy)
Audience: Non-technical executives
Tone: Professional, reassuring, factual
Length: 3 sentences max"

输出示例

我们的结账服务经历了约 15 分钟的中断,影响了约 30% 的交易,目前已恢复。问题源于近期更新中的软件缺陷,已迅速回滚。我们将在下一个计划维护窗口部署永久修复,预计不会对客户产生影响。

Terminal window
# Agent 1:时间线重建
claude "You are an incident timeline analyst.
From these logs and Slack messages, reconstruct a precise timeline:
[粘贴日志和通讯记录]"
# Agent 2:根因分析
claude "You are a root cause analyst.
Given this timeline, perform 5-whys analysis:
[粘贴 Agent 1 的时间线]"
# Agent 3:预防建议
claude "You are an SRE process improvement specialist.
Given this root cause analysis:
[粘贴 Agent 2 的根因分析]
Output: Prioritized prevention measures, effort estimates"
指标使用 Claude 之前3 个月后
平均 MTTR45 分钟18 分钟(下降 60%)
事后报告完成率经常延迟或跳过24 小时内完成率 95%
知识共享工程师各自为政Claude 生成的运维手册全员可用

核心发现:最大的收益不是速度——而是一致性和文档化。


Terminal window
terraform plan -out=plan.txt && cat plan.txt | claude "Review this Terraform plan:
1. Any dangerous changes? (data loss, downtime)
2. Are the changes what we expect?
3. Any missing changes we should add?
4. Cost implications if visible"
Terminal window
claude "Generate a Terraform module for:
- AWS ECS Fargate service
- With ALB and target group
- Auto-scaling based on CPU
- Secrets from SSM Parameter Store
Follow these conventions:
- Use for_each over count
- All resources tagged with var.tags
- Output the service URL and ARN"
Terminal window
claude "I need to move a resource to a different state file:
Current state: terraform-prod/terraform.tfstate
Resource: aws_s3_bucket.logs
Target state: terraform-shared/terraform.tfstate
What's the safest procedure? Include rollback steps."
Terminal window
terraform plan -detailed-exitcode 2>&1 | tee drift.txt
cat drift.txt | claude "Analyze this Terraform drift:
1. What changed outside of Terraform?
2. Is this drift expected (manual change) or concerning?
3. Should we import the changes or revert to Terraform state?
4. What's the safest remediation path?"
Terminal window
cat playbook.yml | claude "Review this Ansible playbook:
1. Idempotency issues?
2. Security concerns?
3. Error handling gaps?
4. Performance optimizations?"
Terminal window
claude "Generate an Ansible role for:
- Installing and configuring Nginx
- SSL certificates via Let's Encrypt (certbot)
- Hardened configuration (disable server tokens, etc.)
- Log rotation
Follow best practices:
- Use handlers for service restarts
- Variables in defaults/main.yml
- Include molecule tests structure"
Terminal window
cat application.yaml | claude "Review this ArgoCD Application:
1. Sync policy appropriate for the environment?
2. Resource health checks defined?
3. Any sync wave ordering issues?
4. Namespace and project permissions correct?"
Terminal window
claude "Generate Helm values for deploying [application] to:
- Environment: staging
- Resources: Limited (cost-conscious)
- Replicas: 2
- Ingress: Internal only
- Secrets: From external-secrets operator
Base chart: [chart name]
Include comments explaining each value"

Terminal window
tfsec . --format=json | claude "Analyze these security findings:
1. Prioritize by severity and exploitability
2. Which are false positives in our context?
3. For real issues: what's the fix?
4. Which can we ignore with a documented reason?"
Terminal window
cat iam-policy.json | claude "Review this IAM policy:
1. Does it follow least privilege?
2. Any overly permissive actions? (*, admin, etc.)
3. Resource constraints appropriate?
4. Suggest a more restrictive version that still works"

模型输入 (1M tokens)输出 (1M tokens)
Sonnet 4$3$15
Opus 4$15$75

典型 DevOps 会话:20K-50K tokens = $0.10-$0.50

成本控制策略

  1. 日常任务使用 Sonnet(默认)
  2. 复杂的多系统分析使用 Opus
  3. 对话过长时使用 /compact 压缩上下文
  4. 避免粘贴完整日志——先用 grep 筛选相关部分
数据类型为什么不行替代方案
API 密钥、Token可能被缓存或记录使用占位符:<API_KEY>
生产环境密钥安全风险描述密钥类型,而非实际值
客户 PII隐私合规要求使用脱敏示例
含 PII 的事件详情法律责任分享前先脱敏
  • 涉及多个系统交互的复杂根因分析
  • 文档生成(事后报告、运维手册、操作流程)
  • 学习新工具(不熟悉的云服务、新的 K8s 特性)
  • 验证你的假设(作为第二意见)
  • 批量操作(为多环境生成配置)

症状提示词
CrashLoopBackOffkubectl describe pod <pod> -n <ns> | claude "Exit code meaning? 3 likely causes?"
OOMKilledkubectl top pods && describe pod | claude "Leak or under-provisioned?"
ImagePullBackOffkubectl describe pod | claude "Auth, network, or wrong image?"
Pendingkubectl describe pod && describe nodes | claude "Resource, selector, or affinity?"
Service 不可达kubectl get svc,endpoints | claude "Healthy endpoints? Selector matching?"
症状提示词
高延迟[metrics] | claude "Bottleneck location? Compute, network, or dependency?"
磁盘满df -h && du -sh /* | claude "What's consuming space? Safe to delete?"
连接拒绝netstat -tlnp | claude "Service listening? Port correct? Firewall?"
SSL 证书过期openssl s_client -connect host:443 | claude "Days until expiry? Renewal steps?"
DNS 问题dig +trace domain | claude "Where does resolution fail?"
任务提示词
Plan 审查terraform plan | claude "Dangerous changes? Missing? Cost impact?"
漂移分析terraform plan -detailed-exitcode | claude "What drifted? Expected? Fix?"
模块请求claude "Generate Terraform module for [resource] with [requirements]"
服务器用途安装命令
Kubernetes直接访问集群npx -y @anthropic/mcp-kubernetes
AWSAWS API 访问npx -y @anthropic/mcp-aws
GCPGCP API 访问npx -y @anthropic/mcp-gcp

配置位置:~/.claude.json"mcpServers" 字段)