首次响应
描述故障现象,提供环境上下文。Claude 进行分诊并确定检查优先级。
Claude Code 是 DevOps 和 SRE 工作流的强力助手——从诊断基础设施问题到生成事后报告、管理基础设施即代码(IaC),无所不能。本指南介绍 FIRE 框架,一种系统化的基础设施诊断方法,以及事件响应、Kubernetes 故障排查和基础设施即代码的实用模式。
使用 Claude Code 进行基础设施诊断遵循四个阶段:
F - First Response(首次响应) → 告诉 Claude 症状 + 上下文I - Investigate(调查) → Claude 分析日志、指标、配置R - Remediate(修复) → Claude 提出修复方案(人工审批)E - Evaluate(评估) → 事后分析、文档记录、预防措施首次响应
描述故障现象,提供环境上下文。Claude 进行分诊并确定检查优先级。
调查
执行命令,粘贴输出。Claude 分析、关联并形成假设。
修复
Claude 提出修复方案并说明影响。由你决定批准或拒绝。
评估
Claude 生成事后报告、文档和预防建议。
| 场景 | 跳转到 |
|---|---|
| 正在处理线上事故 | Kubernetes 故障排查 |
| 首次使用 Claude 做 DevOps | 首次基础设施诊断 |
| 想要自动化运维手册 | 事件响应模式 |
| 需要现成的提示词 | 速查提示词 |
以 Pod CrashLoopBackOff 场景展示 FIRE 框架的实际运用:
首次响应 —— 提供上下文
claude "I have a pod in CrashLoopBackOff in the payment-service namespace.Environment: productionCluster: gke-prod-us-east1Started: 10 minutes agoWhat are the first 3 things I should check?"调查 —— 执行建议的命令并分享输出
kubectl describe pod payment-api-7d4b8c6f5-x2j9k -n payment-service# 复制输出并粘贴给 Claude修复 —— Claude 提议,你审批
claude "Based on this describe output, what's the fix?CONSTRAINT: Show me the exact command but don't execute anything."评估 —— 记录事件
claude "Create a brief incident note for our wiki:- What happened- Root cause- Fix applied- Prevention recommendation"## CrashLoopBackOff 分诊(按优先级)
1. **检查退出码和最近的日志** kubectl logs payment-api-7d4b8c6f5-x2j9k -n payment-service --previous 最常见原因:OOM(退出码 137)、配置错误(退出码 1)、依赖故障
2. **检查 Pod 事件和资源限制** kubectl describe pod payment-api-7d4b8c6f5-x2j9k -n payment-service 关注:OOMKilled、ImagePullBackOff、存活探针失败
3. **检查近期部署变更** kubectl rollout history deployment/payment-api -n payment-service 如有近期变更:考虑在排查期间先回滚安装 K8s MCP 服务器以获得持续的集群上下文:
// ~/.claude.json(或 .mcp.json){ "mcpServers": { "kubernetes": { "command": "npx", "args": ["-y", "@anthropic/mcp-kubernetes"] } }}将 <bracketed> 中的值替换为你自己的实际值。
kubectl describe pod <pod> -n <ns> | claude "Analyze this CrashLoopBackOff:1. What's the exit code and what does it mean?2. Check the last 5 restarts pattern (timing, consistent or escalating?)3. Suggest 3 most likely root causes based on the events4. Give me the exact commands to investigate each hypothesis"Claude 通常会识别的常见原因:
| 退出码 | 含义 |
|---|---|
| 137 | OOMKilled(内存限制超出) |
| 1 | 应用错误(配置错误、缺少依赖) |
| 143 | SIGTERM(优雅关闭超时) |
kubectl top pods -n <ns> && kubectl describe pod <pod> -n <ns> | claude "This pod was OOMKilled:1. Compare requests vs limits vs actual usage2. Is this a memory leak or under-provisioning?3. If leak: what patterns in the container suggest investigation paths?4. If under-provisioned: suggest optimal resource settings based on this data"内存泄漏追踪:
claude "The pod has been restarting every 2 hours with OOMKilled.Memory grows linearly from 200Mi to 512Mi limit before crash.Language: Node.js 18What are the top 3 things to check for memory leaks in this stack?"kubectl describe pod <pod> -n <ns> | claude "ImagePullBackOff diagnosis:1. Is this an auth issue, network issue, or wrong image name?2. What's the exact error message telling us?3. Give me commands to verify the image exists and credentials work"kubectl describe pod <pod> -n <ns> && kubectl describe nodes | claude "Pod stuck in Pending:1. Is this resource constraints, node selectors, or affinity rules?2. Which nodes were considered and why rejected?3. What's the quickest fix vs proper solution?"kubectl get svc,endpoints -n <ns> && kubectl describe svc <svc> -n <ns> | claude "Service not reachable:1. Are there healthy endpoints?2. Is the selector matching pods correctly?3. Is it a network policy blocking traffic?Give me diagnostic commands for each possibility"背景:电商平台,凌晨 3 点告警,结账服务返回 503。
首次响应
claude "INCIDENT: checkout-service returning 503s, started 10 min ago.Impact: 100% of checkout attempts failing.Environment: AWS EKS production, us-east-1.Recent changes: deployment 2 hours ago (new feature flag logic).What's the fastest diagnostic path?"调查(Claude 建议先检查 Pod)
kubectl get pods -n checkout -l app=checkout-service# 输出:5 个 Pod 中有 3 个处于 CrashLoopBackOff 状态
kubectl logs checkout-service-xxx --previous | tail -50 | claude "Analyze crash logs"# Claude 识别出:feature flag 代码中的空指针异常修复
claude "Root cause identified: nil pointer in feature flag logic from recent deploy.Options:A) Rollback to previous versionB) Hotfix the nil checkWhich is faster and safer at 3 AM?"# Claude 建议:回滚(更快、经过验证的状态、明天再修复)
kubectl rollout undo deployment/checkout-service -n checkout# 服务在 2 分钟内恢复评估(第二天)
claude "Create postmortem from this incident:Timeline: 3:02 AM alert, 3:15 AM root cause found, 3:17 AM rollback, 3:19 AM resolvedRoot cause: Feature flag nil pointer from commit abc123Impact: 15 minutes checkout downtimeFormat: Blameless, focused on prevention"结果:15 分钟 MTTR,清晰的事后报告,明确的预防措施。
# 收集相关服务的日志kubectl logs -l app=api-gateway -n ingress --since=10m > gateway.logkubectl logs -l app=auth-service -n auth --since=10m > auth.logkubectl logs -l app=payment-service -n payment --since=10m > payment.log
# 分析关联cat gateway.log auth.log payment.log | claude "Correlate these logs:1. Find the request flow for failed transactions2. Identify where the failure originates3. Are there patterns in timing or specific endpoints?4. Create a timeline of events"grep -E "ERROR|WARN|Exception" app.log | claude "Analyze error patterns:1. Cluster similar errors (group by type, not timestamp)2. What's the most frequent vs most severe?3. Which errors are correlated (same root cause)?4. Prioritize investigation order"claude "I need a PromQL query to:- Show p99 latency for the payment-service- Group by endpoint- Alert if > 500ms for 5 minutesInclude the alert rule YAML too"首次响应(30 秒)
claude "INCIDENT: [具体症状]Context: [服务名], [环境], [开始时间]Recent changes: [部署、基础设施变更、流量波动]Current impact: [受影响用户百分比, 收入影响]What are the 3 most critical things to check first?"调查(2-5 分钟)
# Claude 建议先检查 Pod 健康状态kubectl get pods -n checkout | claude "Quick assessment of this pod list"
# 然后检查最近的日志kubectl logs -l app=checkout --since=5m | head -100 | claude "Analyze for error patterns"修复(需要审批)
claude "Based on investigation:- Root cause: [你的判断]- Evidence: [关键发现]
Propose remediation options:1. Quick mitigation (restore service)2. Proper fix (address root cause)
CONSTRAINT: I need to approve before any action. Show exact commands."评估(事后,非事中)
claude "Create incident postmortem:Timeline: [带时间戳的事件]Format: Blameless, focus on systems not peopleInclude: Action items with owners"claude "Generate incident update for stakeholders:
Incident: Checkout service degradationCurrent status: Mitigated, monitoringImpact: 15 minutes of 30% checkout failuresETA to full resolution: 2 hours (proper fix in next deploy)
Audience: Non-technical executivesTone: Professional, reassuring, factualLength: 3 sentences max"输出示例:
我们的结账服务经历了约 15 分钟的中断,影响了约 30% 的交易,目前已恢复。问题源于近期更新中的软件缺陷,已迅速回滚。我们将在下一个计划维护窗口部署永久修复,预计不会对客户产生影响。
# Agent 1:时间线重建claude "You are an incident timeline analyst.From these logs and Slack messages, reconstruct a precise timeline:[粘贴日志和通讯记录]"
# Agent 2:根因分析claude "You are a root cause analyst.Given this timeline, perform 5-whys analysis:[粘贴 Agent 1 的时间线]"
# Agent 3:预防建议claude "You are an SRE process improvement specialist.Given this root cause analysis:[粘贴 Agent 2 的根因分析]Output: Prioritized prevention measures, effort estimates"| 指标 | 使用 Claude 之前 | 3 个月后 |
|---|---|---|
| 平均 MTTR | 45 分钟 | 18 分钟(下降 60%) |
| 事后报告完成率 | 经常延迟或跳过 | 24 小时内完成率 95% |
| 知识共享 | 工程师各自为政 | Claude 生成的运维手册全员可用 |
核心发现:最大的收益不是速度——而是一致性和文档化。
terraform plan -out=plan.txt && cat plan.txt | claude "Review this Terraform plan:1. Any dangerous changes? (data loss, downtime)2. Are the changes what we expect?3. Any missing changes we should add?4. Cost implications if visible"claude "Generate a Terraform module for:- AWS ECS Fargate service- With ALB and target group- Auto-scaling based on CPU- Secrets from SSM Parameter Store
Follow these conventions:- Use for_each over count- All resources tagged with var.tags- Output the service URL and ARN"claude "I need to move a resource to a different state file:Current state: terraform-prod/terraform.tfstateResource: aws_s3_bucket.logsTarget state: terraform-shared/terraform.tfstate
What's the safest procedure? Include rollback steps."terraform plan -detailed-exitcode 2>&1 | tee drift.txt
cat drift.txt | claude "Analyze this Terraform drift:1. What changed outside of Terraform?2. Is this drift expected (manual change) or concerning?3. Should we import the changes or revert to Terraform state?4. What's the safest remediation path?"cat playbook.yml | claude "Review this Ansible playbook:1. Idempotency issues?2. Security concerns?3. Error handling gaps?4. Performance optimizations?"claude "Generate an Ansible role for:- Installing and configuring Nginx- SSL certificates via Let's Encrypt (certbot)- Hardened configuration (disable server tokens, etc.)- Log rotation
Follow best practices:- Use handlers for service restarts- Variables in defaults/main.yml- Include molecule tests structure"cat application.yaml | claude "Review this ArgoCD Application:1. Sync policy appropriate for the environment?2. Resource health checks defined?3. Any sync wave ordering issues?4. Namespace and project permissions correct?"claude "Generate Helm values for deploying [application] to:- Environment: staging- Resources: Limited (cost-conscious)- Replicas: 2- Ingress: Internal only- Secrets: From external-secrets operator
Base chart: [chart name]Include comments explaining each value"tfsec . --format=json | claude "Analyze these security findings:1. Prioritize by severity and exploitability2. Which are false positives in our context?3. For real issues: what's the fix?4. Which can we ignore with a documented reason?"cat iam-policy.json | claude "Review this IAM policy:1. Does it follow least privilege?2. Any overly permissive actions? (*, admin, etc.)3. Resource constraints appropriate?4. Suggest a more restrictive version that still works"| 模型 | 输入 (1M tokens) | 输出 (1M tokens) |
|---|---|---|
| Sonnet 4 | $3 | $15 |
| Opus 4 | $15 | $75 |
典型 DevOps 会话:20K-50K tokens = $0.10-$0.50
成本控制策略:
/compact 压缩上下文| 数据类型 | 为什么不行 | 替代方案 |
|---|---|---|
| API 密钥、Token | 可能被缓存或记录 | 使用占位符:<API_KEY> |
| 生产环境密钥 | 安全风险 | 描述密钥类型,而非实际值 |
| 客户 PII | 隐私合规要求 | 使用脱敏示例 |
| 含 PII 的事件详情 | 法律责任 | 分享前先脱敏 |
| 局限 | 应对方式 |
|---|---|
| 无法实时获取集群状态 | 使用 K8s MCP 或粘贴 kubectl 输出 |
| 无法直接调用云 API | 使用 MCP 服务器或分享 CLI 输出 |
| 上下文窗口限制(约 100K) | 聚焦相关日志,而非全量导入 |
| 无持久记忆 | 使用 CLAUDE.md 保存项目上下文 |
| 存在幻觉风险 | 执行前务必验证命令 |
| 症状 | 提示词 |
|---|---|
| CrashLoopBackOff | kubectl describe pod <pod> -n <ns> | claude "Exit code meaning? 3 likely causes?" |
| OOMKilled | kubectl top pods && describe pod | claude "Leak or under-provisioned?" |
| ImagePullBackOff | kubectl describe pod | claude "Auth, network, or wrong image?" |
| Pending | kubectl describe pod && describe nodes | claude "Resource, selector, or affinity?" |
| Service 不可达 | kubectl get svc,endpoints | claude "Healthy endpoints? Selector matching?" |
| 症状 | 提示词 |
|---|---|
| 高延迟 | [metrics] | claude "Bottleneck location? Compute, network, or dependency?" |
| 磁盘满 | df -h && du -sh /* | claude "What's consuming space? Safe to delete?" |
| 连接拒绝 | netstat -tlnp | claude "Service listening? Port correct? Firewall?" |
| SSL 证书过期 | openssl s_client -connect host:443 | claude "Days until expiry? Renewal steps?" |
| DNS 问题 | dig +trace domain | claude "Where does resolution fail?" |
| 任务 | 提示词 |
|---|---|
| Plan 审查 | terraform plan | claude "Dangerous changes? Missing? Cost impact?" |
| 漂移分析 | terraform plan -detailed-exitcode | claude "What drifted? Expected? Fix?" |
| 模块请求 | claude "Generate Terraform module for [resource] with [requirements]" |
| 服务器 | 用途 | 安装命令 |
|---|---|---|
| Kubernetes | 直接访问集群 | npx -y @anthropic/mcp-kubernetes |
| AWS | AWS API 访问 | npx -y @anthropic/mcp-aws |
| GCP | GCP API 访问 | npx -y @anthropic/mcp-gcp |
配置位置:~/.claude.json("mcpServers" 字段)