7 - The Resilience Engineer's Guide to Claude Code: Debug Like a Pro

7 - The Resilience Engineer's Guide to Claude Code: Debug Like a Pro

November 06, 2025
2 views

Table of Contents


Introduction

Here's the truth about debugging: Most people treat symptoms, not diseases.

Claude Code crashes? Restart it. Slow responses? Clear context. Hooks failing? Disable them.

That's not engineering. That's whack-a-mole.

What if instead of firefighting, you built resilient systems that self-heal? What if you applied industrial-grade diagnostic frameworks from distributed systems engineering to your AI development workflow?

This isn't a "common errors and fixes" list. This is a resilience engineering handbook for Claude Code, combining:

  • Root Cause Analysis (5 Whys, Fishbone, Fault Tree)
  • Resilience Patterns (Circuit Breaker, Retry, Bulkhead, Fallback)
  • Observability (Metrics, Logs, Traces)
  • Self-Healing Automation (Prevention > detection > cure)

By the end, you won't just fix problemsβ€”you'll prevent them and build systems that recover automatically.

Let's transform you from a bug fixer into a resilience engineer.


Part 1: The Diagnostic Mindset

The Three Levels of Problem-Solving πŸ“Š

| Level | Approach | Example | Result | Engineering Quality | |-------|----------|---------|--------|-------------------| | Level 1: Symptom Treatment | Try again blindly | Claude crashes β†’ restart β†’ works | Got lucky | ❌ Amateur | | Level 2: Direct Cause | Fix immediate issue | Claude crashes β†’ "Out of memory" β†’ restart with more RAM | Temporary fix | ⚠️ Competent | | Level 3: Root Cause | Trace causal chain to origin | OOM β†’ Loading entire codebase β†’ No file filtering β†’ Missing CLAUDE.md β†’ No validation | Permanent prevention | βœ… Professional |

This is engineering.

The Diagnostic Stack (5 Layers)

Every robust system has layers of defense. Think of this like network security layers - each one catches what the previous missed.

| Layer | Focus | Timing | Key Activities | Goal | |-------|-------|--------|---------------|------| | 1️⃣ Prevention | Stop problems before they start | Proactive | Health checks, config validation, dependency verification | Eliminate failure causes | | 2️⃣ Detection | Spot issues immediately | Reactive | Error pattern recognition, baselines, anomaly detection | Fast problem identification | | 3️⃣ Diagnosis | Understand the "why" | Analytical | Root cause analysis, causal chains, pattern correlation | Find true source | | 4️⃣ Recovery | Fix and restore | Corrective | Auto-remediation, fallbacks, state restoration | Return to normal operation | | 5️⃣ Learning | Prevent future occurrences | Evolutionary | Error database, runbooks, process refinement | Continuous improvement |

The goal: Build systems that prevent, detect, diagnose, recover, and learn from failures automatically.

The 5-Layer Diagnostic Stack showing Prevention, Detection, Diagnosis, Recovery, and Learning layers building to a resilient system Figure 1: The 5-Layer Diagnostic Stack - Each layer builds resilience from bottom (prevention) to top (learning)

πŸ’‘ Memory Anchor: Think P-D-D-R-L = "Paddle" through problems systematically.


Part 2: The Prevention Layer

The claude doctor Command

Your first line of defense (learn more in the official docs):

$ claude doctor

What it checks:

| Component | What's Validated | Why It Matters | |-----------|-----------------|----------------| | βœ… Node.js version | 18+ required | Incompatible versions cause crashes | | βœ… npm configuration | Proper setup | Permission errors blocked | | βœ… PATH setup | Binary accessible | "Command not found" prevented | | βœ… API authentication | Valid key | Connection failures avoided | | βœ… Configuration files | Syntax & structure | Malformed configs caught early | | βœ… MCP servers | Connectivity | Tools available when needed | | βœ… Tool availability | ripgrep, git, etc. | Search & version control work |

Sample Output:

Claude Code Health Check
========================
βœ“ Node.js: v20.11.0 (OK)
βœ“ npm: 10.2.4 (OK)
βœ“ PATH: Claude binary found
βœ“ Auth: Valid API key
βœ— ripgrep: NOT FOUND (install for search functionality)
⚠ CLAUDE.md: Multiple files found (may cause conflicts)
βœ“ MCP Servers: 2/2 connected

Recommendation: Install ripgrep, consolidate CLAUDE.md files

Fix issues before starting work.

Pre-Flight Health Check Script

Create .claude/scripts/health-check.sh:

#!/bin/bash

set -e

echo "πŸ₯ Claude Code Health Check"
echo "=========================="

# Check Node.js version
NODE_VERSION=$(node -v | cut -d'v' -f2)
REQUIRED_VERSION="18.0.0"

if [ "$(printf '%s\n' "$REQUIRED_VERSION" "$NODE_VERSION" | sort -V | head -n1)" = "$REQUIRED_VERSION" ]; then
  echo "βœ“ Node.js: v$NODE_VERSION (OK)"
else
  echo "βœ— Node.js: v$NODE_VERSION (UPGRADE REQUIRED)"
  exit 1
fi

# Check for conflicting CLAUDE.md files
CLAUDE_MD_COUNT=$(find . -name "CLAUDE.md" -type f | wc -l)
if [ "$CLAUDE_MD_COUNT" -gt 1 ]; then
  echo "⚠ CLAUDE.md: Found $CLAUDE_MD_COUNT files (potential conflicts)"
  find . -name "CLAUDE.md" -type f
else
  echo "βœ“ CLAUDE.md: Single file (OK)"
fi

# Check ripgrep installation
if command -v rg &> /dev/null; then
  echo "βœ“ ripgrep: Installed"
else
  echo "⚠ ripgrep: NOT FOUND (search features disabled)"
fi

# Check available RAM
if [ "$(uname)" = "Darwin" ]; then
  # macOS
  TOTAL_RAM=$(sysctl -n hw.memsize | awk '{print int($1/1024/1024/1024)}')
else
  # Linux
  TOTAL_RAM=$(free -g | awk '/^Mem:/{print $2}')
fi

if [ "$TOTAL_RAM" -ge 16 ]; then
  echo "βœ“ RAM: ${TOTAL_RAM}GB (OK)"
else
  echo "⚠ RAM: ${TOTAL_RAM}GB (16GB+ recommended)"
fi

# Validate MCP server configs
if [ -f ~/.claude/config.json ]; then
  if python3 -m json.tool ~/.claude/config.json > /dev/null 2>&1; then
    echo "βœ“ MCP Config: Valid JSON"
  else
    echo "βœ— MCP Config: INVALID JSON"
    exit 1
  fi
else
  echo "β„Ή MCP Config: Not configured"
fi

# Check for tool conflicts
if pgrep -x "aider" > /dev/null; then
  echo "⚠ Warning: aider is running (may conflict)"
fi

echo ""
echo "Health check complete!"
echo "Run: claude"

Make executable:

chmod +x .claude/scripts/health-check.sh

Add to SessionStart hook (.claude/settings.json):

{
  "hooks": {
    "sessionStart": ".claude/scripts/health-check.sh"
  }
}

Now every Claude session starts with automated health validation! ✨

Configuration Health Scoring

Create .claude/scripts/config-score.sh:

#!/bin/bash

SCORE=100

# Deduct points for issues
[ ! -f CLAUDE.md ] && SCORE=$((SCORE-20)) && echo "❌ -20: Missing CLAUDE.md"
[ $(find . -name "CLAUDE.md" | wc -l) -gt 1 ] && SCORE=$((SCORE-15)) && echo "⚠️  -15: Multiple CLAUDE.md files"
! command -v rg &> /dev/null && SCORE=$((SCORE-10)) && echo "⚠️  -10: ripgrep not installed"
[ $(free -g 2>/dev/null | awk '/^Mem:/{print $2}') -lt 16 ] && SCORE=$((SCORE-10)) && echo "⚠️  -10: Low RAM (<16GB)"
[ ! -d .claude/commands ] && SCORE=$((SCORE-5)) && echo "ℹ️  -5: No custom commands"
[ ! -f ~/.claude/config.json ] && SCORE=$((SCORE-5)) && echo "ℹ️  -5: No MCP config"

echo ""
echo "Configuration Health: $SCORE/100"

if [ $SCORE -ge 90 ]; then
  echo "🟒 Excellent configuration"
elif [ $SCORE -ge 70 ]; then
  echo "🟑 Good configuration (some improvements possible)"
elif [ $SCORE -ge 50 ]; then
  echo "🟠 Fair configuration (issues detected)"
else
  echo "πŸ”΄ Poor configuration (fix critical issues)"
  exit 1
fi

Run before important tasks:

./.claude/scripts/config-score.sh

Part 3: Error Taxonomy & Pattern Recognition

The 40% Rule 🎯

Research shows: 40% of Claude Code crashes stem from corrupted installs or permission errors.

That means nearly half of all problems can be prevented with proper installation and permissions.

Error Category Matrix

| Category | % of Issues | Severity | Preventable? | Time to Fix | Prevention Strategy | |----------|------------|----------|--------------|-------------|-------------------| | πŸ”§ Installation | 25% | High | βœ… Yes | 5-10 min | Use nvm, validate PATH | | πŸ” Permissions | 15% | High | βœ… Yes | 2-5 min | Never use sudo with npm | | 🧠 Context Management | 20% | Medium | βœ… Yes | 1-2 min | 3-File Rule, auto-compact | | 🌐 Network/API | 15% | Medium | ⚠️ Partial | 5-15 min | Retry logic, fallbacks | | βš™οΈ Configuration | 10% | Low | βœ… Yes | 2-5 min | Validation scripts | | ⚑ Performance | 10% | Low | βœ… Yes | 1-3 min | Monitor context size | | πŸ”Œ Integration (IDE/MCP) | 5% | Low | ⚠️ Partial | 10-20 min | Health checks, circuit breakers |

60%+ of issues are fully preventable with proper setup.

Error decision tree flowchart showing diagnostic paths for Claude Code issues from problem identification through resolution Figure 2: Error Decision Tree - Quick diagnostic paths for any Claude Code issue

Installation Errors (25% of Issues) πŸ”§

Error Pattern 1: "Command not found: claude"

5 Whys Analysis:

| Question | Answer | Layer | |----------|--------|-------| | 1. Why not found? | Not in PATH | Direct symptom | | 2. Why not in PATH? | npm global bin not configured | Configuration | | 3. Why not configured? | User installed with sudo | Installation method | | 4. Why sudo? | Permission error without sudo | Permission setup | | 5. Why permission error? | npm default directory requires root | ROOT CAUSE |

Solution (Fix root cause):

# Configure npm to use user directory
mkdir ~/.npm-global
npm config set prefix '~/.npm-global'

# Add to PATH
echo 'export PATH=~/.npm-global/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

# Reinstall Claude
npm install -g @anthropic-ai/claude-code

Prevention Checklist:

  • βœ… Use nvm (Node Version Manager) from the start
  • βœ… Never install npm packages with sudo
  • βœ… Validate PATH in health check
  • βœ… Document installation procedure

Error Pattern 2: "Permission denied" on macOS/Linux

Solutions Comparison:

| Option | Time | Difficulty | Permanence | Recommended? | |--------|------|-----------|-----------|-------------| | Fix npm permissions | 2 min | Easy | Permanent | βœ… Best for quick fix | | Use nvm | 5 min | Easy | Permanent | βœ… Best long-term | | Fix directory permissions | 3 min | Medium | Risky | ⚠️ Use with caution |

Option 1: Fix npm permissions (Recommended)

mkdir ~/.npm-global
npm config set prefix '~/.npm-global'
export PATH=~/.npm-global/bin:$PATH
npm install -g @anthropic-ai/claude-code

Option 2: Use nvm (Best long-term)

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash
nvm install 20
nvm use 20
npm install -g @anthropic-ai/claude-code

Error Pattern 3: Node.js version incompatible

Direct Fix:

# Check current version
node --version

# Upgrade with nvm
nvm install 20
nvm use 20
nvm alias default 20

# Verify
node --version  # Should be 20.x.x

# Reinstall Claude
npm install -g @anthropic-ai/claude-code

Prevention:

  • βœ… Add Node version check to health script
  • βœ… Use .nvmrc file in projects
  • βœ… Document required versions in CLAUDE.md

Context Management Errors (20% of Issues) 🧠

Error Pattern 1: "Out of context" / "Conversation too long"

5 Whys Chain:

❓ Why out of context?
↓
Token limit reached (200,000)
↓
❓ Why limit reached?
↓
Large files loaded
↓
❓ Why large files?
↓
Entire directory referenced (@src/)
↓
❓ Why entire directory?
↓
No file filtering used
↓
❓ Why no filtering?
↓
User didn't know 3-File Rule
↓
🎯 ROOT CAUSE: Lack of education

Immediate Fix Options:

| Command | Effect | When to Use | Context Preserved? | |---------|--------|-------------|-------------------| | /compact | Summarize & compress | Same topic, need history | βœ… Partial (summary) | | /clear | Start fresh | New topic, confusion | ❌ No | | /status | Check size first | Before deciding | βœ… Yes |

Long-term Solution: The 3-File Rule 🎯

# ❌ Instead of this:
> Review @src/

# βœ… Do this:
> Review @src/auth/login.js @src/middleware/auth.js @tests/auth.test.js

Prevention Hook (PreToolUse):

#!/bin/bash
# .claude/hooks/pre-tool-use.sh

# Detect large directory references
if echo "$CLAUDE_TOOL_INPUT" | grep -q "@.*/$"; then
  echo "⚠️  Warning: You're referencing entire directory"
  echo "πŸ’‘ Tip: Use specific files for better results"
  echo ""
  read -p "Continue anyway? (y/n): " confirm
  [[ $confirm != "y" ]] && exit 1
fi

Performance Errors (10% of Issues) ⚑

Error Pattern: Slow responses (>30s per query)

Diagnostic Decision Tree:

| Check | Command | Normal Range | Action if Abnormal | |-------|---------|--------------|-------------------| | Context size | > /status | <60% | /compact if >60% | | Model used | Ask Claude | Sonnet 4.5 | Switch to faster model | | Files loaded | /status | <10 files | Remove unnecessary files | | System CPU | top \| grep claude | <50% | Check for background processes | | System RAM | ps aux \| grep claude | <4GB | Restart session | | Network latency | ping api.anthropic.com | <100ms | Check connection |

Token-Saving Hack (from community): 🧠

Instead of asking Claude to read files and make changes:

# ❌ Expensive (loads files into context):
> Read all files in src/components/ and update import paths
# Uses ~50,000 tokens

Do this:

# βœ… Efficient (generates automation):
> Create a bash script that:
> 1. Finds all files in src/components/
> 2. Updates import paths with sed
> 3. Runs and validates changes
> 4. Deletes itself when done

# Uses ~5,000 tokens (10x reduction!)

Why it works: Claude writes script once, script executes locally, no need to load all files into context.


Part 4: The Diagnostic Toolkit

Built-In Diagnostic Flags

| Flag | What It Shows | When to Use | Output Level | |------|--------------|-------------|-------------| | --verbose | Tool calls, file ops, token usage, performance | Debugging unexpected behavior | πŸ”Š Detailed | | --mcp-debug | MCP connections, protocol exchanges, auth flows | MCP tools not working | πŸ”Š Detailed | | --no-hooks | Disables all hooks temporarily | Suspecting hook failures | πŸ”‡ Normal |

Example Usage:

# Diagnose tool issues
$ claude --verbose

# Debug MCP problems
$ claude --mcp-debug

# Bypass hooks for testing
$ claude --no-hooks

Session Inspection

Quick Reference Table:

| Task | Command | Use Case | |------|---------|---------| | List sessions | ls ~/.claude/sessions/ | Find specific session | | View latest | cat ~/.claude/sessions/$(ls -t ~/.claude/sessions/ \| head -1) \| jq . | Inspect current state | | Extract messages | cat ~/.claude/sessions/latest.json \| jq '.messages[] \| {role, content}' | Review conversation | | Count tokens (estimate) | cat ~/.claude/sessions/latest.json \| jq '.messages[] \| .content' \| wc -w | Check context size |

πŸ’‘ Token Estimation: words Γ— 1.3 β‰ˆ tokens

Custom Diagnostic Scripts

Context Analyzer (.claude/scripts/context-analyzer.sh):

#!/bin/bash

SESSION_FILE=~/.claude/sessions/$(ls -t ~/.claude/sessions/ | head -1)

echo "πŸ“Š Context Analysis"
echo "=================="

# Message count
MSG_COUNT=$(cat $SESSION_FILE | jq '.messages | length')
echo "Messages: $MSG_COUNT"

# Token estimate
WORDS=$(cat $SESSION_FILE | jq '.messages[] | .content' | wc -w)
TOKENS=$(($WORDS * 13 / 10))
echo "Estimated tokens: $TOKENS"

# Context utilization
LIMIT=200000
PCT=$(($TOKENS * 100 / $LIMIT))
echo "Context usage: $PCT%"

if [ $PCT -gt 80 ]; then
  echo "⚠️  WARNING: High context usage - consider /compact"
elif [ $PCT -gt 60 ]; then
  echo "ℹ️  Moderate context usage - monitor"
else
  echo "βœ“ Healthy context usage"
fi

# Files referenced
echo ""
echo "Files referenced:"
cat $SESSION_FILE | jq '.messages[] | .content' | grep -o '@[^[:space:]]*' | sort -u

Part 5: Resilience Patterns

Resilience patterns flowchart showing Circuit Breaker, Retry Logic, Exponential Backoff, Fallback strategies, and graceful degradation Figure 3: Resilience Patterns - From Circuit Breaker through Fallback to Graceful Degradation

Resilience Pattern Comparison

| Pattern | Problem Solved | When to Use | Benefits | Complexity | |---------|---------------|-------------|----------|-----------| | Circuit Breaker | Repeated failures blocking workflow | Flaky subagents or services | Fails fast, auto-recovery | Medium | | Retry + Backoff | Transient network errors | Temporary failures | High success rate | Low | | Fallback | Primary option unavailable | Critical path needs guarantee | Always operational | Low | | Bulkhead | Resource exhaustion cascades | Parallel workloads | Failure isolation | High |

Circuit Breaker Pattern πŸ”΄β†’πŸŸ’

Problem: Flaky subagent keeps failing, blocking main workflow

Implementation (.claude/scripts/circuit-breaker.sh):

#!/bin/bash

SUBAGENT_NAME=$1
FAILURE_COUNT_FILE="/tmp/claude-circuit-$SUBAGENT_NAME.count"
MAX_FAILURES=3
COOLDOWN_SECONDS=300

# Initialize or read failure count
if [ -f $FAILURE_COUNT_FILE ]; then
  FAILURES=$(cat $FAILURE_COUNT_FILE)
  LAST_FAILURE=$(stat -f %m $FAILURE_COUNT_FILE 2>/dev/null || stat -c %Y $FAILURE_COUNT_FILE)
  NOW=$(date +%s)

  # Reset after cooldown period
  if [ $(($NOW - $LAST_FAILURE)) -gt $COOLDOWN_SECONDS ]; then
    echo 0 > $FAILURE_COUNT_FILE
    FAILURES=0
  fi
else
  echo 0 > $FAILURE_COUNT_FILE
  FAILURES=0
fi

# Check if circuit is open
if [ $FAILURES -ge $MAX_FAILURES ]; then
  echo "πŸ”΄ Circuit OPEN for $SUBAGENT_NAME (too many failures)"
  echo "Cooldown: $COOLDOWN_SECONDS seconds"
  exit 1
fi

# Try to execute subagent
if ! claude-subagent $SUBAGENT_NAME; then
  # Increment failure count
  echo $(($FAILURES + 1)) > $FAILURE_COUNT_FILE
  echo "⚠️  Subagent failed ($((FAILURES + 1))/$MAX_FAILURES)"
  exit 1
fi

# Success - reset counter
echo 0 > $FAILURE_COUNT_FILE
echo "βœ“ Subagent succeeded"

Circuit States:

| State | Condition | Behavior | Visual | |-------|-----------|---------|--------| | CLOSED | Failures < threshold | Normal operation | 🟒 | | OPEN | Failures β‰₯ threshold | Fail fast | πŸ”΄ | | HALF-OPEN | After cooldown | Test recovery | 🟑 |

Retry with Exponential Backoff πŸ”„

Backoff Schedule:

| Attempt | Delay | Total Wait | Success Rate | |---------|-------|-----------|-------------| | 1 | 0s | 0s | ~70% | | 2 | 2s | 2s | ~85% | | 3 | 4s | 6s | ~93% | | 4 | 8s | 14s | ~97% | | 5 | 16s | 30s | ~99% |

Implementation (.claude/scripts/retry.sh):

#!/bin/bash

COMMAND="$@"
MAX_ATTEMPTS=5
BASE_DELAY=2

for attempt in $(seq 1 $MAX_ATTEMPTS); do
  echo "Attempt $attempt/$MAX_ATTEMPTS..."

  if eval "$COMMAND"; then
    echo "βœ“ Success on attempt $attempt"
    exit 0
  fi

  if [ $attempt -lt $MAX_ATTEMPTS ]; then
    DELAY=$(($BASE_DELAY ** $attempt))
    echo "⏳ Waiting ${DELAY}s before retry..."
    sleep $DELAY
  fi
done

echo "βœ— Failed after $MAX_ATTEMPTS attempts"
exit 1

Fallback Strategy πŸ”€

Model Fallback Hierarchy:

| Priority | Model | Use Case | Speed | Quality | Cost | |----------|-------|---------|-------|---------|------| | Primary | Opus 4.5 | Best quality needed | Slow | ⭐⭐⭐⭐⭐ | $$$ | | Fallback 1 | Sonnet 4.5 | Balanced | Medium | ⭐⭐⭐⭐ | $$ | | Fallback 2 | Haiku | Speed critical | Fast | ⭐⭐⭐ | $ |

Implementation (.claude/scripts/smart-claude.sh):

#!/bin/bash

PROMPT="$@"

# Try Opus first
if claude --model=opus-4.5 -p "$PROMPT" 2>/dev/null; then
  exit 0
fi

echo "⚠️  Opus unavailable, falling back to Sonnet..."

# Try Sonnet
if claude --model=sonnet-4.5 -p "$PROMPT" 2>/dev/null; then
  exit 0
fi

echo "⚠️  Sonnet unavailable, falling back to Haiku..."

# Try Haiku
if claude --model=haiku -p "$PROMPT"; then
  exit 0
fi

echo "βœ— All models unavailable"
exit 1

Bulkhead Isolation 🚒

Resource Limits:

| Resource | Default | Recommended Limit | Reason | |----------|---------|------------------|---------| | Memory | Unlimited | 2GB per subagent | Prevent OOM crashes | | CPU | Unlimited | 50% per subagent | Fair scheduling | | Processes | Unlimited | 100 per subagent | Prevent fork bombs | | File handles | Unlimited | 1024 per subagent | Prevent exhaustion |

Implementation (.claude/agents/isolated-subagent.sh):

#!/bin/bash

AGENT_NAME=$1
MEMORY_LIMIT="2G"  # 2GB max
CPU_LIMIT="50"     # 50% CPU max

# Run subagent in isolated container (if Docker available)
if command -v docker &> /dev/null; then
  docker run --rm \
    --memory=$MEMORY_LIMIT \
    --cpus=$CPU_LIMIT \
    -v $(pwd):/workspace \
    claude-agent:latest \
    @$AGENT_NAME
else
  # Fallback to process limits
  ulimit -v 2097152  # 2GB virtual memory
  nice -n 10 claude @$AGENT_NAME
fi

Part 6: Root Cause Analysis in Action

Case Study 1: Slow Responses (5 Whys) 🐌

5 Whys Analysis Template:

| # | Question | Answer | Category | |---|----------|--------|----------| | Q1 | Why are responses slow? | High latency per request | Symptom | | Q2 | Why is latency high? | Large token count per message | Direct cause | | Q3 | Why large token count? | Loading entire codebase every request | Behavior | | Q4 | Why loading entire codebase? | Using @src/ directory reference | Usage pattern | | Q5 | Why using directory reference? | User doesn't know about 3-File Rule | ROOT CAUSE |

Solution Hierarchy:

| Level | Timeframe | Solution | Impact | |-------|-----------|---------|--------| | 1. Immediate | Now | Use /compact, switch to specific files | This session | | 2. Short-term | Today | Document 3-File Rule in team wiki | This project | | 3. Long-term | This week | Add PreToolUse hook warning on directory refs | All projects | | 4. Systematic | This month | Create onboarding training on context management | Organization |

Case Study 2: Hook Failures (Fishbone Diagram) 🐟

Fishbone diagram showing root causes of hook failures across four categories: People (No Training), Process (No Logging), Technology (Missing Tools), and Environment (FS Latency) Figure 4: Fishbone (Ishikawa) Diagram - Root cause analysis for hook failures across 4 categories

Root Causes by Category:

| Category | Root Cause | Fix Strategy | Prevention | |----------|-----------|-------------|-----------| | πŸ‘₯ People | Team unaware hooks exist | Document in README | Team training | | πŸ“‹ Process | No logging for debugging | Add structured logging | Standard procedure | | πŸ”§ Technology | Missing dependencies (prettier) | Add dependency check | Installation script | | 🌍 Environment | File system latency | Add timeout handling | Performance monitoring |

Solutions (Address each root cause):

People:

# Document hooks in README.md
cat >> README.md <<EOF

## Claude Code Hooks

We use automated hooks for:
- PostToolUse: Auto-formatting with Prettier
- PreToolUse: Safety validation
- SessionStart: Environment checks

See .claude/hooks/ for details.
EOF

Process:

#!/bin/bash
# .claude/hooks/post-tool-use.sh

LOG_FILE=~/.claude/hook-logs/post-tool-use.log
mkdir -p ~/.claude/hook-logs

echo "[$(date)] Starting PostToolUse hook" >> $LOG_FILE

for file in $CLAUDE_FILE_PATHS; do
  if [[ $file == *.js ]]; then
    echo "[$(date)] Formatting $file" >> $LOG_FILE
    if ! prettier --write "$file" 2>>$LOG_FILE; then
      echo "[$(date)] ERROR formatting $file" >> $LOG_FILE
      exit 1
    fi
  fi
done

echo "[$(date)] PostToolUse hook complete" >> $LOG_FILE

Technology:

# Add dependency check
#!/bin/bash
# .claude/hooks/post-tool-use.sh

if ! command -v prettier &> /dev/null; then
  echo "ERROR: prettier not installed"
  echo "Install: npm install -g prettier"
  exit 1
fi

# Rest of hook...

Environment:

# Add timeout for file operations
#!/bin/bash

timeout 10s prettier --write "$file" || {
  echo "WARNING: Formatting timeout for $file"
  # Continue anyway
}

Case Study 3: Context Confusion (Fault Tree) 🌳

Diagnostic Checklist:

| Step | Check | Command | Expected Result | Action if Failed | |------|-------|---------|----------------|-----------------| | 1 | Multiple CLAUDE.md files | find . -name "CLAUDE.md" | 1 file found | Consolidate | | 2 | Conflicting rules | Review each file | Consistent rules | Merge or prioritize | | 3 | Load order priority | Check hierarchy | Project > Personal | Document order | | 4 | Session state | > /status | Clean context | /clear if confused |

Root Cause: Conflicting rules in different CLAUDE.md files

Solution:

# Consolidate to single source of truth
# Keep project root CLAUDE.md, remove others
rm ./src/backend/CLAUDE.md

# Update root CLAUDE.md with context-specific rules
cat >> ./CLAUDE.md <<EOF

## Code Style by Directory

### Modern code (src/api/, src/services/)
ALWAYS: Use async/await

### Legacy code (src/legacy/)
ALWAYS: Use callbacks (maintain consistency)
EOF

Prevention:

# Add to health check
#!/bin/bash
# .claude/scripts/health-check.sh

CLAUDE_MD_FILES=$(find . -name "CLAUDE.md" -type f)
COUNT=$(echo "$CLAUDE_MD_FILES" | wc -l)

if [ $COUNT -gt 1 ]; then
  echo "⚠️  Multiple CLAUDE.md files detected:"
  echo "$CLAUDE_MD_FILES"
  echo ""
  echo "Recommendation: Consolidate to single file"
  echo "See: https://docs.claude.com/claude-code/configuration#hierarchy"
fi

Case Study 4: MCP Issues (Observability) πŸ”

MCP Troubleshooting Matrix:

| Symptom | Check | Tool | Expected | Action | |---------|-------|------|---------|---------| | Tools not available | Server status | cat ~/.claude/config.json | Valid config | Fix JSON | | Connection timeout | Network | curl -v [MCP_URL] | HTTP 200 | Check DNS/firewall | | Authentication failed | Credentials | --mcp-debug | No auth errors | Update credentials | | Slow responses | Performance | ping [MCP_HOST] | <100ms | Check network |

Diagnostic Flow:

  1. Metrics: Check config exists and is valid
  2. Logs: Run with --mcp-debug
  3. Traces: Test endpoint directly with curl

Example: DNS Resolution Failure

# Test MCP endpoint directly
$ curl -v https://mcp.company.com/github

# Output:
* Could not resolve host: mcp.company.com
* Closing connection 0
curl: (6) Could not resolve host: mcp.company.com

Root Cause: DNS resolution failure

Solution:

# Check DNS
$ nslookup mcp.company.com
# Server failed

# Fix: Switch to public DNS temporarily
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

# Verify
$ curl https://mcp.company.com/github
# Success!

Part 7: Recovery Playbooks

Recovery workflow showing 5 stages: Detect Error Pattern, Diagnose Root Cause, Recover Auto-Remediate, Learn Update Runbook, Prevent Add Check, with continuous monitoring Figure 5: Recovery Workflow - The Detect β†’ Diagnose β†’ Recover β†’ Learn β†’ Prevent cycle

Playbook Quick Reference

| Playbook | Scenario | Time to Fix | Difficulty | Prerequisites | |----------|---------|-------------|-----------|---------------| | #1: Installation | Claude command not found | 5-10 min | Easy | npm access | | #2: Context Overflow | "Out of context" error | 1-2 min | Easy | Current session | | #3: MCP Connection | Tools unavailable | 10-15 min | Medium | Server access | | #4: Hook Failure | Operations blocked | 5-10 min | Medium | Hook scripts | | #5: Performance | Progressively slower | 3-5 min | Easy | System access |

Playbook 1: Installation Recovery πŸ”§

Diagnostic Decision Tree:

Claude command not found
         ↓
    Is it installed?
    β”œβ”€ NO β†’ npm install -g @anthropic-ai/claude-code
    └─ YES β†’ Check PATH
              β”œβ”€ Not in PATH β†’ Add npm bin to PATH
              └─ In PATH β†’ Check permissions
                           β”œβ”€ Permission errors β†’ Fix npm config
                           └─ Other β†’ Reinstall

Steps:

| Step | Action | Command | Expected Result | |------|--------|---------|----------------| | 1 | Verify installation | npm list -g @anthropic-ai/claude-code | Package listed | | 2 | If not installed | npm install -g @anthropic-ai/claude-code | Successful install | | 3 | If installed but not found | export PATH=$(npm config get prefix)/bin:$PATH | Claude accessible | | 4 | If permission errors | See permission fix section | No sudo needed | | 5 | Verify | which claude && claude --version | Version displayed |

Playbook 2: Context Overflow Recovery 🧠

Decision Matrix:

| Can Lose History? | Current Work Important? | Need Specific Info? | Recommended Action | |------------------|------------------------|-------------------|-------------------| | βœ… Yes | N/A | N/A | /clear - fastest | | ❌ No | βœ… Yes | ❌ No | /compact - preserves summary | | ❌ No | βœ… Yes | βœ… Yes | Export β†’ Clear β†’ Reload relevant parts |

Export and Reload Pattern:

# In Claude:
> Export our conversation to /tmp/session-backup.md

# Clear context
> /clear

# Reload only relevant parts
> Read the architecture decisions from @/tmp/session-backup.md
> Now let's continue with...

Playbook 3: MCP Connection Recovery πŸ”Œ

6-Step Recovery Process:

| Step | Action | Purpose | Command | |------|--------|---------|---------| | 1️⃣ | Check server status | Identify which servers failing | claude --mcp-debug | | 2️⃣ | Test connectivity | Verify network path | curl -v [MCP_URL] | | 3️⃣ | Validate config | Check JSON syntax | python3 -m json.tool ~/.claude/config.json | | 4️⃣ | Restart servers | Clear stuck state | pkill -f mcp-server && npm run mcp-server & | | 5️⃣ | Restart Claude | Fresh connection | claude --mcp-debug | | 6️⃣ | Check logs | Find root cause | tail -f ~/.claude/logs/mcp-*.log |

Playbook 4: Hook Failure Recovery ⚑

Emergency Bypass:

# Start Claude without hooks
claude --no-hooks

Diagnosis Steps:

| Step | Check | Command | Look For | |------|-------|---------|---------| | 1 | Script exists | ls -la .claude/hooks/ | Files present | | 2 | Executable | ls -la .claude/hooks/pre-tool-use.sh | rwx permissions | | 3 | Syntax errors | bash -x .claude/hooks/pre-tool-use.sh | No errors | | 4 | Dependencies | command -v jq prettier | All tools found | | 5 | Debug logging | Check /tmp/claude-hook-debug.log | Error messages |

Add Debug Logging:

#!/bin/bash
# .claude/hooks/pre-tool-use.sh

set -x  # Enable debug output
exec 2>> /tmp/claude-hook-debug.log  # Log errors

# Your hook logic...

Playbook 5: Performance Degradation Recovery ⚑

Diagnostic Workflow:

| Symptom | Check | Threshold | Action | Expected Improvement | |---------|-------|-----------|--------|---------------------| | Slow responses | Context size | >60% | /compact | 2-3x faster | | High memory | Process memory | >4GB | Restart session | Fresh state | | Network lag | API latency | >500ms | Check connection | Faster network | | Wrong model | Model used | Opus | Switch to Sonnet | 3-5x faster |

Quick Fix Commands:

# 1. Check context size
> /status

# 2. If >60% full: Compact
> /compact

# 3. Check system resources
top -p $(pgrep claude)
ps aux | grep claude | awk '{print $6/1024 " MB"}'

# 4. If high memory: Restart
> exit
claude

# 5. If network latency: Test
ping api.anthropic.com

# 6. If model slow: Switch
claude --model=sonnet-4.5

Part 8: Observability for Claude Code

The Observability Triad πŸ“Š

Following industry-standard observability practices:

| Pillar | Focus | Question Answered | Data Type | Retention | |--------|-------|------------------|-----------|-----------| | πŸ“Š Metrics | What's happening | How much? How fast? | Quantitative (numbers) | 30 days | | πŸ“ Logs | Why it's happening | What failed? Why? | Qualitative (events) | 7 days | | πŸ”— Traces | How it's happening | What's the flow? | Causal (sequences) | 24 hours |

Observability triad diagram showing three interconnected pillars: Metrics (Token Usage, Response Time, Session Duration), Logs (Errors, Prompts, Diagnostics), and Traces (Prompt to Tool to Model to Result workflow) Figure 6: The Observability Triad - Metrics, Logs, and Traces working together for complete system visibility

Implementing Metrics Collection πŸ“Š

Key Metrics to Track:

| Metric | Unit | Healthy Range | Warning | Critical | |--------|------|--------------|---------|----------| | Token usage | Tokens | <100k | 100k-150k | >150k | | Response time | Seconds | <5s | 5-15s | >15s | | Session duration | Minutes | 15-60m | 60-120m | >120m | | Message count | Count | <50 | 50-100 | >100 | | Context utilization | Percentage | <60% | 60-80% | >80% |

Metrics Collector (.claude/scripts/metrics.sh):

#!/bin/bash

METRICS_FILE=~/.claude/metrics/$(date +%Y-%m-%d).json
mkdir -p ~/.claude/metrics

# Token usage
TOKENS=$(cat ~/.claude/sessions/latest.json 2>/dev/null | jq '.messages[] | .content' | wc -w | awk '{print $1 * 1.3}')

# Session duration
START_TIME=$(stat -f %B ~/.claude/sessions/latest.json 2>/dev/null || stat -c %W ~/.claude/sessions/latest.json)
NOW=$(date +%s)
DURATION=$(($NOW - $START_TIME))

# Message count
MSG_COUNT=$(cat ~/.claude/sessions/latest.json 2>/dev/null | jq '.messages | length')

# Log metrics
cat >> $METRICS_FILE <<EOF
{
  "timestamp": "$(date -Iseconds)",
  "tokens": $TOKENS,
  "duration_seconds": $DURATION,
  "message_count": $MSG_COUNT,
  "tokens_per_minute": $(($TOKENS * 60 / $DURATION))
}
EOF

Add to Stop hook:

{
  "hooks": {
    "stop": ".claude/scripts/metrics.sh"
  }
}

Log Aggregation πŸ“

Log Levels:

| Level | Use Case | Example | Retention | |-------|---------|---------|-----------| | ERROR | Failures requiring action | Hook failed, API error | 30 days | | WARN | Potential issues | High context usage | 14 days | | INFO | Normal operations | Session started | 7 days | | DEBUG | Detailed diagnostics | Tool invocations | 1 day |

Structured Logging Hook:

#!/bin/bash
# .claude/hooks/user-prompt-submit.sh

LOG_FILE=~/.claude/logs/prompts.jsonl
mkdir -p ~/.claude/logs

# Log in JSON Lines format (one JSON object per line)
cat >> $LOG_FILE <<EOF
{"timestamp":"$(date -Iseconds)","prompt":"$CLAUDE_USER_PROMPT","project":"$(basename $(pwd))","user":"$(whoami)"}
EOF

Search Logs:

| Query | Command | Use Case | |-------|---------|---------| | Find keyword | cat ~/.claude/logs/prompts.jsonl \| jq 'select(.prompt \| contains("refactor"))' | Track specific topics | | Count by project | cat ~/.claude/logs/prompts.jsonl \| jq -r '.project' \| sort \| uniq -c | Usage analytics | | Last hour | cat ~/.claude/logs/prompts.jsonl \| jq "select(.timestamp > \"$HOUR_AGO\")" | Recent activity |

Distributed Tracing πŸ”—

Trace Events to Capture:

| Event Type | Trigger | Data Captured | Use Case | |-----------|---------|---------------|---------| | session_start | Claude launches | Timestamp, project, config | Performance baseline | | prompt_submit | User sends prompt | Prompt text, token count | User behavior analysis | | pre_tool_use | Before tool call | Tool name, input | Workflow understanding | | post_tool_use | After tool call | Tool result, duration | Performance tracking | | session_end | Claude exits | Total tokens, duration | Session analysis |

Trace Implementation (.claude/scripts/trace.sh):

#!/bin/bash
# .claude/scripts/trace.sh

TRACE_ID=$(uuidgen)
TRACE_FILE=~/.claude/traces/$TRACE_ID.json
mkdir -p ~/.claude/traces

echo "{\"trace_id\":\"$TRACE_ID\",\"events\":[]}" > $TRACE_FILE

# Function to log trace event
trace_event() {
  local event_type=$1
  local event_data=$2

  cat $TRACE_FILE | jq ".events += [{\"timestamp\":\"$(date -Iseconds)\",\"type\":\"$event_type\",\"data\":\"$event_data\"}]" > $TRACE_FILE.tmp
  mv $TRACE_FILE.tmp $TRACE_FILE
}

# Export for use in hooks
export TRACE_ID
export -f trace_event

echo "Trace started: $TRACE_ID"

Analyze Trace:

# View complete trace
cat ~/.claude/traces/$TRACE_ID.json | jq .

# Timeline
cat ~/.claude/traces/$TRACE_ID.json | jq '.events[] | "\(.timestamp): \(.type) - \(.data)"'

Part 9: Zimbra Troubleshooting Case Studies

Case Study 1: Mail Delivery Failures πŸ“§

Diagnostic Workflow:

| Step | Action | Command | Finding | |------|--------|---------|---------| | 1️⃣ Check queue | Identify stuck messages | postqueue -p \| grep gmail.com | 47 messages stuck | | 2️⃣ Check logs | Find error pattern | tail -f /var/log/maillog \| grep gmail.com | "Network is unreachable" | | 3️⃣ 5 Whys | Root cause analysis | See table below | IPv6 misconfiguration | | 4️⃣ Fix | Disable or configure IPv6 | postconf -e "inet_protocols = ipv4" | Issue resolved | | 5️⃣ Verify | Flush queue | postqueue -f | Messages delivered |

5 Whys Analysis:

| # | Question | Answer | Category | |---|----------|--------|----------| | Q1 | Why are emails stuck? | Can't connect to Gmail MX servers | Symptom | | Q2 | Why can't connect? | Network unreachable error | Network | | Q3 | Why network unreachable? | IPv6 connection failing | Protocol | | Q4 | Why IPv6 failing? | Server has IPv6 address but no IPv6 route | Configuration | | Q5 | Why configured without route? | Auto-configuration enabled IPv6 without admin knowledge | ROOT CAUSE |

Solution Options:

| Option | Time | Risk | Permanence | Recommended? | |--------|------|------|-----------|-------------| | Disable IPv6 in Postfix | 2 min | Low | Temporary | βœ… Quick fix | | Configure IPv6 routing | 1 hour | Medium | Permanent | βœ… Long-term | | Use IPv4 only globally | 5 min | Low | Permanent | ⚠️ Limits future |

Case Study 2: LDAP Authentication Issues πŸ”

Root Cause Matrix:

| Category | Root Cause | Impact | Fix Priority | Time to Fix | |----------|-----------|--------|-------------|-------------| | People | Users using correct passwords | None | N/A | - | | Process | No LDAP validation before deploy | High | πŸ”΄ Critical | Add to CI/CD | | Technology | LDAP bind DN misconfigured | Critical | πŸ”΄ Critical | 5 min | | Environment | LDAP server load spike | Medium | 🟑 Important | Scale infrastructure |

Diagnostic with Claude Code:

# Use Claude to check LDAP config
claude

> Analyze Zimbra LDAP configuration:
> @/opt/zimbra/conf/localconfig.xml
>
> Check for:
> 1. Bind DN correctness
> 2. Password encryption
> 3. Server connectivity settings

# Claude identifies:
# ❌ ldap_bind_dn: uid=zimbra,cn=users,dc=example,dc=com
# βœ… Should be: uid=zimbra,cn=admins,dc=example,dc=com

Solution:

# Fix bind DN
zmlocalconfig -e ldap_bind_dn="uid=zimbra,cn=admins,dc=example,dc=com"

# Restart LDAP
zmcontrol restart ldap

# Verify
ldapsearch -x -H ldap://localhost -D "uid=zimbra,cn=admins,dc=example,dc=com" -W

Prevention Hook (.claude/hooks/pre-tool-use.sh):

#!/bin/bash

# Detect Zimbra config changes
if echo "$CLAUDE_FILE_PATHS" | grep -q "/opt/zimbra/conf/"; then
  echo "⚠️  Zimbra configuration change detected"
  echo "πŸ” Running validation..."

  # Validate LDAP config
  if ! zmlocalconfig -c ldap 2>/dev/null; then
    echo "❌ LDAP configuration invalid"
    echo "Fix errors before proceeding"
    exit 1
  fi

  echo "βœ… Configuration valid"

  read -p "Create config backup before change? (y/n): " backup
  if [[ $backup == "y" ]]; then
    BACKUP_DIR=~/.zimbra-backups/$(date +%Y%m%d-%H%M%S)
    mkdir -p $BACKUP_DIR
    cp -r /opt/zimbra/conf/ $BACKUP_DIR/
    echo "πŸ“¦ Backup saved: $BACKUP_DIR"
  fi
fi

Case Study 3: Performance Degradation 🐌

Performance Metrics:

| Metric | Normal | Warning | Critical | Action | |--------|--------|---------|---------|---------| | Mailbox threads | Running | Stopped | Not responding | Restart mailbox | | LDAP threads | Running | Slow | Stopped | Restart LDAP | | Mail queue size | <50 | 50-100 | >100 | Flush queue | | LDAP response time | <50ms | 50-500ms | >500ms | Optimize queries |

Metrics Collection (.claude/scripts/zimbra-metrics.sh):

#!/bin/bash

METRICS_FILE=~/zimbra-metrics/$(date +%Y-%m-%d).json
mkdir -p ~/zimbra-metrics

# Collect metrics
MAILBOX_THREADS=$(zmcontrol status | grep mailbox | awk '{print $4}')
LDAP_THREADS=$(zmcontrol status | grep ldap | awk '{print $4}')
MAIL_QUEUE=$(postqueue -p | tail -1 | awk '{print $5}')
LDAP_RESPONSE_TIME=$(time ldapsearch -x -b "" -s base 2>&1 | grep real | awk '{print $2}')

# Log metrics
cat >> $METRICS_FILE <<EOF
{
  "timestamp": "$(date -Iseconds)",
  "mailbox_threads": "$MAILBOX_THREADS",
  "ldap_threads": "$LDAP_THREADS",
  "mail_queue_size": "$MAIL_QUEUE",
  "ldap_response_ms": "$LDAP_RESPONSE_TIME"
}
EOF

Pattern Detected by Claude:

πŸ” Pattern detected: LDAP queries spiking every 5 minutes
πŸ“Š Average: 50ms, Spike: 2000ms
⏰ Timing correlates with sync job

Root Cause: Inefficient LDAP sync job

Solution:

# Optimize sync job
vim /opt/zimbra/conf/sync-config.xml

# Change:
# <interval>300</interval>  <!-- 5 minutes -->
# To:
# <interval>3600</interval> <!-- 1 hour -->

# Add connection pooling
# <pool_size>10</pool_size>

# Restart
zmcontrol restart sync

FAQ

How do I know if an issue is Claude Code or my system? πŸ”

Test Isolation:

| Check | Command | Claude Issue? | System Issue? | |-------|---------|--------------|--------------| | Run diagnostics | claude doctor | ❌ Checks fail | βœ… Checks pass | | System resources | top; free -h; df -h | βœ… Normal | ❌ High usage | | Network | ping api.anthropic.com | βœ… Fast (<100ms) | ❌ Slow/unreachable | | Other tools | Test npm, git, etc. | βœ… Work | ❌ Also broken |

When should I use /compact vs /clear? 🧠

| Scenario | Use /compact | Use /clear | Reason | |----------|---------------|-------------|---------| | Same topic, large context | βœ… | ❌ | Preserves history | | New unrelated task | ❌ | βœ… | Fresh start needed | | Claude seems confused | ❌ | βœ… | Context contaminated | | Refinement of same work | βœ… | ❌ | Keep decisions made | | Switching features | ❌ | βœ… | Different context |

Rule of thumb: Compact for refinement, clear for reset.

How can I prevent 40% of crashes (corrupted installs)? 🎯

Three-Step Prevention:

| Step | Action | Time | Impact | |------|--------|------|--------| | 1️⃣ Use nvm | Install Node Version Manager | 5 min | Prevents 90% of permission issues | | 2️⃣ Install properly | npm install -g @anthropic-ai/claude-code | 2 min | Clean installation | | 3️⃣ Validate | claude doctor | 1 min | Catches 100% of setup problems |

What's the fastest way to diagnose slow responses? ⚑

Quick Diagnostic (2-minute protocol):

| Step | Action | Expected Time | Fix | |------|--------|--------------|-----| | 1 | > /status | 5 sec | Check context size | | 2 | If >60% | 10 sec | > /compact | | 3 | Ask which model | 5 sec | Switch if needed | | 4 | Use 3-File Rule | Ongoing | Specific files only | | 5 | Restart session | 30 sec | Last resort |

How do I debug hooks that fail silently? πŸ”‡

Debug Strategy:

| Level | Technique | Implementation | Output Location | |-------|-----------|----------------|----------------| | Basic | Enable debug mode | set -x in hook | stderr | | Standard | Log to file | exec 2>> /tmp/hook.log | /tmp/hook.log | | Advanced | Structured logging | JSON format logs | ~/.claude/logs/ | | Expert | Distributed tracing | trace_event calls | ~/.claude/traces/ |

Quick Debug Template:

#!/bin/bash
# .claude/hooks/my-hook.sh

set -x  # Enable debug mode
exec 2>> /tmp/claude-hook-debug.log  # Log to file

# Your hook logic...

echo "Hook completed successfully"

Can I monitor Claude Code like a production service? πŸ“Š

Yes! Implement the Observability Triad:

| Component | Tracks | Storage | Retention | Query Method | |-----------|--------|---------|-----------|-------------| | Metrics | Token usage, response time, session duration | JSON files | 30 days | jq queries | | Logs | Prompts, errors, hook execution | JSONL files | 7 days | grep/jq | | Traces | Complete workflow from prompt to result | JSON files | 24 hours | jq traces |

See "Part 8: Observability" for full implementation.

What's the best way to handle flaky MCP servers? πŸ”Œ

Resilience Strategy Comparison:

| Approach | Fail Fast? | Auto-Recovery? | Implementation | Best For | |----------|-----------|---------------|----------------|----------| | Circuit Breaker | βœ… Yes | βœ… Yes | Medium complexity | Repeated failures | | Retry + Backoff | ❌ No | βœ… Yes | Low complexity | Transient errors | | Fallback | βœ… Yes | βœ… Yes | Low complexity | Alternative available | | Bulkhead | N/A | βœ… Yes | High complexity | Resource isolation |

Recommended: Circuit breaker (see "Part 5: Resilience Patterns")


Conclusion

You now have an industrial-grade troubleshooting framework for Claude Code:

Your Resilience Arsenal ⚑

| Component | Covered? | Key Techniques | Impact | |-----------|---------|---------------|---------| | βœ… Diagnostic Stack | Yes | 5 layers: P-D-D-R-L | Systematic approach | | βœ… Root Cause Analysis | Yes | 5 Whys, Fishbone, Fault Tree | Find true source | | βœ… Resilience Patterns | Yes | Circuit Breaker, Retry, Bulkhead, Fallback | Self-healing systems | | βœ… Observability | Yes | Metrics, Logs, Traces | Complete visibility | | βœ… Recovery Playbooks | Yes | 5 complete runbooks | Fast resolution | | βœ… Automation | Yes | Self-healing hooks | Prevent and recover | | βœ… Case Studies | Yes | Real-world Zimbra diagnostics | Practical application |

But here's the key insight: This isn't about fixing problems faster.

It's about building systems that prevent problems and recover automatically when they occur.

The Challenge: Build Your Resilience Stack πŸš€

Week-by-Week Implementation:

| Week | Focus | Tasks | Success Metrics | |------|-------|-------|----------------| | Week 1: Prevention | Stop problems before they start | Health checks, validators, pre-flight checks, CLAUDE.md docs | 40% fewer issues | | Week 2: Detection | Spot issues immediately | Metrics collection, structured logging, baselines, anomaly detection | <10s detection time | | Week 3: Diagnosis | Understand the "why" | Learn 5 Whys, create Fishbone templates, practice Fault Tree, document workflows | Root cause in <5min | | Week 4: Recovery | Fix and restore fast | Circuit breakers, retry logic, recovery playbooks, self-healing automation | <1min recovery |

Your Mission: After 4 weeks, you should have a resilient Claude Code system that:

  • βœ… Prevents 60%+ of issues before they occur
  • βœ… Detects 90%+ of problems within seconds
  • βœ… Diagnoses root causes systematically
  • βœ… Recovers automatically from failures

Measurement Framework πŸ“Š

| Metric | Before | Target After 4 Weeks | Measurement | |--------|--------|---------------------|-------------| | Issues per week | Baseline | -60% | Track incidents | | Mean time to detect | Minutes | <10 seconds | Observability data | | Mean time to resolve | Hours | <5 minutes | Playbook usage | | Prevention rate | 0% | 60% | Health checks | | Auto-recovery rate | 0% | 40% | Resilience patterns |

Then measure: How many hours did you save by preventing fires instead of fighting them?

For more resources, check the official GitHub repository for community-contributed troubleshooting guides and diagnostic tools.


πŸ‘ˆ Previous: Advanced Features Guide πŸ‘‰ Next: [Best Practices & Workflows](./ claude-code-workflow-engineering-team-productivity-guide)


P.S. If you're still just restarting Claude when it crashes instead of asking "Why did it crash?", you're not engineeringβ€”you're hoping. Hope is not a strategy. Build resilience instead.