
7 - The Resilience Engineer's Guide to Claude Code: Debug Like a Pro
Table of Contents
- Introduction
- Part 1: The Diagnostic Mindset
- Part 2: The Prevention Layer
- Part 3: Error Taxonomy & Pattern Recognition
- Part 4: The Diagnostic Toolkit
- Part 5: Resilience Patterns
- Part 6: Root Cause Analysis in Action
- Part 7: Recovery Playbooks
- Part 8: Observability for Claude Code
- Part 9: Zimbra Troubleshooting Case Studies
- FAQ
- Conclusion
Introduction
Here's the truth about debugging: Most people treat symptoms, not diseases.
Claude Code crashes? Restart it. Slow responses? Clear context. Hooks failing? Disable them.
That's not engineering. That's whack-a-mole.
What if instead of firefighting, you built resilient systems that self-heal? What if you applied industrial-grade diagnostic frameworks from distributed systems engineering to your AI development workflow?
This isn't a "common errors and fixes" list. This is a resilience engineering handbook for Claude Code, combining:
- Root Cause Analysis (5 Whys, Fishbone, Fault Tree)
- Resilience Patterns (Circuit Breaker, Retry, Bulkhead, Fallback)
- Observability (Metrics, Logs, Traces)
- Self-Healing Automation (Prevention > detection > cure)
By the end, you won't just fix problemsβyou'll prevent them and build systems that recover automatically.
Let's transform you from a bug fixer into a resilience engineer.
Part 1: The Diagnostic Mindset
The Three Levels of Problem-Solving π
| Level | Approach | Example | Result | Engineering Quality | |-------|----------|---------|--------|-------------------| | Level 1: Symptom Treatment | Try again blindly | Claude crashes β restart β works | Got lucky | β Amateur | | Level 2: Direct Cause | Fix immediate issue | Claude crashes β "Out of memory" β restart with more RAM | Temporary fix | β οΈ Competent | | Level 3: Root Cause | Trace causal chain to origin | OOM β Loading entire codebase β No file filtering β Missing CLAUDE.md β No validation | Permanent prevention | β Professional |
This is engineering.
The Diagnostic Stack (5 Layers)
Every robust system has layers of defense. Think of this like network security layers - each one catches what the previous missed.
| Layer | Focus | Timing | Key Activities | Goal | |-------|-------|--------|---------------|------| | 1οΈβ£ Prevention | Stop problems before they start | Proactive | Health checks, config validation, dependency verification | Eliminate failure causes | | 2οΈβ£ Detection | Spot issues immediately | Reactive | Error pattern recognition, baselines, anomaly detection | Fast problem identification | | 3οΈβ£ Diagnosis | Understand the "why" | Analytical | Root cause analysis, causal chains, pattern correlation | Find true source | | 4οΈβ£ Recovery | Fix and restore | Corrective | Auto-remediation, fallbacks, state restoration | Return to normal operation | | 5οΈβ£ Learning | Prevent future occurrences | Evolutionary | Error database, runbooks, process refinement | Continuous improvement |
The goal: Build systems that prevent, detect, diagnose, recover, and learn from failures automatically.
Figure 1: The 5-Layer Diagnostic Stack - Each layer builds resilience from bottom (prevention) to top (learning)
π‘ Memory Anchor: Think P-D-D-R-L = "Paddle" through problems systematically.
Part 2: The Prevention Layer
The claude doctor Command
Your first line of defense (learn more in the official docs):
$ claude doctor
What it checks:
| Component | What's Validated | Why It Matters | |-----------|-----------------|----------------| | β Node.js version | 18+ required | Incompatible versions cause crashes | | β npm configuration | Proper setup | Permission errors blocked | | β PATH setup | Binary accessible | "Command not found" prevented | | β API authentication | Valid key | Connection failures avoided | | β Configuration files | Syntax & structure | Malformed configs caught early | | β MCP servers | Connectivity | Tools available when needed | | β Tool availability | ripgrep, git, etc. | Search & version control work |
Sample Output:
Claude Code Health Check
========================
β Node.js: v20.11.0 (OK)
β npm: 10.2.4 (OK)
β PATH: Claude binary found
β Auth: Valid API key
β ripgrep: NOT FOUND (install for search functionality)
β CLAUDE.md: Multiple files found (may cause conflicts)
β MCP Servers: 2/2 connected
Recommendation: Install ripgrep, consolidate CLAUDE.md files
Fix issues before starting work.
Pre-Flight Health Check Script
Create .claude/scripts/health-check.sh:
#!/bin/bash
set -e
echo "π₯ Claude Code Health Check"
echo "=========================="
# Check Node.js version
NODE_VERSION=$(node -v | cut -d'v' -f2)
REQUIRED_VERSION="18.0.0"
if [ "$(printf '%s\n' "$REQUIRED_VERSION" "$NODE_VERSION" | sort -V | head -n1)" = "$REQUIRED_VERSION" ]; then
echo "β Node.js: v$NODE_VERSION (OK)"
else
echo "β Node.js: v$NODE_VERSION (UPGRADE REQUIRED)"
exit 1
fi
# Check for conflicting CLAUDE.md files
CLAUDE_MD_COUNT=$(find . -name "CLAUDE.md" -type f | wc -l)
if [ "$CLAUDE_MD_COUNT" -gt 1 ]; then
echo "β CLAUDE.md: Found $CLAUDE_MD_COUNT files (potential conflicts)"
find . -name "CLAUDE.md" -type f
else
echo "β CLAUDE.md: Single file (OK)"
fi
# Check ripgrep installation
if command -v rg &> /dev/null; then
echo "β ripgrep: Installed"
else
echo "β ripgrep: NOT FOUND (search features disabled)"
fi
# Check available RAM
if [ "$(uname)" = "Darwin" ]; then
# macOS
TOTAL_RAM=$(sysctl -n hw.memsize | awk '{print int($1/1024/1024/1024)}')
else
# Linux
TOTAL_RAM=$(free -g | awk '/^Mem:/{print $2}')
fi
if [ "$TOTAL_RAM" -ge 16 ]; then
echo "β RAM: ${TOTAL_RAM}GB (OK)"
else
echo "β RAM: ${TOTAL_RAM}GB (16GB+ recommended)"
fi
# Validate MCP server configs
if [ -f ~/.claude/config.json ]; then
if python3 -m json.tool ~/.claude/config.json > /dev/null 2>&1; then
echo "β MCP Config: Valid JSON"
else
echo "β MCP Config: INVALID JSON"
exit 1
fi
else
echo "βΉ MCP Config: Not configured"
fi
# Check for tool conflicts
if pgrep -x "aider" > /dev/null; then
echo "β Warning: aider is running (may conflict)"
fi
echo ""
echo "Health check complete!"
echo "Run: claude"
Make executable:
chmod +x .claude/scripts/health-check.sh
Add to SessionStart hook (.claude/settings.json):
{
"hooks": {
"sessionStart": ".claude/scripts/health-check.sh"
}
}
Now every Claude session starts with automated health validation! β¨
Configuration Health Scoring
Create .claude/scripts/config-score.sh:
#!/bin/bash
SCORE=100
# Deduct points for issues
[ ! -f CLAUDE.md ] && SCORE=$((SCORE-20)) && echo "β -20: Missing CLAUDE.md"
[ $(find . -name "CLAUDE.md" | wc -l) -gt 1 ] && SCORE=$((SCORE-15)) && echo "β οΈ -15: Multiple CLAUDE.md files"
! command -v rg &> /dev/null && SCORE=$((SCORE-10)) && echo "β οΈ -10: ripgrep not installed"
[ $(free -g 2>/dev/null | awk '/^Mem:/{print $2}') -lt 16 ] && SCORE=$((SCORE-10)) && echo "β οΈ -10: Low RAM (<16GB)"
[ ! -d .claude/commands ] && SCORE=$((SCORE-5)) && echo "βΉοΈ -5: No custom commands"
[ ! -f ~/.claude/config.json ] && SCORE=$((SCORE-5)) && echo "βΉοΈ -5: No MCP config"
echo ""
echo "Configuration Health: $SCORE/100"
if [ $SCORE -ge 90 ]; then
echo "π’ Excellent configuration"
elif [ $SCORE -ge 70 ]; then
echo "π‘ Good configuration (some improvements possible)"
elif [ $SCORE -ge 50 ]; then
echo "π Fair configuration (issues detected)"
else
echo "π΄ Poor configuration (fix critical issues)"
exit 1
fi
Run before important tasks:
./.claude/scripts/config-score.sh
Part 3: Error Taxonomy & Pattern Recognition
The 40% Rule π―
Research shows: 40% of Claude Code crashes stem from corrupted installs or permission errors.
That means nearly half of all problems can be prevented with proper installation and permissions.
Error Category Matrix
| Category | % of Issues | Severity | Preventable? | Time to Fix | Prevention Strategy | |----------|------------|----------|--------------|-------------|-------------------| | π§ Installation | 25% | High | β Yes | 5-10 min | Use nvm, validate PATH | | π Permissions | 15% | High | β Yes | 2-5 min | Never use sudo with npm | | π§ Context Management | 20% | Medium | β Yes | 1-2 min | 3-File Rule, auto-compact | | π Network/API | 15% | Medium | β οΈ Partial | 5-15 min | Retry logic, fallbacks | | βοΈ Configuration | 10% | Low | β Yes | 2-5 min | Validation scripts | | β‘ Performance | 10% | Low | β Yes | 1-3 min | Monitor context size | | π Integration (IDE/MCP) | 5% | Low | β οΈ Partial | 10-20 min | Health checks, circuit breakers |
60%+ of issues are fully preventable with proper setup.
Figure 2: Error Decision Tree - Quick diagnostic paths for any Claude Code issue
Installation Errors (25% of Issues) π§
Error Pattern 1: "Command not found: claude"
5 Whys Analysis:
| Question | Answer | Layer | |----------|--------|-------| | 1. Why not found? | Not in PATH | Direct symptom | | 2. Why not in PATH? | npm global bin not configured | Configuration | | 3. Why not configured? | User installed with sudo | Installation method | | 4. Why sudo? | Permission error without sudo | Permission setup | | 5. Why permission error? | npm default directory requires root | ROOT CAUSE |
Solution (Fix root cause):
# Configure npm to use user directory
mkdir ~/.npm-global
npm config set prefix '~/.npm-global'
# Add to PATH
echo 'export PATH=~/.npm-global/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
# Reinstall Claude
npm install -g @anthropic-ai/claude-code
Prevention Checklist:
- β Use nvm (Node Version Manager) from the start
- β Never install npm packages with sudo
- β Validate PATH in health check
- β Document installation procedure
Error Pattern 2: "Permission denied" on macOS/Linux
Solutions Comparison:
| Option | Time | Difficulty | Permanence | Recommended? | |--------|------|-----------|-----------|-------------| | Fix npm permissions | 2 min | Easy | Permanent | β Best for quick fix | | Use nvm | 5 min | Easy | Permanent | β Best long-term | | Fix directory permissions | 3 min | Medium | Risky | β οΈ Use with caution |
Option 1: Fix npm permissions (Recommended)
mkdir ~/.npm-global
npm config set prefix '~/.npm-global'
export PATH=~/.npm-global/bin:$PATH
npm install -g @anthropic-ai/claude-code
Option 2: Use nvm (Best long-term)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash
nvm install 20
nvm use 20
npm install -g @anthropic-ai/claude-code
Error Pattern 3: Node.js version incompatible
Direct Fix:
# Check current version
node --version
# Upgrade with nvm
nvm install 20
nvm use 20
nvm alias default 20
# Verify
node --version # Should be 20.x.x
# Reinstall Claude
npm install -g @anthropic-ai/claude-code
Prevention:
- β Add Node version check to health script
- β Use .nvmrc file in projects
- β Document required versions in CLAUDE.md
Context Management Errors (20% of Issues) π§
Error Pattern 1: "Out of context" / "Conversation too long"
5 Whys Chain:
β Why out of context?
β
Token limit reached (200,000)
β
β Why limit reached?
β
Large files loaded
β
β Why large files?
β
Entire directory referenced (@src/)
β
β Why entire directory?
β
No file filtering used
β
β Why no filtering?
β
User didn't know 3-File Rule
β
π― ROOT CAUSE: Lack of education
Immediate Fix Options:
| Command | Effect | When to Use | Context Preserved? |
|---------|--------|-------------|-------------------|
| /compact | Summarize & compress | Same topic, need history | β
Partial (summary) |
| /clear | Start fresh | New topic, confusion | β No |
| /status | Check size first | Before deciding | β
Yes |
Long-term Solution: The 3-File Rule π―
# β Instead of this:
> Review @src/
# β
Do this:
> Review @src/auth/login.js @src/middleware/auth.js @tests/auth.test.js
Prevention Hook (PreToolUse):
#!/bin/bash
# .claude/hooks/pre-tool-use.sh
# Detect large directory references
if echo "$CLAUDE_TOOL_INPUT" | grep -q "@.*/$"; then
echo "β οΈ Warning: You're referencing entire directory"
echo "π‘ Tip: Use specific files for better results"
echo ""
read -p "Continue anyway? (y/n): " confirm
[[ $confirm != "y" ]] && exit 1
fi
Performance Errors (10% of Issues) β‘
Error Pattern: Slow responses (>30s per query)
Diagnostic Decision Tree:
| Check | Command | Normal Range | Action if Abnormal |
|-------|---------|--------------|-------------------|
| Context size | > /status | <60% | /compact if >60% |
| Model used | Ask Claude | Sonnet 4.5 | Switch to faster model |
| Files loaded | /status | <10 files | Remove unnecessary files |
| System CPU | top \| grep claude | <50% | Check for background processes |
| System RAM | ps aux \| grep claude | <4GB | Restart session |
| Network latency | ping api.anthropic.com | <100ms | Check connection |
Token-Saving Hack (from community): π§
Instead of asking Claude to read files and make changes:
# β Expensive (loads files into context):
> Read all files in src/components/ and update import paths
# Uses ~50,000 tokens
Do this:
# β
Efficient (generates automation):
> Create a bash script that:
> 1. Finds all files in src/components/
> 2. Updates import paths with sed
> 3. Runs and validates changes
> 4. Deletes itself when done
# Uses ~5,000 tokens (10x reduction!)
Why it works: Claude writes script once, script executes locally, no need to load all files into context.
Part 4: The Diagnostic Toolkit
Built-In Diagnostic Flags
| Flag | What It Shows | When to Use | Output Level |
|------|--------------|-------------|-------------|
| --verbose | Tool calls, file ops, token usage, performance | Debugging unexpected behavior | π Detailed |
| --mcp-debug | MCP connections, protocol exchanges, auth flows | MCP tools not working | π Detailed |
| --no-hooks | Disables all hooks temporarily | Suspecting hook failures | π Normal |
Example Usage:
# Diagnose tool issues
$ claude --verbose
# Debug MCP problems
$ claude --mcp-debug
# Bypass hooks for testing
$ claude --no-hooks
Session Inspection
Quick Reference Table:
| Task | Command | Use Case |
|------|---------|---------|
| List sessions | ls ~/.claude/sessions/ | Find specific session |
| View latest | cat ~/.claude/sessions/$(ls -t ~/.claude/sessions/ \| head -1) \| jq . | Inspect current state |
| Extract messages | cat ~/.claude/sessions/latest.json \| jq '.messages[] \| {role, content}' | Review conversation |
| Count tokens (estimate) | cat ~/.claude/sessions/latest.json \| jq '.messages[] \| .content' \| wc -w | Check context size |
π‘ Token Estimation: words Γ 1.3 β tokens
Custom Diagnostic Scripts
Context Analyzer (.claude/scripts/context-analyzer.sh):
#!/bin/bash
SESSION_FILE=~/.claude/sessions/$(ls -t ~/.claude/sessions/ | head -1)
echo "π Context Analysis"
echo "=================="
# Message count
MSG_COUNT=$(cat $SESSION_FILE | jq '.messages | length')
echo "Messages: $MSG_COUNT"
# Token estimate
WORDS=$(cat $SESSION_FILE | jq '.messages[] | .content' | wc -w)
TOKENS=$(($WORDS * 13 / 10))
echo "Estimated tokens: $TOKENS"
# Context utilization
LIMIT=200000
PCT=$(($TOKENS * 100 / $LIMIT))
echo "Context usage: $PCT%"
if [ $PCT -gt 80 ]; then
echo "β οΈ WARNING: High context usage - consider /compact"
elif [ $PCT -gt 60 ]; then
echo "βΉοΈ Moderate context usage - monitor"
else
echo "β Healthy context usage"
fi
# Files referenced
echo ""
echo "Files referenced:"
cat $SESSION_FILE | jq '.messages[] | .content' | grep -o '@[^[:space:]]*' | sort -u
Part 5: Resilience Patterns
Figure 3: Resilience Patterns - From Circuit Breaker through Fallback to Graceful Degradation
Resilience Pattern Comparison
| Pattern | Problem Solved | When to Use | Benefits | Complexity | |---------|---------------|-------------|----------|-----------| | Circuit Breaker | Repeated failures blocking workflow | Flaky subagents or services | Fails fast, auto-recovery | Medium | | Retry + Backoff | Transient network errors | Temporary failures | High success rate | Low | | Fallback | Primary option unavailable | Critical path needs guarantee | Always operational | Low | | Bulkhead | Resource exhaustion cascades | Parallel workloads | Failure isolation | High |
Circuit Breaker Pattern π΄βπ’
Problem: Flaky subagent keeps failing, blocking main workflow
Implementation (.claude/scripts/circuit-breaker.sh):
#!/bin/bash
SUBAGENT_NAME=$1
FAILURE_COUNT_FILE="/tmp/claude-circuit-$SUBAGENT_NAME.count"
MAX_FAILURES=3
COOLDOWN_SECONDS=300
# Initialize or read failure count
if [ -f $FAILURE_COUNT_FILE ]; then
FAILURES=$(cat $FAILURE_COUNT_FILE)
LAST_FAILURE=$(stat -f %m $FAILURE_COUNT_FILE 2>/dev/null || stat -c %Y $FAILURE_COUNT_FILE)
NOW=$(date +%s)
# Reset after cooldown period
if [ $(($NOW - $LAST_FAILURE)) -gt $COOLDOWN_SECONDS ]; then
echo 0 > $FAILURE_COUNT_FILE
FAILURES=0
fi
else
echo 0 > $FAILURE_COUNT_FILE
FAILURES=0
fi
# Check if circuit is open
if [ $FAILURES -ge $MAX_FAILURES ]; then
echo "π΄ Circuit OPEN for $SUBAGENT_NAME (too many failures)"
echo "Cooldown: $COOLDOWN_SECONDS seconds"
exit 1
fi
# Try to execute subagent
if ! claude-subagent $SUBAGENT_NAME; then
# Increment failure count
echo $(($FAILURES + 1)) > $FAILURE_COUNT_FILE
echo "β οΈ Subagent failed ($((FAILURES + 1))/$MAX_FAILURES)"
exit 1
fi
# Success - reset counter
echo 0 > $FAILURE_COUNT_FILE
echo "β Subagent succeeded"
Circuit States:
| State | Condition | Behavior | Visual | |-------|-----------|---------|--------| | CLOSED | Failures < threshold | Normal operation | π’ | | OPEN | Failures β₯ threshold | Fail fast | π΄ | | HALF-OPEN | After cooldown | Test recovery | π‘ |
Retry with Exponential Backoff π
Backoff Schedule:
| Attempt | Delay | Total Wait | Success Rate | |---------|-------|-----------|-------------| | 1 | 0s | 0s | ~70% | | 2 | 2s | 2s | ~85% | | 3 | 4s | 6s | ~93% | | 4 | 8s | 14s | ~97% | | 5 | 16s | 30s | ~99% |
Implementation (.claude/scripts/retry.sh):
#!/bin/bash
COMMAND="$@"
MAX_ATTEMPTS=5
BASE_DELAY=2
for attempt in $(seq 1 $MAX_ATTEMPTS); do
echo "Attempt $attempt/$MAX_ATTEMPTS..."
if eval "$COMMAND"; then
echo "β Success on attempt $attempt"
exit 0
fi
if [ $attempt -lt $MAX_ATTEMPTS ]; then
DELAY=$(($BASE_DELAY ** $attempt))
echo "β³ Waiting ${DELAY}s before retry..."
sleep $DELAY
fi
done
echo "β Failed after $MAX_ATTEMPTS attempts"
exit 1
Fallback Strategy π
Model Fallback Hierarchy:
| Priority | Model | Use Case | Speed | Quality | Cost | |----------|-------|---------|-------|---------|------| | Primary | Opus 4.5 | Best quality needed | Slow | βββββ | $$$ | | Fallback 1 | Sonnet 4.5 | Balanced | Medium | ββββ | $$ | | Fallback 2 | Haiku | Speed critical | Fast | βββ | $ |
Implementation (.claude/scripts/smart-claude.sh):
#!/bin/bash
PROMPT="$@"
# Try Opus first
if claude --model=opus-4.5 -p "$PROMPT" 2>/dev/null; then
exit 0
fi
echo "β οΈ Opus unavailable, falling back to Sonnet..."
# Try Sonnet
if claude --model=sonnet-4.5 -p "$PROMPT" 2>/dev/null; then
exit 0
fi
echo "β οΈ Sonnet unavailable, falling back to Haiku..."
# Try Haiku
if claude --model=haiku -p "$PROMPT"; then
exit 0
fi
echo "β All models unavailable"
exit 1
Bulkhead Isolation π’
Resource Limits:
| Resource | Default | Recommended Limit | Reason | |----------|---------|------------------|---------| | Memory | Unlimited | 2GB per subagent | Prevent OOM crashes | | CPU | Unlimited | 50% per subagent | Fair scheduling | | Processes | Unlimited | 100 per subagent | Prevent fork bombs | | File handles | Unlimited | 1024 per subagent | Prevent exhaustion |
Implementation (.claude/agents/isolated-subagent.sh):
#!/bin/bash
AGENT_NAME=$1
MEMORY_LIMIT="2G" # 2GB max
CPU_LIMIT="50" # 50% CPU max
# Run subagent in isolated container (if Docker available)
if command -v docker &> /dev/null; then
docker run --rm \
--memory=$MEMORY_LIMIT \
--cpus=$CPU_LIMIT \
-v $(pwd):/workspace \
claude-agent:latest \
@$AGENT_NAME
else
# Fallback to process limits
ulimit -v 2097152 # 2GB virtual memory
nice -n 10 claude @$AGENT_NAME
fi
Part 6: Root Cause Analysis in Action
Case Study 1: Slow Responses (5 Whys) π
5 Whys Analysis Template:
| # | Question | Answer | Category | |---|----------|--------|----------| | Q1 | Why are responses slow? | High latency per request | Symptom | | Q2 | Why is latency high? | Large token count per message | Direct cause | | Q3 | Why large token count? | Loading entire codebase every request | Behavior | | Q4 | Why loading entire codebase? | Using @src/ directory reference | Usage pattern | | Q5 | Why using directory reference? | User doesn't know about 3-File Rule | ROOT CAUSE |
Solution Hierarchy:
| Level | Timeframe | Solution | Impact | |-------|-----------|---------|--------| | 1. Immediate | Now | Use /compact, switch to specific files | This session | | 2. Short-term | Today | Document 3-File Rule in team wiki | This project | | 3. Long-term | This week | Add PreToolUse hook warning on directory refs | All projects | | 4. Systematic | This month | Create onboarding training on context management | Organization |
Case Study 2: Hook Failures (Fishbone Diagram) π
Figure 4: Fishbone (Ishikawa) Diagram - Root cause analysis for hook failures across 4 categories
Root Causes by Category:
| Category | Root Cause | Fix Strategy | Prevention | |----------|-----------|-------------|-----------| | π₯ People | Team unaware hooks exist | Document in README | Team training | | π Process | No logging for debugging | Add structured logging | Standard procedure | | π§ Technology | Missing dependencies (prettier) | Add dependency check | Installation script | | π Environment | File system latency | Add timeout handling | Performance monitoring |
Solutions (Address each root cause):
People:
# Document hooks in README.md
cat >> README.md <<EOF
## Claude Code Hooks
We use automated hooks for:
- PostToolUse: Auto-formatting with Prettier
- PreToolUse: Safety validation
- SessionStart: Environment checks
See .claude/hooks/ for details.
EOF
Process:
#!/bin/bash
# .claude/hooks/post-tool-use.sh
LOG_FILE=~/.claude/hook-logs/post-tool-use.log
mkdir -p ~/.claude/hook-logs
echo "[$(date)] Starting PostToolUse hook" >> $LOG_FILE
for file in $CLAUDE_FILE_PATHS; do
if [[ $file == *.js ]]; then
echo "[$(date)] Formatting $file" >> $LOG_FILE
if ! prettier --write "$file" 2>>$LOG_FILE; then
echo "[$(date)] ERROR formatting $file" >> $LOG_FILE
exit 1
fi
fi
done
echo "[$(date)] PostToolUse hook complete" >> $LOG_FILE
Technology:
# Add dependency check
#!/bin/bash
# .claude/hooks/post-tool-use.sh
if ! command -v prettier &> /dev/null; then
echo "ERROR: prettier not installed"
echo "Install: npm install -g prettier"
exit 1
fi
# Rest of hook...
Environment:
# Add timeout for file operations
#!/bin/bash
timeout 10s prettier --write "$file" || {
echo "WARNING: Formatting timeout for $file"
# Continue anyway
}
Case Study 3: Context Confusion (Fault Tree) π³
Diagnostic Checklist:
| Step | Check | Command | Expected Result | Action if Failed |
|------|-------|---------|----------------|-----------------|
| 1 | Multiple CLAUDE.md files | find . -name "CLAUDE.md" | 1 file found | Consolidate |
| 2 | Conflicting rules | Review each file | Consistent rules | Merge or prioritize |
| 3 | Load order priority | Check hierarchy | Project > Personal | Document order |
| 4 | Session state | > /status | Clean context | /clear if confused |
Root Cause: Conflicting rules in different CLAUDE.md files
Solution:
# Consolidate to single source of truth
# Keep project root CLAUDE.md, remove others
rm ./src/backend/CLAUDE.md
# Update root CLAUDE.md with context-specific rules
cat >> ./CLAUDE.md <<EOF
## Code Style by Directory
### Modern code (src/api/, src/services/)
ALWAYS: Use async/await
### Legacy code (src/legacy/)
ALWAYS: Use callbacks (maintain consistency)
EOF
Prevention:
# Add to health check
#!/bin/bash
# .claude/scripts/health-check.sh
CLAUDE_MD_FILES=$(find . -name "CLAUDE.md" -type f)
COUNT=$(echo "$CLAUDE_MD_FILES" | wc -l)
if [ $COUNT -gt 1 ]; then
echo "β οΈ Multiple CLAUDE.md files detected:"
echo "$CLAUDE_MD_FILES"
echo ""
echo "Recommendation: Consolidate to single file"
echo "See: https://docs.claude.com/claude-code/configuration#hierarchy"
fi
Case Study 4: MCP Issues (Observability) π
MCP Troubleshooting Matrix:
| Symptom | Check | Tool | Expected | Action |
|---------|-------|------|---------|---------|
| Tools not available | Server status | cat ~/.claude/config.json | Valid config | Fix JSON |
| Connection timeout | Network | curl -v [MCP_URL] | HTTP 200 | Check DNS/firewall |
| Authentication failed | Credentials | --mcp-debug | No auth errors | Update credentials |
| Slow responses | Performance | ping [MCP_HOST] | <100ms | Check network |
Diagnostic Flow:
- Metrics: Check config exists and is valid
- Logs: Run with
--mcp-debug - Traces: Test endpoint directly with
curl
Example: DNS Resolution Failure
# Test MCP endpoint directly
$ curl -v https://mcp.company.com/github
# Output:
* Could not resolve host: mcp.company.com
* Closing connection 0
curl: (6) Could not resolve host: mcp.company.com
Root Cause: DNS resolution failure
Solution:
# Check DNS
$ nslookup mcp.company.com
# Server failed
# Fix: Switch to public DNS temporarily
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Verify
$ curl https://mcp.company.com/github
# Success!
Part 7: Recovery Playbooks
Figure 5: Recovery Workflow - The Detect β Diagnose β Recover β Learn β Prevent cycle
Playbook Quick Reference
| Playbook | Scenario | Time to Fix | Difficulty | Prerequisites | |----------|---------|-------------|-----------|---------------| | #1: Installation | Claude command not found | 5-10 min | Easy | npm access | | #2: Context Overflow | "Out of context" error | 1-2 min | Easy | Current session | | #3: MCP Connection | Tools unavailable | 10-15 min | Medium | Server access | | #4: Hook Failure | Operations blocked | 5-10 min | Medium | Hook scripts | | #5: Performance | Progressively slower | 3-5 min | Easy | System access |
Playbook 1: Installation Recovery π§
Diagnostic Decision Tree:
Claude command not found
β
Is it installed?
ββ NO β npm install -g @anthropic-ai/claude-code
ββ YES β Check PATH
ββ Not in PATH β Add npm bin to PATH
ββ In PATH β Check permissions
ββ Permission errors β Fix npm config
ββ Other β Reinstall
Steps:
| Step | Action | Command | Expected Result |
|------|--------|---------|----------------|
| 1 | Verify installation | npm list -g @anthropic-ai/claude-code | Package listed |
| 2 | If not installed | npm install -g @anthropic-ai/claude-code | Successful install |
| 3 | If installed but not found | export PATH=$(npm config get prefix)/bin:$PATH | Claude accessible |
| 4 | If permission errors | See permission fix section | No sudo needed |
| 5 | Verify | which claude && claude --version | Version displayed |
Playbook 2: Context Overflow Recovery π§
Decision Matrix:
| Can Lose History? | Current Work Important? | Need Specific Info? | Recommended Action |
|------------------|------------------------|-------------------|-------------------|
| β
Yes | N/A | N/A | /clear - fastest |
| β No | β
Yes | β No | /compact - preserves summary |
| β No | β
Yes | β
Yes | Export β Clear β Reload relevant parts |
Export and Reload Pattern:
# In Claude:
> Export our conversation to /tmp/session-backup.md
# Clear context
> /clear
# Reload only relevant parts
> Read the architecture decisions from @/tmp/session-backup.md
> Now let's continue with...
Playbook 3: MCP Connection Recovery π
6-Step Recovery Process:
| Step | Action | Purpose | Command |
|------|--------|---------|---------|
| 1οΈβ£ | Check server status | Identify which servers failing | claude --mcp-debug |
| 2οΈβ£ | Test connectivity | Verify network path | curl -v [MCP_URL] |
| 3οΈβ£ | Validate config | Check JSON syntax | python3 -m json.tool ~/.claude/config.json |
| 4οΈβ£ | Restart servers | Clear stuck state | pkill -f mcp-server && npm run mcp-server & |
| 5οΈβ£ | Restart Claude | Fresh connection | claude --mcp-debug |
| 6οΈβ£ | Check logs | Find root cause | tail -f ~/.claude/logs/mcp-*.log |
Playbook 4: Hook Failure Recovery β‘
Emergency Bypass:
# Start Claude without hooks
claude --no-hooks
Diagnosis Steps:
| Step | Check | Command | Look For |
|------|-------|---------|---------|
| 1 | Script exists | ls -la .claude/hooks/ | Files present |
| 2 | Executable | ls -la .claude/hooks/pre-tool-use.sh | rwx permissions |
| 3 | Syntax errors | bash -x .claude/hooks/pre-tool-use.sh | No errors |
| 4 | Dependencies | command -v jq prettier | All tools found |
| 5 | Debug logging | Check /tmp/claude-hook-debug.log | Error messages |
Add Debug Logging:
#!/bin/bash
# .claude/hooks/pre-tool-use.sh
set -x # Enable debug output
exec 2>> /tmp/claude-hook-debug.log # Log errors
# Your hook logic...
Playbook 5: Performance Degradation Recovery β‘
Diagnostic Workflow:
| Symptom | Check | Threshold | Action | Expected Improvement |
|---------|-------|-----------|--------|---------------------|
| Slow responses | Context size | >60% | /compact | 2-3x faster |
| High memory | Process memory | >4GB | Restart session | Fresh state |
| Network lag | API latency | >500ms | Check connection | Faster network |
| Wrong model | Model used | Opus | Switch to Sonnet | 3-5x faster |
Quick Fix Commands:
# 1. Check context size
> /status
# 2. If >60% full: Compact
> /compact
# 3. Check system resources
top -p $(pgrep claude)
ps aux | grep claude | awk '{print $6/1024 " MB"}'
# 4. If high memory: Restart
> exit
claude
# 5. If network latency: Test
ping api.anthropic.com
# 6. If model slow: Switch
claude --model=sonnet-4.5
Part 8: Observability for Claude Code
The Observability Triad π
Following industry-standard observability practices:
| Pillar | Focus | Question Answered | Data Type | Retention | |--------|-------|------------------|-----------|-----------| | π Metrics | What's happening | How much? How fast? | Quantitative (numbers) | 30 days | | π Logs | Why it's happening | What failed? Why? | Qualitative (events) | 7 days | | π Traces | How it's happening | What's the flow? | Causal (sequences) | 24 hours |
Figure 6: The Observability Triad - Metrics, Logs, and Traces working together for complete system visibility
Implementing Metrics Collection π
Key Metrics to Track:
| Metric | Unit | Healthy Range | Warning | Critical | |--------|------|--------------|---------|----------| | Token usage | Tokens | <100k | 100k-150k | >150k | | Response time | Seconds | <5s | 5-15s | >15s | | Session duration | Minutes | 15-60m | 60-120m | >120m | | Message count | Count | <50 | 50-100 | >100 | | Context utilization | Percentage | <60% | 60-80% | >80% |
Metrics Collector (.claude/scripts/metrics.sh):
#!/bin/bash
METRICS_FILE=~/.claude/metrics/$(date +%Y-%m-%d).json
mkdir -p ~/.claude/metrics
# Token usage
TOKENS=$(cat ~/.claude/sessions/latest.json 2>/dev/null | jq '.messages[] | .content' | wc -w | awk '{print $1 * 1.3}')
# Session duration
START_TIME=$(stat -f %B ~/.claude/sessions/latest.json 2>/dev/null || stat -c %W ~/.claude/sessions/latest.json)
NOW=$(date +%s)
DURATION=$(($NOW - $START_TIME))
# Message count
MSG_COUNT=$(cat ~/.claude/sessions/latest.json 2>/dev/null | jq '.messages | length')
# Log metrics
cat >> $METRICS_FILE <<EOF
{
"timestamp": "$(date -Iseconds)",
"tokens": $TOKENS,
"duration_seconds": $DURATION,
"message_count": $MSG_COUNT,
"tokens_per_minute": $(($TOKENS * 60 / $DURATION))
}
EOF
Add to Stop hook:
{
"hooks": {
"stop": ".claude/scripts/metrics.sh"
}
}
Log Aggregation π
Log Levels:
| Level | Use Case | Example | Retention | |-------|---------|---------|-----------| | ERROR | Failures requiring action | Hook failed, API error | 30 days | | WARN | Potential issues | High context usage | 14 days | | INFO | Normal operations | Session started | 7 days | | DEBUG | Detailed diagnostics | Tool invocations | 1 day |
Structured Logging Hook:
#!/bin/bash
# .claude/hooks/user-prompt-submit.sh
LOG_FILE=~/.claude/logs/prompts.jsonl
mkdir -p ~/.claude/logs
# Log in JSON Lines format (one JSON object per line)
cat >> $LOG_FILE <<EOF
{"timestamp":"$(date -Iseconds)","prompt":"$CLAUDE_USER_PROMPT","project":"$(basename $(pwd))","user":"$(whoami)"}
EOF
Search Logs:
| Query | Command | Use Case |
|-------|---------|---------|
| Find keyword | cat ~/.claude/logs/prompts.jsonl \| jq 'select(.prompt \| contains("refactor"))' | Track specific topics |
| Count by project | cat ~/.claude/logs/prompts.jsonl \| jq -r '.project' \| sort \| uniq -c | Usage analytics |
| Last hour | cat ~/.claude/logs/prompts.jsonl \| jq "select(.timestamp > \"$HOUR_AGO\")" | Recent activity |
Distributed Tracing π
Trace Events to Capture:
| Event Type | Trigger | Data Captured | Use Case |
|-----------|---------|---------------|---------|
| session_start | Claude launches | Timestamp, project, config | Performance baseline |
| prompt_submit | User sends prompt | Prompt text, token count | User behavior analysis |
| pre_tool_use | Before tool call | Tool name, input | Workflow understanding |
| post_tool_use | After tool call | Tool result, duration | Performance tracking |
| session_end | Claude exits | Total tokens, duration | Session analysis |
Trace Implementation (.claude/scripts/trace.sh):
#!/bin/bash
# .claude/scripts/trace.sh
TRACE_ID=$(uuidgen)
TRACE_FILE=~/.claude/traces/$TRACE_ID.json
mkdir -p ~/.claude/traces
echo "{\"trace_id\":\"$TRACE_ID\",\"events\":[]}" > $TRACE_FILE
# Function to log trace event
trace_event() {
local event_type=$1
local event_data=$2
cat $TRACE_FILE | jq ".events += [{\"timestamp\":\"$(date -Iseconds)\",\"type\":\"$event_type\",\"data\":\"$event_data\"}]" > $TRACE_FILE.tmp
mv $TRACE_FILE.tmp $TRACE_FILE
}
# Export for use in hooks
export TRACE_ID
export -f trace_event
echo "Trace started: $TRACE_ID"
Analyze Trace:
# View complete trace
cat ~/.claude/traces/$TRACE_ID.json | jq .
# Timeline
cat ~/.claude/traces/$TRACE_ID.json | jq '.events[] | "\(.timestamp): \(.type) - \(.data)"'
Part 9: Zimbra Troubleshooting Case Studies
Case Study 1: Mail Delivery Failures π§
Diagnostic Workflow:
| Step | Action | Command | Finding |
|------|--------|---------|---------|
| 1οΈβ£ Check queue | Identify stuck messages | postqueue -p \| grep gmail.com | 47 messages stuck |
| 2οΈβ£ Check logs | Find error pattern | tail -f /var/log/maillog \| grep gmail.com | "Network is unreachable" |
| 3οΈβ£ 5 Whys | Root cause analysis | See table below | IPv6 misconfiguration |
| 4οΈβ£ Fix | Disable or configure IPv6 | postconf -e "inet_protocols = ipv4" | Issue resolved |
| 5οΈβ£ Verify | Flush queue | postqueue -f | Messages delivered |
5 Whys Analysis:
| # | Question | Answer | Category | |---|----------|--------|----------| | Q1 | Why are emails stuck? | Can't connect to Gmail MX servers | Symptom | | Q2 | Why can't connect? | Network unreachable error | Network | | Q3 | Why network unreachable? | IPv6 connection failing | Protocol | | Q4 | Why IPv6 failing? | Server has IPv6 address but no IPv6 route | Configuration | | Q5 | Why configured without route? | Auto-configuration enabled IPv6 without admin knowledge | ROOT CAUSE |
Solution Options:
| Option | Time | Risk | Permanence | Recommended? | |--------|------|------|-----------|-------------| | Disable IPv6 in Postfix | 2 min | Low | Temporary | β Quick fix | | Configure IPv6 routing | 1 hour | Medium | Permanent | β Long-term | | Use IPv4 only globally | 5 min | Low | Permanent | β οΈ Limits future |
Case Study 2: LDAP Authentication Issues π
Root Cause Matrix:
| Category | Root Cause | Impact | Fix Priority | Time to Fix | |----------|-----------|--------|-------------|-------------| | People | Users using correct passwords | None | N/A | - | | Process | No LDAP validation before deploy | High | π΄ Critical | Add to CI/CD | | Technology | LDAP bind DN misconfigured | Critical | π΄ Critical | 5 min | | Environment | LDAP server load spike | Medium | π‘ Important | Scale infrastructure |
Diagnostic with Claude Code:
# Use Claude to check LDAP config
claude
> Analyze Zimbra LDAP configuration:
> @/opt/zimbra/conf/localconfig.xml
>
> Check for:
> 1. Bind DN correctness
> 2. Password encryption
> 3. Server connectivity settings
# Claude identifies:
# β ldap_bind_dn: uid=zimbra,cn=users,dc=example,dc=com
# β
Should be: uid=zimbra,cn=admins,dc=example,dc=com
Solution:
# Fix bind DN
zmlocalconfig -e ldap_bind_dn="uid=zimbra,cn=admins,dc=example,dc=com"
# Restart LDAP
zmcontrol restart ldap
# Verify
ldapsearch -x -H ldap://localhost -D "uid=zimbra,cn=admins,dc=example,dc=com" -W
Prevention Hook (.claude/hooks/pre-tool-use.sh):
#!/bin/bash
# Detect Zimbra config changes
if echo "$CLAUDE_FILE_PATHS" | grep -q "/opt/zimbra/conf/"; then
echo "β οΈ Zimbra configuration change detected"
echo "π Running validation..."
# Validate LDAP config
if ! zmlocalconfig -c ldap 2>/dev/null; then
echo "β LDAP configuration invalid"
echo "Fix errors before proceeding"
exit 1
fi
echo "β
Configuration valid"
read -p "Create config backup before change? (y/n): " backup
if [[ $backup == "y" ]]; then
BACKUP_DIR=~/.zimbra-backups/$(date +%Y%m%d-%H%M%S)
mkdir -p $BACKUP_DIR
cp -r /opt/zimbra/conf/ $BACKUP_DIR/
echo "π¦ Backup saved: $BACKUP_DIR"
fi
fi
Case Study 3: Performance Degradation π
Performance Metrics:
| Metric | Normal | Warning | Critical | Action | |--------|--------|---------|---------|---------| | Mailbox threads | Running | Stopped | Not responding | Restart mailbox | | LDAP threads | Running | Slow | Stopped | Restart LDAP | | Mail queue size | <50 | 50-100 | >100 | Flush queue | | LDAP response time | <50ms | 50-500ms | >500ms | Optimize queries |
Metrics Collection (.claude/scripts/zimbra-metrics.sh):
#!/bin/bash
METRICS_FILE=~/zimbra-metrics/$(date +%Y-%m-%d).json
mkdir -p ~/zimbra-metrics
# Collect metrics
MAILBOX_THREADS=$(zmcontrol status | grep mailbox | awk '{print $4}')
LDAP_THREADS=$(zmcontrol status | grep ldap | awk '{print $4}')
MAIL_QUEUE=$(postqueue -p | tail -1 | awk '{print $5}')
LDAP_RESPONSE_TIME=$(time ldapsearch -x -b "" -s base 2>&1 | grep real | awk '{print $2}')
# Log metrics
cat >> $METRICS_FILE <<EOF
{
"timestamp": "$(date -Iseconds)",
"mailbox_threads": "$MAILBOX_THREADS",
"ldap_threads": "$LDAP_THREADS",
"mail_queue_size": "$MAIL_QUEUE",
"ldap_response_ms": "$LDAP_RESPONSE_TIME"
}
EOF
Pattern Detected by Claude:
π Pattern detected: LDAP queries spiking every 5 minutes
π Average: 50ms, Spike: 2000ms
β° Timing correlates with sync job
Root Cause: Inefficient LDAP sync job
Solution:
# Optimize sync job
vim /opt/zimbra/conf/sync-config.xml
# Change:
# <interval>300</interval> <!-- 5 minutes -->
# To:
# <interval>3600</interval> <!-- 1 hour -->
# Add connection pooling
# <pool_size>10</pool_size>
# Restart
zmcontrol restart sync
FAQ
How do I know if an issue is Claude Code or my system? π
Test Isolation:
| Check | Command | Claude Issue? | System Issue? |
|-------|---------|--------------|--------------|
| Run diagnostics | claude doctor | β Checks fail | β
Checks pass |
| System resources | top; free -h; df -h | β
Normal | β High usage |
| Network | ping api.anthropic.com | β
Fast (<100ms) | β Slow/unreachable |
| Other tools | Test npm, git, etc. | β
Work | β Also broken |
When should I use /compact vs /clear? π§
| Scenario | Use /compact | Use /clear | Reason |
|----------|---------------|-------------|---------|
| Same topic, large context | β
| β | Preserves history |
| New unrelated task | β | β
| Fresh start needed |
| Claude seems confused | β | β
| Context contaminated |
| Refinement of same work | β
| β | Keep decisions made |
| Switching features | β | β
| Different context |
Rule of thumb: Compact for refinement, clear for reset.
How can I prevent 40% of crashes (corrupted installs)? π―
Three-Step Prevention:
| Step | Action | Time | Impact |
|------|--------|------|--------|
| 1οΈβ£ Use nvm | Install Node Version Manager | 5 min | Prevents 90% of permission issues |
| 2οΈβ£ Install properly | npm install -g @anthropic-ai/claude-code | 2 min | Clean installation |
| 3οΈβ£ Validate | claude doctor | 1 min | Catches 100% of setup problems |
What's the fastest way to diagnose slow responses? β‘
Quick Diagnostic (2-minute protocol):
| Step | Action | Expected Time | Fix |
|------|--------|--------------|-----|
| 1 | > /status | 5 sec | Check context size |
| 2 | If >60% | 10 sec | > /compact |
| 3 | Ask which model | 5 sec | Switch if needed |
| 4 | Use 3-File Rule | Ongoing | Specific files only |
| 5 | Restart session | 30 sec | Last resort |
How do I debug hooks that fail silently? π
Debug Strategy:
| Level | Technique | Implementation | Output Location |
|-------|-----------|----------------|----------------|
| Basic | Enable debug mode | set -x in hook | stderr |
| Standard | Log to file | exec 2>> /tmp/hook.log | /tmp/hook.log |
| Advanced | Structured logging | JSON format logs | ~/.claude/logs/ |
| Expert | Distributed tracing | trace_event calls | ~/.claude/traces/ |
Quick Debug Template:
#!/bin/bash
# .claude/hooks/my-hook.sh
set -x # Enable debug mode
exec 2>> /tmp/claude-hook-debug.log # Log to file
# Your hook logic...
echo "Hook completed successfully"
Can I monitor Claude Code like a production service? π
Yes! Implement the Observability Triad:
| Component | Tracks | Storage | Retention | Query Method | |-----------|--------|---------|-----------|-------------| | Metrics | Token usage, response time, session duration | JSON files | 30 days | jq queries | | Logs | Prompts, errors, hook execution | JSONL files | 7 days | grep/jq | | Traces | Complete workflow from prompt to result | JSON files | 24 hours | jq traces |
See "Part 8: Observability" for full implementation.
What's the best way to handle flaky MCP servers? π
Resilience Strategy Comparison:
| Approach | Fail Fast? | Auto-Recovery? | Implementation | Best For | |----------|-----------|---------------|----------------|----------| | Circuit Breaker | β Yes | β Yes | Medium complexity | Repeated failures | | Retry + Backoff | β No | β Yes | Low complexity | Transient errors | | Fallback | β Yes | β Yes | Low complexity | Alternative available | | Bulkhead | N/A | β Yes | High complexity | Resource isolation |
Recommended: Circuit breaker (see "Part 5: Resilience Patterns")
Conclusion
You now have an industrial-grade troubleshooting framework for Claude Code:
Your Resilience Arsenal β‘
| Component | Covered? | Key Techniques | Impact | |-----------|---------|---------------|---------| | β Diagnostic Stack | Yes | 5 layers: P-D-D-R-L | Systematic approach | | β Root Cause Analysis | Yes | 5 Whys, Fishbone, Fault Tree | Find true source | | β Resilience Patterns | Yes | Circuit Breaker, Retry, Bulkhead, Fallback | Self-healing systems | | β Observability | Yes | Metrics, Logs, Traces | Complete visibility | | β Recovery Playbooks | Yes | 5 complete runbooks | Fast resolution | | β Automation | Yes | Self-healing hooks | Prevent and recover | | β Case Studies | Yes | Real-world Zimbra diagnostics | Practical application |
But here's the key insight: This isn't about fixing problems faster.
It's about building systems that prevent problems and recover automatically when they occur.
The Challenge: Build Your Resilience Stack π
Week-by-Week Implementation:
| Week | Focus | Tasks | Success Metrics | |------|-------|-------|----------------| | Week 1: Prevention | Stop problems before they start | Health checks, validators, pre-flight checks, CLAUDE.md docs | 40% fewer issues | | Week 2: Detection | Spot issues immediately | Metrics collection, structured logging, baselines, anomaly detection | <10s detection time | | Week 3: Diagnosis | Understand the "why" | Learn 5 Whys, create Fishbone templates, practice Fault Tree, document workflows | Root cause in <5min | | Week 4: Recovery | Fix and restore fast | Circuit breakers, retry logic, recovery playbooks, self-healing automation | <1min recovery |
Your Mission: After 4 weeks, you should have a resilient Claude Code system that:
- β Prevents 60%+ of issues before they occur
- β Detects 90%+ of problems within seconds
- β Diagnoses root causes systematically
- β Recovers automatically from failures
Measurement Framework π
| Metric | Before | Target After 4 Weeks | Measurement | |--------|--------|---------------------|-------------| | Issues per week | Baseline | -60% | Track incidents | | Mean time to detect | Minutes | <10 seconds | Observability data | | Mean time to resolve | Hours | <5 minutes | Playbook usage | | Prevention rate | 0% | 60% | Health checks | | Auto-recovery rate | 0% | 40% | Resilience patterns |
Then measure: How many hours did you save by preventing fires instead of fighting them?
For more resources, check the official GitHub repository for community-contributed troubleshooting guides and diagnostic tools.
π Previous: Advanced Features Guide π Next: [Best Practices & Workflows](./ claude-code-workflow-engineering-team-productivity-guide)
P.S. If you're still just restarting Claude when it crashes instead of asking "Why did it crash?", you're not engineeringβyou're hoping. Hope is not a strategy. Build resilience instead.