Cross-Platform Troubleshooting Runbook
Technical incident response patterns for self-hosted WordPress, Drupal, Odoo, Rails, and Next.js systems
Jump by Symptom
Site Down or 5xx Spike
- Confirm blast radius: all routes, subset of routes, or single service dependency.
- Check infrastructure first: CPU saturation, memory pressure, connection pool exhaustion, queue backlog.
- Freeze new deploys and config mutations.
- Evaluate rollback condition within 5 minutes based on user-impact KPI.
- If rollback is cleaner than patch, roll back and continue diagnosis in staging.
Slow After Update
- Compare p50/p95 latency before and after release by route.
- Inspect DB slow query logs and lock wait changes.
- Toggle cache layers one by one (edge/page/object/query cache).
- Identify new synchronous calls and external dependency lag.
- Apply narrow optimization patch, then re-measure against pre-release baseline.
Errors After Release
- Segment by error class: validation, runtime exception, dependency mismatch, migration drift.
- Inspect release artifact diff: packages/modules/plugins/theme/custom code.
- Validate data/schema compatibility for write paths.
- Reproduce in staging with production-like configuration and seed data.
- Choose fast rollback or hotfix path based on data-integrity risk.
Suspected Security Incident
- Contain first: disable compromised accounts, rotate credentials, isolate suspicious nodes.
- Preserve evidence: immutable copies of logs, process snapshots, and request traces.
- Identify entry vector: vulnerable dependency, credential reuse, exposed endpoint, or misconfiguration.
- Patch and validate in staging before production promotion.
- Run post-incident hardening: WAF rules, least privilege, automated advisories, and recovery drills.
Incident Triage Framework (First 15 Minutes)
- Classify impact: availability, data integrity, security, or performance.
- Freeze risky changes: stop deployments and configuration churn.
- Capture evidence: current logs, metrics snapshots, request traces, and failing URLs.
- Define rollback threshold: if business KPI drops below threshold, roll back immediately.
- Assign one commander and one comms owner to avoid coordination deadlocks.
Diagnostic Decision Tree
Symptom A: Global timeout or 5xx spike. Start with infrastructure saturation (CPU, memory, DB connections, queue backlog).
Symptom B: Admin works, frontend fails. Check cache invalidation, route-level middleware, and CDN edge rules.
Symptom C: Some pages fail after update. Check plugin/module compatibility and template overrides.
Symptom D: Latency drift without errors. Investigate slow query percentiles, N+1 patterns, and lock contention.
WordPress Advanced Troubleshooting
Plugin Conflict Isolation: Disable in groups by domain (SEO, caching, security, forms), then binary-search to offending plugin.
Fatal Error Tracing: Enable debug logging and inspect stack traces for hooks executed before failure.
Cache Layer Split Testing: Toggle object cache, page cache, and CDN cache independently to find stale-layer defects.
Database Recovery: Run table checks, verify collation drift, and test point-in-time restoration in staging before production apply.
Drupal Advanced Troubleshooting
Module Dependency Breaks: Map dependency graph before uninstall/upgrade and validate config schema changes.
Container Rebuild Errors: Inspect service definitions and stale compiled container cache artifacts.
Config Sync Drift: Compare active config vs exported config and isolate environment-specific overrides.
Render Cache Poisoning: Validate cache contexts/tags/max-age to prevent cross-user cache leaks.
Odoo Advanced Troubleshooting
Slow Transactions: Profile ORM-heavy flows and identify computed fields with expensive dependencies.
Queue Backpressure: Segment long-running workers from latency-sensitive jobs and enforce queue priorities.
Customization Regression: Validate inherited views and custom modules against updated base models.
Reporting Delays: Move heavy reports to asynchronous jobs and pre-materialize frequent aggregates.
Rails and Next.js Troubleshooting Patterns
Rails: Use query logs plus application traces to isolate N+1 and callback amplification; add bulletproof database indexes before scaling workers.
Next.js: Separate server rendering errors from client hydration mismatches; inspect edge/runtime logs and stale tag revalidation behavior.
React: Profile render waterfalls and expensive effects; verify concurrency-safe state transitions in high-interaction pages.
Stabilization and Recovery Checklist
- Apply narrowest fix first, not the largest rewrite.
- Re-run synthetic checks before user traffic expansion.
- Track error budget burn rate during and after remediation.
- Ship post-incident hardening tasks with owners and dates.
- Document exact trigger and detection gap for prevention.
Need Direct Incident Support?
If you are in a live production issue, we can help with triage, rollback, and root-cause correction.