Cross-Platform Troubleshooting Runbook

Technical incident response patterns for self-hosted WordPress, Drupal, Odoo, Rails, and Next.js systems

Troubleshooting22 min readLast updated: June 2026

Jump by Symptom

Site Down or 5xx Spike

  1. Confirm blast radius: all routes, subset of routes, or single service dependency.
  2. Check infrastructure first: CPU saturation, memory pressure, connection pool exhaustion, queue backlog.
  3. Freeze new deploys and config mutations.
  4. Evaluate rollback condition within 5 minutes based on user-impact KPI.
  5. If rollback is cleaner than patch, roll back and continue diagnosis in staging.

Slow After Update

  1. Compare p50/p95 latency before and after release by route.
  2. Inspect DB slow query logs and lock wait changes.
  3. Toggle cache layers one by one (edge/page/object/query cache).
  4. Identify new synchronous calls and external dependency lag.
  5. Apply narrow optimization patch, then re-measure against pre-release baseline.

Errors After Release

  1. Segment by error class: validation, runtime exception, dependency mismatch, migration drift.
  2. Inspect release artifact diff: packages/modules/plugins/theme/custom code.
  3. Validate data/schema compatibility for write paths.
  4. Reproduce in staging with production-like configuration and seed data.
  5. Choose fast rollback or hotfix path based on data-integrity risk.

Suspected Security Incident

  1. Contain first: disable compromised accounts, rotate credentials, isolate suspicious nodes.
  2. Preserve evidence: immutable copies of logs, process snapshots, and request traces.
  3. Identify entry vector: vulnerable dependency, credential reuse, exposed endpoint, or misconfiguration.
  4. Patch and validate in staging before production promotion.
  5. Run post-incident hardening: WAF rules, least privilege, automated advisories, and recovery drills.

Incident Triage Framework (First 15 Minutes)

  1. Classify impact: availability, data integrity, security, or performance.
  2. Freeze risky changes: stop deployments and configuration churn.
  3. Capture evidence: current logs, metrics snapshots, request traces, and failing URLs.
  4. Define rollback threshold: if business KPI drops below threshold, roll back immediately.
  5. Assign one commander and one comms owner to avoid coordination deadlocks.

Diagnostic Decision Tree

Symptom A: Global timeout or 5xx spike. Start with infrastructure saturation (CPU, memory, DB connections, queue backlog).

Symptom B: Admin works, frontend fails. Check cache invalidation, route-level middleware, and CDN edge rules.

Symptom C: Some pages fail after update. Check plugin/module compatibility and template overrides.

Symptom D: Latency drift without errors. Investigate slow query percentiles, N+1 patterns, and lock contention.

WordPress Advanced Troubleshooting

Plugin Conflict Isolation: Disable in groups by domain (SEO, caching, security, forms), then binary-search to offending plugin.

Fatal Error Tracing: Enable debug logging and inspect stack traces for hooks executed before failure.

Cache Layer Split Testing: Toggle object cache, page cache, and CDN cache independently to find stale-layer defects.

Database Recovery: Run table checks, verify collation drift, and test point-in-time restoration in staging before production apply.

Drupal Advanced Troubleshooting

Module Dependency Breaks: Map dependency graph before uninstall/upgrade and validate config schema changes.

Container Rebuild Errors: Inspect service definitions and stale compiled container cache artifacts.

Config Sync Drift: Compare active config vs exported config and isolate environment-specific overrides.

Render Cache Poisoning: Validate cache contexts/tags/max-age to prevent cross-user cache leaks.

Odoo Advanced Troubleshooting

Slow Transactions: Profile ORM-heavy flows and identify computed fields with expensive dependencies.

Queue Backpressure: Segment long-running workers from latency-sensitive jobs and enforce queue priorities.

Customization Regression: Validate inherited views and custom modules against updated base models.

Reporting Delays: Move heavy reports to asynchronous jobs and pre-materialize frequent aggregates.

Rails and Next.js Troubleshooting Patterns

Rails: Use query logs plus application traces to isolate N+1 and callback amplification; add bulletproof database indexes before scaling workers.

Next.js: Separate server rendering errors from client hydration mismatches; inspect edge/runtime logs and stale tag revalidation behavior.

React: Profile render waterfalls and expensive effects; verify concurrency-safe state transitions in high-interaction pages.

Stabilization and Recovery Checklist

  • Apply narrowest fix first, not the largest rewrite.
  • Re-run synthetic checks before user traffic expansion.
  • Track error budget burn rate during and after remediation.
  • Ship post-incident hardening tasks with owners and dates.
  • Document exact trigger and detection gap for prevention.

Need Direct Incident Support?

If you are in a live production issue, we can help with triage, rollback, and root-cause correction.