Skip to content

Incident Runbook

First response for the most likely operational problems. Each entry: how you'll notice, how to confirm, how to fix, and how to verify recovery. Platform/DevOps actions are at /platform/*; environment domains are qbtime.r2d2dev.com (prod) and staging.qbtime.r2d2dev.com (staging).

Triage order

  1. Is the app up? GET /api/v1/health should return DB status.
  2. Is the cron firing? See "Scheduled reports stopped".
  3. Is a single tenant broken, or all of them? The Fleet health page (/platform/health) shows every company's connection, token, last run, and recent failures at a glance.

QBT connection / token expired or refresh failing

Notice: a company shows expired/expiring token or error connection on Fleet health; affected company admins get a qbt_token_expired / qbt_refresh_failed notification; reports for that company stop.

Confirm: open the company's Status page; check Fleet health token state.

Fix: the access token auto-refreshes, so a transient failure usually clears on the next cron cycle. If it persists, the refresh token or client secret is bad:

  1. Have the customer admin re-run Connect for that company (Companies page), which mints a fresh token pair.
  2. If the QBT client secret was rotated in QBT, update it in-app via Rotate secret (Companies page), then re-run Connect.

Verify: Fleet health shows token ok and connection connected; run a manual report for that company.


Report email failed to send

Notice: report_send_failed notification; Fleet health shows a non-zero Failures (7d) count; a report_runs row with status failed.

Confirm: check the cron Worker logs (npx wrangler tail qbtime-cron-prod) for the failing cycle, and the company's recent runs.

Fix: common causes —

  • Email provider creds: verify GRAPH_TENANT_ID/GRAPH_CLIENT_ID/ GRAPH_CLIENT_SECRET/GRAPH_SENDER (or SendGrid equivalents) in the Pages env. An expired Graph client secret is the usual culprit.
  • Recipients: confirm the report config has valid recipient emails.
  • QBT side: if the data pull failed, treat as the connection incident above.

Runs are idempotent (send-dedupe), so re-firing the cron won't double-send a report that already went out.

Verify: run the report manually (Reports page → Preview → Run & send) and confirm delivery; failure count stops climbing.


Scheduled reports stopped firing

Notice: no reports at the expected times; no recent report_runs rows.

Confirm the chain (cron Worker → internal endpoint → runner):

npx wrangler tail qbtime-cron-prod          # is the Worker waking every 15 min?

Then test the app endpoint directly (PowerShell-safe):

Invoke-WebRequest -Uri "https://qbtime.r2d2dev.com/api/v1/internal/run-due-reports" `
  -Method POST -Headers @{ "X-Internal-Secret" = "<prod INTERNAL_TASK_SECRET>" } |
  Select-Object -ExpandProperty Content

Fix:

  • 403 from the endpoint: the cron Worker's INTERNAL_TASK_SECRET doesn't match the app's for that env. Re-set it: npx wrangler secret put INTERNAL_TASK_SECRET --env production (from cron-worker/), using the app's value.
  • Worker not firing at all: confirm it's deployed as a Worker, not a Pages project (Pages can't run cron). Redeploy: npx wrangler deploy --env production from cron-worker/. The output must show schedule: */15 * * * *.
  • Wrong target: the deploy output must show the correct APP_BASE_URL for the env (a staging Worker pointed at the prod URL is a known past mistake; the --env flag prevents it).

Verify: wrangler tail shows "*/15 * * * *" @ ... - Ok at the next quarter-hour.


Account lockout / admin can't sign in

Notice: admin reports repeated login failure or a lockout message.

Fix:

  • Forgot password: use the self-serve Forgot password? link on the login page → email reset link (valid 1 hour).
  • Lockout from failed attempts: the lockout is time-boxed and clears on its own; advise the admin to wait, or reset the password (which also lets them in with fresh credentials).
  • MFA device lost: the admin uses a recovery code at the MFA step.

Suspected data tampering

Notice: unexpected changes; audit concerns.

Confirm: verify the append-only audit chain — GET /api/v1/platform/verify-audit (DevOps). A broken chain identifies the first altered row.

Fix: investigate via the audit log viewer; if data was corrupted, restore from backup (see Backup & Restore) into staging first to inspect, then into prod if warranted.


Bad deploy / need to roll back

App (Pages): redeploy the previous good commit — npm run build && npm run deploy:prod from that commit. Pages also keeps prior deployments you can promote in the dashboard.

Database migration gone wrong: restore from the pre-migration backup you took (see Backup & Restore → "before every production migration"). Always back up immediately before applying a prod migration.

Verify after any rollback: GET /api/v1/health, a sign-in, a manual report run, and audit-chain verification.