Incident Runbook¶
First response for the most likely operational problems. Each entry: how you'll
notice, how to confirm, how to fix, and how to verify recovery. Platform/DevOps
actions are at /platform/*; environment domains are
qbtime.r2d2dev.com (prod) and staging.qbtime.r2d2dev.com (staging).
Triage order¶
- Is the app up?
GET /api/v1/healthshould return DB status. - Is the cron firing? See "Scheduled reports stopped".
- Is a single tenant broken, or all of them? The Fleet health page
(
/platform/health) shows every company's connection, token, last run, and recent failures at a glance.
QBT connection / token expired or refresh failing¶
Notice: a company shows expired/expiring token or error connection on
Fleet health; affected company admins get a qbt_token_expired /
qbt_refresh_failed notification; reports for that company stop.
Confirm: open the company's Status page; check Fleet health token state.
Fix: the access token auto-refreshes, so a transient failure usually clears on the next cron cycle. If it persists, the refresh token or client secret is bad:
- Have the customer admin re-run Connect for that company (Companies page), which mints a fresh token pair.
- If the QBT client secret was rotated in QBT, update it in-app via Rotate secret (Companies page), then re-run Connect.
Verify: Fleet health shows token ok and connection connected; run a
manual report for that company.
Report email failed to send¶
Notice: report_send_failed notification; Fleet health shows a non-zero
Failures (7d) count; a report_runs row with status failed.
Confirm: check the cron Worker logs (npx wrangler tail qbtime-cron-prod)
for the failing cycle, and the company's recent runs.
Fix: common causes —
- Email provider creds: verify
GRAPH_TENANT_ID/GRAPH_CLIENT_ID/GRAPH_CLIENT_SECRET/GRAPH_SENDER(or SendGrid equivalents) in the Pages env. An expired Graph client secret is the usual culprit. - Recipients: confirm the report config has valid recipient emails.
- QBT side: if the data pull failed, treat as the connection incident above.
Runs are idempotent (send-dedupe), so re-firing the cron won't double-send a report that already went out.
Verify: run the report manually (Reports page → Preview → Run & send) and confirm delivery; failure count stops climbing.
Scheduled reports stopped firing¶
Notice: no reports at the expected times; no recent report_runs rows.
Confirm the chain (cron Worker → internal endpoint → runner):
npx wrangler tail qbtime-cron-prod # is the Worker waking every 15 min?
Then test the app endpoint directly (PowerShell-safe):
Invoke-WebRequest -Uri "https://qbtime.r2d2dev.com/api/v1/internal/run-due-reports" `
-Method POST -Headers @{ "X-Internal-Secret" = "<prod INTERNAL_TASK_SECRET>" } |
Select-Object -ExpandProperty Content
Fix:
- 403 from the endpoint: the cron Worker's
INTERNAL_TASK_SECRETdoesn't match the app's for that env. Re-set it:npx wrangler secret put INTERNAL_TASK_SECRET --env production(fromcron-worker/), using the app's value. - Worker not firing at all: confirm it's deployed as a Worker, not a
Pages project (Pages can't run cron). Redeploy:
npx wrangler deploy --env productionfromcron-worker/. The output must showschedule: */15 * * * *. - Wrong target: the deploy output must show the correct
APP_BASE_URLfor the env (a staging Worker pointed at the prod URL is a known past mistake; the--envflag prevents it).
Verify: wrangler tail shows "*/15 * * * *" @ ... - Ok at the next
quarter-hour.
Account lockout / admin can't sign in¶
Notice: admin reports repeated login failure or a lockout message.
Fix:
- Forgot password: use the self-serve Forgot password? link on the login page → email reset link (valid 1 hour).
- Lockout from failed attempts: the lockout is time-boxed and clears on its own; advise the admin to wait, or reset the password (which also lets them in with fresh credentials).
- MFA device lost: the admin uses a recovery code at the MFA step.
Suspected data tampering¶
Notice: unexpected changes; audit concerns.
Confirm: verify the append-only audit chain — GET /api/v1/platform/verify-audit
(DevOps). A broken chain identifies the first altered row.
Fix: investigate via the audit log viewer; if data was corrupted, restore from backup (see Backup & Restore) into staging first to inspect, then into prod if warranted.
Bad deploy / need to roll back¶
App (Pages): redeploy the previous good commit —
npm run build && npm run deploy:prod from that commit. Pages also keeps prior
deployments you can promote in the dashboard.
Database migration gone wrong: restore from the pre-migration backup you took (see Backup & Restore → "before every production migration"). Always back up immediately before applying a prod migration.
Verify after any rollback: GET /api/v1/health, a sign-in, a manual report
run, and audit-chain verification.