What this is
You run things. A website, a database, a cron job, a server, some API keys, a few subscriptions, maybe a rental property. Each one can break. Most of the time they don't. But when they do, you want a runbook — not a troubleshooting session.
This is not the flywheel. The flywheel discovers friction you didn't know about. This guide handles friction you already know about — known systems, known failure modes, known fixes. The flywheel is R&D. This is operations.
The maintenance guide is a ranked inventory of everything you depend on, with a health check for each one, a runbook for when it breaks, and a schedule that makes sure you actually look. The output is a status dashboard: green, yellow, red. If everything's green, you never see it. If something's yellow, it shows up in the daily briefing. If something's red, it wakes you up.
The five layers
1. What matters
A ranked inventory of the systems you depend on. Not everything — just the ones where downtime costs you. Each system gets a priority tier:
- Critical — down means broken. Your site is unreachable, your database is gone, your deploy pipeline is dead. Fix immediately.
- Important — down means degraded. A cron job isn't running, backups are stale, a cert is about to expire. Fix today.
- Nice-to-have — down means annoying. A monitoring dashboard is stale, a convenience script is broken, a subscription auto-renewed at the wrong tier. Fix when convenient.
2. How to check it
For each system, a health check. A command, a URL, a query — something an agent can run without interpretation.
curl -sf https://your-site.xyz | head -1— did the site respond?psql -c "SELECT 1"— is the database alive?certbot certificates 2>&1 | grep "Expiry"— when do certs expire?systemctl is-active nginx— is nginx running?ls -lt /backups/ | head -3— is the latest backup recent?
The check is the runbook's smallest unit. If you can't write a check, you don't understand the system well enough to maintain it.
3. How to fix it
For each known failure mode, a procedure. Not a troubleshooting guide — a script. "If the site is down: check nginx, check DNS, check the deploy. Here are the commands." The fix is a sequence of steps an agent can follow. If the agent can't fix it, it escalates — tells you what it found and what it tried.
Runbooks are per-system, per-failure-mode. One file each. The file describes the symptom, the diagnosis steps, and the fix. The agent reads the file and follows it. You don't need to remember how to restart postgres at 2am — the runbook remembers.
4. When to check it
A schedule. The cadence matches the priority:
- Critical — checked daily (or continuously via uptime monitor)
- Important — checked weekly
- Nice-to-have — checked monthly
The maintenance agent runs on a cron. It produces a status report. Green rows are silent. Yellow rows show up in the daily briefing. Red rows notify you immediately. The schedule is self-managed — if a system keeps going yellow, the agent proposes upgrading its check frequency.
5. What changed
A changelog. Every time a system changes — new deploy, config update, dependency upgrade, cert renewal — the maintenance guide logs it. When something breaks, the changelog is the first place you look. "What changed since it was working?"
The changelog is also how you catch drift. If nothing has changed in three months, either the system is perfectly stable or nobody's watching. The maintenance agent flags stale systems for review.
The dashboard
The maintenance agent produces a status file after each run. Here's what it looks like:
The dashboard is a file, not an app. Static HTML, regenerated on each check run. Archived by date. The daily briefing reads it and surfaces anything non-green.
The output structure
Everything lives in one folder. The agent creates it and maintains it across runs.
Talking to other systems
Maintenance doesn't exist in isolation. It talks to the other guides through the folder convention:
- Maintenance → Flywheel. A red check the agent can't fix becomes an observation in
flywheel/observations/. The flywheel picks it up, identifies the pattern, proposes a structural fix. - Flywheel → Maintenance. The flywheel discovers a system that keeps failing. It proposes adding the system to the maintenance inventory with a health check and runbook.
- Maintenance → Daily Briefing. The status dashboard feeds the briefing. Yellow and red items appear in the "action required" section.
- Any system → Maintenance. Any agent can write a call for help to
ops/incidents/. The maintenance agent picks it up on its next run.
Each system has a self-reflection skill — a periodic check that runs with full local context. Maintenance knows its own check history, failure rates, and false positives. It logs friction about its own operation to the flywheel. The systems improve each other.
Common systems to track
The handoff
Build a maintenance system with the user, one layer at a time. Start with inventory, end with automation. Every step has an approval gate.
- Interview. Ask the user: What do you run? What breaks? What worries you? What wakes you up at 2am? What would ruin your day if it went down? Don't assume — some people run servers, some run spreadsheets. Start with whatever they have.
- Build the inventory. Create
ops/inventory.md. List each system with: name, what it does, priority tier (critical/important/nice-to-have), how to access it, and who owns it. Rank by priority. Review together. - Write health checks. For each system in the inventory, write a check command. Something you can run right now that returns pass/fail. Test each check live — run it and show the result. If a check fails, that's your first finding. Write checks to
ops/inventory.mdalongside each system entry. Verify each check works. - Run the first status check. Execute every health check. Generate
ops/status.htmlwith green/yellow/red results. This is your baseline. If anything is already yellow or red, flag it. Show the dashboard to the user. - Write runbooks. For each critical and important system, ask: "What goes wrong with this?" Write one runbook per failure mode to
ops/runbooks/. Each runbook has: symptom (how you'd notice), diagnosis (commands to run), fix (steps to resolve), escalation (what to do if the fix doesn't work). Review each runbook. - Build the expirations list. Scan for anything with a deadline: SSL certs, domain renewals, API key expirations, subscription renewals, contracts. Write to
ops/expirations.mdsorted by date. Highlight anything within 30 days. - Set up the schedule. Create
ops/schedule.md. Critical systems: daily. Important: weekly. Nice-to-have: monthly. Suggest automation — a cron job that runs health checks and generates the status dashboard. If the user has a daily briefing, connect the status output to it. - Start the changelog. Create
ops/changelog.md. Log today as the first entry: "Maintenance system initialized." Going forward, any change to any tracked system gets an entry. - Connect to other systems. If the user has a flywheel, set up the observation bridge: red checks that can't be auto-fixed write to
flywheel/observations/. If they have a daily briefing, feed the status dashboard into it. If they have neither, the maintenance system works standalone. - Self-assess. Review the maintenance system's own performance. Are health checks producing false positives? Are runbooks accurate? Are any systems missing from the inventory? Write self-improvement proposals to
ops/proposals/and log observations to the flywheel if one exists. Every proposal should include a lesson: what this failure taught you that future systems should inherit.
Key rules: Health checks must be testable right now — run each one and show the result. Runbooks are scripts, not essays — an agent should be able to follow them mechanically. Don't touch production systems without explicit approval. If a check requires SSH or elevated access, confirm access before attempting. The maintenance system starts simple (inventory + checks) and grows (runbooks + schedule + automation). Don't overwhelm the user with ten runbooks on day one — start with the critical systems and expand. Every maintenance proposal should have a lesson in it — not just what to change, but what this incident teaches about how to build or operate better next time.
- Fix Your Papercuts — maintenance prevents known papercuts from recurring. The runbook is a papercut fix, written down permanently.
- Solved Problems Stay Solved — each runbook is a solved problem. You debug it once, write it down, never debug it again.
- Don't Ask Me to Track It — automated checks instead of manual monitoring. The system watches itself.
- The Steering File — the inventory is a steering file for operations. It tells the agent what to care about.
- Memory Is Files — the changelog is memory. Incident reports are memory. The maintenance system remembers what broke and how it was fixed.
- The Folder Is the Interface — systems communicate by writing files to each other's folders. The protocol is markdown in the right place.
- Flywheel — discovers new friction; graduates findings into maintenance when they become permanent ops
- Daily Briefing — surfaces maintenance alerts in the morning briefing
- Wall of Data — the infrastructure that collects data from the systems you maintain
- Zero to Dev — start here if you haven't set up yet
- Google SRE Book — the industry standard on reliability engineering (this guide is the personal-scale version)
- Uptime Kuma — self-hosted monitoring tool; pairs well with the health check pattern
Navigate to your project root, create the ops folder, and hand it to an agent.
cd /path/to/your/project-root && mkdir -p ops/runbooks ops/incidents ops/proposals
cd "$env:USERPROFILE\<project-root>"; mkdir ops -Force; mkdir ops\runbooks, ops\incidents, ops\proposals -Force
<project-root> with the folder name you actually use.claude
Follow the instructions on this page. If anything looks unsafe or beyond what I'd reasonably want, tell me before doing it: