The 8 Best Claude Code Skills for SREs (2026)

By Skill Index Editorial · Jun 2, 2026

An SRE lives in the gap between "the deploy went out" and "the deploy is fine." The job is making sure the feature does not page someone at 3am, and that when it does, whoever picks up the pager finds the root cause before the error budget is gone. The recurring work is operational: configure the deploy so it is repeatable, gate the release behind a health check, watch the canary, get told when an upstream changes, investigate without guessing, hand off cleanly at shift change, and write down what you learned. The eight skills below map onto that loop. Each is a real, verified Claude Code skill from a plugin with public commit history and a real star count on GitHub.

From the gstack plugin (104,138 stars, MIT, verified — the largest plugin in the index). It detects your deploy platform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions, or a custom setup), your production URL, your health-check endpoints, and the commands that report deploy status, then writes the whole configuration into CLAUDE.md so every future deploy is automatic. For an SRE the value is the health-check detection: the difference between a deploy that "completed" and one that is actually serving traffic is whether something checked the /healthz endpoint, and this skill makes that check a permanent part of the deploy definition rather than a step someone remembers on a good day.

When to use: once per service, on day one, before the first automated deploy. Re-run it whenever the platform, the production URL, or the health endpoint changes. It is the prerequisite that makes the next skill safe.

Also from gstack. A structured workflow that merges the PR, waits for CI and the deploy to finish, then verifies production health via canary checks before it calls the release done. This is the SRE's core discipline encoded as a single command: no merge without green CI, no "done" without a health verification, and a clear stopping point if the canary degrades. The failure mode it removes is the one every reliability team has lived through — the deploy that technically shipped, reported success, and quietly broke production because nobody watched the first minute after rollout.

When to use: every production deploy bigger than a copy fix. Configure it once with setup-deploy above so the health endpoint and platform are already known. Pair it with the devops-engineers skill stack for the CI-pipeline half of the same release path, and with the platform-engineering skill stack for the rollback-and-blast-radius controls around it.

Also from gstack. Post-deploy canary monitoring: it watches the live application for console errors, performance regressions, and page failures, takes periodic screenshots, compares them against a pre-deploy baseline, and alerts on anomalies. For an SRE this is the detection layer that decides whether a release is promoted or rolled back. The metric that matters is time-to-detect — a regression that exists in production until a user complains has already burned error budget, and the canary shrinks that window from "hours until someone notices" to "minutes until the comparison fails."

When to use: on every deploy, as the automated gate land-and-deploy hands off to. Keep the baseline fresh so the comparison stays meaningful. For deeper page-speed and Core Web Vitals regression tracking on top of it, gstack also ships a benchmark skill that establishes load-time and bundle-size baselines and compares them on every PR — useful when the regression you fear is latency rather than an outright error.

From the cli plugin (425 stars, verified, by Firecrawl). It detects when content on a website changes and notifies you by webhook or email — no cron job, no scraper, no diff script to maintain. Each watched page is labelled same, new, changed, removed, or error, with snapshot history and per-field diffs, and a built-in judge filters out timestamp and formatting noise so it only fires on real changes. The SRE use is external dependency awareness: watch a vendor's status page, an upstream API's changelog, or a provider's deprecation notice, and route the webhook into your alerting. Half of the incidents that surprise an on-call are changes someone else shipped that you found out about too late.

When to use: point it at every third-party status page, changelog, and pricing or deprecation page your service depends on, and wire the webhook into the same channel your other alerts land in. Pair it with the data-analysts skill stack when the change you are tracking is a number you need to trend rather than an event you need to be paged on.

Also from gstack. A systematic debugging workflow with four phases — investigate, analyze, hypothesize, implement — under one Iron Law: no fixes without a confirmed root cause. This is the antidote to the most expensive habit in incident response, which is mitigating the symptom (restart the pod, roll back, scale up) and closing the ticket while the actual cause stays armed for the next time. Mitigation stops the bleeding and is the right first move; the skill is for the part after, where the question is what really happened and what else it touches, so the same outage does not recur on a different service next week.

When to use: on every incident with a real customer impact, and on every "it was fine yesterday" that the mitigation did not fully explain. Run it after the page is resolved, not instead of resolving it. Pair it with the backend-engineers skill stack when the root cause lives in application logic rather than infrastructure.

Also from gstack. It captures git state, the decisions made so far, and the remaining work into a saved context that any later session — or any other person — can pick up without losing the thread, with a paired context-restore to resume. The SRE use is the shift handoff and the long-running incident. When an incident outlasts your on-call window, the cost of a bad handoff is the next person re-deriving everything you already ruled out, often while the clock is still running. A saved context turns the handoff from a hurried verbal summary into a durable artifact: what we know, what we tried, what we ruled out, what is still open.

When to use: at every shift change during an active incident, and any time you step away from a long investigation you will not finish in one sitting. Save before you log off, not after the next person pings you confused. Pair it with investigate above so the saved context already carries the root-cause work in progress.

Also from gstack. A persistent store of what the project has learned across sessions, with commands to review, search, prune, and export it — and a proactive nudge when you ask about a past pattern or wonder "didn't we fix this before?" For an SRE this is the lightweight runbook layer that actually gets used, because it is searched from inside the same tool doing the work rather than living in a wiki nobody opens during an incident. The most wasteful thing a reliability team does is solve the same failure twice; a searchable learnings store is the cheapest defence against institutional amnesia.

When to use: write a learning at the close of every incident and every non-obvious fix, and search it at the start of every investigation before you start from scratch. Prune it on a cadence so stale entries do not mislead. For runbooks and postmortems that need to live as real documents in a knowledge base, obsidian-skills ships a verified obsidian-cli skill (33,445 stars) that reads, creates, and searches notes in an Obsidian vault straight from the command line.

Also from gstack. A weekly engineering retrospective that analyses commit history, work patterns, and code-quality metrics with persistent history and trend tracking, and it is team-aware — it breaks contributions down per person with both praise and growth areas. The SRE framing is the blameless postmortem at a cadence rather than only after a fire. Reliability is a trend, not an event: the retro turns the week's incidents, deploys, and near-misses into a tracked line you can show the team, so the conversation shifts from "who broke it" to "what does the slope say about where we are heading."

When to use: at the end of every week or sprint — the skill proactively suggests itself there. Use the trend tracking as the artifact you bring to the reliability review, the same way the rest of the org reads a dashboard. Pair it with the qa-engineers skill stack for the pre-release quality half of the same reliability picture.

How to install

Seven of the eight skills live in the gstack plugin (104,138 stars on garrytan/gstack, MIT, verified — the largest plugin in the index) and one in cli (425 stars, verified, by Firecrawl), so install is a two-marketplace operation, and Skill Index has the exact install command on each skill detail page with a copy button. The highest-ROI sequence: run setup-deploy once per service, then route every release through land-and-deploy and gate it behind canary so no deploy is "done" until production health is verified. Point firecrawl-monitor at every upstream status and changelog page. Keep investigate ready for the next incident, save state with context-save at every shift change, write a learn entry at the close of each incident, and run retro weekly so reliability becomes a trend line you can defend. Pair the on-call blocks with deep-work sessions via focus.thicket.sh and capture incident evidence as you go with capture.thicket.sh, and the month stops being a string of reactive pages and starts being a reliability program with a slope you can point at.

Frequently Asked Questions

Which Claude Code skill should an SRE install first?

Install gstack first and run setup-deploy once per service. It detects your deploy platform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions, or custom), your production URL, your health-check endpoints, and your deploy-status commands, then writes the configuration into CLAUDE.md so every future deploy is automatic. The gstack plugin has 104,138 stars and an MIT license and is the largest, most active plugin in the index. The reason this is the first install is the health-check detection: the difference between a deploy that 'completed' and one that is actually serving traffic is whether something checked the /healthz endpoint, and setup-deploy makes that check a permanent part of the deploy definition rather than a step someone remembers on a good day. Run it once per service on day one, re-run it whenever the platform or health endpoint changes, and it becomes the prerequisite that makes land-and-deploy safe.

What is the best Claude Code skill for a safe production deploy?

Use land-and-deploy from the gstack plugin (104,138 stars, MIT, verified). It merges the PR, waits for CI and the deploy to finish, and then verifies production health via canary checks before it calls the release done. This is the SRE's core discipline encoded as a single command: no merge without green CI, no 'done' without a health verification, and a clear stopping point if the canary degrades. The failure mode it removes is the one every reliability team has lived through — the deploy that technically shipped, reported success, and quietly broke production because nobody watched the first minute after rollout. Configure it once with setup-deploy so the health endpoint and platform are already known, then route every production deploy bigger than a copy fix through it.

How do I catch a production regression right after deploying?

Use canary from gstack. It watches the live application for console errors, performance regressions, and page failures, takes periodic screenshots, compares them against a pre-deploy baseline, and alerts on anomalies. For an SRE this is the detection layer that decides whether a release is promoted or rolled back, and the metric that matters is time-to-detect: a regression that exists in production until a user complains has already burned error budget, and the canary shrinks that window from hours until someone notices to minutes until the comparison fails. Run it on every deploy as the automated gate that land-and-deploy hands off to, and keep the baseline fresh so the comparison stays meaningful. When the regression you fear is latency rather than an outright error, gstack's benchmark skill establishes Core Web Vitals and bundle-size baselines and compares them on every PR.

How can an SRE get alerted when an upstream dependency changes?

Use firecrawl-monitor from the cli plugin (425 stars, verified, by Firecrawl). It detects when content on a website changes and notifies you by webhook or email — no cron job, no scraper, no diff script to maintain. Each watched page is labelled same, new, changed, removed, or error, with snapshot history and per-field diffs, and a built-in judge filters out timestamp and formatting noise so it only fires on real changes. The SRE use is external dependency awareness: watch a vendor's status page, an upstream API's changelog, or a provider's deprecation notice, and route the webhook straight into the same channel your other alerts land in. Half of the incidents that surprise an on-call are changes someone else shipped that you found out about too late, and this closes that gap.

What is the best Claude Code skill for incident root-cause analysis?

Use investigate from gstack. It runs a four-phase workflow — investigate, analyze, hypothesize, implement — under one Iron Law: no fixes without a confirmed root cause. This is the antidote to the most expensive habit in incident response, which is mitigating the symptom (restart the pod, roll back, scale up) and closing the ticket while the actual cause stays armed for the next time. Mitigation stops the bleeding and is the right first move; the skill is for the part after, where the question is what really happened and what else it touches, so the same outage does not recur on a different service next week. Run it on every incident with real customer impact and on every 'it was fine yesterday' the mitigation did not fully explain — after the page is resolved, not instead of resolving it.

How do SREs hand off an incident at shift change without losing context?

Use context-save from gstack. It captures git state, the decisions made so far, and the remaining work into a saved context that any later session or any other person can pick up without losing the thread, with a paired context-restore to resume. When an incident outlasts your on-call window, the cost of a bad handoff is the next person re-deriving everything you already ruled out, often while the clock is still running. A saved context turns the handoff from a hurried verbal summary into a durable artifact: what we know, what we tried, what we ruled out, what is still open. Save at every shift change during an active incident and any time you step away from a long investigation you will not finish in one sitting — before you log off, not after the next person pings you confused. Pair it with investigate so the saved context already carries the root-cause work in progress.

How do reliability teams stop solving the same incident twice?

Use learn from gstack — a persistent store of what the project has learned across sessions, with commands to review, search, prune, and export it, plus a proactive nudge when you ask about a past pattern or wonder 'didn't we fix this before?' For an SRE this is the lightweight runbook layer that actually gets used, because it is searched from inside the same tool doing the work rather than living in a wiki nobody opens during an incident. Write a learning at the close of every incident and every non-obvious fix, search it at the start of every investigation before starting from scratch, and prune it on a cadence so stale entries do not mislead. When runbooks and postmortems need to live as real documents in a knowledge base, the obsidian-skills plugin's verified obsidian-cli skill (33,445 stars) reads, creates, and searches notes in an Obsidian vault straight from the command line. Pair both with a weekly retro so the team sees reliability as a trend rather than a string of disconnected fires.

How to install

Frequently Asked Questions

More from the Skill Index