The 8 Best Claude Code Skills for AI Engineers (2026)

By Skill Index Editorial · Jun 4, 2026

Building an LLM application is mostly plumbing the model never sees. The prompt is the easy part; the work is everything around it — the retrieval that feeds it context, the schema that constrains its output, the eval that says whether v2 of the agent beats v1, and the debugging when a run that passed yesterday fails today for no obvious reason. AI engineering is the discipline of making a nondeterministic component behave like a dependable system, and the loop is concrete: ingest grounding data, design the agent, run it, judge it, root-cause it when it breaks, and codify what worked so the next version starts from there. The eight skills below map onto that loop. Each is a real, verified Claude Code skill from a plugin with public commit history and a real star count on GitHub, and Skill Index has the exact install command on each detail page.

From the cli plugin (425 stars, verified, by Firecrawl). It runs a real web search and optionally scrapes the full page content of every result, returning structured JSON with LLM-optimized markdown rather than the thin snippets a basic search API gives back. For an AI engineer this is the retrieval front door: when an agent needs current information it does not have in its index, this is the tool that finds the pages and hands back clean, model-ready text in one step. It is the first stage of a retrieval pipeline you control end to end — search for sources, then escalate to scrape or crawl when you know which pages matter.

When to use: any time an agent needs to answer from live web data, or when you are assembling the seed set for a retrieval index and need to discover the right pages first. Use the --scrape option so the search returns full content and feed that markdown straight into the model as grounding context. Pair it with the ml-engineers skill stack for the model-side evaluation around the retrieval you build here.

Also from the cli plugin (425 stars, verified, by Firecrawl). It bulk-extracts an entire site or site section — all of /docs, say — following links up to a depth or page limit you set, with path filtering and concurrent extraction, and returns each page as clean markdown. For an AI engineer this is how you build a RAG corpus without writing and maintaining a scraper: point it at a documentation site, a knowledge base, or a product wiki, and get back the whole thing as embed-ready text. The honest bottleneck in most retrieval systems is not the vector store or the reranker; it is the quality and coverage of the corpus, and a one-command bulk crawl that returns clean markdown removes the most tedious part of getting that corpus.

When to use: when you are standing up a RAG index over a docs site and need every page under a path as clean text to chunk and embed. Set the depth and limit deliberately so you ingest the section you mean to and not the entire internet. Pair it with firecrawl-search above — search to discover the right entry points, then crawl the section that matters.

Also from the cli plugin (425 stars, verified, by Firecrawl). It turns a local file — PDF, DOCX, ODT, RTF, XLSX, HTML, and more — into clean markdown on disk, with options for an AI summary or a direct question against the file. The use is the other half of grounding-context preparation: a lot of the documents an LLM app needs to reason over live as files, not URLs, and a PDF dumped raw into a prompt is full of headers, footers, and layout noise that wastes tokens and confuses the model. This produces the clean, structured markdown you actually want to chunk and embed, saved to disk so you read it incrementally with grep and head rather than blowing up the context window. Better source text is the highest-leverage, least-glamorous fix in most retrieval systems.

When to use: any time the grounding data for an app is a local document rather than a web page — parse it first, inspect the markdown, then chunk only the relevant sections into the index. Treat clean parsing as a prerequisite for ingestion, not an afterthought. Pair it with firecrawl-crawl above so web-sourced and file-sourced context land in the same clean-markdown format before they hit your embedder.

From the gstack plugin (104,138 stars, MIT, verified — the largest plugin in the index). It turns a vague intent into a precise, executable spec in five phases, files the result as a GitHub issue, and can spawn an agent in a fresh worktree to build it. For an AI engineer this is the design step that agent work usually skips and pays for later: an LLM agent is only as good as the boundary you draw around it — what it is allowed to do, what counts as success, what the inputs and outputs are — and writing that down as a real spec before you start is the difference between an agent you can evaluate and one you can only eyeball. The five-phase structure forces the ambiguous parts into the open, which is exactly where agent behavior goes wrong if you leave it implicit.

When to use: at the start of any non-trivial agent or feature, before you write the first prompt or tool definition. Run it to pin down the success criteria and the I/O contract, then build against that spec so your later eval has something concrete to score against. Pair it with the backend-engineers skill stack when the agent you are speccing has to integrate with a real service layer.

Also from gstack. A wrapper around the OpenAI Codex CLI with three modes: an independent diff review with a pass/fail gate, an adversarial "challenge" mode that actively tries to break your code, and a consult mode with session continuity for follow-ups — a second opinion from a different model than the one you are coding with. For an AI engineer this matters twice over. First, your application code — the retrieval glue, the tool definitions, the output parsers — gets reviewed by a model with no stake in having written it, and the challenge mode is the closest thing to an automated red-team for the brittle parsing logic that LLM apps are full of. Second, it is a working pattern for the thing you are probably building anyway: routing a task to more than one model and letting them check each other rather than trusting a single generation.

When to use: as the review gate before you land any pipeline change, specifically in challenge mode against the parts most likely to fail silently — the JSON extraction, the retry logic, the fallback paths. A different model catches the failure mode your primary model is blind to. Pair it with investigate below so a surfaced bug gets root-caused rather than patched.

Also from gstack. A systematic debugging workflow with four phases — investigate, analyze, hypothesize, implement — under one Iron Law: no fixes without a confirmed root cause. For an AI engineer this is the antidote to the most expensive habit in LLM development, which is treating every failure as a prompt-tuning problem. When an agent returns garbage, the reflex is to add another instruction to the system prompt and move on; half the time the real cause is retrieval that returned the wrong chunk, a schema the model never actually saw, or a tool that errored quietly. The structured workflow forces you past the prompt and into the pipeline, where nondeterministic failures usually live.

When to use: every time a run fails or degrades and the cause is not obvious — especially the "it worked yesterday" failures that prompt tweaks do not fix. Run it before you touch the prompt, so you find out whether the bug is in the model's behavior or in the plumbing feeding it. Pair it with the data-analysts skill stack when the failure is a metric drift you need to trend rather than a single broken run.

Also from gstack. It walks back through the conversation, takes the most recent successful flow, and codifies it into a permanent, tested skill on disk — script, test, and fixture — so the next run executes the saved version instead of re-deriving it. For an AI engineer this is how an agent flow that finally works stops being a one-off you reconstruct from memory and becomes a named, committed, test-covered artifact. The generated test is the quiet payoff: it pins the behavior, so the next time you change the prompt or swap the model you find out immediately whether you improved the flow or broke it — which is the regression check most agent work never has, and the reason agents that worked in a demo quietly rot in production.

When to use: the moment an agent flow crosses from "experimenting" to "I want this exact behavior every time," before the working version scrolls out of reach. Treat the committed skill as the canonical version of that flow, and let its test be the gate any future change has to clear. Pair it with spec above so the flow you codify is the one that actually met the criteria you wrote down.

Also from gstack. A persistent store of what the project has learned across sessions, with commands to review, search, prune, and export it, plus a proactive nudge when you ask about a past pattern or wonder "didn't we solve this before?" For an AI engineer this is the eval-and-pattern memory that actually gets used, because it is searched from inside the same tool doing the work rather than living in a doc nobody opens. LLM development generates a steady stream of hard-won findings — this embedding model beat that one on our data, this chunk size stopped the hallucination, this retry pattern fixed the timeout — and the recurring waste is rediscovering the same finding three months later because it lived in a Slack thread that scrolled away. A searchable learnings store is the cheapest defense against re-running an experiment you already ran.

When to use: write a learning every time an eval produces a non-obvious result or a change measurably moves a metric, and search it at the start of any new pipeline before you re-derive a finding you already proved out. Prune on a cadence so stale results do not mislead a future decision. For findings and rubrics that need to live as real, linkable documents in a knowledge base, the obsidian-skills plugin ships a verified obsidian-cli skill (33,445 stars) that reads, creates, and searches notes in an Obsidian vault from the command line.

How to install

Three of the eight skills live in the cli plugin (425 stars, verified, by Firecrawl) and five in gstack (104,138 stars on garrytan/gstack, MIT, verified — the largest and most active plugin in the index), so install is a two-marketplace operation, and Skill Index has the exact install command on each skill detail page with a copy button. The highest-ROI sequence for an LLM-app build: use firecrawl-search, firecrawl-crawl, and firecrawl-parse to assemble clean, embed-ready grounding context from the web and from local files. Write the agent down with spec before you build it, so the success criteria exist before the first prompt. Gate every change through codex — especially challenge mode against the brittle parsing paths — and when something breaks, root-cause it with investigate instead of tuning the prompt blind. When a flow finally works, skillify it into a tested artifact and write the finding into learn so the next iteration starts from your best version, not from scratch. Bracket the deep build sessions with focused blocks via focus.thicket.sh, and the work stops being a string of one-off experiments and starts being a measured pipeline you can defend.

Frequently Asked Questions

Which Claude Code skill should an AI engineer install first?

It depends on where your bottleneck is, but for most LLM-app builds the retrieval layer comes first, so install the cli plugin (425 stars, verified, by Firecrawl) and start with firecrawl-search. It runs a real web search and optionally scrapes the full page content of every result, returning structured JSON with LLM-optimized markdown rather than the thin snippets a basic search API gives back. For an AI engineer this is the retrieval front door: when an agent needs current information it does not have in its index, this is the tool that finds the pages and hands back clean, model-ready text in one step. Use the --scrape option so the search returns full content, not just URLs, and feed that markdown straight into the model as grounding context. If your bottleneck is agent design rather than retrieval, start instead with gstack's spec skill (104,138 stars, MIT, verified) to pin down the success criteria before you write the first prompt.

What is the best Claude Code skill for building a RAG corpus?

Use firecrawl-crawl from the cli plugin (425 stars, verified, by Firecrawl). It bulk-extracts an entire site or site section — all of /docs, say — following links up to a depth or page limit you set, with path filtering and concurrent extraction, and returns each page as clean markdown. For an AI engineer this is how you build a RAG corpus without writing and maintaining a scraper: point it at a documentation site, a knowledge base, or a product wiki, and get back the whole thing as embed-ready text. The honest bottleneck in most retrieval systems is not the vector store or the reranker; it is the quality and coverage of the corpus, and a one-command bulk crawl that returns clean markdown removes the most tedious part of getting that corpus in the first place. Set the depth and limit deliberately so you ingest the section you mean to, and pair it with firecrawl-search to discover the right entry points first and firecrawl-parse for the local files that belong in the same index.

How do I prepare local documents as grounding context for an LLM app?

Use firecrawl-parse from the cli plugin (425 stars, verified, by Firecrawl). It turns a local file — PDF, DOCX, ODT, RTF, XLSX, HTML, and more — into clean markdown on disk, with options for an AI summary or a direct question against the file. The use is grounding-context preparation: a lot of the documents an LLM app needs to reason over live as files, not URLs, and a PDF dumped raw into a prompt is full of headers, footers, and layout noise that wastes tokens and confuses the model. firecrawl-parse produces the clean, structured markdown you actually want to chunk and embed, saved to disk so you read it incrementally with grep and head rather than blowing up the context window. Parse first, inspect the markdown, then chunk only the relevant sections into the index — better source text is the highest-leverage, least-glamorous fix in most retrieval systems.

How should an AI engineer design an agent before building it?

Use spec from the gstack plugin (104,138 stars, MIT, verified — the largest plugin in the index). It turns a vague intent into a precise, executable spec in five phases, files the result as a GitHub issue, and can spawn an agent in a fresh worktree to build it. For an AI engineer this is the design step that agent work usually skips and pays for later: an LLM agent is only as good as the boundary you draw around it — what it is allowed to do, what counts as success, what the inputs and outputs are — and writing that down as a real spec before you start is the difference between an agent you can evaluate and one you can only eyeball. The five-phase structure forces the ambiguous parts into the open, which is exactly where agent behavior goes wrong if you leave it implicit. Run it at the start of any non-trivial agent, then build against the spec so your later eval has concrete success criteria to score against.

How do I get a second-model review of my LLM application code?

Use codex from gstack — a wrapper around the OpenAI Codex CLI with three modes: an independent diff review with a pass/fail gate, an adversarial challenge mode that actively tries to break your code, and a consult mode with session continuity. For an AI engineer it matters twice over. First, your application code — the retrieval glue, the tool definitions, the output parsers — gets reviewed by a model with no stake in having written it, and the challenge mode is the closest thing to an automated red-team for the brittle parsing logic LLM apps are full of. Second, it is a working pattern for the multi-model routing you are probably building anyway: send a task to more than one model and let them check each other rather than trusting a single generation. Run it as the review gate before you land any pipeline change, specifically in challenge mode against the JSON extraction, retry logic, and fallback paths most likely to fail silently, and pair it with investigate so a surfaced bug gets root-caused rather than patched.

How do I debug a nondeterministic LLM failure that prompt tweaks don't fix?

Use investigate from gstack — a systematic debugging workflow with four phases (investigate, analyze, hypothesize, implement) under one Iron Law: no fixes without a confirmed root cause. For an AI engineer this is the antidote to the most expensive habit in LLM development, which is treating every failure as a prompt-tuning problem. When an agent returns garbage, the reflex is to add another instruction to the system prompt and move on; half the time the real cause is retrieval that returned the wrong chunk, a schema the model never actually saw, or a tool that errored quietly. The structured workflow forces you past the prompt and into the pipeline, where nondeterministic failures usually live. Run it every time a run fails or degrades and the cause is not obvious — especially the 'it worked yesterday' failures — before you touch the prompt, so you find out whether the bug is in the model's behavior or in the plumbing feeding it.

How do AI engineers stop re-running experiments they already ran?

Use learn from gstack — a persistent store of what the project has learned across sessions, with commands to review, search, prune, and export it, plus a proactive nudge when you ask about a past pattern or wonder 'didn't we solve this before?' For an AI engineer this is the eval-and-pattern memory that actually gets used, because it is searched from inside the same tool doing the work rather than living in a doc nobody opens. LLM development generates a steady stream of hard-won findings — this embedding model beat that one on our data, this chunk size stopped the hallucination, this retry pattern fixed the timeout — and the recurring waste is rediscovering the same finding months later because it lived in a Slack thread that scrolled away. Write a learning whenever an eval produces a non-obvious result, search it before you re-derive a finding, and prune on a cadence so stale results do not mislead. When findings need to live as real linkable documents, the obsidian-skills plugin's verified obsidian-cli skill (33,445 stars) reads, creates, and searches notes in an Obsidian vault from the command line, and you can pair the build sessions with focused blocks via focus.thicket.sh.

How to install

Frequently Asked Questions

More from the Skill Index