Prompt engineering is an experimental discipline that pretends to be a writing one. The output is a prompt, but the work is a loop: draft a version, run it against real inputs, score the results, compare it to the last version and to other models, keep what wins, and write down why. The hard part is not wording a prompt cleverly; it is doing that loop with enough rigor that "v4 is better" means something measurable instead of a vibe. The eight skills below map onto that loop — eval and judging, structured output, versioning and reuse, a pattern library, context carry-over across long sessions, and clean grounding data. Each is a real, verified Claude Code skill from a plugin with public commit history and a real star count on GitHub, and Skill Index has the exact install command on each detail page.

From the gstack plugin (104,138 stars, MIT, verified — the largest plugin in the index). It runs the same prompt through Claude, GPT (via the Codex CLI), and Gemini side-by-side and compares latency, tokens, cost, and optionally quality via an LLM judge. For a prompt engineer this is the single most useful thing in the index, because it answers the question that otherwise gets answered by guesswork: which model is actually best for this prompt, with data instead of vibes. The LLM-judge option is the part that matters most — it turns "this response feels better" into a scored, repeatable comparison you can run again after you change one line of the prompt. It is deliberately separate from gstack's page-performance benchmark skill, so the name does not collide with web-perf work.

When to use: every time you have two candidate prompts or two candidate models and you are about to pick one on instinct. Run it, read the judge score and the token-and-cost columns together, and pick the version that wins on quality at an acceptable cost. Pair it with the ml-engineers skill stack for the model-evaluation context around it.

From the cli plugin (425 stars, verified, by Firecrawl). An autonomous extraction agent that navigates complex multi-page sites and returns structured JSON against a schema you supply. The prompt-engineering use is structured output the honest way: instead of begging a prompt to "respond only in JSON" and praying it holds, you hand the tool an explicit JSON schema and get conformant data back. That habit — constrain the shape, then validate against it — is the same discipline you want in every production prompt that feeds a downstream system. It is also a fast way to build the input side of an eval set: point it at real pages and get clean, schema-shaped records to test prompts against.

When to use: any time you need real structured data — pricing tiers, product listings, directory entries — as test inputs for a prompt, or any time you are designing a structured-output prompt and want a working schema-constrained pattern to copy. Pair it with the data-analysts skill stack when the extracted records are something you then need to trend rather than feed to a model.

Also from gstack. It walks back through the conversation, takes the most recent successful flow, and codifies it into a permanent, tested skill on disk — script, test, and fixture — so the next run executes the saved version instead of re-deriving it. For a prompt engineer this is prompt versioning with teeth: the prompt or flow that finally worked stops living in a scrollback you will lose and becomes a named, committed, test-covered artifact. The generated test is the quiet payoff — it pins the behavior, so when you tweak the prompt later you find out immediately whether you improved it or broke it, which is exactly the regression check most prompt work never has.

When to use: the moment a prompt or flow crosses from "experimenting" to "I want this exact thing again," before the working version scrolls out of reach. Treat the committed skill as the canonical version of that prompt. Pair it with benchmark-models above so the version you codify is the one that actually won the comparison.

Also from gstack. A persistent store of what the project has learned across sessions, with commands to review, search, prune, and export it, plus a proactive nudge when you ask about a past pattern or wonder "didn't we solve this before?" For a prompt engineer this is the pattern library that actually gets used, because it is searched from inside the same tool doing the work rather than living in a doc nobody opens. The recurring waste in prompt work is rediscovering the same fix — the phrasing that stops a model from hedging, the system-prompt clause that fixes a refusal, the few-shot ordering that lifts accuracy — and a searchable learnings store is the cheapest defense against re-solving it from scratch every time.

When to use: write a learning every time a prompt change produces a non-obvious improvement, and search it at the start of every new prompt before you reinvent a technique you already proved out. Prune on a cadence so stale tricks do not mislead. For prompts and rubrics that need to live as real, linkable documents, pair it with the obsidian-cli skill below.

Also from gstack. It captures git state, the decisions made so far, and the remaining work into a saved context that a later session can pick up without losing the thread. Prompt iteration is long: by version ten you are holding a dozen tradeoffs in your head — why you dropped the chain-of-thought block, which few-shot example caused the regression, what the judge flagged last run. A context window does not survive a session boundary, and re-deriving that state the next morning is pure waste. The skill turns the end of a session into a durable artifact: what worked, what failed, what is still open.

When to use: at the end of any prompt-iteration session you will not finish in one sitting, and before you step away from a long eval run. Save before you log off, not after you have already lost the thread. Pair it with context-restore below to resume.

Also from gstack. The other half of the pair: it loads the most recent saved state — across branches and even across workspace handoffs — so you resume exactly where the last session ended. For a prompt engineer the value is continuity across a multi-day iteration, where the cost of a cold start is re-reading your own diffs to remember which prompt variant was the current best.

When to use: at the start of every session that continues prompt work from a previous day, and any time a teammate hands off an in-progress iteration to you. Run it before you touch the prompt, so you are editing the version that was actually winning. Pair it with context-save above and with learn so the durable wins are searchable even after the session context is gone.

From the obsidian-skills plugin (33,445 stars, MIT, verified). It reads, creates, searches, and manages notes in an Obsidian vault straight from the command line. For a prompt engineer this is the durable home for the prompt library — the place where reusable system prompts, few-shot sets, eval rubrics, and the reasoning behind each version live as real, linkable, searchable documents instead of scattered text files. The advantage over the in-session learn store is permanence and structure: wikilinks let you connect a prompt to its eval results and its judge rubric, so the library is a graph you can navigate rather than a flat log. Build prompts in the tool that runs them, then file the keepers in the vault.

When to use: as the system of record for any prompt you will reuse or share — save the final version, its few-shot examples, and the rubric you judged it against as linked notes. Pair it with learn so the fast in-session capture and the durable vault stay in sync, and with the technical-writers skill stack when the prompts you are documenting are part of a product spec.

Also from the cli plugin (425 stars, verified, by Firecrawl). It turns a local document — PDF, DOCX, XLSX, HTML, and more — into clean markdown on disk, with options for an AI summary or a direct question against the file. The prompt-engineering use is grounding-context preparation: the quality of a RAG or retrieval prompt is capped by the quality of the text you feed it, and a PDF dumped raw into a prompt is full of headers, footers, and layout noise that wastes tokens and confuses the model. This produces the clean markdown you actually want as context, saved to disk so you read it incrementally with grep and head rather than blowing up the window. Better grounding data is the highest-leverage, least-glamorous fix in most retrieval prompts.

When to use: any time a prompt needs a local document as grounding context — parse it first, inspect the markdown, then feed only the relevant sections. Treat clean parsing as a prerequisite for any retrieval prompt rather than an afterthought. Pair it with firecrawl-agent above when the grounding data lives on the web rather than on disk.

How to install

Five of the eight skills live in the gstack plugin (104,138 stars on garrytan/gstack, MIT, verified — the largest and most active plugin in the index), two in cli (425 stars, verified, by Firecrawl), and one in obsidian-skills (33,445 stars, MIT, verified), so install is a three-marketplace operation — Skill Index has the exact install command on each skill detail page with a copy button. The highest-ROI sequence for a prompt engineer's iteration loop: make benchmark-models the gate every prompt-or-model choice passes through, so picks are scored by an LLM judge instead of intuition. Use firecrawl-agent to build schema-shaped test inputs and firecrawl-parse to prepare clean grounding context. When a version finally wins, skillify it into a tested, committed artifact, write the why into learn, and file the keeper in your Obsidian vault via obsidian-cli. Bracket every long iteration with context-save at the end and context-restore at the start so a multi-day prompt never cold-starts. Pair the deep iteration sessions with focused blocks via focus.thicket.sh, and the work stops being a string of disconnected experiments and starts being a measured loop with a version history you can defend.