Research engineering is the discipline of turning a paper-shaped idea into a result you can trust. The job is not writing the model; it is everything around the training loop — chasing down why a run that converged last night diverges today, keeping track of which of forty experiments actually moved the metric, holding the thread across a sweep that takes three days to finish, comparing your implementation against a baseline, reading the literature that tells you which baseline to use, and writing the whole thing up so a reviewer can reproduce it. The hard part is reproducibility under nondeterminism: a single seed change, a silent dtype cast, or a data-loader race can erase a week. The eight skills below map onto that loop — debug, remember, carry context, evaluate, research, write up, and codify. Each is a real, verified Claude Code skill from a plugin with public commit history and a real star count on GitHub, and Skill Index has the exact install command on each detail page.
From the gstack plugin (104,138 stars, MIT, verified — the largest plugin in the index). A systematic debugging workflow with four phases — investigate, analyze, hypothesize, implement — under one Iron Law: no fixes without a confirmed root cause. For a research engineer this is the antidote to the single most expensive failure mode in experimental work, which is changing five things at once when a run breaks and never learning which one mattered. A loss that goes to NaN, an eval that regressed after a refactor, a reproduction that lands two points below the paper — the reflex is to start tweaking the learning rate and shipping reruns. The structured workflow forces you to isolate the cause first: was it the seed, the data shuffle, a shape mismatch the framework swallowed, a mixed-precision cast that overflowed? Finding the actual root cause is what makes the fix transfer to the next run instead of being a coincidence.
When to use: every time a training run, an eval, or a reproduction fails or silently degrades and the cause is not obvious — especially the "it converged yesterday" failures that a blind hyperparameter sweep will never explain. Run it before you launch another GPU-hour of reruns, so the next experiment tests a hypothesis instead of a guess. Pair it with the ml-engineers skill stack for the model-training context around the debugging.
Also from gstack. A persistent store of what the project has learned across sessions, with commands to review, search, prune, and export it, plus a proactive nudge when you ask about a past pattern or wonder "didn't we already try this?" For a research engineer this is the lab notebook that actually gets read, because it is searched from inside the same tool running the experiments rather than living in a spreadsheet nobody opens. Research generates a relentless stream of hard-won findings — this optimizer beat that one at this batch size, this augmentation hurt on the small split, this seed range is unstable, this preprocessing bug inflated the baseline by a point — and the recurring waste is re-running an ablation you already ran three weeks ago because the result lived in a Slack message that scrolled away. A searchable findings store is the cheapest defense against burning compute on an experiment whose answer you already have.
When to use: write a learning every time an experiment produces a non-obvious result or a change measurably moves a metric, and search it before you launch a new sweep so you do not re-derive a finding you already proved out. Prune on a cadence so a stale result from an old data version does not mislead a current decision. Pair it with the data-analysts skill stack when a finding is really a metric trend you need to chart rather than a single fact.
Also from gstack. It captures git state, the decisions made so far, and the remaining work into a saved context that a later session can pick up without losing the thread. Research work is long by nature: a sweep can run overnight, a multi-day reproduction holds a dozen open questions in your head — which config is the current best, why you froze the backbone, what the eval flagged on the held-out split, which two runs are still pending. A context window does not survive a session boundary, and reconstructing that state the next morning by re-reading your own diffs and run logs is pure overhead. The skill turns the end of a session into a durable artifact: what worked, what failed, what is still launching.
When to use: at the end of any experiment session you will not finish in one sitting, and before you kick off a long run and step away. Save before you log off, not after you have already lost the thread of which variant was winning. Pair it with context-restore below to resume.
Also from gstack. The other half of the pair: it loads the most recent saved state — across branches and even across workspace handoffs — so you resume exactly where the last session ended. For a research engineer the value is continuity across a multi-day experiment, where the cost of a cold start is re-reading scattered run logs to remember which checkpoint was the current best and which ablations are still outstanding. It also makes handoffs real: when a long-running study passes from one person to another, the receiver loads the saved decisions instead of reverse-engineering them from the commit history.
When to use: at the start of every session that continues an experiment from a previous day, and any time an in-progress study is handed off to you. Run it before you launch the next run, so you are extending the configuration that was actually winning rather than a stale branch. Pair it with context-save above and with learn so the durable findings stay searchable even after the session context is gone.
Also from gstack. It runs the same prompt through Claude, GPT (via the Codex CLI), and Gemini side-by-side and compares latency, tokens, cost, and optionally quality via an LLM judge. For a research engineer whose work involves model-generated outputs — a model-in-the-loop pipeline, a synthetic-data generator, an LLM-judged eval — this is a ready-made cross-model comparison harness, and the LLM-judge option is the part that matters most: it turns "this output looks better" into a scored, repeatable number you can run again after you change one variable. The discipline it encodes is the one every research engineer already lives by — hold everything fixed, vary one thing, compare against a baseline — applied to model choice instead of hyperparameters. It is deliberately separate from gstack's page-performance benchmark skill, so the names do not collide.
When to use: any time your experiment depends on which model generates or judges, and you are about to pick one on instinct. Run it, read the judge score alongside the cost and latency columns, and pick the model that wins on quality at an acceptable budget — then record the result in learn so the next study starts from it. Pair it with the ml-engineers skill stack for the broader model-evaluation context.
From the cli plugin (425 stars, verified, by Firecrawl). It runs a real web search and optionally scrapes the full page content of every result, returning structured JSON with LLM-optimized markdown rather than the thin snippets a basic search API gives back. For a research engineer this is the literature-review front door: when you need the method section of a paper, the exact hyperparameters a baseline used, or the current state of a benchmark leaderboard, this finds the pages and hands back clean, readable text in one step — no copy-pasting out of a PDF viewer, no fighting a paywalled abstract for the one number you need. Knowing which baseline to reproduce and what configuration it used is half the work of a reproduction, and this is the tool that gets you that context fast.
When to use: at the start of any reproduction or new study, when you need the prior art — the paper that introduced the method, the repo that implements it, the leaderboard that ranks it. Use the full-content option so the search returns the actual method details, not just links, and drop the relevant findings into learn so the next study inherits them. Pair it with investigate above when the literature reveals that your reproduction gap is a known implementation pitfall rather than a bug in your code.
Also from gstack. It turns any markdown file into a publication-quality PDF — proper one-inch margins, intelligent page breaks, page numbers, a cover page, running headers, a clickable table of contents, and a diagonal DRAFT watermark when you want one. For a research engineer this is the last mile that experimental work usually does by hand and badly: the writeup. The results live as markdown next to the code — the experiment log, the ablation table, the methodology — and turning that into a shareable document for a reviewer, a collaborator, or an internal report normally means wrestling with a word processor or a fragile LaTeX setup. A one-command path from the markdown you already wrote to a finished, paginated PDF means the writeup is something you do at the end of every study instead of something you skip.
When to use: any time an experiment report, an ablation summary, or a reproduction writeup needs to leave the repo as a real document — a design review, a results memo, a draft for a paper appendix. Use the DRAFT watermark for the in-progress version so reviewers know it is not final. Pair it with the technical-writers skill stack when the report is part of a larger piece of documentation.
Also from gstack. It walks back through the conversation, takes the most recent successful flow, and codifies it into a permanent, tested artifact on disk — script, test, and fixture — so the next run executes the saved version instead of re-deriving it. For a research engineer this is reproducibility made concrete: the data-prep step, the eval harness, or the experiment-launch flow that finally works stops being a sequence you reconstruct from memory and becomes a named, committed, test-covered artifact that runs the same way every time. The generated test is the quiet payoff — it pins the behavior, so the next time you change the pipeline you find out immediately whether you improved it or silently broke the thing that made last week's result valid. That regression check is exactly what most research code never has, and the reason a result that reproduced in March quietly stops reproducing in June.
When to use: the moment an experiment-prep or eval flow crosses from "I got it working" to "I need this exact behavior on every run," before the working version scrolls out of reach. Treat the committed artifact as the canonical version of that step, and let its test be the gate any future change has to clear. Pair it with context-save above so the flow you codify is the one tied to the run that actually produced the result.
How to install
Seven of the eight skills live in the gstack plugin (104,138 stars on garrytan/gstack, MIT, verified — the largest and most active plugin in the index) and one in cli (425 stars, verified, by Firecrawl), so install is a two-marketplace operation, and Skill Index has the exact install command on each skill detail page with a copy button. The highest-ROI sequence for a research workflow: when a run breaks, root-cause it with investigate before you spend another GPU-hour on blind reruns, and write the finding into learn so the next study inherits it instead of rediscovering it. Bracket every multi-day sweep with context-save at the end and context-restore at the start so a long experiment never cold-starts. When your work depends on a generated or judged output, settle the model choice with benchmark-models instead of a vibe, and gather the prior art for any reproduction with firecrawl-search. When a study is done, turn the markdown log into a shareable report with make-pdf, and skillify the eval and data-prep flows into tested artifacts so the result stays reproducible months later. Bracket the deep build sessions with focused blocks via focus.thicket.sh, and the work stops being a string of one-off runs and starts being a measured, reproducible research program you can defend.