ML work looks like software engineering until it doesn't. A failing unit test is a crisp signal; a model that's 0.3 points worse on the eval set is a fog. Runs are expensive, failures are probabilistic, and most of the work is not writing code but deciding what to run next. The skills below are the ones that most directly compress the loop between "I have an idea" and "I know if it worked." All live in verified plugins with real commit history and real star counts.
If your pipeline calls Claude at any point — data labeling, synthetic data generation, evaluation, agent rollouts — this is the one skill you cannot skip. From the flagship skills plugin (121,347 stars). Covers prompt caching, tool use, batch API, memory, citations, thinking, model migration (4.5 → 4.6 → 4.7), and compaction. Handles the API fluency most ML engineers learn the expensive way — by overpaying for a month.
When to use: any service or notebook importing anthropic/@anthropic-ai/sdk. Especially critical for prompt caching: a 1M-token context read without caching is the same prompt billed at full rate on every call. With caching done right, you pay ~10% of that on cache hits and the math on synthetic-data generation stops being ugly.
From the flagship superpowers plugin (162,164 stars). Forces a disciplined hypothesis-and-test loop: state the hypothesis, design the minimal experiment, observe, update. For ML this matters more than for deterministic code — when a run is worse and you can't tell if it's the data pipeline, the loss function, the learning rate, or variance, random changes are a Ouija board.
When to use: any time your eval metric moves and you don't immediately know why. Reproducing with a fixed seed is the first experiment, not the last resort.
Cross-model benchmark. Runs the same prompt through Claude, GPT (via Codex CLI), and Gemini side-by-side and compares latency, tokens, cost, and optionally quality via an LLM judge. From gstack (78,986 stars). Answers "which model is actually best for this step in the pipeline?" with data instead of vibes.
When to use: before committing to a model for an inference-heavy pipeline stage. Also after every major model release — the relative ordering changes more often than people want to believe, and the cheapest model that clears your quality bar is usually the one your CFO will prefer.
Spawns multiple independent subagents in one go, each working in its own context. ML maps onto this naturally: try five prompt variants in parallel, evaluate on the same held-out set, pick the winner. Sequential A/B iteration wastes hours on IO-bound model calls that could have run concurrently.
When to use: any time you have 2+ independent experiments — prompt variants, hyperparameter settings, retrieval configs, sampling strategies. The skill handles the fan-out; you handle the scoring.
A hard gate against claiming work is done before evidence supports it. Runs verification commands and refuses to declare success without output. For ML this is the defense against the most common silent failure — reporting that a training run "converged" because the script exited 0, without looking at the loss curve or the held-out metric.
When to use: before every "the new model is better" claim, every PR that touches data preprocessing, every eval that depends on a sampler or a judge. Evidence before assertions, always.
Safety guardrails for destructive commands. Warns before rm -rf, DROP TABLE, aws s3 rm --recursive, force pushes, and other operations that don't have an "undo" button. From gstack (78,986 stars).
When to use: always on, but especially when rotating training data, pruning checkpoint directories, or cleaning up S3 buckets of model artifacts. The expensive lesson here is deleting the only copy of a checkpoint that took 40 GPU-hours to produce. Cheap insurance.
Enforces the red-green-refactor cycle: failing test first, watch it fail, implement, watch it pass. For ML, the analogue is writing the eval harness and the assertion about expected behavior before writing the transform or the training loop. Data leakage bugs, off-by-one tokenization bugs, and silent label-mapping regressions all surface the moment you have a cheap test that asserts the thing you actually care about.
When to use: before touching data preprocessing, custom loss code, tokenizer logic, or any function whose bug would produce a plausible-looking but wrong result. These are the bugs that train for hours before revealing themselves.
How to install
Each skill lives inside a plugin. Add the plugin marketplace once, then install with a single command. The skill detail page has the exact install string.
If you're new to Claude Code plugins and doing ML work, the highest-ROI first install is skills (for claude-api) together with superpowers (for everything else on this list except the two gstack entries). Between them you get API fluency, a disciplined debugging loop, verification gates, and parallel experiment dispatch — the four things that most visibly shorten the iteration cycle on ML work.