inspect_evals/

eval-quality-workflow

community[skill]

Fix or review a single evaluation against all EVALUATION_CHECKLIST.md standards. Use "fix" mode to refactor an eval into compliance, or "review" mode to assess compliance without making changes. Use when user asks to fix, review, or check an evaluation's quality. Trigger when the user asks you to run the "Fix An Evaluation" or "Review An Evaluation" workflow. Do NOT use for reviewing ALL evals against a single code quality standard (use code-quality-review-all instead).

$/plugin install inspect_evals

details

Evaluation Quality — Fix or Review

This skill covers two closely related workflows for a single evaluation in src/inspect_evals/:

Fix An Evaluation: Refactor the evaluation to comply with EVALUATION_CHECKLIST.md
Review An Evaluation: Assess compliance without making changes

Identifying the Evaluation

If the user has given you a name, that takes priority. If you were just building an evaluation, or the user has uncommitted code for one specific evaluation, you can assume that's the correct one. If you are not confident which evaluation to look at, ask the user.

Fix An Evaluation

Our standards are in EVALUATION_CHECKLIST.md, with links to BEST_PRACTICES.md and CONTRIBUTING.md. Your job is to refactor the evaluation to meet these standards.

Set up the working directory: a. If the user provides specific instructions about any step, assume the user's instructions override these instructions. b. If there is no evaluation name, ask the user for one. c. The evaluation name should be the eval folder name plus its version (from the @task function's version argument). For instance, GPQA version 1.1.2 becomes "gpqa_1_1_2". If this exact folder name already exists, add a number to it via "gpqa_1_1_2_analysis2". This name will be referred to as <eval_name>. d. Create a folder called agent_artefacts/<eval_name>/fix if it isn't present. e. Whenever you create a .md file as part of this workflow, assume it is made in agent_artefacts/<eval_name>/fix. f. Copy EVALUATION_CHECKLIST.md to the folder. g. Create a NOTES.md file for miscellaneous helpful notes. Err on the side of taking lots of notes. Create an UNCERTAINTIES.md file to note any uncertainties.
Go over each item in the EVALUATION_CHECKLIST, using the linked documents for context where necessary, going from top to bottom. It is important to go over every single item in the checklist! a. For each checklist item, assess your confidence that you know what is being asked of you. Select Low, Medium, or High. b. If you have High confidence, fix the evaluation to pass the checklist if needed, then edit EVALUATION_CHECKLIST.md to place a check next to it. It is acceptable to check off an item without making any changes if it already passes the requirement. c. If you have Medium confidence, make a note of this in UNCERTAINTIES.md along with any questions you have, then do your best to solve it as per the High confidence workflow. d. If you have Low confidence, make a note of this in UNCERTAINTIES.md along with any questions you have, then do not attempt to solve it and leave the checkmark blank.

Do not attempt the evaluation report. You should tell the user in SUMMARY.md that producing evaluation reports automatically requires a separate workflow (the /eval-report-workflow skill). You should still perform the initial tests to ensure the evaluation runs at all. The model you should test on is the one that the example commands in the evaluation's README uses. For example, uv run inspect eval inspect_evals/<eval_name> --model openai/gpt-5-nano --limit 2

Your task is over when you have examined every single checklist item except for the evaluation report and completed each one that you are capable of. Do a final pass over NOTES.md and UNCERTAINTIES.md, then create a SUMMARY.md file and summarise what you have done. Then inform the user that your task is finished.

Review An Evaluation

Our standards are in EVALUATION_CHECKLIST.md, with links to BEST_PRACTICES.md and CONTRIBUTING.md. Your job is to review the evaluation against the agent-checkable standards without making changes.

Set up the working directory: a. If the user provides specific instructions about any step, assume the user's instructions override these instructions. b. If there is no evaluation name, ask the user for one. c. The evaluation name should be the eval folder name plus its version (from the @task function's version argument). For instance, GPQA version 1.1.2 becomes "gpqa_1_1_2". If this exact folder name already exists, add a number to it via "gpqa_1_1_2_analysis2". This name will be referred to as <eval_name>. d. Create a folder called agent_artefacts/<eval_name>/review if it isn't present. e. Whenever you create a .md file as part of this workflow, assume it is made in agent_artefacts/<eval_name>/review. f. Copy EVALUATION_CHECKLIST.md to the folder. g. Create a NOTES.md file for miscellaneous helpful notes. Err on the side of taking lots of notes. Create an UNCERTAINTIES.md file to note any uncertainties.
Read our EVALUATION_CHECKLIST.md and linked documents.
For each item in the agent runnable checks, go over all possibly relevant files in the evaluation. Go over each item, using linked documents for context where necessary, going from top to bottom. It is important to go over every single item in the checklist! Note any issues you find in NOTES.md under the following format: a. Standard Describe the standard that has not been met. b. Issue: Describe the issue. c. Location: If possible, write the file and lines where the issue occurs. If the issue is not localised by line, write only the file. If the issue is not localised by file, write the evaluation name here. d. Fix: What fix you would recommend. Go into as much detail as you feel confident - it's okay if you don't have a strong solution to solve a given problem, but if you do, please mention what your intended solution is. e. Comment: Write a comment that can be used as a Github comment that gives all the relevant information from above. Each comment should begin with (Agent) in it, to make it clear where the comment comes from. We will provide functionality to write these as Github comments in future through the 'gh' CLI, so make your comments compatible with this. The comments should be informative, yet polite - the contributor may not have been the one to ask for this review, and external contributors have varying degrees of experience.
Write a summary in SUMMARY.md of the evaluation's overall quality, how many issues you found, and major issues if any. The contributor won't see this unless they're the one who asked for this review, so you can be more blunt here.
Tell the user your task is done.

technical

github: UKGovernmentBEIS/inspect_evals
stars: 517
license: MIT
contributors: 100
last commit: 2026-05-29T04:29:08Z
file: .claude/skills/eval-quality-workflow/SKILL.md