check-trajectories-workflow
Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
/plugin install inspect_evalsdetails
Check Agent Trajectories
This workflow uses Inspect Scout to automatically analyze agent trajectories. It's faster than manual analysis but may miss nuanced issues.
Workflow Steps
-
If the user does not give any indication of what log file to use, ask them for which log file or files they want the workflow performed on. You can show them the most recent log file in logs/ and ask them if that's what they want.
-
Determine the name of the evaluation the log file is used in, and create a folder titled
agent_artefacts/trajectory_analysis/<eval_name>. Add a number to the end if the folder already exists. -
Read agent_artefacts/trajectory_analysis/inspect_scout/scanners.py to see what the default scanners check for: a. outcome_summary: Brief summary of why the agent succeeded or failed. b. external_failure: Failed due to CAPTCHAs, rate limiting, network issues, missing dependencies. c. formatting_failure: Failed due to incorrect answer formatting despite correct answer. d. reward_hacking_success: Succeeded through reward hacking or unintended means. e. ethical_refusal: Failed because the agent refused on ethical or safety grounds.
Additionally, if
agent_artefacts/trajectory_analysis/<eval_name><version>exists already, check for any scanners contained in the latest version and include those. -
Tell the user what will be checked by default and ask if they want to check for anything else.
-
If the user wants additional checks, create an eval_scanners.py file under
agent_artefacts/trajectory_analysis/<eval_name>/, and add Inspect Scout scanners that check for their requirements. Use the existing scanners inagent_artefacts/trajectory_analysis/inspect_scout/scanners.pyand the Inspect Scout documentation as references. Copy any scanners found in the eval_name folder at the end of Step 2 across as well.Each custom scanner should be wrapped in an
InspectEvalScannerobject and added to aSCANNERSlist:from scanners import InspectEvalScanner SCANNERS = [ InspectEvalScanner( name="my_scanner", scanner_factory=my_scanner_function, invalidates_success=False, # Set True if flagging invalidates a success invalidates_failure=True, # Set True if flagging invalidates a failure ), ]Ask the user whether each scanner should invalidate successes, failures, both, or neither.
-
Check how many samples are in the log file with
uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> --dry-run. If more than 100 samples would be analysed, ask the user if they want to run all samples or a subset. Tell them that Inspect Evals guidance is to run at least 100 samples, and ask them if they want to run any more than that. -
Provide the user the command they can use to run the scanners. If the user asks you to run it, you can do so. a. If no custom scanners:
uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> -o agent_artefacts/trajectory_analysis/<eval_name>/scout_resultswith an optional--limit <num>if a subset is run. b. If custom scanners were created in step 4, add-n <eval_name>like so:uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> -n <eval_name> -o agent_artefacts/trajectory_analysis/<eval_name>/scout_resultsThe
-n <eval_name>option automatically loads scanners fromagent_artefacts/trajectory_analysis/<eval_name>/eval_scanners.py. Duplicate scanner names are detected and skipped with a warning. -
Once the command is done, extract the results:
uv run python agent_artefacts/trajectory_analysis/inspect_scout/extract_results.py agent_artefacts/trajectory_analysis/<eval_name>/scout_results -
Analyze sample validity:
uv run python agent_artefacts/trajectory_analysis/inspect_scout/analyze_validity.py agent_artefacts/trajectory_analysis/<eval_name>/scout_results -
Review the extracted results and create a summary in
<eval_name>_<model_name>_ANALYSIS.md. If the file already exists, check to see if the user wants it overwritten or a new file generated by date. If the user wants it generated by date, use<eval_name>_<model_name>_<date>_ANALYSIS.md. -
Tell the user the task is done.
technical
- github
- UKGovernmentBEIS/inspect_evals
- stars
- 517
- license
- MIT
- contributors
- 100
- last commit
- 2026-05-29T04:29:08Z
- file
- .claude/skills/check-trajectories-workflow/SKILL.md