Skill Index

inspect_evals/

check-trajectories-workflow

community[skill]

Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.

$/plugin install inspect_evals

details

Check Agent Trajectories

This workflow uses Inspect Scout to automatically analyze agent trajectories. It's faster than manual analysis but may miss nuanced issues.

Workflow Steps

  1. If the user does not give any indication of what log file to use, ask them for which log file or files they want the workflow performed on. You can show them the most recent log file in logs/ and ask them if that's what they want.

  2. Determine the name of the evaluation the log file is used in, and create a folder titled agent_artefacts/trajectory_analysis/<eval_name>. Add a number to the end if the folder already exists.

  3. Read agent_artefacts/trajectory_analysis/inspect_scout/scanners.py to see what the default scanners check for: a. outcome_summary: Brief summary of why the agent succeeded or failed. b. external_failure: Failed due to CAPTCHAs, rate limiting, network issues, missing dependencies. c. formatting_failure: Failed due to incorrect answer formatting despite correct answer. d. reward_hacking_success: Succeeded through reward hacking or unintended means. e. ethical_refusal: Failed because the agent refused on ethical or safety grounds.

    Additionally, if agent_artefacts/trajectory_analysis/<eval_name><version> exists already, check for any scanners contained in the latest version and include those.

  4. Tell the user what will be checked by default and ask if they want to check for anything else.

  5. If the user wants additional checks, create an eval_scanners.py file under agent_artefacts/trajectory_analysis/<eval_name>/, and add Inspect Scout scanners that check for their requirements. Use the existing scanners in agent_artefacts/trajectory_analysis/inspect_scout/scanners.py and the Inspect Scout documentation as references. Copy any scanners found in the eval_name folder at the end of Step 2 across as well.

    Each custom scanner should be wrapped in an InspectEvalScanner object and added to a SCANNERS list:

    from scanners import InspectEvalScanner
    
    SCANNERS = [
        InspectEvalScanner(
            name="my_scanner",
            scanner_factory=my_scanner_function,
            invalidates_success=False,  # Set True if flagging invalidates a success
            invalidates_failure=True,   # Set True if flagging invalidates a failure
        ),
    ]
    

    Ask the user whether each scanner should invalidate successes, failures, both, or neither.

  6. Check how many samples are in the log file with uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> --dry-run. If more than 100 samples would be analysed, ask the user if they want to run all samples or a subset. Tell them that Inspect Evals guidance is to run at least 100 samples, and ask them if they want to run any more than that.

  7. Provide the user the command they can use to run the scanners. If the user asks you to run it, you can do so. a. If no custom scanners: uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> -o agent_artefacts/trajectory_analysis/<eval_name>/scout_results with an optional --limit <num> if a subset is run. b. If custom scanners were created in step 4, add -n <eval_name> like so: uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> -n <eval_name> -o agent_artefacts/trajectory_analysis/<eval_name>/scout_results

    The -n <eval_name> option automatically loads scanners from agent_artefacts/trajectory_analysis/<eval_name>/eval_scanners.py. Duplicate scanner names are detected and skipped with a warning.

  8. Once the command is done, extract the results: uv run python agent_artefacts/trajectory_analysis/inspect_scout/extract_results.py agent_artefacts/trajectory_analysis/<eval_name>/scout_results

  9. Analyze sample validity: uv run python agent_artefacts/trajectory_analysis/inspect_scout/analyze_validity.py agent_artefacts/trajectory_analysis/<eval_name>/scout_results

  10. Review the extracted results and create a summary in <eval_name>_<model_name>_ANALYSIS.md. If the file already exists, check to see if the user wants it overwritten or a new file generated by date. If the user wants it generated by date, use <eval_name>_<model_name>_<date>_ANALYSIS.md.

  11. Tell the user the task is done.

technical

github
UKGovernmentBEIS/inspect_evals
stars
517
license
MIT
contributors
100
last commit
2026-05-29T04:29:08Z
file
.claude/skills/check-trajectories-workflow/SKILL.md

related