inspect_evals/

read-eval-logs

community[skill]

View and analyse Inspect evaluation log files using the Python API. Trigger whenever you need to look at a .eval file yourself without using pre-written scripts.

$/plugin install inspect_evals

details

Analysing Eval Log Files

This skill covers how to view and analyse Inspect evaluation log files using the Python API and CLI commands.

Quick Reference

CLI Commands

# List all logs in the default log directory (./logs or INSPECT_LOG_DIR)
uv run inspect log list --json

# List logs with specific status
uv run inspect log list --json --status success
uv run inspect log list --json --status error

# List retryable logs (error/cancelled without subsequent success)
uv run inspect log list --json --retryable

# Dump a log file as JSON (works with any format: .eval or .json)
uv run inspect log dump <log_file_path>

# Convert between log formats
uv run inspect log convert source.json --to eval --output-dir log-output
uv run inspect log convert logs/ --to eval --output-dir logs-eval

# Get JSON schema for log files
uv run inspect log schema

Interactive Log Viewer

# Start the interactive log viewer (updates automatically)
uv run inspect view

Python API

Key Imports

from inspect_ai.log import (
    # Listing and reading
    list_eval_logs,
    read_eval_log,
    read_eval_log_sample,
    read_eval_log_samples,
    read_eval_log_sample_summaries,
    
    # Writing
    write_eval_log,
    
    # Utilities
    retryable_eval_logs,
    recompute_metrics,
    resolve_sample_attachments,
    
    # Types
    EvalLog,
    EvalLogInfo,
    EvalSample,
    EvalSampleSummary,
)

Listing Logs

# List all logs in default directory
logs = list_eval_logs()

# List logs in a specific directory
logs = list_eval_logs(log_dir="./experiment-logs")

# Filter by format
logs = list_eval_logs(formats=["eval"])  # Only .eval files

# Filter with a custom function (receives header-only EvalLog)
logs = list_eval_logs(filter=lambda log: log.status == "success")

# Non-recursive listing
logs = list_eval_logs(recursive=False)

Reading Logs

# Read full log
log = read_eval_log("path/to/logfile.eval")

# Read header only (fast, excludes samples)
log = read_eval_log("path/to/logfile.eval", header_only=True)

# Read with attachments resolved
log = read_eval_log("path/to/logfile.eval", resolve_attachments=True)

Reading Samples

# Read a single sample by ID and epoch
sample = read_eval_log_sample("path/to/logfile.eval", id=42, epoch=1)

# Read a single sample by UUID
sample = read_eval_log_sample("path/to/logfile.eval", uuid="sample-uuid")

# Stream all samples (memory efficient - one at a time)
for sample in read_eval_log_samples("path/to/logfile.eval"):
    process(sample)

# Stream samples from incomplete logs
for sample in read_eval_log_samples("path/to/logfile.eval", all_samples_required=False):
    process(sample)

# Read sample summaries (fast, includes scoring info)
summaries = read_eval_log_sample_summaries("path/to/logfile.eval")

Filtering Samples

# Read only samples with errors using summaries for filtering
errors = []
for summary in read_eval_log_sample_summaries(log_file):
    if summary.error is not None:
        errors.append(
            read_eval_log_sample(log_file, summary.id, summary.epoch)
        )

EvalLog Structure

The EvalLog object contains:

Field	Type	Description
`version`	`int`	File format version (currently 2)
`status`	`str`	`"started"`, `"success"`, `"cancelled"`, or `"error"`
`eval`	`EvalSpec`	Task, model, creation time, config
`plan`	`EvalPlan`	Solvers and generation config
`results`	`EvalResults`	Aggregate scores and metrics
`stats`	`EvalStats`	Runtime, model usage statistics
`error`	`EvalError`	Error info if `status == "error"`
`samples`	`list[EvalSample]`	Individual samples (if not header_only)
`reductions`	`list[EvalSampleReductions]`	Multi-epoch reductions
`location`	`str`	URI where log was read from

Always Check Status

log = read_eval_log("path/to/logfile.eval")
if log.status == "success":
    # Safe to analyse results
    for score in log.results.scores:
        print(f"{score.name}: {score.metrics}")

EvalSample Structure

Each sample contains:

Field	Type	Description
`id`	`int \| str`	Unique sample ID
`epoch`	`int`	Epoch number
`input`	`str \| list[ChatMessage]`	Sample input
`target`	`str \| list[str]`	Expected target(s)
`messages`	`list[ChatMessage]`	Full conversation history
`output`	`ModelOutput`	Model's output
`scores`	`dict[str, Score]`	Scores from scorers
`metadata`	`dict[str, Any]`	Sample metadata
`store`	`dict[str, Any]`	State at end of execution
`events`	`list[Event]`	Transcript events
`error`	`EvalError`	Error if sample failed
`total_time`	`float`	Total sample runtime
`model_usage`	`dict[str, ModelUsage]`	Token usage

Common Analysis Patterns

Get Aggregate Metrics

log = read_eval_log(log_file, header_only=True)
if log.results:
    for score in log.results.scores:
        print(f"Scorer: {score.name}")
        for metric_name, metric in score.metrics.items():
            print(f"  {metric_name}: {metric.value}")

Analyse Failed Samples

log = read_eval_log(log_file)
if log.samples:
    failed = [s for s in log.samples if s.error is not None]
    for sample in failed:
        print(f"Sample {sample.id}: {sample.error.message}")

Extract Model Usage

log = read_eval_log(log_file, header_only=True)
for model, usage in log.stats.model_usage.items():
    print(f"{model}: {usage.input_tokens} in, {usage.output_tokens} out")

Compare Multiple Runs

logs = list_eval_logs(filter=lambda l: l.eval.task == "my_task")
for log_info in logs:
    log = read_eval_log(log_info, header_only=True)
    if log.results and log.results.scores:
        accuracy = log.results.scores[0].metrics.get("accuracy")
        print(f"{log.eval.model}: {accuracy.value if accuracy else 'N/A'}")

Find Retryable Logs

all_logs = list_eval_logs()
retryable = retryable_eval_logs(all_logs)
for log_info in retryable:
    print(f"Can retry: {log_info.name}")

Log File Formats

Type	Description
`.eval`	Binary format, ~1/8 size of JSON, fast incremental access
`.json`	Text format, human-readable, slower for large files

Both formats are fully supported by the API and can be intermixed.

Working with Large Logs

For logs too large to fit in memory:

Use .eval format - supports compression and incremental access
Read header only - read_eval_log(log_file, header_only=True)
Stream samples - read_eval_log_samples() yields one at a time
Use summaries - read_eval_log_sample_summaries() for quick overview

Modifying Logs

# Read, modify, and write back
log = read_eval_log(log_file)
# ... modify log ...
write_eval_log(log)  # Uses log.location automatically

# Write to a new location
write_eval_log(log, location="new_path.eval")

# Recompute metrics after score edits
recompute_metrics(log)
write_eval_log(log)

Environment Variables

Variable	Description
`INSPECT_LOG_DIR`	Default log directory (default: `./logs`)
`INSPECT_EVAL_LOG_FILE_PATTERN`	Log filename pattern (e.g., `{task}_{model}_{id}`)

Related Tools

Inspect Scout - Transcript analysis
Inspect Viz - Data visualization
Log Dataframes - Extract pandas DataFrames from logs (see inspect_ai.analysis)

technical

github: UKGovernmentBEIS/inspect_evals
stars: 517
license: MIT
contributors: 100
last commit: 2026-05-29T04:29:08Z
file: .claude/skills/read-eval-logs/SKILL.md