Skip to main content
EZVals EZVals is an evaluation framework for testing LLMs and Agents What makes EZVals special is that its built to be used by your coding agent. Your agent writes evals, runs them, analyzes results, and iterates while you monitor the situation from the Dashboard.

Agent-First

Full CLI that mirrors every UI action. Built-in skill gives your agent eval-specific guidance.

Local Dashboard

Compare runs, annotate, and export from a local web UI.

Flexible

Quick smoke tests or deep multi-model comparisons. Pytest-style assertions, clean code.

Local-Only

Code, data, and results live in your repo. No external platforms or auth.

Get Started

Install the EZVals skill for your coding agent:
npx skills add camronh/evals-skill
Then prompt your agent:
Compare Opus 4.6 vs GPT-5.3 on answer correctness. Show me the comparison.
The agent writes the evals, runs them, and reports back with results. For the comparison example, that looks something like:
I ran both models against your 50-question correctness dataset.

| Model    | Correctness | Avg Latency |
|----------|-------------|-------------|
| Opus 4.6 | 94%         | 1.2s        |
| GPT-5.3  | 87%         | 0.8s        |

Opus scores 7% higher on correctness but is ~50% slower.

Full comparison: http://127.0.0.1:8000/?compare_run_id=a1b2...&compare_run_id=c3d4...
EZVals Comparison View Results can be exported as JSON, CSV, Markdown, or PNG from the dashboard or CLI.

Setup Details

Install the library directly, configure your project, and learn how EZVals works under the hood.