Skip to main content
EZVals Hero EZVals is a lightweight, code-first evaluation framework for testing AI agents and LLM applications. Inspired by Pytest and LangSmith, EZVals lets you write evaluations like you write unit tests!

Install

uv add ezvals --dev

Quick Example

Evaluating a simple sentiment analyzer against a ground truth dataset:
from ezvals import eval, parametrize, EvalContext

@eval(
  input="I love this product!",  # Prompt
  reference="positive"           # Ground Truth
  dataset="sentiment"            # Label for filtering
)
async def test_sentiment_analysis(ctx: EvalContext):
    # Run test and store output in context
    ctx.output = await analyze_sentiment(ctx.input)

    # Evaluate!
    assert ctx.output == ctx.reference, f"Expected {ctx.reference}, got {ctx.output}"
Or use @parametrize to apply eval to a dataset:
@eval(dataset="sentiment")
@parametrize("input,reference", [
    ("I love this product!", "positive"),
    ("This is terrible", "negative"),
    ("It's okay I guess", "neutral"),
])
async def test_sentiment_batch(ctx: EvalContext):
    # Parametrized data is injected into the ctx
    ctx.output = await analyze_sentiment(ctx.input)

    assert ctx.output == ctx.reference, f"Expected {ctx.reference}, got {ctx.output}"
Start the web UI to run evals and review results:
ezvals serve sentiment_evals.py

# Or run all in a dir
ezvals serve evals/

Web UI

EZVals spins up a local Web UI that makes it easy to filter, run, and rerun evals. Do deep analysis on the results EZVals Web UI All results are stored locally in a .json file for further analysis.

Agent Mode

The CLI and SDK make it easy for your coding agent to run, analyze, and iterate on the evals!
Can you run test_sentiment_batch evals and tell me why the scores are so low?
Your coding agent would run:
ezvals run sentiment_evals.py::test_sentiment_batch --session sentiment-failed-results

// Eval results saved to ./ezvals/cool-cloud-2025.json
The agent could review the json results before presenting a solution to you. It could even implement that solution and rerun the evals to compare before and after!

Existing eval frameworks are frustrating:

Too Opinionated

One function per dataset, rigid patterns. No way to run different logic per test case.

Cloud-Based

Datasets in the cloud. No version control. Code and data live in different places.

UI-Based

Your coding agent can’t run evals, analyze results, or iterate on datasets.
Pytest isn’t the answer either. Tests are pass/fail—evals are for analysis. You want to see all results, latency, cost, comparisons over time. Pytest doesn’t do that. EZVals is different: minimal, flexible, agent-friendly. Everything lives locally as code and JSON.

Ready to start?

Follow our quickstart guide to set up EZVals in under 5 minutes.