Cases - EZVals

The cases= argument on @eval lets you generate multiple evaluations from one function.

Basic Usage

from ezvals import eval, EvalContext

@eval(
    dataset="math",
    cases=[
        {"input": {"a": 2, "b": 3}, "reference": 5},
        {"input": {"a": 10, "b": 20}, "reference": 30},
        {"input": {"a": 0, "b": 0}, "reference": 0},
    ],
)
def test_addition(ctx: EvalContext):
    result = ctx.input["a"] + ctx.input["b"]
    ctx.output = result
    assert result == ctx.reference, f"Expected {ctx.reference}, got {result}"

This generates three evaluations with numeric IDs:

test_addition[0]
test_addition[1]
test_addition[2]

Without custom IDs, test variants are numbered sequentially. Provide IDs for readable names.

Case Shape

cases must be a list of dicts. Each dict can override any @eval argument plus an id:

@eval(
    dataset="sentiment",
    labels=["prod"],
    cases=[
        {"id": "pos", "input": "I love this!", "reference": "positive"},
        {"id": "neg", "input": "Terrible!", "reference": "negative", "labels": ["edge"]},
    ],
)
def test_sentiment(ctx: EvalContext):
    ctx.output = analyze_sentiment(ctx.input)
    assert ctx.output == ctx.reference

Per-Case Overrides

Case dicts can override:

input, reference, metadata, dataset, labels, default_score_key
timeout, target, evaluators
id (for naming)

Rules:

If a key is omitted, the decorator default is used.
If a key is present with None, it clears the default.
labels merge with the default (duplicates removed); labels: None clears.
metadata merges (case wins).

Custom IDs

@eval(cases=[
    {"id": "low", "input": 0.2},
    {"id": "mid", "input": 0.5},
    {"id": "high", "input": 0.8},
])
def test_thresholds(ctx: EvalContext):
    ...

Generates:

test_thresholds[low]
test_thresholds[mid]
test_thresholds[high]

Explicit Grids

You can build explicit grids using list comprehensions:

MODEL_CASES = [
    {"input": {"model": m, "temperature": t}}
    for m in ["gpt-4", "gpt-3.5"]
    for t in [0.0, 0.7, 1.0]
]

@eval(dataset="models", cases=MODEL_CASES)
def test_model_grid(ctx: EvalContext):
    ctx.output = run_model(ctx.input["model"], ctx.input["temperature"])
    assert ctx.output is not None

Running Specific Variants

# Run all variants
ezvals run evals.py::test_math

# Run specific variant
ezvals run evals.py::test_math[low]

​Basic Usage

​Case Shape

​Per-Case Overrides

​Custom IDs

​Explicit Grids

​Running Specific Variants

Basic Usage

Case Shape

Per-Case Overrides

Custom IDs

Explicit Grids

Running Specific Variants