Skip to main content
The cases= argument on @eval lets you generate multiple evaluations from one function.

Basic Usage

from ezvals import eval, EvalContext

@eval(
    dataset="math",
    cases=[
        {"input": {"a": 2, "b": 3}, "reference": 5},
        {"input": {"a": 10, "b": 20}, "reference": 30},
        {"input": {"a": 0, "b": 0}, "reference": 0},
    ],
)
def test_addition(ctx: EvalContext):
    result = ctx.input["a"] + ctx.input["b"]
    ctx.output = result
    assert result == ctx.reference, f"Expected {ctx.reference}, got {result}"
This generates three evaluations with numeric IDs:
  • test_addition[0]
  • test_addition[1]
  • test_addition[2]
Without custom IDs, test variants are numbered sequentially. Provide IDs for readable names.

Case Shape

cases must be a list of dicts. Each dict can override any @eval argument plus an id:
@eval(
    dataset="sentiment",
    labels=["prod"],
    cases=[
        {"id": "pos", "input": "I love this!", "reference": "positive"},
        {"id": "neg", "input": "Terrible!", "reference": "negative", "labels": ["edge"]},
    ],
)
def test_sentiment(ctx: EvalContext):
    ctx.output = analyze_sentiment(ctx.input)
    assert ctx.output == ctx.reference

Per-Case Overrides

Case dicts can override:
  • input, reference, metadata, dataset, labels, default_score_key
  • timeout, target, evaluators
  • id (for naming)
Rules:
  • If a key is omitted, the decorator default is used.
  • If a key is present with None, it clears the default.
  • labels merge with the default (duplicates removed); labels: None clears.
  • metadata merges (case wins).

Custom IDs

@eval(cases=[
    {"id": "low", "input": 0.2},
    {"id": "mid", "input": 0.5},
    {"id": "high", "input": 0.8},
])
def test_thresholds(ctx: EvalContext):
    ...
Generates:
  • test_thresholds[low]
  • test_thresholds[mid]
  • test_thresholds[high]

Explicit Grids

You can build explicit grids using list comprehensions:
MODEL_CASES = [
    {"input": {"model": m, "temperature": t}}
    for m in ["gpt-4", "gpt-3.5"]
    for t in [0.0, 0.7, 1.0]
]

@eval(dataset="models", cases=MODEL_CASES)
def test_model_grid(ctx: EvalContext):
    ctx.output = run_model(ctx.input["model"], ctx.input["temperature"])
    assert ctx.output is not None

Running Specific Variants

# Run all variants
ezvals run evals.py::test_math

# Run specific variant
ezvals run evals.py::test_math[low]