Skip to main content
The @eval decorator marks functions as evaluations. Here’s a complete example:
from ezvals import eval, parametrize, EvalContext

@eval(
    dataset="customer_service",
    labels=["production", "critical"],
    input="I want to cancel my subscription",
    metadata={"category": "cancellation"},
    timeout=10.0,
)
async def test_cancellation_handling(ctx: EvalContext):
    ctx.output = await customer_agent(ctx.input)

    assert "cancel" in ctx.output.lower(), "Should address cancellation"
    assert len(ctx.output) > 50, "Response too short"

Configuration Options

Dataset

Groups related evaluations together:
@eval(dataset="sentiment_analysis")
async def test_positive_sentiment(ctx: EvalContext):
    ...

@eval(dataset="sentiment_analysis")
async def test_negative_sentiment(ctx: EvalContext):
    ...
If not specified, dataset defaults to the filename (e.g., evals.pyevals).

Labels

Tags for filtering:
@eval(labels=["production", "fast"])
async def test_quick_response(ctx: EvalContext):
    ...

@eval(labels=["experimental", "slow"])
async def test_complex_reasoning(ctx: EvalContext):
    ...
Filter with CLI:
ezvals run evals.py --label production

Pre-populated Fields

Set context fields directly in the decorator:
@eval(
    input="What is 2 + 2?",
    reference="4",
    metadata={"difficulty": "easy"}
)
async def test_arithmetic(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference

Default Score Key

Specify the key for scores (used when assertions fail or with store(scores=...)):
@eval(input="Classify this text", default_score_key="accuracy")
async def test_classification(ctx: EvalContext):
    ctx.output = await classifier(ctx.input)
    assert ctx.output in ["positive", "negative", "neutral"]

Timeout

Set a maximum execution time:
@eval(input="complex task", timeout=5.0)
async def test_with_timeout(ctx: EvalContext):
    ctx.output = await slow_agent(ctx.input)
On timeout, the evaluation fails with an error message.

Target Hook

Run a function before the evaluation body:
def call_agent(ctx: EvalContext):
    ctx.output = my_agent(ctx.input)

@eval(input="What's the weather?", target=call_agent)
async def test_weather(ctx: EvalContext):
    # ctx.output already populated by target
    assert "weather" in ctx.output.lower()
This separates agent invocation from assertion logic.

Evaluators

Post-processing functions that add scores:
def check_length(result):
    return {
        "key": "length",
        "passed": len(result.output) > 10,
        "notes": f"Output length: {len(result.output)}"
    }

@eval(input="test", evaluators=[check_length])
async def test_response(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    # check_length runs after, adds "length" score

Input Loader

Load test examples dynamically from external sources (databases, APIs):
async def fetch_from_db():
    examples = await db.get_test_cases()
    return [{"input": e.prompt, "reference": e.expected} for e in examples]

@eval(dataset="dynamic", input_loader=fetch_from_db)
async def test_from_database(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference
Each example from the loader becomes a separate eval run. The loader is called lazily at eval time (not at import time), making it ideal for fetching from LangSmith, databases, or APIs. Loader return format:
  • Return a list of dicts with input, reference, and/or metadata keys
  • Or return objects with .input, .reference, .metadata attributes
input_loader cannot be combined with input=, reference=, or @parametrize.

Sync and Async

Both sync and async functions work—just use async def if your code uses await.

Returning Multiple Results

Return a list of EvalResult objects for batch evaluations:
from ezvals import eval, EvalResult

@eval(dataset="batch")
def test_batch():
    results = []
    for prompt in ["hello", "hi", "hey"]:
        output = my_agent(prompt)
        results.append(EvalResult(
            input=prompt,
            output=output,
            scores=[{"key": "valid", "passed": True}]
        ))
    return results

All Options Reference

OptionTypeDescription
datasetstrGroup name for the evaluation
labelslist[str]Tags for filtering
inputAnyPre-populate ctx.input
referenceAnyPre-populate ctx.reference
metadatadictPre-populate ctx.metadata
default_score_keystrKey for auto-added scores
timeoutfloatMax execution time in seconds
targetcallablePre-hook to run before evaluation
evaluatorslist[callable]Post-processing score functions
input_loadercallableAsync/sync function returning examples