The @eval Decorator

The @eval decorator marks functions as evaluations. Here’s a complete example:

from ezvals import eval, EvalContext

@eval(
    dataset="customer_service",
    labels=["production", "critical"],
    input="I want to cancel my subscription",
    metadata={"category": "cancellation"},
    timeout=10.0,
)
async def test_cancellation_handling(ctx: EvalContext):
    ctx.output = await customer_agent(ctx.input)

    assert "cancel" in ctx.output.lower(), "Should address cancellation"
    assert len(ctx.output) > 50, "Response too short"

Configuration Options

Dataset

Groups related evaluations together:

@eval(dataset="sentiment_analysis")
async def test_positive_sentiment(ctx: EvalContext):
    ...

@eval(dataset="sentiment_analysis")
async def test_negative_sentiment(ctx: EvalContext):
    ...

If not specified, dataset defaults to the filename (e.g., evals.py → evals).

Labels

Tags for filtering:

@eval(labels=["production", "fast"])
async def test_quick_response(ctx: EvalContext):
    ...

@eval(labels=["experimental", "slow"])
async def test_complex_reasoning(ctx: EvalContext):
    ...

Filter with CLI:

ezvals run evals.py --label production

Pre-populated Fields

Set context fields directly in the decorator:

@eval(
    input="What is 2 + 2?",
    reference="4",
    metadata={"difficulty": "easy"}
)
async def test_arithmetic(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference

Default Score Key

Specify the key for scores (used when assertions fail or with store(scores=...)):

@eval(input="Classify this text", default_score_key="accuracy")
async def test_classification(ctx: EvalContext):
    ctx.output = await classifier(ctx.input)
    assert ctx.output in ["positive", "negative", "neutral"]

Timeout

Set a maximum execution time:

@eval(input="complex task", timeout=5.0)
async def test_with_timeout(ctx: EvalContext):
    ctx.output = await slow_agent(ctx.input)

On timeout, the evaluation fails with an error message.

Target Hook

Run a function before the evaluation body:

def call_agent(ctx: EvalContext):
    ctx.output = my_agent(ctx.input)

@eval(input="What's the weather?", target=call_agent)
async def test_weather(ctx: EvalContext):
    # ctx.output already populated by target
    assert "weather" in ctx.output.lower()

This separates agent invocation from assertion logic.

Evaluators

Post-processing functions that add scores:

def check_length(result):
    return {
        "key": "length",
        "passed": len(result.output) > 10,
        "notes": f"Output length: {len(result.output)}"
    }

@eval(input="test", evaluators=[check_length])
async def test_response(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    # check_length runs after, adds "length" score

Input Loader

Load test examples dynamically from external sources (databases, APIs):

async def fetch_from_db():
    examples = await db.get_test_cases()
    return [{"input": e.prompt, "reference": e.expected} for e in examples]

@eval(dataset="dynamic", input_loader=fetch_from_db)
async def test_from_database(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference

Each example from the loader becomes a separate eval run. The loader is called lazily at eval time (not at import time), making it ideal for fetching from LangSmith, databases, or APIs. Loader return format:

Return a list of dicts with input, reference, and/or metadata keys
Or return objects with .input, .reference, .metadata attributes

input_loader cannot be combined with input=, reference=, or cases=.

Sync and Async

Both sync and async functions work—just use async def if your code uses await.

Returning Multiple Results

Return a list of EvalResult objects for batch evaluations:

from ezvals import eval, EvalResult

@eval(dataset="batch")
def test_batch():
    results = []
    for prompt in ["hello", "hi", "hey"]:
        output = my_agent(prompt)
        results.append(EvalResult(
            input=prompt,
            output=output,
            scores=[{"key": "valid", "passed": True}]
        ))
    return results

All Options Reference

Option	Type	Description
`dataset`	str	Group name for the evaluation
`labels`	list[str]	Tags for filtering
`input`	Any	Pre-populate ctx.input
`reference`	Any	Pre-populate ctx.reference
`metadata`	dict	Pre-populate ctx.metadata
`default_score_key`	str	Key for auto-added scores
`timeout`	float	Max execution time in seconds
`target`	callable	Pre-hook to run before evaluation
`evaluators`	list[callable]	Post-processing score functions
`input_loader`	callable	Async/sync function returning examples
`cases`	list[dict]	Case definitions that expand into multiple eval variants

Getting Started

Examples

Core Concepts

Guides

API Reference

Resources

The @eval Decorator

Configuration Options

Dataset

Labels

Pre-populated Fields

Default Score Key

Timeout

Target Hook

Evaluators

Input Loader

Sync and Async

Returning Multiple Results

All Options Reference

Getting Started

Examples

Core Concepts

Guides

API Reference

Resources

​Configuration Options

​Dataset

​Labels

​Pre-populated Fields

​Default Score Key

​Timeout

​Target Hook

​Evaluators

​Input Loader

​Sync and Async

​Returning Multiple Results

​All Options Reference

Configuration Options

Dataset

Labels

Pre-populated Fields

Default Score Key

Timeout

Target Hook

Evaluators

Input Loader

Sync and Async

Returning Multiple Results

All Options Reference