The @eval decorator marks functions as evaluations. Here’s a complete example:
from ezvals import eval, parametrize, EvalContext
@eval(
dataset="customer_service",
labels=["production", "critical"],
input="I want to cancel my subscription",
metadata={"category": "cancellation"},
timeout=10.0,
)
async def test_cancellation_handling(ctx: EvalContext):
ctx.output = await customer_agent(ctx.input)
assert "cancel" in ctx.output.lower(), "Should address cancellation"
assert len(ctx.output) > 50, "Response too short"
Configuration Options
Dataset
Groups related evaluations together:
@eval(dataset="sentiment_analysis")
async def test_positive_sentiment(ctx: EvalContext):
...
@eval(dataset="sentiment_analysis")
async def test_negative_sentiment(ctx: EvalContext):
...
If not specified, dataset defaults to the filename (e.g., evals.py → evals).
Labels
Tags for filtering:
@eval(labels=["production", "fast"])
async def test_quick_response(ctx: EvalContext):
...
@eval(labels=["experimental", "slow"])
async def test_complex_reasoning(ctx: EvalContext):
...
Filter with CLI:
ezvals run evals.py --label production
Pre-populated Fields
Set context fields directly in the decorator:
@eval(
input="What is 2 + 2?",
reference="4",
metadata={"difficulty": "easy"}
)
async def test_arithmetic(ctx: EvalContext):
ctx.output = await my_agent(ctx.input)
assert ctx.output == ctx.reference
Default Score Key
Specify the key for scores (used when assertions fail or with store(scores=...)):
@eval(input="Classify this text", default_score_key="accuracy")
async def test_classification(ctx: EvalContext):
ctx.output = await classifier(ctx.input)
assert ctx.output in ["positive", "negative", "neutral"]
Timeout
Set a maximum execution time:
@eval(input="complex task", timeout=5.0)
async def test_with_timeout(ctx: EvalContext):
ctx.output = await slow_agent(ctx.input)
On timeout, the evaluation fails with an error message.
Target Hook
Run a function before the evaluation body:
def call_agent(ctx: EvalContext):
ctx.output = my_agent(ctx.input)
@eval(input="What's the weather?", target=call_agent)
async def test_weather(ctx: EvalContext):
# ctx.output already populated by target
assert "weather" in ctx.output.lower()
This separates agent invocation from assertion logic.
Evaluators
Post-processing functions that add scores:
def check_length(result):
return {
"key": "length",
"passed": len(result.output) > 10,
"notes": f"Output length: {len(result.output)}"
}
@eval(input="test", evaluators=[check_length])
async def test_response(ctx: EvalContext):
ctx.output = await my_agent(ctx.input)
# check_length runs after, adds "length" score
Load test examples dynamically from external sources (databases, APIs):
async def fetch_from_db():
examples = await db.get_test_cases()
return [{"input": e.prompt, "reference": e.expected} for e in examples]
@eval(dataset="dynamic", input_loader=fetch_from_db)
async def test_from_database(ctx: EvalContext):
ctx.output = await my_agent(ctx.input)
assert ctx.output == ctx.reference
Each example from the loader becomes a separate eval run. The loader is called lazily at eval time (not at import time), making it ideal for fetching from LangSmith, databases, or APIs.
Loader return format:
- Return a list of dicts with
input, reference, and/or metadata keys
- Or return objects with
.input, .reference, .metadata attributes
input_loader cannot be combined with input=, reference=, or @parametrize.
Sync and Async
Both sync and async functions work—just use async def if your code uses await.
Returning Multiple Results
Return a list of EvalResult objects for batch evaluations:
from ezvals import eval, EvalResult
@eval(dataset="batch")
def test_batch():
results = []
for prompt in ["hello", "hi", "hey"]:
output = my_agent(prompt)
results.append(EvalResult(
input=prompt,
output=output,
scores=[{"key": "valid", "passed": True}]
))
return results
All Options Reference
| Option | Type | Description |
|---|
dataset | str | Group name for the evaluation |
labels | list[str] | Tags for filtering |
input | Any | Pre-populate ctx.input |
reference | Any | Pre-populate ctx.reference |
metadata | dict | Pre-populate ctx.metadata |
default_score_key | str | Key for auto-added scores |
timeout | float | Max execution time in seconds |
target | callable | Pre-hook to run before evaluation |
evaluators | list[callable] | Post-processing score functions |
input_loader | callable | Async/sync function returning examples |