Skip to main content
Every eval function receives an EvalContext that accumulates evaluation data and builds into an EvalResult.
from ezvals import eval, EvalContext

@eval(input="What is 2 + 2?", dataset="demo")
async def my_eval(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == "4", f"Expected 4, got {ctx.output}"

Pre-populated Fields

Set context fields directly in the @eval decorator:
@eval(
    input="What is the capital of France?",
    reference="Paris",
    dataset="geography",
    metadata={"category": "capitals"}
)
async def test_capital(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference
With @parametrize, special parameter names (input, reference, metadata, trace_data, latency) auto-populate context fields:
@eval(dataset="sentiment")
@parametrize("input,reference", [
    ("I love this!", "positive"),
    ("This is terrible", "negative"),
])
async def test_sentiment(ctx: EvalContext):
    ctx.output = await analyze_sentiment(ctx.input)
    assert ctx.output == ctx.reference

Setting Fields

Input, Output, and Reference

ctx.input = "Your test input"
ctx.output = "The model's response"
ctx.reference = "Expected output"  # Optional

Metadata

Store additional context for debugging and analysis:
ctx.metadata["model"] = "gpt-4"
ctx.metadata["temperature"] = 0.7

Scoring with Assertions

Use assertions to score:
ctx.output = await my_agent(ctx.input)

assert ctx.output is not None, "Got no output"
assert "expected" in ctx.output.lower(), "Missing expected content"
Failed assertions become failing scores with the message as notes.

Using store() for Explicit Scoring

For numeric scores or multiple named metrics, use store():
# Numeric score (0-1 range)
ctx.store(scores=0.85)

# Multiple scores with different keys
ctx.store(scores=[
    {"passed": True, "key": "format"},
    {"value": 0.9, "key": "relevance"}
])

Setting Multiple Fields at Once

store() lets you set multiple context fields in one call:
# From agent result
result = await run_agent(ctx.input)
ctx.store(**result, scores=True)  # Spread agent result

# Or set everything explicitly
ctx.store(
    input="test",
    output="response",
    latency=0.5,
    scores=True,
    messages=[{"role": "user", "content": "test"}],
    metadata={"model": "gpt-4"}
)

Auto-Return Behavior

You don’t need to explicitly return anything—the context automatically builds into an EvalResult when the function completes:
@eval(input="test", dataset="demo")
async def my_eval(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output, "Got empty response"
    # No return needed

Exception Safety

If your evaluation throws an exception, partial data is preserved:
@eval(input="test input", dataset="demo")
async def test_with_error(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    # If this fails, input and output are still recorded
    assert ctx.output == "expected"
The resulting EvalResult will have:
  • input and output preserved
  • error field with the exception message
  • A failing score automatically added

Default Scoring

If no score is added and no assertions fail, EZVals auto-adds a passing score:
@eval(input="test", dataset="demo", default_score_key="correctness")
async def test_auto_pass(ctx: EvalContext):
    ctx.output = "result"
    # No explicit score - auto-passes with key "correctness"

Custom Parameters

For parametrized tests with custom parameter names, include them in your function signature:
@eval(dataset="demo")
@parametrize("prompt,expected_category", [
    ("I want a refund", "complaint"),
    ("Thank you!", "praise"),
])
async def test_classification(ctx: EvalContext, prompt, expected_category):
    ctx.input = prompt
    ctx.output = await classify(prompt)
    assert ctx.output == expected_category

Run Metadata (for Observability)

When your eval function runs, the context includes metadata about the current run and eval. This is useful for tagging traces in LangSmith or other observability tools.
@eval(dataset="customer_service", labels=["production"])
async def my_eval(ctx: EvalContext):
    # Run-level metadata (same for all evals in a run)
    print(ctx.run_id)        # "1705312200"
    print(ctx.session_name)  # "model-upgrade"
    print(ctx.run_name)      # "baseline"
    print(ctx.eval_path)     # "evals/"

    # Per-eval metadata (from decorator)
    print(ctx.function_name) # "my_eval"
    print(ctx.dataset)       # "customer_service"
    print(ctx.labels)        # ["production"]

    # Tag traces with this metadata
    ctx.output = await my_agent(ctx.input)

Run-Level Metadata

PropertyTypeDescription
run_idstr | NoneUnique run identifier (timestamp)
session_namestr | NoneSession name for the run
run_namestr | NoneHuman-readable run name
eval_pathstr | NonePath to eval file(s) being run

Per-Eval Metadata

PropertyTypeDescription
function_namestr | NoneName of the eval function
datasetstr | NoneDataset from @eval decorator
labelslist[str] | NoneLabels from @eval decorator

API Reference

MethodDescription
store(input, output, reference, latency, scores, messages, trace_url, metadata, trace_data)Set any context fields at once
build()Convert to immutable EvalResult

store() Parameters

ParameterTypeDescription
inputAnyThe test input
outputAnyThe system output
referenceAnyExpected output
latencyfloatExecution time in seconds
scoresbool/float/dict/listScore(s) to add
messageslistConversation messages (sets trace_data.messages)
trace_urlstrExternal trace link (sets trace_data.trace_url)
metadatadictMerges into ctx.metadata
trace_datadictMerges into ctx.trace_data
PropertyTypeDescription
inputAnyThe test input
outputAnyThe system output
referenceAnyExpected output (optional)
metadatadictCustom metadata
trace_dataTraceDataDebug/trace data
latencyfloatExecution time
scoreslistList of Score objects