EvalContext

Every eval function receives an EvalContext that accumulates evaluation data and builds into an EvalResult.

from ezvals import eval, EvalContext

@eval(input="What is 2 + 2?", dataset="demo")
async def my_eval(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == "4", f"Expected 4, got {ctx.output}"

Pre-populated Fields

Set context fields directly in the @eval decorator:

@eval(
    input="What is the capital of France?",
    reference="Paris",
    dataset="geography",
    metadata={"category": "capitals"}
)
async def test_capital(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output == ctx.reference

With cases=, input and reference in each case auto-populate the context fields:

@eval(
    dataset="sentiment",
    cases=[
        {"input": "I love this!", "reference": "positive"},
        {"input": "This is terrible", "reference": "negative"},
    ],
)
async def test_sentiment(ctx: EvalContext):
    ctx.output = await analyze_sentiment(ctx.input)
    assert ctx.output == ctx.reference

Setting Fields

Input, Output, and Reference

ctx.input = "Your test input"
ctx.output = "The model's response"
ctx.reference = "Expected output"  # Optional

Metadata

Store additional context for debugging and analysis:

ctx.metadata["model"] = "gpt-4"
ctx.metadata["temperature"] = 0.7

Scoring with Assertions

Use assertions to score:

ctx.output = await my_agent(ctx.input)

assert ctx.output is not None, "Got no output"
assert "expected" in ctx.output.lower(), "Missing expected content"

Failed assertions become failing scores with the message as notes.

Using store() for Explicit Scoring

For numeric scores or multiple named metrics, use store():

# Numeric score (0-1 range)
ctx.store(scores=0.85)

# Multiple scores with different keys
ctx.store(scores=[
    {"passed": True, "key": "format"},
    {"value": 0.9, "key": "relevance"}
])

Setting Multiple Fields at Once

store() lets you set multiple context fields in one call:

# From agent result
result = await run_agent(ctx.input)
ctx.store(**result, scores=True)  # Spread agent result

# Or set everything explicitly
ctx.store(
    input="test",
    output="response",
    latency=0.5,
    scores=True,
    messages=[{"role": "user", "content": "test"}],
    metadata={"model": "gpt-4"}
)

Auto-Return Behavior

You don’t need to explicitly return anything—the context automatically builds into an EvalResult when the function completes:

@eval(input="test", dataset="demo")
async def my_eval(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    assert ctx.output, "Got empty response"
    # No return needed

Exception Safety

If your evaluation throws an exception, partial data is preserved:

@eval(input="test input", dataset="demo")
async def test_with_error(ctx: EvalContext):
    ctx.output = await my_agent(ctx.input)
    # If this fails, input and output are still recorded
    assert ctx.output == "expected"

The resulting EvalResult will have:

input and output preserved
error field with the exception message
A failing score automatically added

Default Scoring

If no score is added and no assertions fail, EZVals auto-adds a passing score:

@eval(input="test", dataset="demo", default_score_key="pass")
async def test_auto_pass(ctx: EvalContext):
    ctx.output = "result"
    # No explicit score - auto-passes with key "pass"

Custom Parameters

For case-based tests with custom data, store it in ctx.input:

@eval(
    dataset="demo",
    cases=[
        {"input": {"prompt": "I want a refund", "expected_category": "complaint"}},
        {"input": {"prompt": "Thank you!", "expected_category": "praise"}},
    ],
)
async def test_classification(ctx: EvalContext):
    ctx.output = await classify(ctx.input["prompt"])
    assert ctx.output == ctx.input["expected_category"]

Run Metadata (for Observability)

When your eval function runs, the context includes metadata about the current run and eval. This is useful for tagging traces in LangSmith or other observability tools.

@eval(dataset="customer_service", labels=["production"])
async def my_eval(ctx: EvalContext):
    # Run-level metadata (same for all evals in a run)
    print(ctx.run_id)        # "1705312200"
    print(ctx.session_name)  # "model-upgrade"
    print(ctx.run_name)      # "baseline"
    print(ctx.eval_path)     # "evals/"

    # Per-eval metadata (from decorator)
    print(ctx.function_name) # "my_eval"
    print(ctx.dataset)       # "customer_service"
    print(ctx.labels)        # ["production"]

    # Tag traces with this metadata
    ctx.output = await my_agent(ctx.input)

Run-Level Metadata

Property	Type	Description
`run_id`	str \| None	Unique run identifier (timestamp), read-only
`session_name`	str \| None	Session name for the run, read-only
`run_name`	str \| None	Human-readable run name, read-only
`eval_path`	str \| None	Path to eval file(s) being run, read-only

Per-Eval Metadata

Property	Type	Description
`function_name`	str \| None	Name of the eval function
`dataset`	str \| None	Dataset from @eval decorator
`labels`	list[str] \| None	Labels from @eval decorator

API Reference

Method	Description
`store(input, output, reference, latency, scores, messages, trace_url, metadata, trace_data)`	Set any context fields at once
`build()`	Convert to immutable EvalResult

store() Parameters

Parameter	Type	Description
`input`	Any	The test input
`output`	Any	The system output
`reference`	Any	Expected output
`latency`	float	Execution time in seconds
`scores`	bool/float/dict/list	Score(s) to add
`messages`	list	Conversation messages (sets trace_data.messages)
`trace_url`	str	External trace link (sets trace_data.trace_url)
`metadata`	dict	Merges into ctx.metadata
`trace_data`	dict	Merges into ctx.trace_data

Property	Type	Description
`input`	Any	The test input
`output`	Any	The system output
`reference`	Any	Expected output (optional)
`metadata`	dict	Custom metadata
`trace_data`	TraceData	Debug/trace data
`latency`	float	Execution time
`scores`	list	List of Score objects

Getting Started

Examples

Core Concepts

Guides

API Reference

Resources

Pre-populated Fields

Setting Fields

Input, Output, and Reference

Metadata

Scoring with Assertions

Using store() for Explicit Scoring

Setting Multiple Fields at Once

Auto-Return Behavior

Exception Safety

Default Scoring

Custom Parameters

Run Metadata (for Observability)

Run-Level Metadata

Per-Eval Metadata

API Reference

store() Parameters

Getting Started

Examples

Core Concepts

Guides

API Reference

Resources

​Pre-populated Fields

​Setting Fields

​Input, Output, and Reference

​Metadata

​Scoring with Assertions

​Using store() for Explicit Scoring

​Setting Multiple Fields at Once

​Auto-Return Behavior

​Exception Safety

​Default Scoring

​Custom Parameters

​Run Metadata (for Observability)

​Run-Level Metadata

​Per-Eval Metadata

​API Reference

​store() Parameters

Pre-populated Fields

Setting Fields

Input, Output, and Reference

Metadata

Scoring with Assertions

Using store() for Explicit Scoring

Setting Multiple Fields at Once

Auto-Return Behavior

Exception Safety

Default Scoring

Custom Parameters

Run Metadata (for Observability)

Run-Level Metadata

Per-Eval Metadata

API Reference

store() Parameters