Score

Score represents a single metric or assessment within an evaluation result.

Schema

class Score:
    key: str              # Required: metric identifier
    value: float = None   # Optional: numeric score
    passed: bool = None   # Optional: pass/fail status
    notes: str = None     # Optional: explanation

At least one of value or passed must be provided.

Fields

key (required)

A string identifier for the metric. Use descriptive names:

{"key": "accuracy"}
{"key": "response_time"}
{"key": "format_valid"}
{"key": "contains_keywords"}

value (optional)

A numeric score, typically in the 0-1 range:

{"key": "confidence", "value": 0.95}
{"key": "similarity", "value": 0.82}
{"key": "quality", "value": 0.7}

passed (optional)

A boolean indicating pass/fail:

{"key": "format_valid", "passed": True}
{"key": "safety_check", "passed": False}

notes (optional)

Human-readable explanation:

{
    "key": "accuracy",
    "passed": True,
    "notes": "Output matches reference exactly"
}

{
    "key": "length",
    "passed": False,
    "notes": "Response too short: 15 chars, expected 50+"
}

Score Types

Boolean Score

Simple pass/fail:

{"key": "correct", "passed": True}
{"key": "format_valid", "passed": False, "notes": "Missing closing bracket"}

Numeric Score

Continuous metric:

{"key": "confidence", "value": 0.87}
{"key": "similarity", "value": 0.92, "notes": "Cosine similarity"}

Combined Score

Both numeric and boolean:

{
    "key": "quality",
    "value": 0.75,
    "passed": True,  # Passes threshold
    "notes": "Score: 0.75 (threshold: 0.6)"
}

Creating Scores

Via store()

The recommended way:

@eval(dataset="demo", default_score_key="accuracy")
async def my_eval(ctx: EvalContext):
    # Boolean with notes
    ctx.store(scores={"passed": True, "notes": "Exact match"})

    # Numeric
    ctx.store(scores={"value": 0.85, "key": "confidence", "notes": "Confidence score"})

    # Boolean with custom key
    ctx.store(scores={"passed": output.is_valid, "key": "format", "notes": "Valid format"})

Direct Construction

For evaluators or manual creation:

score = {
    "key": "semantic_similarity",
    "value": compute_similarity(output, reference),
    "passed": similarity > 0.8,
    "notes": f"Similarity: {similarity:.2f}"
}

Multiple Scores

An evaluation can have multiple scores:

@eval(dataset="qa")
async def test_answer(ctx: EvalContext):
    ctx.input = "What is Python?"
    ctx.output = await agent(ctx.input)

    ctx.store(scores=[
        {"passed": is_correct, "key": "accuracy", "notes": "Factually correct"},
        {"passed": is_concise, "key": "brevity", "notes": "Under 100 words"},
        {"value": confidence, "key": "confidence", "notes": "Model confidence"},
        {"passed": is_safe, "key": "safety", "notes": "No harmful content"}
    ])

Result:

{
    "scores": [
        {"key": "accuracy", "passed": true, "notes": "Factually correct"},
        {"key": "brevity", "passed": true, "notes": "Under 100 words"},
        {"key": "confidence", "value": 0.92, "notes": "Model confidence"},
        {"key": "safety", "passed": true, "notes": "No harmful content"}
    ]
}

Score Aggregation

The CLI and Web UI aggregate scores:

Aggregation	Description
Pass Rate	% of scores where `passed=True`
Average	Mean of all `value` scores
By Key	Group scores by key for analysis

Default Score Key

Set via decorator to name auto-created scores:

@eval(default_score_key="pass")
async def my_eval(ctx: EvalContext):
    ctx.input = "test"
    ctx.output = "result"
    ctx.store(scores={"passed": True, "notes": "All checks passed"})
    # Score will have key="pass"

Auto-Scoring

If no score is added, EZVals creates one automatically:

@eval(dataset="demo", default_score_key="success")
async def test_no_explicit_score(ctx: EvalContext):
    ctx.input = "test"
    ctx.output = "result"
    # Auto-adds: {"key": "success", "passed": True}

Best Practices

Use consistent key names

# Good - consistent naming
ctx.store(scores=[
    {"key": "accuracy", ...},
    {"key": "format_valid", ...},
    {"key": "response_time", ...}
])

# Avoid - inconsistent keys
ctx.store(scores=[
    {"key": "Accuracy", ...},
    {"key": "is_format_valid", ...},
    {"key": "responseTime", ...}
])

Include meaningful notes

# Good - explains the result
ctx.store(scores={
    "passed": False,
    "key": "accuracy",
    "notes": f"Expected '{expected}', got '{actual}'"
})

# Less helpful
ctx.store(scores={"passed": False, "key": "accuracy", "notes": "Failed"})

Normalize numeric scores

# Good - 0-1 range
ctx.store(scores={"value": similarity / 100, "key": "similarity"})

# Harder to interpret
ctx.store(scores={"value": similarity, "key": "similarity"})  # 0-100 range

Combine value and passed

# Useful for thresholded metrics
score = 0.75
ctx.store(scores={
    "value": score,
    "passed": score >= 0.7,
    "key": "quality",
    "notes": f"Score: {score} (threshold: 0.7)"
})

Getting Started

Examples

Core Concepts

Guides

API Reference

Resources

Schema

Fields

key (required)

value (optional)

passed (optional)

notes (optional)

Score Types

Boolean Score

Numeric Score

Combined Score

Creating Scores

Via store()

Direct Construction

Multiple Scores

Score Aggregation

Default Score Key

Auto-Scoring

Best Practices

Getting Started

Examples

Core Concepts

Guides

API Reference

Resources

​Schema

​Fields

​key (required)

​value (optional)

​passed (optional)

​notes (optional)

​Score Types

​Boolean Score

​Numeric Score

​Combined Score

​Creating Scores

​Via store()

​Direct Construction

​Multiple Scores

​Score Aggregation

​Default Score Key

​Auto-Scoring

​Best Practices

Schema

Fields

key (required)

value (optional)

passed (optional)

notes (optional)

Score Types

Boolean Score

Numeric Score

Combined Score

Creating Scores

Via store()

Direct Construction

Multiple Scores

Score Aggregation

Default Score Key

Auto-Scoring

Best Practices