Skip to main content
Score represents a single metric or assessment within an evaluation result.

Schema

class Score:
    key: str              # Required: metric identifier
    value: float = None   # Optional: numeric score
    passed: bool = None   # Optional: pass/fail status
    notes: str = None     # Optional: explanation
At least one of value or passed must be provided.

Fields

key (required)

A string identifier for the metric. Use descriptive names:
{"key": "accuracy"}
{"key": "response_time"}
{"key": "format_valid"}
{"key": "contains_keywords"}

value (optional)

A numeric score, typically in the 0-1 range:
{"key": "confidence", "value": 0.95}
{"key": "similarity", "value": 0.82}
{"key": "quality", "value": 0.7}

passed (optional)

A boolean indicating pass/fail:
{"key": "format_valid", "passed": True}
{"key": "safety_check", "passed": False}

notes (optional)

Human-readable explanation:
{
    "key": "accuracy",
    "passed": True,
    "notes": "Output matches reference exactly"
}

{
    "key": "length",
    "passed": False,
    "notes": "Response too short: 15 chars, expected 50+"
}

Score Types

Boolean Score

Simple pass/fail:
{"key": "correct", "passed": True}
{"key": "format_valid", "passed": False, "notes": "Missing closing bracket"}

Numeric Score

Continuous metric:
{"key": "confidence", "value": 0.87}
{"key": "similarity", "value": 0.92, "notes": "Cosine similarity"}

Combined Score

Both numeric and boolean:
{
    "key": "quality",
    "value": 0.75,
    "passed": True,  # Passes threshold
    "notes": "Score: 0.75 (threshold: 0.6)"
}

Creating Scores

Via store()

The recommended way:
@eval(dataset="demo", default_score_key="accuracy")
async def my_eval(ctx: EvalContext):
    # Boolean with notes
    ctx.store(scores={"passed": True, "notes": "Exact match"})

    # Numeric
    ctx.store(scores={"value": 0.85, "key": "confidence", "notes": "Confidence score"})

    # Boolean with custom key
    ctx.store(scores={"passed": output.is_valid, "key": "format", "notes": "Valid format"})

Direct Construction

For evaluators or manual creation:
score = {
    "key": "semantic_similarity",
    "value": compute_similarity(output, reference),
    "passed": similarity > 0.8,
    "notes": f"Similarity: {similarity:.2f}"
}

Multiple Scores

An evaluation can have multiple scores:
@eval(dataset="qa")
async def test_answer(ctx: EvalContext):
    ctx.input = "What is Python?"
    ctx.output = await agent(ctx.input)

    ctx.store(scores=[
        {"passed": is_correct, "key": "accuracy", "notes": "Factually correct"},
        {"passed": is_concise, "key": "brevity", "notes": "Under 100 words"},
        {"value": confidence, "key": "confidence", "notes": "Model confidence"},
        {"passed": is_safe, "key": "safety", "notes": "No harmful content"}
    ])
Result:
{
    "scores": [
        {"key": "accuracy", "passed": true, "notes": "Factually correct"},
        {"key": "brevity", "passed": true, "notes": "Under 100 words"},
        {"key": "confidence", "value": 0.92, "notes": "Model confidence"},
        {"key": "safety", "passed": true, "notes": "No harmful content"}
    ]
}

Score Aggregation

The CLI and Web UI aggregate scores:
AggregationDescription
Pass Rate% of scores where passed=True
AverageMean of all value scores
By KeyGroup scores by key for analysis

Default Score Key

Set via decorator to name auto-created scores:
@eval(default_score_key="correctness")
async def my_eval(ctx: EvalContext):
    ctx.input = "test"
    ctx.output = "result"
    ctx.store(scores={"passed": True, "notes": "All checks passed"})
    # Score will have key="correctness"

Auto-Scoring

If no score is added, EZVals creates one automatically:
@eval(dataset="demo", default_score_key="success")
async def test_no_explicit_score(ctx: EvalContext):
    ctx.input = "test"
    ctx.output = "result"
    # Auto-adds: {"key": "success", "passed": True}

Best Practices

# Good - consistent naming
ctx.store(scores=[
    {"key": "accuracy", ...},
    {"key": "format_valid", ...},
    {"key": "response_time", ...}
])

# Avoid - inconsistent keys
ctx.store(scores=[
    {"key": "Accuracy", ...},
    {"key": "is_format_valid", ...},
    {"key": "responseTime", ...}
])
# Good - explains the result
ctx.store(scores={
    "passed": False,
    "key": "accuracy",
    "notes": f"Expected '{expected}', got '{actual}'"
})

# Less helpful
ctx.store(scores={"passed": False, "key": "accuracy", "notes": "Failed"})
# Good - 0-1 range
ctx.store(scores={"value": similarity / 100, "key": "similarity"})

# Harder to interpret
ctx.store(scores={"value": similarity, "key": "similarity"})  # 0-100 range
# Useful for thresholded metrics
score = 0.75
ctx.store(scores={
    "value": score,
    "passed": score >= 0.7,
    "key": "quality",
    "notes": f"Score: {score} (threshold: 0.7)"
})