Score represents a single metric or assessment within an evaluation result.
Schema
class Score:
key: str # Required: metric identifier
value: float = None # Optional: numeric score
passed: bool = None # Optional: pass/fail status
notes: str = None # Optional: explanation
At least one of value or passed must be provided.
Fields
key (required)
A string identifier for the metric. Use descriptive names:
{"key": "accuracy"}
{"key": "response_time"}
{"key": "format_valid"}
{"key": "contains_keywords"}
value (optional)
A numeric score, typically in the 0-1 range:
{"key": "confidence", "value": 0.95}
{"key": "similarity", "value": 0.82}
{"key": "quality", "value": 0.7}
passed (optional)
A boolean indicating pass/fail:
{"key": "format_valid", "passed": True}
{"key": "safety_check", "passed": False}
notes (optional)
Human-readable explanation:
{
"key": "accuracy",
"passed": True,
"notes": "Output matches reference exactly"
}
{
"key": "length",
"passed": False,
"notes": "Response too short: 15 chars, expected 50+"
}
Score Types
Boolean Score
Simple pass/fail:
{"key": "correct", "passed": True}
{"key": "format_valid", "passed": False, "notes": "Missing closing bracket"}
Numeric Score
Continuous metric:
{"key": "confidence", "value": 0.87}
{"key": "similarity", "value": 0.92, "notes": "Cosine similarity"}
Combined Score
Both numeric and boolean:
{
"key": "quality",
"value": 0.75,
"passed": True, # Passes threshold
"notes": "Score: 0.75 (threshold: 0.6)"
}
Creating Scores
Via store()
The recommended way:
@eval(dataset="demo", default_score_key="accuracy")
async def my_eval(ctx: EvalContext):
# Boolean with notes
ctx.store(scores={"passed": True, "notes": "Exact match"})
# Numeric
ctx.store(scores={"value": 0.85, "key": "confidence", "notes": "Confidence score"})
# Boolean with custom key
ctx.store(scores={"passed": output.is_valid, "key": "format", "notes": "Valid format"})
Direct Construction
For evaluators or manual creation:
score = {
"key": "semantic_similarity",
"value": compute_similarity(output, reference),
"passed": similarity > 0.8,
"notes": f"Similarity: {similarity:.2f}"
}
Multiple Scores
An evaluation can have multiple scores:
@eval(dataset="qa")
async def test_answer(ctx: EvalContext):
ctx.input = "What is Python?"
ctx.output = await agent(ctx.input)
ctx.store(scores=[
{"passed": is_correct, "key": "accuracy", "notes": "Factually correct"},
{"passed": is_concise, "key": "brevity", "notes": "Under 100 words"},
{"value": confidence, "key": "confidence", "notes": "Model confidence"},
{"passed": is_safe, "key": "safety", "notes": "No harmful content"}
])
Result:
{
"scores": [
{"key": "accuracy", "passed": true, "notes": "Factually correct"},
{"key": "brevity", "passed": true, "notes": "Under 100 words"},
{"key": "confidence", "value": 0.92, "notes": "Model confidence"},
{"key": "safety", "passed": true, "notes": "No harmful content"}
]
}
Score Aggregation
The CLI and Web UI aggregate scores:
| Aggregation | Description |
|---|
| Pass Rate | % of scores where passed=True |
| Average | Mean of all value scores |
| By Key | Group scores by key for analysis |
Default Score Key
Set via decorator to name auto-created scores:
@eval(default_score_key="correctness")
async def my_eval(ctx: EvalContext):
ctx.input = "test"
ctx.output = "result"
ctx.store(scores={"passed": True, "notes": "All checks passed"})
# Score will have key="correctness"
Auto-Scoring
If no score is added, EZVals creates one automatically:
@eval(dataset="demo", default_score_key="success")
async def test_no_explicit_score(ctx: EvalContext):
ctx.input = "test"
ctx.output = "result"
# Auto-adds: {"key": "success", "passed": True}
Best Practices
# Good - consistent naming
ctx.store(scores=[
{"key": "accuracy", ...},
{"key": "format_valid", ...},
{"key": "response_time", ...}
])
# Avoid - inconsistent keys
ctx.store(scores=[
{"key": "Accuracy", ...},
{"key": "is_format_valid", ...},
{"key": "responseTime", ...}
])
# Good - explains the result
ctx.store(scores={
"passed": False,
"key": "accuracy",
"notes": f"Expected '{expected}', got '{actual}'"
})
# Less helpful
ctx.store(scores={"passed": False, "key": "accuracy", "notes": "Failed"})
# Good - 0-1 range
ctx.store(scores={"value": similarity / 100, "key": "similarity"})
# Harder to interpret
ctx.store(scores={"value": similarity, "key": "similarity"}) # 0-100 range
# Useful for thresholded metrics
score = 0.75
ctx.store(scores={
"value": score,
"passed": score >= 0.7,
"key": "quality",
"notes": f"Score: {score} (threshold: 0.7)"
})