0.1.7 - 2026-02-17
- Added: SDK
run(...)API for programmatic eval execution withezvals runoption parity; CLIezvals runnow calls this shared core path. - Changed: SDK
run(...)now defaults to built-in runtime settings without creatingezvals.json; CLI continues to honor config by explicitly opting into config loading. - Changed: Runner internals now share one execution path for sequential and concurrent result shaping, reducing duplicated callback/loader/result mapping code.
- Changed: Serve startup for eval-path and JSON-path flows now uses a shared server boot helper (port/url/browser/server lifecycle) to reduce duplication.
- Tests: Stabilized the full-rerun E2E control test by asserting new-run artifacts in storage, removing flaky UI timing dependence under
pytest -n auto. - Added: Detail sidebar score cards now support inline editing with per-score pencil controls, Save/Cancel actions, and persistence through the existing result PATCH API.
- Changed: Score editing in detail view is type-locked; boolean scores remain boolean and value scores remain value-only, and keyboard save shortcuts were removed in favor of explicit Save clicks.
- Changed: Detail edit pencils for scores and annotation now appear on hover for a cleaner default sidebar presentation.
- Tests: Added E2E coverage for detail score editing, including type-specific controls and persisted updates after page reload.
- Fixed:
ezvals skills addnow installs the full bundled skill directory, including nesteduse-cases/andezvals-docs/folders. - Tests: Added CLI unit assertions that
skills addinstalls nested skill content for both canonical and.agents/fallback installs. - Changed:
ezvals skills addnow uses explicit target flags (--agents,--claude,--codex,--cursor,--windsurf,--kiro,--roo) instead of--agents <name>. - Changed:
ezvals skills addnow fails fast when no target flag is provided. - Changed: When
--agentsis selected,.agentsis always treated as canonical source (including combined targets like--agents --claude). - Tests: Added CLI unit coverage for required target flags and
.agentscanonical-priority behavior. - Added: Agent skill now prompts users to file GitHub issues when they mention EZVals friction, bugs, or feature ideas.
- Changed: Docs navigation reorders examples to show RAG Agent before Granular Evals.
0.1.6 - 2026-02-13
- Fixed: Dataset and label chips now truncate long text with ellipsis and expose the full value via hover tooltip in results rows, comparison rows, and filter pills.
- Fixed: Detail sidebar dataset/label chips now truncate long values so long names do not stretch metadata layout.
- Changed: Example eval labels now include intentionally long three-word values so chip truncation behavior is easier to verify in the UI.
- Tests: Added E2E coverage for long dataset/label chip truncation and tooltip behavior.
- Fixed: Dashboard table column resizing now keeps the dragged edge aligned with cursor movement by stabilizing visible header widths at drag start.
- Fixed: Resizing a column no longer unintentionally toggles table sorting on mouse release.
- Tests: Added E2E coverage for column-resize drag behavior (edge tracking, width change, and no-sort-on-drag regression guard).
- Added: Dashboard now includes a slim overflow menu action to reload the EZVals server from the UI when state gets stuck or eval files change.
- Changed: In-app reload now performs a full
ezvals serveprocess restart with the same serve arguments and preserves the active bound port. - Tests: Added unit and E2E coverage for restart re-exec flow, restart API signaling, and overflow-menu restart interaction.
- Added: Detail view now auto-detects chat-style message arrays in
input,output, andreference, rendering them in a chat-like “Pretty” mode with a “Raw” JSON toggle. - Changed: Message normalization/rendering now lives in a reusable UI component (
DataViewer) used across detail panels and the trace messages drawer. - Changed:
examples/demo_trace_data.py::test_direct_trace_datanow includes message-array examples forinput,output, andreference. - Tests: Added E2E coverage to verify pretty/raw message rendering and toggling across input/output/reference and the messages pane.
- Changed: Default score-key naming now uses
pass/Passin runtime fallbacks, examples, tests, and docs where the previous default usedcorrectness/Correctness. - Tests: Updated integration, unit, and E2E expectations to assert the
passdefault score key label. - Changed: PNG export now uses a compact options button with a cleaner, more ergonomic configuration panel.
- Added: PNG export supports editable title, score color tuning, comparison run rename/reorder controls, and footer toggles for test count and average latency.
- Fixed: PNG preview updates no longer flicker while typing configuration inputs.
- Fixed: Comparison-mode PNG exports no longer overlap footer metrics with the run key legend.
- Tests: Added E2E coverage for configurable PNG export options, comparison run reordering in export, and preview regeneration behavior.
- Changed: Detail view metadata now renders as a readable key-value list instead of raw JSON, with human-friendly key labels.
- Fixed: URL metadata values in detail view are now clickable links.
- Tests: Added E2E coverage for metadata label formatting and clickable metadata URLs in detail view.
- Added: Detail sidebar now shows a subtle “Tools” row with unique tool names detected from
trace_data.messagestool calls. - Changed: Detail message parsing now supports common message-array tool-call schemas (
tool_calls,tool_use,tool_call, andfunction_call) so tool usage is extracted more reliably. - Tests: Added E2E coverage for mixed message-array tool-call schemas to verify tool name extraction and deduplication in the detail sidebar.
- Added: Run controls now include Pause/Resume in addition to Stop. Pause lets in-flight evaluations finish while holding queued work; Resume continues pending evaluations from the same run.
- Added:
ezvals servenow auto-builds frontend assets whenui/srcis newer thanezvals/static, reducing stale-UI surprises during local development. - Changed: Running-state controls are now an icon-only connected cluster (pause/play + stop) with theme-matched styling.
- Fixed: Quick pause/resume interactions no longer spawn overlapping runner loops; resume now unpauses the active run thread instead of starting a duplicate execution path.
- Fixed: Stopped runs now remain immediate-cancel behavior, while paused runs continue polling until in-flight rows finish so statuses update without manual refresh.
- Tests: Added integration coverage for pause/resume API behavior and E2E coverage for pause/resume UI and run control state transitions.
- Added:
ezvals servenow supports--open/--no-opento control whether the browser opens automatically on startup. - Changed: CLI and web UI docs now describe browser auto-open as default behavior and document
--no-openfor terminal-only startup flows. - Tests: Added CLI unit coverage to verify
--no-openis exposed in help and propagated for both eval-path and JSON-path serve flows. - Changed: Run metadata on
EvalContext(ctx.run_id,ctx.session_name,ctx.run_name,ctx.eval_path) is now read-only after context initialization. - Tests: Added unit coverage to verify run metadata fields on
EvalContextcannot be reassigned. - Added: Dashboard now keeps shareable URL query state in sync for search, filters, compare selection, and table sort (
sort=<col>,<dir>,<type>). - Changed: Canonical comparison links now use repeated
compare_run_idparams withoutrun_idin compare mode. - Changed: Evals skill URL-sharing guidance now shows compare-mode links using
compare_run_id-only format. - Fixed: Launch URLs shaped like
run_id=<id>&compare_run_id=<id>now reliably hydrate two-run comparison mode. - Tests: Added E2E coverage for single-compare query hydration, live filter/search URL sync, and sort query hydration/sync.
- Added: Evals skill
running.mdnow documents the per-resultEvalResultoutput schema (results[*].result) with concrete JSON examples. - Added: Evals skill includes parsing recipes for common agent tasks like finding failing evals, filtering by error text, and extracting/aggregating score values.
- Added: Up/down controls on comparison chips so runs can be reordered directly in compare mode without exiting.
- Changed: Comparison chip order now immediately drives comparison chart bar order and comparison table column order.
- Tests: Added E2E coverage for reordering comparison runs from chip controls.
- Fixed: Detail view now uses viewport-aware defaults for input/output split, sidebar width, and reference panel height so first-time users do not start with broken proportions.
- Tests: Added E2E coverage to verify fresh-session detail layout keeps sidebar width and reference panel height within sane viewport-relative bounds.
- Fixed: Run dropdown now appears when a serve-started unsaved active run has another existing run in the same session, so headless
ezvals runresults are immediately switchable. - Fixed: Compare action now appears when there is at least one other run available in the current session, even if the active run is not yet saved.
- Tests: Added E2E coverage for the unsaved-active + existing-session-run dropdown scenario to prevent regression.
- Fixed: Expanded table rows no longer force all cells to top-align, preventing the visual jump when clicking rows with short content.
- Tests: Added E2E coverage to verify row clicks keep cell vertical alignment stable for rows with non-overflowing content.
- Fixed:
ezvals servenow reuses the requested port more reliably on rapid stop/start cycles by probing availability with socket reuse enabled. - Tests: Added unit coverage to prevent regressions in serve port probing during quick restarts.
0.1.5 - 2026-02-07
- Added: PNG export option in the export dropdown. Opens a preview modal with Save and Copy buttons. Renders stats bars, metrics, and branding on a 1200x630 canvas at 2x resolution for crisp output.
- Added: PNG export supports comparison mode with grouped bars, outline-styled run name chips, and per-run colors.
- Added: CI guard workflow to prevent accidentally committing generated UI artifacts.
- Added: Wheel verification step in publish workflow to ensure UI assets are bundled.
- Added: Agent skill guidance for sharing focused results views via direct URLs when the user already has the UI running.
- Changed: Frontend build artifacts (
ezvals/static/) are no longer tracked in git. The publish workflow builds the UI and bundles assets into the wheel at release time. - Changed: Actionable error message when serving from a git checkout with missing UI assets.
- Changed: Stats bars and comparison bars now render with gradient fills, color-matched glow shadows, and a glassy highlight effect.
- Changed: Serve startup state now uses explicit readable URL parameters and no longer uses the encoded startup payload.
- Fixed: Comparison mode stats bars and comparison run counts now compute from the currently visible filtered rows instead of full-run aggregates.
- Fixed: Comparison chips showed run IDs instead of run names when comparison runs were loaded from URL parameters.
- Tests: Added E2E coverage for comparison mode to verify stats bar percentages update after applying filters.
- Tests: Added test for missing-UI-assets error response.
- Tests: Added coverage for startup URL hydration and explicit verification that legacy startup payload query input is ignored.
0.1.4 - 2026-02-05
- Changed: Migrated UI source from JSX to non-strict TypeScript (TSX).
- Fixed: Tailwind CSS content config to scan
.ts/.tsxfiles after TypeScript migration. - Added: Sticky table header so column names stay visible while scrolling.
0.1.3 - 2026-01-25
- Fixed: PyPI publish workflow failing due to duplicate files in wheel (removed redundant force-include).
- Tests: E2E tests now use dynamic port allocation for parallel execution with pytest-xdist.
0.1.2 - 2026-01-24
- Changed: Replaced
@parametrizewithcases=on@evalusing list-of-dict cases. - Changed: Case dicts can override
input,reference,metadata,dataset,labels,default_score_key,timeout,target, andevaluators;idcontrols variant naming. - Changed: Web UI rebuilt with React and Vite for improved maintainability. No user-facing behavior changes.
- Removed:
@parametrizedecorator and tuple-based case syntax. - Fixed: Run button incorrectly showed “Stop” instead of “Run” on fresh page load when no evaluations were running.
- Fixed: Inline run name editing now works when multiple runs exist (dropdown visible).
- Fixed: Race condition in discovery module when clearing cached modules (sys.modules changed during iteration).
- Tests: Updated E2E tests for React UI migration—fixed selectors, element IDs, and interaction patterns.
- Tests: Updated E2E tests to expect dropdown visible when rows are selected (matching current UI behavior).
0.1.1 - 2026-01-03
- Changed: Stats bar error count now always displays (shows 0 when no errors) for consistent layout.
- Changed: Reorganized stats bar metrics into rows for better visual hierarchy.
- Changed: Updated documentation UI screenshots with new logo.
0.1.0 - 2026-01-02
- Added: Markdown export with stats visualization and filtered data support.
- Added: Export dropdown menu in header with JSON, CSV, and Markdown formats.
- Added:
ezvals exportCLI command for exporting runs to JSON, CSV, or Markdown. - Added: Markdown export includes ASCII progress bars with color emoji indicators.
- Added: Filtered Markdown export respects current filters and column visibility.
- Added: Comparison mode for viewing 2-4 runs side-by-side. Click ”+ Compare” in stats bar to select runs, view grouped bar charts with color-coded metrics, and compare outputs across runs in a unified table with per-run columns.
- Added:
GET /api/runs/{run_id}/dataendpoint for fetching run data without changing the active run. - Added: Run dropdown selector to switch between past runs in the same session. Dropdown appears when 2+ runs exist.
- Added: Activate run endpoint
POST /api/runs/{run_id}/activateto switch the active run being viewed. - Added: Run metadata on
EvalContextfor observability tagging. Accessctx.run_id,ctx.session_name,ctx.run_name,ctx.eval_path(run-level) andctx.function_name,ctx.dataset,ctx.labels(per-eval) inside eval functions for LangSmith/observability integration. - Changed: Export moved from Settings modal to dedicated dropdown in header.
- Changed: Stats bar chart values now show stacked format—percentage prominent on top, ratio smaller below (e.g., “87%” over “54/62”).
- Changed: Run selection persisted across page reloads within session.
- Fixed: Hot reload now works when editing target modules. Previously, changes to imported modules weren’t picked up on Rerun because Python’s
sys.modulescache wasn’t cleared. - Fixed: Rerun after renaming a run now works correctly. Previously,
app.state.run_namewasn’t synced after rename, causing results to write to a duplicate file. - Fixed: Progress bar now uses CSS variable
--progress-bar-bgfor proper light/dark mode theming. - Fixed: Edit run name button now works when dropdown is shown (hides dropdown, shows input).
- Fixed: Run dropdown now hides when rows are selected for rerun.
- Tests: Added E2E tests for comparison mode functionality.
- Tests: Added hot reload test for module cache clearing.
- Tests: Added tests for run metadata injection into EvalContext.
0.0.2a17 - 2025-12-16
- Added: Per-case
datasetandlabelssupport via@parametrizeandinput_loader. Dataset overrides function-level; labels merge (no duplicates). - Added:
input_loaderparameter on@evaldecorator for dynamic data loading from external sources (databases, APIs like LangSmith). Loader is called lazily at eval time, each example becomes a separate eval run. - Added: Load previous runs with
ezvals serve path/to/run.json. Opens UI with that run loaded; rerun enabled if source eval file exists, view-only mode if not. - Added: HTTP API reference documentation for building custom UIs and integrations.
- Added: Inline run name editing—click pencil icon in stats bar to rename runs. Enter or checkmark saves, Escape or blur cancels.
- Added: Click-to-copy on session/run names with “Copied!” tooltip feedback.
- Changed: Stats panel now dynamically updates when filters/search are active. Shows “filtered/total” format for test count, with latency and score chips calculated from visible rows only.
- Changed: Removed separate filtered summary bar—main stats panel now serves as single source of truth for all statistics.
- Changed: Scores column is now sortable in the results table.
- Fixed: Sticky header no longer overlaps content when scrolling.
- Fixed: Progress bar uses subtle animation instead of shimmer effect.
- Fixed: Latency display now shows correct values for filtered results.
- Fixed: Detail page correctly displays reference values and trace URLs.
- Tests: Added unit tests for input_loader functionality.
- Tests: Updated E2E filter test to verify new dynamic stats behavior.
- Tests: Added integration tests for loading previous run JSON files.
- Tests: Added E2E tests for run rename feature and UI fixes.
0.0.2a9 - 2025-12-06
- Changed: Project renamed from Twevals to EZVals. Package is now
ezvals, CLI command isezvals, config file isezvals.json, results directory is.ezvals/.
0.0.2a16 - 2025-12-06
- Changed: Results storage restructured to hierarchical sessions with overwrite-by-name semantics. New directory structure:
.ezvals/sessions/{session_name}/{run_name}_{timestamp}.json. - Changed:
servecommand auto-generates session names (each serve = new session).runcommand defaults to “default” session. - Changed: Config
results_dirdefault updated from.ezvals/runsto.ezvals/sessions. - Added: GitHub-style split button for Run controls with Rerun/New Run dropdown options.
- Added: Run mode persistence via localStorage - dropdown selection is sticky across sessions.
- Added: Descriptive subtext in dropdown menu (“Overwrite current run results” / “Create a fresh run in this session”).
- Added: API endpoints for new run:
POST /api/runs/new(creates without overwrite),DELETE /api/runs/{run_id},DELETE /api/sessions/{session_name}. - Fixed: Run button state no longer sporadically resets after clicking Run due to race condition in
_hasRunBeforetracking. - Fixed: Run name now displays immediately on server start (previously only appeared after first run file was saved).
- Tests: Added E2E tests for split button behavior and run controls.
- Tests: Updated unit tests for new storage structure and session management.
0.0.2a15 - 2025-12-05
- Changed:
add_output()andadd_score()replaced with unifiedstore()method. Usectx.store(output=..., scores=..., messages=..., trace_url=...)for all context updates. Same score key overwrites, different key appends. - Changed:
run_datarenamed totrace_datawith structuredTraceDataschema. First-classmessagesandtrace_urlproperties, plus arbitrary extra fields via dict-style access. - Added:
--runflag forezvals serveto auto-run all evaluations on startup. - Added: Light mode support with theme toggle in UI header.
- Changed: UI completely refactored to client-side rendering with new dashboard design featuring bar charts and score breakdown.
- Added: Three-state filtering for datasets and labels (include/exclude/any) in the web UI.
- Added: Filter persistence across navigation using sessionStorage.
- Added: Keyboard shortcuts for table view (r=refresh, e=export, f=filter).
- Added: Experience specifications (
.spec/) as canonical source of truth for product behavior. - Tests: Added E2E tests for keyboard shortcuts and stats bar.
- Tests: Added unit tests for CLI filtering and store() method.
0.0.2a14 - 2025-12-01
- Changed: Documentation rewritten for clarity—simplified examples, removed redundant code patterns, consistent use of direct
ctx.outputassignment. - Changed: README simplified to match new documentation patterns—assertions as primary scoring, direct property assignment, removed
set_params()references. - Changed: Removed
set_params()method from EvalContext (use direct property assignment instead). - Added: UI screenshot in documentation assets.
0.0.2a13 - 2025-11-30
- Added:
--no-saveflag forezvals runto skip saving results to file (outputs JSON to stdout instead). - Added: Dataset and label pill filters in the web UI sidebar for quick filtering.
- Added: Shift-click range selection for result checkboxes in the web UI.
- Added: Trace URL link button in result detail page when
run_data.trace_urlis present. - Added: Footer with GitHub and documentation links in the web UI.
- Changed: Error messages now include full stack traces for better debugging.
- Changed: Errors print immediately to console (in red) during evaluation runs in both serve and run commands.
- Changed: Server logging reduced to warnings only (no access logs) for cleaner output.
- Changed: Web UI header is now sticky for better navigation.
0.0.2a12 - 2025-11-30
- Changed: CLI restructured into explicit subcommands -
ezvals serve <path>starts the UI,ezvals run <path>runs headless. The--serveflag is removed. - Changed:
ezvals serveno longer auto-runs evaluations. Discovered evals are displayed in the UI with “not_started” status; users click Run to start execution. - Changed: Removed
--limit,--dev,--host,--list, and--quietflags fromezvals serveto simplify the serve command. - Changed:
ezvals runnow outputs minimal text by default (optimized for LLM agents) - just “Running…” and “Results saved to…”. Use--visualfor rich output. - Added:
--visualflag forezvals runto show progress dots, results table, and summary (the previous default behavior). - Added:
--verbose/-vflag forezvals runnow shows stdout from eval functions instead of controlling output verbosity. - Changed: Results are always auto-saved to file. Priority:
--outputflag > configresults_dir> default.ezvals/runs/. - Changed: Removed
--csvand--jsonflags fromezvals run- results always save as JSON. - Changed: Concurrency minimum is now 1 (sequential execution). Values < 1 throw an error.
- Added: Support for running selected evaluations from the initial “not_started” state via checkbox selection in the UI.
- Added:
ezvals.jsonconfig file for persisting CLI defaults. Auto-generated on first run withconcurrencyandresults_dir. - Added: Settings UI in the web interface to edit runtime options (concurrency, results_dir, timeout). Changes take effect on next run without restarting the server.
- Removed:
rerun_configfrom run JSON files (YAGNI - feature for viewing historical runs didn’t exist). - Tests: Updated CLI tests to use new
runsubcommand syntax.
0.0.2a11 - 2025-11-29
- Added: Full-page detail view for individual evaluation results with dedicated URL (
/runs/{run_id}/results/{index}), enabling new tab support and multi-monitor workflows. - Added: Keyboard navigation on detail page (Arrow keys to navigate between results, Escape to return to table).
- Added: Expandable table rows - click any row to expand and view truncated content inline; click function name to navigate to full detail page.
- Added: Scores wrap and show all values when table row is expanded (previously limited to 2 with “+N” overflow).
- Added: Markdown rendering and syntax highlighting for Output content on detail page.
- Added: Collapsible Run Data section on detail page.
- Changed: Scroll position is now preserved when returning from detail page to table view.
- Changed: Removed expandable detail rows from table in favor of row expand/collapse pattern and dedicated detail page.
- Fixed: Server log output no longer experiences staircase effect when using
--servedue to improved terminal handling. - Added: Server can now be stopped by pressing
Esckey when running with--serve. - Changed: EvalContext detection now uses type annotation (
: EvalContext) instead of parameter name matching. Any parameter name works as long as it’s typed correctly. - Added: Session management for grouping related eval runs together with
--sessionand--run-nameCLI flags. - Added: Auto-generated friendly names (adjective-noun format like “swift-falcon”) when session/run names not provided.
- Added: File naming with run-name prefix:
{run_name}_{timestamp}.json. - Added: Session/run metadata (
session_name,run_name,run_id) in run JSON files. - Added: UI display of current session and run name in the stats bar.
- Added: API endpoints for session management:
GET /api/sessions,GET /api/sessions/{name}/runs,PATCH /api/runs/{run_id}. - Added: Run controls - stop button to cancel running/pending evaluations mid-run.
- Added: Selective rerun - select individual results via checkboxes and rerun only those.
- Added: UI selection checkboxes with select-all functionality and indeterminate state.
- Added:
POST /api/runs/stopendpoint to cancel running evaluations. - Changed:
POST /api/runs/rerunnow accepts optionalindicesparameter for selective reruns. - Tests: Added E2E tests for run controls (selection, start, stop, selective rerun).
- Changed: Codebase cleanup reducing ~510 lines through consolidating duplicate patterns, simplifying async/sync execution, and removing redundant code.
0.0.2a10 - 2025-11-28
- Added:
run_evals()function for programmatic execution of multiple evals with support for functions, paths, concurrency, and all CLI options. - Added: Direct call support for parametrized evals - calling a parametrized function now runs all variants and returns
List[EvalResult]. - Changed:
EvalFunction.__call__now detects__param_sets__attribute and automatically runs all parametrized variants. - Added: UI redesign with dark mode support, amber/zinc color scheme, improved typography, and responsive layout.
- Added: Background evaluation execution - UI loads immediately while evals run in background with live status updates.
- Added: Auto-open browser when starting
ezvals --servefor faster workflow. - Added: Rerun configuration stored in run JSON for reproducible reruns from UI.
- Changed: Simplified server/CLI code by inlining background execution logic.
0.0.2a9 - 2025-11-23
- Added:
timeoutparameter to@evaldecorator for setting per-evaluation timeout limits in seconds. - Added:
--timeoutCLI flag forezvals runcommand to set a global timeout that overrides individual test timeouts. - Added: Timeout enforcement for both sync and async functions using
concurrent.futures.ThreadPoolExecutorandasyncio.wait_for. - Added: Timeout support for target hooks with proper error handling and latency tracking on timeout.
- Tests: Added comprehensive timeout tests covering async/sync functions and target hooks.
0.0.2a8 - 2025-11-23
- Added:
--listflag toezvals runcommand to list evaluations without running them, preserving all filtering options. - Changed: Removed standalone
listcommand in favor ofrun --list. - Tests: Added regression test for concurrency output capturing.
0.0.2a7 - 2025-11-23
- Added: File-level defaults via
ezvals_defaultsdictionary - set global properties (dataset, labels, metadata, etc.) at the top of test files that all tests inherit, similar to pytest’s pytestmark pattern. - Added: Support for all decorator parameters in file defaults including evaluators, target, input, reference, default_score_key, metadata, and metadata_from_params.
- Added: Deep merge for metadata - when both file defaults and decorator specify metadata, they are merged with decorator values taking precedence on conflicts.
- Added: Deep copy of mutable values (lists, dicts) in file defaults to prevent shared mutation between tests.
- Added: Validation and warnings for unknown keys in
ezvals_defaultsdictionary. - Changed:
default_score_keyparameter default changed from “pass” to None to enable file-level defaults, with “pass” still applied as final fallback via EvalContext. - Tests: Added 17 comprehensive tests for file defaults including inheritance, overrides, deep merge, mutable value copying, and default_score_key priority chain.
0.0.2a6 - 2025-11-23
- Added:
targetparameter to@evaldecorator allowing pre-hook functions that run before the evaluation function, enabling separation of agent/LLM invocation from evaluation logic. - Added: Target hooks receive
EvalContextand can populate output, metadata, and custom attributes before the eval function executes. - Added: Target hooks support both sync and async functions, with automatic latency tracking.
- Added: Parametrize integration with targets - parametrized values are automatically available to target hooks via
ctx.inputandctx.metadata. - Added: Target return value handling - targets can return dicts (treated as output payload) or EvalResult objects for flexible result injection.
- Changed: Parametrize now defaults
ctx.inputto parametrized values when no explicit input is provided, making param data accessible to targets. - Tests: Added comprehensive unit tests for target functionality including output injection, error handling, async support, and parametrize integration.
0.0.2a5 - 2025-11-22
- Added:
--jsonflag forezvals runcommand to output results as compact JSON to stdout, omitting null values for machine-readable output. - Added: Pytest-style progress reporting during evaluation execution with colored output (green for pass, red for fail/error).
- Added: Progress display shows one line per file with filename prefix followed by status characters (
.,F,E), matching pytest’s output format. - Added: Detailed failure reporting after progress output showing dataset::function_name, error messages, and input/output for failed evaluations.
- Added: Progress hooks (
on_start,on_complete) toEvalRunnerfor extensible progress reporting. - Changed: Replaced spinner-only progress indicator with real-time pytest-style character output.
- Tests: Added comprehensive tests for progress reporting hooks and CLI progress output validation.
- Tests: Added tests for
--jsonflag output format validation and null value omission.
0.0.2a4 - 2025-11-22
- Added: Function name filtering using
file.py::function_namesyntax, similar to pytest. Run specific evaluation functions or parametrized variants (e.g.,ezvals run tests.py::my_evalortests.py::my_eval[param1]). - Tests: Added comprehensive tests for function filtering including exact matches, parametrized variants, and combined filters with dataset/labels.
0.0.2a3 - 2025-11-22
- Fixed:
add_output()now correctly handles dicts without EvalResult fields (like{'full_name': 'Kim Diaz'}) by storing them as-is in the output field, instead of incorrectly treating them as structured EvalResult dicts. - Fixed: Eval results table now preserves source file order instead of alphabetically sorting functions by name (resolves #2).
- Changed: Replaced
ParametrizedEvalFunctionclass with function attributes (__param_sets__,__param_ids__) for simpler architecture. - Changed:
generate_eval_functions()is now a standalone function instead of a class method. - Changed: Removed special-case handling for parametrized functions in
@evaldecorator, unifying code paths. - Tests: Added comprehensive tests for
add_output()behavior with arbitrary dict structures to prevent regression. - Tests: Added test to verify source file order preservation in discovery.
0.0.2a2 - 2025-11-22
- Fixed: Failed assertions now create failing scores instead of error states, properly treating them as validation failures rather than execution errors.
- Fixed: Output field is now preserved in results table when assertions fail, instead of being overwritten with error messages.
- Changed: Table formatter only displays error in output column when output is empty, preventing loss of actual output data.
- Tests: Added comprehensive tests for assertion handling including edge cases for assertions without messages, non-assertion errors, and multiple assertions.
0.0.2a1 - 2025-11-22
- Fixed: Module discovery now properly handles relative imports by temporarily adding parent directory to sys.path with cleanup.
0.0.2a0 - 2025-11-22
- Added:
EvalContext(accessible asctx,context, orcarrierparameter) provides a mutable builder pattern for constructing eval results incrementally with methods likeadd_output(),add_score(), andset_params(). - Added: Context manager support for
EvalContextallowingwithstatement usage in eval functions. - Added: Automatic context injection - functions accepting a
ctx/context/carrierparameter receive anEvalContextinstance automatically. - Added: Auto-return feature - eval functions using context don’t need explicit
returnstatements; context is auto-built at function end. - Added: Smart
add_output()method that extracts EvalResult fields (output, latency, run_data, metadata) from dict responses or sets output directly. - Added: Flexible
add_score()supporting boolean pass/fail, numeric values, custom keys, and full control via kwargs. - Added:
set_params()helper for parametrized tests to set both input and metadata from params. - Added: Filters in the web UI for dataset, labels, and status with multi-select support and persistence.
- Added: Reference column in results table to display expected/ground truth outputs.
- Changed: Tests that complete without calling
add_score()now automatically pass with a default “pass” score, similar to pytest behavior. - Changed:
EvalContextdefaults to “pass” as the default score key (previously required explicit key). - Changed: Migrated from Poetry to uv for faster dependency management and installation.
- Changed:
EvalContext.build()auto-adds{"key": "pass", "passed": True}when no scores are provided and no error occurred. - Changed:
EvalRunnerensures all results without scores and without errors receive a default passing score. - Tests: Added comprehensive unit tests for EvalContext methods, context manager pattern, and decorator integration.
- Tests: Added integration tests for context usage patterns, parametrize auto-mapping, and assertion preservation.
- Tests: Added e2e test for advanced UI filters functionality.
0.0.1a0 - 2025-09-04
- Added: Results Web UI with expandable rows, multi-column sorting, column toggles and resizable columns, copy-to-clipboard, and summary chips for score ratios/averages.
- Added: Inline editing in the UI for dataset, labels, metadata (JSON), scores (key/value/passed/notes), and a free-form annotation. Edits are persisted to JSON.
- Added: Actions menu in the UI with Refresh, Rerun full suite, Export JSON, and Export CSV.
- Added: Server endpoints:
PATCH /api/runs/{run_id}/results/{index}for updates;POST /api/runs/rerun;GET /api/runs/{run_id}/export/{json|csv}. - Added:
ezvals serve --devhot-reload mode for rapid UI/eval iteration. - Added: CLI CSV export via
ezvals run ... --csv results.csv(in addition to JSON-o). - Added:
ResultsStorefor robust, atomic JSON writes under.ezvals/runs/withlatest.jsonconvenience copy. - Added:
EvalResult.run_datafor run-specific structured data, displayed in the UI details panel. - Changed: CLI
runprints a results table by default and a concise summary below it. - Tests: Added integration tests for server (JSON flow, export endpoints, rerun) and e2e UI tests (Playwright), plus unit tests for storage behavior.

