Create Benchmark
Start a benchmark run that compares a job run output against a ground truth dataset. Produces per-field accuracy scores and overall metrics.
Start a new benchmark run that evaluates your current extraction output against a ground truth dataset. The benchmark compares each document in the dataset entry-by-entry and field-by-field, producing an overall accuracy score and per-field breakdowns.
The typical workflow is: create a benchmark after making extraction pipeline changes, poll GET /v1/quality/benchmarks/:id until status is completed, then inspect results. Run multiple benchmarks against the same dataset over time to track accuracy trends.
The response returns the benchmark with status: queued, accuracy_overall: null, and documents_processed: 0. The documents_total field reflects how many entries are in the dataset. Poll the detail endpoint to check status and documents_processed for progress. Once completed, accuracy_overall and accuracy_by_field are populated.
Multiple benchmarks can run in parallel against different datasets. Use GET /v1/quality/benchmarks/compare after completion to compare two runs side by side. The dataset_id is fixed at creation — to benchmark against a different dataset, create a new run.
queued. Poll the benchmark detail endpoint or list benchmarks to check when the run completes./v1/quality/benchmarksResponse
Response fields (201 Created)
Response (201 Created)
{
"id": "c3d4e5f6-a7b8-9012-cdef-123456789012",
"name": "Benchmark 2024-09-25",
"dataset_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"user_schema_id": null,
"status": "queued",
"accuracy_overall": null,
"accuracy_by_field": null,
"documents_processed": 0,
"documents_total": 50,
"duration_ms": null,
"accuracy_delta": null,
"compared_to_run_id": null,
"created_at": "2024-09-25T12:00:00.000Z",
"completed_at": null,
"links": {
"self": "/v1/quality/benchmarks/c3d4e5f6-a7b8-9012-cdef-123456789012",
"results": "/v1/quality/benchmarks/c3d4e5f6-a7b8-9012-cdef-123456789012/results"
}
}Errors
Error responses