Create Benchmark

Start a benchmark run that compares a job run output against a ground truth dataset. Produces per-field accuracy scores and overall metrics.

Start a new benchmark run that evaluates your current extraction output against a ground truth dataset. The benchmark compares each document in the dataset entry-by-entry and field-by-field, producing an overall accuracy score and per-field breakdowns.

The typical workflow is: create a benchmark after making extraction pipeline changes, poll GET /v1/quality/benchmarks/:id until status is completed, then inspect results. Run multiple benchmarks against the same dataset over time to track accuracy trends.

The response returns the benchmark with status: queued, accuracy_overall: null, and documents_processed: 0. The documents_total field reflects how many entries are in the dataset. Poll the detail endpoint to check status and documents_processed for progress. Once completed, accuracy_overall and accuracy_by_field are populated.

Multiple benchmarks can run in parallel against different datasets. Use GET /v1/quality/benchmarks/compare after completion to compare two runs side by side. The dataset_id is fixed at creation — to benchmark against a different dataset, create a new run.

Benchmark runs are asynchronous. The endpoint returns immediately with status queued. Poll the benchmark detail endpoint or list benchmarks to check when the run completes.

POST/v1/quality/benchmarks

Response

Response fields (201 Created)

idstringBenchmark run UUID.

namestringBenchmark run name.

dataset_idstringGround truth dataset ID.

user_schema_idstring | nullUser schema ID, if scoped.

statusstringInitial status: queued.

accuracy_overallnullNull until the run completes.

documents_processedintegerAlways 0 at creation.

documents_totalintegerTotal entries in the dataset to evaluate.

created_atstringISO 8601 creation timestamp.

completed_atnullNull until the run completes.

links.selfstringURL to this benchmark run.

links.resultsstringURL to the per-document results.

Response (201 Created)

{
  "id": "c3d4e5f6-a7b8-9012-cdef-123456789012",
  "name": "Benchmark 2024-09-25",
  "dataset_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "user_schema_id": null,
  "status": "queued",
  "accuracy_overall": null,
  "accuracy_by_field": null,
  "documents_processed": 0,
  "documents_total": 50,
  "duration_ms": null,
  "accuracy_delta": null,
  "compared_to_run_id": null,
  "created_at": "2024-09-25T12:00:00.000Z",
  "completed_at": null,
  "links": {
    "self": "/v1/quality/benchmarks/c3d4e5f6-a7b8-9012-cdef-123456789012",
    "results": "/v1/quality/benchmarks/c3d4e5f6-a7b8-9012-cdef-123456789012/results"
  }
}

Errors

Error responses

400validation_errorMissing required field: dataset_id.

401unauthorizedMissing or invalid API key.

404not_foundThe specified dataset_id does not exist for your workspace.

429rate_limitedToo many requests. Retry after the period indicated in the Retry-After header.

List Benchmarks

Benchmark Results

Create Benchmark

Response

Errors

Related