List Benchmarks

List benchmark runs that compare extraction results against ground truth datasets. Each run produces per-field accuracy metrics.

Benchmark runs compare your extraction output against ground truth datasets to produce per-field accuracy scores. Each run evaluates every document in the dataset and produces an accuracy_overall score along with per-field breakdowns. Use benchmarks to track extraction quality over time and measure the impact of schema or pipeline changes.

Use this endpoint to see all benchmark runs and their accuracy scores. A typical workflow is to list benchmarks after making schema or pipeline changes, then compare the latest run against previous ones using GET /v1/quality/benchmarks/compare to measure improvement or detect regressions.

Each benchmark includes status (queued, running, completed, or failed), accuracy_overall (0-1 score, null while running), accuracy_by_field (per-field breakdown), and documents_processed/documents_total for progress tracking. The accuracy_delta and compared_to_run_id fields support cross-run comparisons.

Run benchmarks regularly after extraction pipeline changes. Pair with GET /v1/quality/benchmarks/:id/results for per-document drill-down showing which fields matched and which diverged. Use the compare endpoint to track accuracy trends across multiple runs.

GET/v1/quality/benchmarks

Response

Response fields

dataarrayArray of benchmark run objects.

data[].idstringBenchmark run UUID.

data[].namestringBenchmark run name.

data[].dataset_idstringGround truth dataset ID used for this run.

data[].user_schema_idstring | nullUser schema scoping this benchmark, if any.

data[].statusstringRun status: queued, running, completed, or failed.

data[].accuracy_overallnumber | nullOverall accuracy score (0–1). Null while running.

data[].accuracy_by_fieldobject | nullPer-field accuracy scores. Null while running.

data[].documents_processedintegerNumber of documents evaluated so far.

data[].documents_totalintegerTotal documents to evaluate.

data[].duration_msinteger | nullTotal run duration in milliseconds.

data[].created_atstringISO 8601 creation timestamp.

data[].completed_atstring | nullISO 8601 completion timestamp.

data[].links.selfstringURL to this benchmark run.

data[].links.resultsstringURL to the per-document results.

pagination.totalintegerTotal number of benchmark runs.

pagination.limitintegerMaximum results per page.

pagination.has_morebooleanWhether more results exist beyond this page.

pagination.next_cursorstring | nullCursor to fetch the next page.

Response

{
  "data": [
    {
      "id": "c3d4e5f6-a7b8-9012-cdef-123456789012",
      "name": "Benchmark 2024-09-25",
      "dataset_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "user_schema_id": null,
      "status": "completed",
      "accuracy_overall": 0.93,
      "accuracy_by_field": {
        "vendor_name": 0.98,
        "total_amount": 0.90,
        "invoice_number": 0.92
      },
      "documents_processed": 50,
      "documents_total": 50,
      "duration_ms": 4200,
      "accuracy_delta": null,
      "compared_to_run_id": null,
      "created_at": "2024-09-25T12:00:00.000Z",
      "completed_at": "2024-09-25T12:00:04.200Z",
      "links": {
        "self": "/v1/quality/benchmarks/c3d4e5f6-a7b8-9012-cdef-123456789012",
        "results": "/v1/quality/benchmarks/c3d4e5f6-a7b8-9012-cdef-123456789012/results"
      }
    }
  ],
  "pagination": {
    "total": 5,
    "limit": 20,
    "has_more": false,
    "next_cursor": null
  }
}

Errors

Error responses

401unauthorizedMissing or invalid API key.

429rate_limitedToo many requests. Retry after the period indicated in the Retry-After header.

Create Benchmark

Benchmark Results

List Benchmarks

Response

Errors

Related