Create Ground Truth Dataset

Create a new ground truth dataset linked to a schema. The dataset defines the expected extraction output used for accuracy benchmarking.

Create an empty ground truth dataset that you can populate with verified entries. Datasets serve as the baseline for benchmark runs that measure extraction accuracy. After creating a dataset, add entries individually or import them in bulk via CSV.

The typical workflow is: create the dataset, then populate it using POST /v1/quality/ground-truth/:id/entries for individual entries or POST /v1/quality/ground-truth/:id/entries/import-csv for bulk import. Once populated, create a benchmark run with POST /v1/quality/benchmarks.

The response returns the dataset with document_count: 0 since it is initially empty. The user_schema_id is null unless you associate it with a schema. The links.self URL points to the detail endpoint where you can retrieve entries or delete the dataset.

For best results, aim for at least 30-50 entries per dataset. Linking a dataset to a user_schema_id ensures ground truth field names align with your extraction schema, producing more meaningful benchmark comparisons.

Field keys in expected_data entries should match the field names used in your extraction schema. Unmatched fields are stored but ignored during benchmark comparison.

POST/v1/quality/ground-truth

Response

Response fields (201 Created)

idstringDataset UUID.

namestringDataset name.

descriptionstring | nullOptional description.

user_schema_idstring | nullAssociated user schema ID, if any.

document_countintegerNumber of entries (0 for newly created datasets).

created_atstringISO 8601 creation timestamp.

links.selfstringURL to this dataset.

Response (201 Created)

{
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "name": "Invoice Accuracy Set",
  "description": null,
  "user_schema_id": null,
  "document_count": 0,
  "created_at": "2024-09-01T10:00:00.000Z",
  "links": {
    "self": "/v1/quality/ground-truth/a1b2c3d4-e5f6-7890-abcd-ef1234567890"
  }
}

Errors

Error responses

400validation_errorMissing required field: name.

401unauthorizedMissing or invalid API key.

429rate_limitedToo many requests. Retry after the period indicated in the Retry-After header.

List Datasets

Quality Entries

Create Ground Truth Dataset

Response

Errors

Related