Ground Truth

Ground Truth datasets (formerly called "golden samples") are manually created datasets of known-correct field values that Benchmarks measure extraction accuracy against in Talonic. Benchmark runs re-extract the same documents and compare the results field by field against the Ground Truth, producing per-field accuracy scores with AI judge verdicts. In the app, Benchmarks and Ground Truth live off-nav at /review/benchmarks (open the direct URL), not under a Validation nav section — Validation refers to the in-pipeline checks, while Benchmarks measure accuracy.

To create a Ground Truth entry, select a document and manually enter the correct value for each field. The system stores these known-correct values as the accuracy baseline. When you run a benchmark, the extraction pipeline processes the same document independently, and the results are compared field by field against your Ground Truth.

Benchmark scoring uses an AI judge to evaluate each field comparison. The judge accounts for semantic equivalence — for example, "United States" and "US" may be scored as a match depending on the field type. Per-field accuracy scores let you identify exactly which fields are underperforming and need schema or instruction tuning.

For best results, build Ground Truth from a representative mix of document types and complexity levels. Most teams maintain 5-10 Ground Truth documents per schema and re-run Benchmarks after schema changes, instruction updates, or model upgrades to track quality trends over time.

A typical benchmarking workflow starts after a schema change or model upgrade. Open the Benchmarks surface at /review/benchmarks (direct URL), select the Ground Truth entries you want to benchmark, and click Run Benchmark. The system re-extracts each document independently and compares every field against your known-correct values. The results page shows a per-field accuracy matrix with pass/fail indicators and AI judge verdicts explaining each comparison. Use this data to pinpoint fields that need schema instruction tuning or additional extraction context.

Create Ground Truth entries by selecting a document and entering known-correct values for each field
Benchmark runs compare extraction results field-by-field against the Ground Truth baseline
AI judge evaluates semantic equivalence (e.g., "United States" matches "US" for country fields)
Per-field accuracy scores identify exactly which fields are underperforming
Maintain 5-10 Ground Truth documents per schema for representative coverage
Re-run benchmarks after schema changes, instruction updates, or model upgrades
Benchmark results are stored historically for tracking quality trends over time

Ground Truth via API

The public API exposes Ground Truth datasets under /v1/quality/ground-truth and benchmark runs under /v1/quality/benchmarks. Create a dataset with POST /v1/quality/ground-truth (a name plus optional description), then attach known-correct values per document via /v1/quality/ground-truth/:id/entries. Fetching a single dataset returns its sample entries, and GET /v1/quality/benchmarks/:id/results returns per-document accuracy results for a run.

List Ground Truth datasets

curl -s https://api.talonic.com/v1/quality/ground-truth \
  -H "Authorization: Bearer $TALONIC_API_KEY"

# Response:
# {
#   "data": [
#     {
#       "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
#       "name": "Invoice golden set",
#       "user_schema_id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
#       "document_count": 8,
#       "created_at": "2026-03-15T09:00:00Z"
#     }
#   ],
#   "pagination": { "total": 8, "limit": 20, "has_more": false }
# }

Compare two benchmark runs

curl -s "https://api.talonic.com/v1/quality/benchmarks/compare?run_a=<run_a_id>&run_b=<run_b_id>" \
  -H "Authorization: Bearer $TALONIC_API_KEY"

# Response: both runs side by side plus the overall accuracy delta
# (run_a minus run_b; null if either run is unscored).
# {
#   "run_a": { "id": "...", "accuracy_overall": 0.93, ... },
#   "run_b": { "id": "...", "accuracy_overall": 0.89, ... },
#   "accuracy_delta": 0.04
# }

Ground Truth is most valuable when it represents the diversity of your document corpus. Include at least one "clean" document with all fields present and correctly formatted, one document with unusual formatting or missing fields, and one document from each major variation you encounter. This coverage ensures that Benchmarks test the full range of extraction challenges rather than just the easy cases. Re-run Benchmarks after every significant schema change — new field instructions, updated reference tables, or model upgrades — to measure whether the change improved or regressed extraction quality.

Ground Truth is not used during normal extraction — it exists solely for Benchmarks. Changing a Ground Truth entry does not affect how documents are processed. This separation ensures that Ground Truth remains a pure measurement tool without introducing bias into the extraction pipeline.

Frequently asked questions

What is Ground Truth?+

Ground Truth datasets (formerly called golden samples) are manually created datasets of known-correct values. Benchmark runs compare extraction results against them for per-field accuracy scoring.

How do benchmark runs work?+

Benchmark runs compare extraction results against Ground Truth, producing per-field accuracy scores with AI judge verdicts to measure extraction quality.

How much Ground Truth should I create?+

Most teams maintain 5-10 Ground Truth documents per schema, covering a representative mix of document types and complexity levels. Re-run Benchmarks after schema changes or model upgrades to track quality trends.

Does the AI judge handle semantic equivalence in benchmark scoring?+

Yes. The AI judge accounts for semantic equivalence when comparing extracted values against the Ground Truth. For example, "United States" and "US" may be scored as a match depending on the field type and context. This prevents false negatives where the extraction is correct but uses a different surface form than the Ground Truth value.

Can I compare two benchmark runs?+

Yes. GET /v1/quality/benchmarks/compare?run_a=<id>&run_b=<id> returns both runs side by side together with the overall accuracy delta (run_a minus run_b). The delta is null when either run has not yet produced an overall accuracy score, so wait for both runs to complete before comparing.

Validation Checks

Review Gates

Corrections

Ground Truth

Ground Truth via API

Frequently asked questions

Related