Skip to main content

Ground Truth

Manually-created reference datasets with known-correct values. Create from Validation → Golden Samples. Benchmark runs compare extraction results against golden samples for per-field accuracy scoring with AI judge verdicts.

To create a golden sample, select a document and manually enter the correct value for each field. The system stores these known-correct values as the ground truth baseline. When you run a benchmark, the extraction pipeline processes the same document independently, and the results are compared field by field against your golden sample.

Benchmark scoring uses an AI judge to evaluate each field comparison. The judge accounts for semantic equivalence — for example, "United States" and "US" may be scored as a match depending on the field type. Per-field accuracy scores let you identify exactly which fields are underperforming and need schema or instruction tuning.

For best results, create golden samples from a representative mix of document types and complexity levels. Most teams maintain 5-10 golden samples per schema and re-run benchmarks after schema changes, instruction updates, or model upgrades to track quality trends over time.

A typical benchmarking workflow starts after a schema change or model upgrade. Navigate to Validation → Golden Samples, select the samples you want to benchmark, and click Run Benchmark. The system re-extracts each document independently and compares every field against your known-correct values. The results page shows a per-field accuracy matrix with pass/fail indicators and AI judge verdicts explaining each comparison. Use this data to pinpoint fields that need schema instruction tuning or additional extraction context.

  • Create golden samples by selecting a document and entering known-correct values for each field
  • Benchmark runs compare extraction results field-by-field against the golden sample baseline
  • AI judge evaluates semantic equivalence (e.g., "United States" matches "US" for country fields)
  • Per-field accuracy scores identify exactly which fields are underperforming
  • Maintain 5-10 golden samples per schema for representative coverage
  • Re-run benchmarks after schema changes, instruction updates, or model upgrades
  • Benchmark results are stored historically for tracking quality trends over time
List ground truth datasets
curl -s https://api.talonic.com/v1/quality/datasets \
  -H "Authorization: Bearer $TALONIC_API_KEY"

# Response:
# {
#   "datasets": [
#     {
#       "id": "gs_001",
#       "document_name": "Invoice_ACME_sample.pdf",
#       "schema_id": "us_def456",
#       "field_count": 14,
#       "created_at": "2025-03-15T09:00:00Z"
#     }
#   ],
#   "total": 8
# }

Golden samples are most valuable when they represent the diversity of your document corpus. Include at least one "clean" document with all fields present and correctly formatted, one document with unusual formatting or missing fields, and one document from each major variation you encounter. This coverage ensures that benchmarks test the full range of extraction challenges rather than just the easy cases. Re-run benchmarks after every significant schema change — new field instructions, updated reference tables, or model upgrades — to measure whether the change improved or regressed extraction quality.

Golden samples are not used during normal extraction — they exist solely for benchmarking. Changing a golden sample does not affect how documents are processed. This separation ensures that ground truth data remains a pure measurement tool without introducing bias into the extraction pipeline.

Frequently asked questions

What are golden samples?+
Golden samples are manually-created reference datasets with known-correct values, used as ground truth. Benchmark runs compare extraction results against them for per-field accuracy scoring.
How do benchmark runs work?+
Benchmark runs compare extraction results against golden samples, producing per-field accuracy scores with AI judge verdicts to measure extraction quality.
How many golden samples should I create?+
Most teams maintain 5-10 golden samples per schema, covering a representative mix of document types and complexity levels. Re-run benchmarks after schema changes or model upgrades to track quality trends.
Does the AI judge handle semantic equivalence in benchmark scoring?+
Yes. The AI judge accounts for semantic equivalence when comparing extracted values against golden sample ground truth. For example, "United States" and "US" may be scored as a match depending on the field type and context. This prevents false negatives where the extraction is correct but uses a different surface form than the golden sample.