Skip to main content

Reference Data

Upload CSV or Excel files as lookup tables for the matching engine and schema reference strategies. These reference datasets represent your "ground truth" — the known records you want to match extracted document data against. Each reference dataset is versioned independently and can be shared across multiple schemas and matching configurations without duplication.

Reference data is the foundation of the matching system. Common examples include customer lists, product catalogs, vendor registries, contract databases, and supplier directories. When you upload a reference dataset, the platform indexes all columns and rows for fast lookup during matching runs. You can also import reference data directly from a SQL database connection using POST /matching/reference-data/from-sql, which streams rows asynchronously in batches of 500 from your connected MSSQL or PostgreSQL database.

When you upload a reference dataset, the platform indexes all columns and rows for fast lookup during matching runs. Each dataset is versioned independently, so you can update your reference data without affecting in-progress matching configurations. A single dataset can be shared across multiple schemas and matching configurations.

For best results, ensure your reference data is clean and deduplicated before uploading. Include all columns that you plan to match against — such as names, identifiers, dates, and amounts. Most teams refresh their reference data periodically by re-uploading from their source system or by using the SQL import option to pull directly from a connected database.

You can also import reference data directly from a SQL database connection. The import runs asynchronously — rows are streamed in batches of 500 and column headers appear immediately so you can preview the structure while the import runs.
Upload reference data via CSV
curl -X POST https://api.talonic.com/v1/matching/reference-data \
  -H "Authorization: Bearer $TALONIC_API_KEY" \
  -F "file=@vendor_registry.csv"

# Response:
# {
#   "id": "ref_vendor_001",
#   "name": "vendor_registry",
#   "status": "ready",
#   "row_count": 2450,
#   "columns": ["vendor_id", "vendor_name", "country", "tax_id"],
#   "created_at": "2025-04-22T10:00:00Z"
# }
Import reference data from a SQL database
curl -X POST https://api.talonic.com/v1/matching/reference-data/from-sql \
  -H "Authorization: Bearer $TALONIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "connection_id": "src_sql_001",
    "kind": "table",
    "table_name": "vendors",
    "schema_name": "public"
  }'

# Response (async — poll status until ready):
# {
#   "id": "ref_vendor_002",
#   "status": "importing",
#   "source_meta": {
#     "connection_id": "src_sql_001",
#     "table_name": "vendors"
#   }
# }
  • CSV and Excel (XLSX) file uploads for quick one-time imports.
  • SQL database imports for live reference data from connected sources.
  • Versioning — each dataset tracks versions independently.
  • Cross-schema sharing — one dataset can be referenced by multiple schemas and matching configurations.

Frequently asked questions

What file formats are supported for reference data?+
CSV and Excel (XLSX) files can be uploaded as reference datasets. Each dataset is versioned and can be shared across multiple schemas.
How is reference data used?+
Reference datasets serve two purposes. First, the matching engine uses them for field-to-field comparisons — comparing extracted document values against reference rows using weighted strategies (exact, fuzzy, date_range, numeric_range). Second, reference strategies in schemas use them for code mapping and value resolution, translating labels found in documents into canonical codes defined in the reference dataset.
Can I import reference data from a database?+
Yes. Use the SQL import option to stream rows from a connected SQL database (MSSQL or PostgreSQL). The import runs asynchronously and you can monitor progress while it loads.
What happens if I delete a source connection that was used for a SQL import?+
The reference data remains intact. Deleting a source connection does not cascade to reference datasets — the UI shows a "source disconnected" indicator, but the imported data continues to work for matching.
How do I refresh reference data from a SQL source?+
Re-run the SQL import using the same connection and table parameters. A new reference dataset is created with the latest data. Update your matching configurations to point to the new dataset version. The previous version remains available for comparison.