Reference Data

Reference data is a lookup table, uploaded as a CSV or Excel file, that the matching engine and schema reference strategies compare extracted document data against. These datasets are your system of record: the known records such as customer lists, product catalogs, vendor registries, and contract databases. Each reference dataset is versioned independently and can be shared across multiple schemas and matching configurations without duplication.

When you upload a reference dataset, the platform indexes all columns and rows for fast lookup during matching runs. Because each dataset is versioned independently, you can update your reference data without affecting in-progress matching configurations. Besides file upload, you can import reference data directly from a connected SQL database (MSSQL or PostgreSQL) from the reference data page: the import runs asynchronously, streaming rows in batches while column headers appear immediately so you can preview the structure.

Preparing and refreshing datasets

For best results, ensure your reference data is clean and deduplicated before uploading. Include all columns that you plan to match against, such as names, identifiers, dates, and amounts. Most teams refresh their reference data periodically by re-uploading from their source system or by re-running the SQL import to pull directly from a connected database.

Deleting a source connection does not cascade to reference datasets imported from it. The UI shows a "source disconnected" indicator, but the imported data continues to work for matching.

Create a reference dataset via the public API

# The API accepts JSON rows directly; CSV/XLSX files are uploaded in the app.
curl -X POST https://api.talonic.com/v1/reference-data \
  -H "Authorization: Bearer $TALONIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "vendor_registry",
    "columns": ["vendor_id", "vendor_name", "country", "tax_id"],
    "data": [
      { "vendor_id": "V-001", "vendor_name": "ACME Corporation", "country": "DE", "tax_id": "DE123456789" },
      { "vendor_id": "V-002", "vendor_name": "Globex GmbH", "country": "DE", "tax_id": "DE987654321" }
    ]
  }'

# Response (201):
# {
#   "id": "b8c9d0e1-…",
#   "name": "vendor_registry",
#   "source_type": "json",
#   "row_count": 2,
#   "columns": ["vendor_id", "vendor_name", "country", "tax_id"],
#   "created_at": "2026-04-22T10:00:00Z"
# }

Inspect a dataset and page through its rows

# Dataset metadata:
curl -s https://api.talonic.com/v1/reference-data/b8c9d0e1-… \
  -H "Authorization: Bearer $TALONIC_API_KEY"

# Rows, paginated (limit capped at 500 per page):
curl -s "https://api.talonic.com/v1/reference-data/b8c9d0e1-…/rows?page=1&limit=100" \
  -H "Authorization: Bearer $TALONIC_API_KEY"
# -> { "data": [ { "vendor_id": "V-001", ... } ], "pagination": { "page": 1, "limit": 100, "total": 2450 } }

CSV and Excel (XLSX) file uploads in the app for quick one-time imports.
SQL database imports for live reference data from connected MSSQL or PostgreSQL sources.
JSON row creation via POST /v1/reference-data for programmatic loading.
Versioning — each dataset tracks versions independently.
Cross-schema sharing — one dataset can be referenced by multiple schemas and matching configurations.

Frequently asked questions

What file formats are supported for reference data?+

CSV and Excel (XLSX) files can be uploaded as reference datasets in the app, and JSON rows can be created directly via POST /v1/reference-data. Each dataset is versioned and can be shared across multiple schemas.

How is reference data used?+

Reference datasets serve two purposes. First, the matching engine uses them for field-to-field comparisons — comparing extracted document values against reference rows using weighted strategies (exact, fuzzy, date_range, numeric_range). Second, reference strategies in schemas use them for code mapping and value resolution, translating labels found in documents into canonical codes defined in the reference dataset.

Can I import reference data from a database?+

Yes. Use the SQL import option on the reference data page to stream rows from a connected SQL database (MSSQL or PostgreSQL). The import runs asynchronously and you can monitor progress while it loads.

What happens if I delete a source connection that was used for a SQL import?+

The reference data remains intact. Deleting a source connection does not cascade to reference datasets — the UI shows a "source disconnected" indicator, but the imported data continues to work for matching.

How do I refresh reference data from a SQL source?+

Re-run the SQL import using the same connection and table parameters. A new reference dataset is created with the latest data. Update your matching configurations to point to the new dataset version. The previous version remains available for comparison.

Matching Configurations

Reference Tables

Reference Primitives

Reference Data

Preparing and refreshing datasets

Frequently asked questions

Related