Skip to main content

Extract

The extract endpoint is the primary entry point for the Talonic API. Send any document and receive schema-validated structured data with per-cell provenance and confidence scores.

POST/v1/extract

Request

Send a multipart/form-data request with the document file and an optional schema. See authentication for header requirements and schema formats for all schema options.

Parameters

ParameterTypeRequiredDescription
filefileOne of file/file_url/document_idThe document file to extract (PDF, DOCX, image, spreadsheet)
file_urlstringOne of file/file_url/document_idURL of a publicly accessible document to extract
document_idstringOne of file/file_url/document_idID of an existing document to re-extract
schemastring (JSON)NoInline JSON schema mapping field names to types
schema_idstringNoID of a saved schema to use for extraction
instructionsstringNoNatural language instructions to guide extraction
include_markdownbooleanNoInclude markdown representation in the response

Examples

curl -X POST https://api.talonic.com/v1/extract \
  -H "Authorization: Bearer $TALONIC_API_KEY" \
  -F "file=@invoice.pdf" \
  -F 'schema={"vendor_name":"string","invoice_number":"string","total_amount":"number","due_date":"date"}'

Response

For small documents, the API returns 200 OK with the extraction result inline. For larger documents, it returns 202 Accepted with a job ID — poll via the jobs endpoint or use webhooks for async notification.

# 200 — Synchronous response
{
  "extraction_id": "ext_abc123",
  "document_id": "doc_xyz789",
  "schema_id": null,
  "status": "completed",
  "data": {
    "rows": [
      {
        "vendor_name": "Acme Corp",
        "invoice_number": "INV-2026-0042",
        "total_amount": 1250.00,
        "due_date": "2026-05-15"
      }
    ]
  }
}

# 202 — Async response
{
  "job_id": "job_def456",
  "status": "processing",
  "document_id": "doc_xyz789"
}

Pipeline processing

The extract endpoint triggers the four-phase pipeline. The Resolve phase classifies the document using the 529-type ontology. The Agent phase extracts values. The Validate phase applies the confidence gate. The Re-read phase cross-checks flagged values. Results include per-cell provenance linking each value to its source region.

Error handling

See the error reference for all error codes. Common errors include file_too_large (exceeds plan limits), unsupported_format, and invalid_schema.

Frequently asked questions

What file formats does the Talonic API support?+
PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, XLSM, PNG, JPG, JPEG, GIF, WEBP, TXT, MD, HTML, XML, JSON, EML, CSV, MSG, BMP, and ZIP archives.
How does authentication work?+
All API requests require a Bearer token in the Authorization header. API keys carry the tlnc_ prefix and are scoped to a source. Create and manage keys from Settings → API Keys.
What schema formats are supported?+
Three formats: JSON Schema (full control), simplified fields (recommended), and flat key-type maps (quick prototyping). Supported types: string, number, integer, boolean, date, array, object, enum.
What are the rate limits?+
Per-key rate limits: 100 req/s extraction, 1,000 req/s read, 200 req/s write. Rate-limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) are included on every response.
How do webhooks work?+
Webhooks deliver POST requests with HMAC-SHA256 signed JSON payloads. Events: extraction.complete, extraction.failed, document.ingested. Failed deliveries retry with exponential backoff (1min, 5min, 30min, 4hr).