POST /v1/extract

Extract structured data from PDFs, images, and Word documents with POST /v1/extract: send a file and optional schema, get JSON with per-field confidence.

POST /v1/extract is Talonic's quick extract endpoint for document data extraction: it extracts structured data from a PDF, image, Word document, or text file in a single API call. Send the file (plus an optional schema describing the fields you want) and receive schema-validated JSON with per-field confidence scores. Documents of 5 pages or fewer return synchronously; larger documents return 202 Accepted with a poll URL.

POST/v1/extract

Form data parameters

filebinaryThe document file. Supports PDF, PNG, JPG, TIFF, WEBP, DOCX, TXT, CSV. Max 500 MB (lower caps apply on Free and Pro tiers).

file_urlstringURL to fetch the document from. Use this instead of file for remote documents.

document_idstringRe-extract an existing document by ID instead of uploading a new file.

schemaobject | stringThe target schema definition. Accepts JSON Schema, simplified fields, or a flat key-type map. Optional: omit it for auto-discovery. See Schema Formats below.

schema_idstringUse a previously saved schema by ID. Mutually exclusive with schema.

instructionsstringNatural language instructions to guide extraction. E.g. "Focus on the billing section. Amounts are in EUR."

optionsobjectAdditional extraction options. See Options below.

include_markdownstringSet to "true" to include the OCR-converted markdown of the document in the response. Useful for PDF-to-markdown conversion.

You must provide exactly one of file, file_url, or document_id. Schema is optional — when omitted, auto-discovery extracts all fields. Add include_markdown=true to also get the raw markdown output.

Request

curl — extract invoice fields from a PDF

The endpoint also honors the Idempotency-Key header: retrying the same request with the same key within 24 hours returns the cached result instead of running a second extraction. See [Idempotency](idempotency).

Response

Response fields (200 Synchronous)

extraction_idstringUUID of the created extraction record.

request_idstringUnique request identifier for tracing and support.

statusstringAlways "complete" for synchronous responses.

documentobjectSource document summary: id, filename, pages, size_bytes, type_detected, language_detected.

dataobjectExtracted field values as a key-value map matching your schema.

schemaobjectSchema used: source (provided, saved, auto_discovered), id, definition, and save_url to persist it.

confidenceobjectConfidence scores: overall (0–1) and fields (per-field score map).

processingobjectProcessing metadata: duration_ms, pages_processed, region.

markdownstring | nullOCR-converted markdown of the document. Only present when include_markdown=true.

linksobjectRelated resource URLs: self (extraction), document, dashboard.

Response (200 — Synchronous)

{
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "request_id": "req_b2c3d4e5f6a78901",
  "status": "complete",
  "document": {
    "id": "c3d4e5f6-a7b8-9012-cdef-123456789012",
    "filename": "invoice-0847.pdf",
    "pages": 2,
    "size_bytes": 184320,
    "type_detected": "invoice",
    "language_detected": "en"
  },
  "data": {
    "vendor_name": "Acme Corp",
    "invoice_number": "INV-2024-0847",
    "total_amount": 14250.00,
    "due_date": "2024-03-15",
    "line_items": [
      { "description": "Enterprise license (annual)", "quantity": 1, "unit_price": 12000.00 },
      { "description": "Implementation services", "quantity": 15, "unit_price": 150.00 }
    ]
  },
  "schema": {
    "source": "provided",
    "id": null,
    "definition": { "type": "object", "properties": { "vendor_name": { "type": "string" } } },
    "save_url": "https://app.talonic.com/schemas/save?from=a1b2c3d4-e5f6-7890-abcd-ef1234567890"
  },
  "confidence": {
    "overall": 0.94,
    "fields": {
      "vendor_name": 0.99,
      "invoice_number": 0.98,
      "total_amount": 0.96,
      "due_date": 0.91,
      "line_items": 0.87
    }
  },
  "processing": {
    "duration_ms": 3420,
    "pages_processed": 2,
    "region": "eu-west"
  },
  "links": {
    "self": "/v1/extractions/a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "document": "/v1/documents/c3d4e5f6-a7b8-9012-cdef-123456789012",
    "dashboard": "https://app.talonic.com/extractions/a1b2c3d4-e5f6-7890-abcd-ef1234567890"
  }
}

Response fields (202 Asynchronous)

request_idstringUnique request identifier for tracing.

statusstringAlways "processing" for asynchronous responses.

documentobjectSource document summary: id, filename, pages, size_bytes.

poll_urlstringURL to poll for document processing status.

estimated_secondsintegerEstimated processing time in seconds.

linksobjectRelated resource URLs: document, extractions, dashboard.

Response (202 — Asynchronous)

{
  "request_id": "req_b2c3d4e5f6a78901",
  "status": "processing",
  "document": {
    "id": "c3d4e5f6-a7b8-9012-cdef-123456789012",
    "filename": "large-report.pdf",
    "pages": 42,
    "size_bytes": 8912640
  },
  "poll_url": "/v1/documents/c3d4e5f6-a7b8-9012-cdef-123456789012",
  "estimated_seconds": 63,
  "links": {
    "document": "/v1/documents/c3d4e5f6-a7b8-9012-cdef-123456789012",
    "extractions": "/v1/documents/c3d4e5f6-a7b8-9012-cdef-123456789012/extractions",
    "dashboard": "https://app.talonic.com/documents/c3d4e5f6-a7b8-9012-cdef-123456789012"
  }
}

Errors

Error responses

400missing_documentNo document source provided. Supply one of: file, file_url, or document_id.

400ambiguous_documentMore than one document source provided. Supply only one of: file, file_url, or document_id.

400unsupported_file_typeThe uploaded file type is not supported. Accepted: PDF, PNG, JPG, TIFF, WEBP, DOCX, TXT, CSV.

400invalid_optionsThe options field is not valid JSON.

401unauthorizedMissing or invalid API key.

413FILE_TOO_LARGEThe file exceeds your plan tier's upload limit. See Rate Limits for per-tier caps.

422extraction_failedExtraction completed but produced no usable output. Check the document quality or schema definition.

429rate_limitedToo many requests. Check X-RateLimit-Reset for when the window resets.

Cost Headers

Synchronous 200 responses include cost transparency headers so you can track spend per call without a separate API round-trip:

Cost response headers

X-Talonic-Cost-CreditsintegerCredits consumed by this extraction.

X-Talonic-Cost-EURnumberEUR cost of this extraction.

X-Talonic-Balance-CreditsintegerRemaining credit balance after this call.

X-Talonic-Cells-Resolved-RegistryintegerFields resolved from the registry (no AI cost).

X-Talonic-Cells-Resolved-AIintegerFields resolved by the AI model.

Cost headers on a sync response

Cost headers are only present on synchronous (200) responses. For async (202) extractions, check the credits balance endpoint or listen for the extraction.complete webhook which includes cost data.

In open (auto-discovery) extraction, when the same field_name appears on multiple rows of a table, all of its values are returned as an array under that key. This means multi-row tables are not collapsed to a single value: a repeated field like line_item_amount comes back as ["120.00", "85.50", "12.00"] rather than just the first value.