Skip to main content

Extract

The top-level extract() method is the primary entry point. Send a document and schema, receive structured validated data.

Extract with inline schema
const result = await talonic.extract({
  file_path: './invoice.pdf',
  schema: {
    vendor_name: 'string',
    invoice_number: 'string',
    total_amount: 'number',
    line_items: [{
      description: 'string',
      quantity: 'number',
      unit_price: 'number',
    }],
  },
})

console.log(result.data)
// { vendor_name: 'Acme Corp', total_amount: 1234.56, ... }
console.log(result.extraction_id) // 'ext_abc123'
console.log(result.confidence)    // { overall: 0.95, fields: { vendor_name: 0.99, ... } }

The method signature is extract(params: ExtractParams): Promise<WithRateLimit<ExtractResult>>. The ExtractParams interface accepts exactly one file source: file (in-memory Blob, Buffer, or Uint8Array), file_path (local path read with fs/promises), file_url (remote URL fetched server-side), or document_id (re-extract an existing document). Provide at most one schema source: schema (inline object or JSON string) or schema_id (UUID of a saved schema). Omit both for auto-discovery, though this is not recommended in production as it may return 500 on some deployments.

Under the hood, extract() reads the file from disk (or fetches it from a URL), uploads it as multipart form data, and returns the structured result in a single round-trip. The response includes a cost block with credit consumption and a rateLimit object parsed from response headers.

Extract from a URL with options
const result = await talonic.extract({
  file_url: 'https://example.com/contracts/nda-2025.pdf',
  schema_id: 'sch_def456',
  instructions: 'Focus on termination and non-compete clauses',
  include_markdown: true,
  options: {
    page_range: '1-5',
    language_hint: 'en',
    strict: true,
  },
})

console.log(result.markdown)           // raw OCR text (because include_markdown: true)
console.log(result.document.pages)     // 5
console.log(result.processing?.duration_ms) // 2340

Additional parameters fine-tune extraction behavior. The instructions field accepts natural-language guidance forwarded to the extraction engine. Set include_markdown: true to receive the raw OCR-converted markdown alongside structured data. The options object supports page_range (e.g. '1-5' or '1,3,7-10' for PDFs), language_hint (ISO 639-1 code), strict mode (omit fields not in the schema), and include_raw_text. The content_type parameter lets you override MIME type inference when the file extension is misleading.

For best results, always supply a schema or schema_id rather than relying on auto-discovery. Inline schemas work well for one-off extractions, while saved schemas (via schema_id) keep your extraction definitions consistent across calls and team members.

Extract in-memory bytes
import { readFile } from 'node:fs/promises'

// Pass a Buffer or Uint8Array directly
const buffer = await readFile('./scan.tiff')
const result = await talonic.extract({
  file: buffer,
  filename: 'scan.tiff',
  content_type: 'image/tiff',
  schema: {
    patient_name: 'string',
    date_of_service: 'date',
    diagnosis_codes: ['string'],
    total_charges: 'number',
  },
})

// The ExtractResult includes document metadata
console.log(result.document.mime_type)         // 'image/tiff'
console.log(result.document.size_bytes)        // 4521984
console.log(result.document.language_detected) // 'en'

The ExtractResult response is a rich object. The extraction_id is a stable identifier you can use with talonic.extractions.get() or talonic.extractions.getData() to retrieve the result later. The document block includes id, filename, pages, size_bytes, mime_type, type_detected, and language_detected. The confidence block provides overall and per-field scores between 0 and 1. The schema block shows the source ('inline' or 'saved'), the resolved definition, and a save_url for persisting inline schemas. The links block provides URLs for the extraction, document, and dashboard views.

The SDK validates parameters before making the API call. Passing zero or more than one file source throws a TalonicError with code missing_file_source or multiple_file_sources. Passing both schema and schema_id throws multiple_schemas. When using JSON Schema format with properties but no required array, the SDK auto-populates required with all property keys to prevent the silent-empty-data footgun where the API returns null for every field.

Every extract() response carries cost metadata (credits consumed, EUR value, post-call balance) and rateLimit info. Use these to build budget-aware pipelines without extra API calls.

Frequently asked questions

What inputs does talonic.extract() accept?+
It accepts file_path (local file), file_url (remote URL), file (in-memory Blob, Buffer, or Uint8Array with filename), or document_id (previously uploaded). Provide a schema or schema_id for the extraction definition.
Does extract() return cost information?+
Yes. Every extract response includes a cost block with costCredits, costEur, balanceCredits, cellsResolvedRegistry, and cellsResolvedAi parsed from response headers.
Can I extract the same document with different schemas?+
Yes. Pass the document_id from a previous upload along with a new schema or schema_id. The document is not re-uploaded, saving bandwidth and credits.
What does the auto-populate required behavior do?+
When you pass a JSON Schema with properties but no required array, the SDK automatically adds all property keys to required. This prevents the silent-empty-data issue where the API returns null with confidence 0 for every field you intended to extract.
What file formats does extract() support?+
The SDK supports PDF, PNG, JPG, TIFF, WebP, BMP, GIF, DOCX, DOC, XLSX, XLS, PPTX, PPT, TXT, Markdown, CSV, TSV, JSON, XML, HTML, EML, MSG, and ZIP. MIME types are inferred from the file extension. Use content_type to override when the extension is misleading.
How do I include the raw OCR text in the response?+
Set include_markdown: true in the extract params. The response will include a markdown field containing the full OCR-converted text that the extraction engine used as input. This is useful for debugging extraction quality or building custom post-processing.