Skip to main content

Quick Start

Extract structured data from a document in five lines. Create a client, call extract() with a file and schema, and read the typed result.

Extract an invoice
import { Talonic } from '@talonic/node'

const talonic = new Talonic({ apiKey: process.env.TALONIC_API_KEY! })

const result = await talonic.extract({
  file_path: './invoice.pdf',
  schema: {
    vendor_name: 'string',
    invoice_number: 'string',
    total_amount: 'number',
    due_date: 'date',
  },
})

console.log(result.data)
// { vendor_name: 'Acme Corp', invoice_number: 'INV-2024-0847', total_amount: 14250, due_date: '2024-03-15' }

The result object contains the extracted data matching your schema, plus rateLimit and cost metadata. The data fields are typed according to your schema definition, so total_amount comes back as a number and due_date as a date string.

You can also pass file_url for remote files or file with filename for in-memory bytes (Blob, Buffer, or Uint8Array). For documents already uploaded to your workspace, pass document_id to skip the upload step entirely. The SDK accepts exactly one file source per call and validates this at runtime, throwing a TalonicError with code missing_file_source or multiple_file_sources if the constraint is violated.

Extract from a URL
// Extract from a remote file — no local download needed
const result = await talonic.extract({
  file_url: 'https://example.com/reports/q4-2025.pdf',
  schema: {
    report_title: 'string',
    period: 'string',
    revenue: 'number',
    net_income: 'number',
    highlights: ['string'],
  },
})

console.log(result.data.revenue)       // 4250000
console.log(result.confidence?.overall) // 0.94
console.log(result.document.pages)      // 12

All extract() calls are async and return a Promise. The SDK handles retries, timeouts, and error mapping automatically, so you only need a single try/catch around your call for error handling. Retryable failures (429, 5xx, network errors, timeouts) are retried up to maxRetries times with exponential backoff, so transient hiccups do not require manual retry logic.

Extract with error handling
import { Talonic, TalonicAuthError, TalonicValidationError, TalonicError } from '@talonic/node'

const talonic = new Talonic({ apiKey: process.env.TALONIC_API_KEY! })

try {
  const result = await talonic.extract({
    file_path: './receipt.png',
    schema: {
      merchant: 'string',
      date: 'date',
      total: 'number',
      items: [{ name: 'string', price: 'number' }],
    },
  })
  console.log(`Extracted ${result.data.items?.length ?? 0} line items from ${result.document.filename}`)
  console.log(`Cost: ${result.cost?.costCredits} credits, balance: ${result.cost?.balanceCredits}`)
} catch (err) {
  if (err instanceof TalonicAuthError) {
    console.error('Invalid API key — check TALONIC_API_KEY')
  } else if (err instanceof TalonicValidationError) {
    console.error(`Bad request: ${err.message} (code: ${err.code})`)
  } else if (err instanceof TalonicError) {
    console.error(`Talonic error: ${err.code} (status ${err.status}, request ${err.requestId})`)
  }
}

The extract() response includes rich metadata beyond the extracted data. The document block contains the filename, page count, file size, detected MIME type, and detected language. The optional confidence block provides an overall confidence score and per-field scores. The processing block reports duration, pages processed, and the region that handled the request. Use these fields to build quality gates and observability into your extraction pipeline.

Re-extract a previously uploaded document with a saved schema
// Use document_id + schema_id to re-extract without re-uploading
const result = await talonic.extract({
  document_id: 'doc_abc123',
  schema_id: 'sch_def456',
  instructions: 'Focus on the indemnification and liability sections',
  include_markdown: true,
})

console.log(result.data)       // structured extraction
console.log(result.markdown)   // raw OCR markdown (when include_markdown is true)
console.log(result.schema)     // { source: 'saved', id: 'sch_def456', definition: { ... } }
This example uses top-level await. If your environment does not support it, wrap the code in an async function.

Frequently asked questions

How do I extract data from a PDF with the Talonic SDK?+
Import Talonic, create a client with your API key, and call talonic.extract() with a file path and schema. The result contains structured, schema-validated JSON.
What does the extract result contain?+
The result includes a data object with fields matching your schema, plus rateLimit (limit, remaining, resetAt) and cost (credits consumed, EUR value, post-call balance) metadata. It also contains document metadata (filename, pages, size_bytes, mime_type), optional confidence scores (overall and per-field), and processing info (duration_ms, region).
Can I extract from a URL instead of a local file?+
Yes. Pass file_url instead of file_path to extract from a remote document. The SDK fetches the file server-side, so you do not need to download it first.
Can I pass natural-language instructions to guide extraction?+
Yes. The extract() method accepts an optional instructions parameter where you can provide natural-language guidance like 'Focus on the indemnification section' or 'Extract amounts in USD only'. Instructions are forwarded to the extraction engine alongside the schema to improve accuracy for ambiguous documents.
How do I extract only specific pages from a PDF?+
Pass the options.page_range parameter with a page specification like '1-5' or '1,3,7-10'. This tells the extraction engine to process only those pages, which reduces processing time and credit consumption for large documents.
What happens if I pass both schema and schema_id?+
The SDK throws a TalonicError with code 'multiple_schemas' immediately, before making any API call. You must provide at most one schema source: either an inline schema object or a schema_id referencing a saved schema.