Skip to main content

Extract data from PDFs as JSON

A PDF is a layout format, not a data format. It describes where ink sits on a page, not what the values mean, which is why pulling clean JSON out of a PDF is harder than it looks. A vendor invoice from Stripe, a three page bank statement from JPMorgan Chase, a packing list from a freight forwarder in Rotterdam, and a signed order confirmation all carry the same underlying facts a downstream system needs, yet each renders them in a different visual arrangement. Developers who want JSON usually reach for an OCR library, get back a wall of text, and then spend days writing brittle regular expressions to find the invoice number, the line items, and the totals. The moment a supplier changes their template, the parser breaks. Talonic turns any PDF into structured JSON without that fragility. Upload a document through the dashboard or the API, and the response is a typed JSON object: scalar fields like issue_date in ISO 8601 form and total_amount as a number, nested arrays for line items, and consistent keys across every variation of the source layout. A 2026-04-30 statement and a 2025-11-02 statement from two different banks return the same shape. Every value arrives with a confidence score between 0 and 1 and a pixel region pointing back to the exact spot on the page it came from, so a number like $18,902.11 can be traced to its source in seconds. The output drops straight into a database, an ERP, or an AI agent, and it also exports to CSV when a spreadsheet is the destination.

What gets extracted from PDFs as JSON

Scalar fieldsinvoice_number: "INV-2043"Typed top-level keys
Dates (ISO 8601)issue_date: "2026-04-30"
Numeric amountstotal_amount: 18902.11Numbers, not strings
Nested arraysline_items: [ ... ]Repeating rows
Confidence score0.98Per field, 0 to 1
Source regionpage 2, x:120 y:540Provenance pointer

How extraction works for PDFs as JSON

Talonic does not rely on per-template rules, so a new PDF layout does not require new configuration. Each document is classified and mapped against a versioned schema in the Field Registry, which defines the keys, types, and validation rules the JSON must satisfy. Text is recovered with OCR when the PDF is scanned, then normalized: dates become ISO 8601 strings, currency amounts become numbers in a single sign convention, and repeated page headers and footers are dropped so they never appear as data. Fields that fail validation, such as a total that does not equal the sum of its line items, are flagged rather than silently returned. Every cell carries a confidence score and a pixel region pointer in line with DIN SPEC 91491 conformity, so low-confidence values can be reviewed against the source page before the JSON is trusted downstream.

Sample extraction

A 2-page supplier invoice PDF (Stripe, April 2026)

{
  "invoice_number": "INV-2043",
  "issue_date": "2026-04-30",
  "vendor": "Stripe, Inc.",
  "currency": "USD",
  "line_items": [
    {
      "description": "Platform fee",
      "quantity": 1,
      "unit_price": 299,
      "amount": 299
    },
    {
      "description": "Additional usage",
      "quantity": 1240,
      "unit_price": 0.02,
      "amount": 24.8
    }
  ],
  "subtotal": 323.8,
  "tax": 0,
  "total_amount": 323.8
}

Frequently asked

How is this different from running a PDF through an OCR library?

OCR returns text, not structured data. You still have to locate and label every value yourself. Talonic returns a typed JSON object with named keys, nested line items, and a confidence score per field, so there is no regex layer to write or maintain when a layout changes.

Can I get the JSON from an API?

Yes. Send the document to the extraction endpoint and poll for the result, or receive it by webhook. The response is JSON, and the same data exports to CSV with a query parameter. The Node SDK wraps the calls if you prefer typed methods over raw HTTP.

What happens with scanned or photographed PDFs?

Scanned PDFs are run through OCR first, then the same schema. Each value comes back with a confidence score and a pixel region, so anything the model is unsure about can be checked against the original image rather than trusted blindly.

Does the JSON shape stay the same across different document layouts?

Yes. The schema fixes the keys and types, so an invoice from one vendor and an invoice from another return the same structure. That stability is what lets you map the output into a database or an ERP once and stop touching it.

Author note

Reviewed by Talonic engineering, API review · last reviewed 2026-06-20