What fields are extracted from PDFs as JSON?

Talonic returns PDFs as JSON as schema-validated, typed fields. Common fields include Scalar fields, Dates (ISO 8601), Numeric amounts, Nested arrays, and more, each normalized (dates to ISO 8601, amounts as numbers) and mapped to a stable key so the output shape stays the same across layouts.

How accurate is extraction from PDFs as JSON, and how is confidence reported?

Every extracted cell carries a confidence score from 0.0 to 1.0 and a provenance pointer back to the source page and region, so low-confidence values can be reviewed against the original before the data is trusted downstream. There is no single accuracy number: confidence is reported per field so you can gate on it.

Can I use PDFs as JSON extraction in production?

Yes. The same engine behind this guide is available as a production REST API and Node SDK with sync, async, and streaming modes, schema versioning, signed webhooks, and EU-resident processing. Start free with an API key, then scale on usage-based pricing.

What does it cost to extract data from PDFs as JSON?

There is a free tier for prototyping and agent evaluation with no credit card. Paid usage is credit-based at 1,000 credits per euro: page ingestion is 100 credits per page and registry-resolved queries are free. See talonic.com/pricing for current tiers.

Extract data from PDFs as JSON

A PDF is a layout format, not a data format. It describes where ink sits on a page, not what the values mean, which is why pulling clean JSON out of a PDF is harder than it looks. A vendor invoice from Stripe, a three page bank statement from JPMorgan Chase, a packing list from a freight forwarder in Rotterdam, and a signed order confirmation all carry the same underlying facts a downstream system needs, yet each renders them in a different visual arrangement. Developers who want JSON usually reach for an OCR library, get back a wall of text, and then spend days writing brittle regular expressions to find the invoice number, the line items, and the totals. The moment a supplier changes their template, the parser breaks. Talonic turns any PDF into structured JSON without that fragility. Upload a document through the dashboard or the API, and the response is a typed JSON object: scalar fields like issue_date in ISO 8601 form and total_amount as a number, nested arrays for line items, and consistent keys across every variation of the source layout. A 2026-04-30 statement and a 2025-11-02 statement from two different banks return the same shape. Every value arrives with a confidence score between 0 and 1 and a pixel region pointing back to the exact spot on the page it came from, so a number like $18,902.11 can be traced to its source in seconds. The output drops straight into a database, an ERP, or an AI agent, and it also exports to CSV when a spreadsheet is the destination.

Start extracting on TalonicOpen tool →See the extraction APIGo to platform →

What gets extracted from PDFs as JSON

Scalar fieldsinvoice_number: "INV-2043"Typed top-level keys

Dates (ISO 8601)issue_date: "2026-04-30"

Numeric amountstotal_amount: 18902.11Numbers, not strings

Nested arraysline_items: [ ... ]Repeating rows

Confidence score0.98Per field, 0 to 1

Source regionpage 2, x:120 y:540Provenance pointer

How extraction works for PDFs as JSON

Talonic does not rely on per-template rules, so a new PDF layout does not require new configuration. Each document is classified and mapped against a versioned schema in the Field Registry, which defines the keys, types, and validation rules the JSON must satisfy. Text is recovered with OCR when the PDF is scanned, then normalized: dates become ISO 8601 strings, currency amounts become numbers in a single sign convention, and repeated page headers and footers are dropped so they never appear as data. Fields that fail validation, such as a total that does not equal the sum of its line items, are flagged rather than silently returned. Every cell carries a confidence score and a pixel region pointer in line with DIN SPEC 91491 conformity, so low-confidence values can be reviewed against the source page before the JSON is trusted downstream.

Sample extraction

A 2-page supplier invoice PDF (Stripe, April 2026)

{
  "invoice_number": "INV-2043",
  "issue_date": "2026-04-30",
  "vendor": "Stripe, Inc.",
  "currency": "USD",
  "line_items": [
    {
      "description": "Platform fee",
      "quantity": 1,
      "unit_price": 299,
      "amount": 299
    },
    {
      "description": "Additional usage",
      "quantity": 1240,
      "unit_price": 0.02,
      "amount": 24.8
    }
  ],
  "subtotal": 323.8,
  "tax": 0,
  "total_amount": 323.8
}

Frequently asked

How is this different from running a PDF through an OCR library?

OCR returns text, not structured data. You still have to locate and label every value yourself. Talonic returns a typed JSON object with named keys, nested line items, and a confidence score per field, so there is no regex layer to write or maintain when a layout changes.

Can I get the JSON from an API?

Yes. Send the document to the extraction endpoint and poll for the result, or receive it by webhook. The response is JSON, and the same data exports to CSV with a query parameter. The Node SDK wraps the calls if you prefer typed methods over raw HTTP.

What happens with scanned or photographed PDFs?

Scanned PDFs are run through OCR first, then the same schema. Each value comes back with a confidence score and a pixel region, so anything the model is unsure about can be checked against the original image rather than trusted blindly.

Does the JSON shape stay the same across different document layouts?

Yes. The schema fixes the keys and types, so an invoice from one vendor and an invoice from another return the same structure. That stability is what lets you map the output into a database or an ERP once and stop touching it.

Ready to extract from your own PDFs as JSON?

Start extracting on TalonicOpen tool →See the extraction APIGo to platform →

Author note

Reviewed by Talonic engineering, API review · last reviewed 2026-06-20

Source: RFC 8259: The JavaScript Object Notation (JSON) Data Interchange Format

Related extraction guides

Extract data from invoices Extract data from bank statements Extract data from balance sheets