What fields are extracted from PDFs as CSV?

Talonic returns PDFs as CSV as schema-validated, typed fields. Common fields include Row, Typed columns, Normalized dates, Single header, and more, each normalized (dates to ISO 8601, amounts as numbers) and mapped to a stable key so the output shape stays the same across layouts.

How accurate is extraction from PDFs as CSV, and how is confidence reported?

Every extracted cell carries a confidence score from 0.0 to 1.0 and a provenance pointer back to the source page and region, so low-confidence values can be reviewed against the original before the data is trusted downstream. There is no single accuracy number: confidence is reported per field so you can gate on it.

Can I use PDFs as CSV extraction in production?

Yes. The same engine behind this guide is available as a production REST API and Node SDK with sync, async, and streaming modes, schema versioning, signed webhooks, and EU-resident processing. Start free with an API key, then scale on usage-based pricing.

What does it cost to extract data from PDFs as CSV?

There is a free tier for prototyping and agent evaluation with no credit card. Paid usage is credit-based at 1,000 credits per euro: page ingestion is 100 credits per page and registry-resolved queries are free. See talonic.com/pricing for current tiers.

Extract data from PDFs as CSV

Spreadsheets still run finance, logistics, and operations, which is why so many workflows end with the same request: get this PDF into a CSV I can open in Excel or load into a database. The problem is that a PDF has no rows. It has lines of text positioned on a page, and a table in a PDF is just ink that happens to line up. A bank statement from Wells Fargo, a packing list from a freight forwarder, a remittance advice with forty line items, and a rent roll for a 120 unit building all look like tables to a human and like loose text to a parser. Copy and paste rarely survives the trip: columns merge, multi line cells split, and the totals row lands in the wrong place. Talonic reads the structure behind the layout and returns clean CSV. Each detected table becomes rows with consistent columns, one header, and one value per cell, so a 3 page statement collapses into a single tidy file. Numbers come back as numbers, dates are normalized to ISO 8601, and a value like 2026-04-30 or $4,812.00 keeps its meaning instead of becoming free text. Repeated page headers and footers are removed so they do not show up as phantom rows. Every cell also carries a confidence score and a pointer back to the region of the source PDF it came from, so a finance team can audit any figure before it reaches the general ledger. When a workflow needs nested data instead of flat rows, the same extraction is available as JSON from the API.

Start extracting on TalonicOpen tool →See the extraction APIGo to platform →

What gets extracted from PDFs as CSV

Rowone transaction or line itemOne record per row

Typed columnsamount: 4812.00Numbers stay numeric

Normalized dates2026-04-30ISO 8601

Single headerdate, description, amountDetected once

Confidence score0.97Per cell, 0 to 1

Source regionpage 3, row 12Provenance pointer

How extraction works for PDFs as CSV

Template based table extraction breaks the moment a column moves, so Talonic does not use fixed templates. Each PDF is classified and aligned to a versioned schema in the Field Registry that names the columns and their types. Tables that wrap across pages are stitched into one continuous set of rows, and the header is detected once rather than repeated. Multi line cells are kept in a single field instead of spilling into new rows, and numeric columns are typed so totals can be checked against their line items. Scanned PDFs are read with OCR before the same alignment runs. Every cell is returned with a confidence score and a pixel region pointer consistent with DIN SPEC 91491 conformity, so a low-confidence cell can be reviewed against the source page before the CSV is loaded anywhere.

Sample extraction

A 3-page Wells Fargo statement, transactions table

{
  "columns": [
    "date",
    "description",
    "amount",
    "balance"
  ],
  "rows": [
    [
      "2026-04-03",
      "ACH CREDIT STRIPE PAYOUT",
      2450,
      14930.55
    ],
    [
      "2026-04-05",
      "CHECK 2174 ABC SUPPLIES",
      -842.16,
      14088.39
    ],
    [
      "2026-04-15",
      "WIRE OUT VENDOR PAYMENT",
      -5000,
      9088.39
    ]
  ]
}

Frequently asked

Will multi-page tables come out as one CSV?

Yes. Tables that continue across pages are stitched into a single set of rows with one header. The page breaks, repeated headers, and footers from the original PDF are removed so they do not appear as extra rows.

Do numbers and dates stay typed, or become text?

Numbers come back as numbers and dates are normalized to ISO 8601, so 04/30/26 and 30 Apr 2026 both become 2026-04-30. That means a CSV loaded into Excel or a database sorts and sums correctly instead of treating values as strings.

What about scanned PDFs or photos of a document?

They are read with OCR first, then aligned to the same columns. Each cell carries a confidence score and a region pointer, so any value the model is unsure about can be checked against the source image.

Can I also get JSON instead of CSV?

Yes. The same extraction returns JSON from the API when you need nested structures like line items inside an invoice. CSV is selected with a query parameter for spreadsheet and database loads.

Ready to extract from your own PDFs as CSV?

Start extracting on TalonicOpen tool →See the extraction APIGo to platform →

Author note

Reviewed by Talonic engineering, API review · last reviewed 2026-06-20

Source: RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files

Related extraction guides

Extract data from bank statements Extract data from invoices Extract data from income statements