Extract data from PDFs as CSV
Spreadsheets still run finance, logistics, and operations, which is why so many workflows end with the same request: get this PDF into a CSV I can open in Excel or load into a database. The problem is that a PDF has no rows. It has lines of text positioned on a page, and a table in a PDF is just ink that happens to line up. A bank statement from Wells Fargo, a packing list from a freight forwarder, a remittance advice with forty line items, and a rent roll for a 120 unit building all look like tables to a human and like loose text to a parser. Copy and paste rarely survives the trip: columns merge, multi line cells split, and the totals row lands in the wrong place. Talonic reads the structure behind the layout and returns clean CSV. Each detected table becomes rows with consistent columns, one header, and one value per cell, so a 3 page statement collapses into a single tidy file. Numbers come back as numbers, dates are normalized to ISO 8601, and a value like 2026-04-30 or $4,812.00 keeps its meaning instead of becoming free text. Repeated page headers and footers are removed so they do not show up as phantom rows. Every cell also carries a confidence score and a pointer back to the region of the source PDF it came from, so a finance team can audit any figure before it reaches the general ledger. When a workflow needs nested data instead of flat rows, the same extraction is available as JSON from the API.
What gets extracted from PDFs as CSV
How extraction works for PDFs as CSV
Template based table extraction breaks the moment a column moves, so Talonic does not use fixed templates. Each PDF is classified and aligned to a versioned schema in the Field Registry that names the columns and their types. Tables that wrap across pages are stitched into one continuous set of rows, and the header is detected once rather than repeated. Multi line cells are kept in a single field instead of spilling into new rows, and numeric columns are typed so totals can be checked against their line items. Scanned PDFs are read with OCR before the same alignment runs. Every cell is returned with a confidence score and a pixel region pointer consistent with DIN SPEC 91491 conformity, so a low-confidence cell can be reviewed against the source page before the CSV is loaded anywhere.
Sample extraction
A 3-page Wells Fargo statement, transactions table
{
"columns": [
"date",
"description",
"amount",
"balance"
],
"rows": [
[
"2026-04-03",
"ACH CREDIT STRIPE PAYOUT",
2450,
14930.55
],
[
"2026-04-05",
"CHECK 2174 ABC SUPPLIES",
-842.16,
14088.39
],
[
"2026-04-15",
"WIRE OUT VENDOR PAYMENT",
-5000,
9088.39
]
]
}Frequently asked
Will multi-page tables come out as one CSV?
Yes. Tables that continue across pages are stitched into a single set of rows with one header. The page breaks, repeated headers, and footers from the original PDF are removed so they do not appear as extra rows.
Do numbers and dates stay typed, or become text?
Numbers come back as numbers and dates are normalized to ISO 8601, so 04/30/26 and 30 Apr 2026 both become 2026-04-30. That means a CSV loaded into Excel or a database sorts and sums correctly instead of treating values as strings.
What about scanned PDFs or photos of a document?
They are read with OCR first, then aligned to the same columns. Each cell carries a confidence score and a region pointer, so any value the model is unsure about can be checked against the source image.
Can I also get JSON instead of CSV?
Yes. The same extraction returns JSON from the API when you need nested structures like line items inside an invoice. CSV is selected with a query parameter for spreadsheet and database loads.
Ready to extract from your own PDFs as CSV?
Author note
Reviewed by Talonic engineering, API review · last reviewed 2026-06-20