Extract data from invoices
Invoices are the heartbeat of B2B commerce. Every accounts payable team handles thousands of them: a SaaS provider's monthly recurring invoice, a one-off line-item invoice from a marketing agency, a multi-page commercial invoice with 200 SKUs from an apparel supplier in Vietnam. The semantics are stable. The format never is. Vendors use whatever template their billing system produces, and the result lands in your inbox as a PDF that nobody can import into NetSuite, QuickBooks, or SAP without re-keying. The cost of that re-keying compounds: missed early-payment discounts, late-payment penalties, three-way-matching backlogs, accruals that fail audit because line items were dropped during data entry, and DSO that creeps up because invoices sit in a manual queue. The hard parts are not the easy fields. Invoice number, vendor name, total amount, due date: most extractors get those right. What breaks at scale is the line-item table. Multi-page invoices where the table continues across pages with repeated headers. Discounts applied to individual lines versus the subtotal. Shipping and handling charges that appear as a separate line, sometimes taxable, sometimes not. Mixed-currency invoices where one line is in USD and another in EUR (rare but real for international service providers). VAT or sales tax that needs to be itemized by jurisdiction. Vendor remit-to addresses that differ from the legal entity name on the invoice header. PO numbers that appear once in the header and again per line. Payment terms expressed as Net 30, 2/10 Net 30, or a specific due date. Talonic returns the full invoice structure as JSON or CSV: line items as a structured array with description, quantity, unit price, line total, and any per-line tax or discount; header fields like invoice number, dates, vendor, buyer, currency, and totals captured once; payment terms preserved as strings rather than parsed into rigid fields, because the variation is too high to standardize. Every extracted cell carries a confidence score and a pixel-region reference so any disputed amount can be audited against the source PDF before approval.
What gets extracted from invoices
How extraction works for invoices
Invoices arrive as PDFs, scans, or images, and the format depends entirely on the vendor's billing software. Talonic classifies each invoice and runs it through the Invoice schema in the Field Registry, which encodes both the header fields and the line-item table without per-template configuration. Multi-page line tables are stitched with repeated headers filtered; discounts and shipping are captured as flagged line items rather than dropped into a generic notes field. Currency is normalized to ISO 4217 codes. Per-line tax is itemized when shown on the source. Cell-level confidence and pixel-region provenance follow DIN SPEC 91491 conformity, so AP teams can audit any disputed amount before approving payment.
Sample extraction
A 2-page commercial invoice in USD with three line items
{
"invoice_number": "INV-2026-00471",
"invoice_date": "2026-04-08",
"due_date": "2026-05-08",
"vendor": "Acme Software, Inc.",
"buyer": "Globex Logistics LLC",
"currency": "USD",
"purchase_order_number": "PO-2026-01102",
"line_items": [
{
"description": "Annual subscription, Pro plan",
"quantity": 5,
"unit_price": 1200,
"line_total": 6000
},
{
"description": "Onboarding services, 8 hours",
"quantity": 8,
"unit_price": 250,
"line_total": 2000
},
{
"description": "Volume discount",
"quantity": 1,
"unit_price": -250,
"line_total": -250
}
],
"totals": {
"subtotal": 7750,
"tax": 670,
"total": 8420
},
"payment_terms": "Net 30, 2% if paid within 10 days"
}Frequently asked
Can Talonic parse line-item tables that span multiple pages?
Yes. Multi-page invoices return a single ordered line-item array. Repeated table headers on pages 2 and beyond are filtered automatically. The aggregate subtotal, tax, and total are tied out against the sum of line totals; mismatches raise a validation flag.
How are discounts and shipping charges handled?
Discounts and shipping appear as their own line items in the array, with a sign matching their effect on the total (discounts negative, shipping positive). Per-line VAT or sales tax is preserved on the line item when shown that way on the source; otherwise the aggregate tax sits in the totals block.
Does it handle invoices in non-Latin scripts or right-to-left languages?
Yes. Talonic processes invoices in Arabic, Hebrew, Chinese, Japanese, Korean, Cyrillic, and most other scripts. The field semantics remain the same; only the source text differs. Currency codes are normalized to ISO 4217 regardless of the script.
What about scanned invoices versus digital PDFs?
Both work. Scanned invoices are OCRed and run through the same schema, returning per-cell confidence. Digital PDFs extract at higher confidence because the text layer is already present. Faxed invoices and photographs of paper invoices are also supported.
Is the output ready for QuickBooks, NetSuite, or SAP?
The structured output exports cleanly to CSV for spreadsheets and to JSON for AP automation, ERPs, and analytics. Line items are a flat array of rows; header fields appear once. Mapping into a destination system's chart of accounts happens downstream of extraction.
Ready to extract from your own invoices?
Author note
Reviewed by Talonic engineering, schema review · last reviewed 2026-05-11