FOR DEVELOPERS

A data extraction API that returns schema-validated JSON

One endpoint. POST a document, get typed JSON back, with a confidence score and provenance on every field. Auto-detect the fields or send your own schema and get exactly that shape. The same call handles invoices, contracts, bank statements, and 529 document types across 25+ formats.

Start free, get an API key →Read the API docs →

One call: /v1/extract

The whole surface is a single POST. Send a file and, optionally, the shape you want. The response is a typed object plus per-field confidence and provenance. No job orchestration to wire up for the synchronous path, and async and streaming modes share the same response contract when you need them.

Request

curl -X POST https://api.talonic.com/v1/extract \
  -H "Authorization: Bearer $TALONIC_API_KEY" \
  -F "file=@invoice.pdf" \
  -F 'schema={"vendor":"string","total_eur":"number","due_date":"date"}'

Response

{
  "data": {
    "vendor": "Acme Corp",
    "total_eur": 1500.00,
    "due_date": "2026-07-15"
  },
  "confidence": {
    "vendor": 0.99,
    "total_eur": 0.97,
    "due_date": 0.92
  },
  "provenance": {
    "total_eur": { "page": 1, "region": [0.62, 0.81, 0.78, 0.84] }
  }
}

Omit the schema field and the API auto-detects every field in the document instead. Send a richer JSON Schema and it validates types, required fields, and nested arrays such as line items.

Auto-schema or your own schema

Schema is the contract. When you do not know what is in a document, send it bare and read back everything the engine finds. When your code already knows the shape it needs, which is every time an agent calls a function, send the schema and get exactly that object. Three schema formats are accepted: full JSON Schema for maximum control, a simplified fields list, and a flat key-type map for the quickest path.

Because output is schema-validated, the JSON shape stays stable across vendors and layouts. A redesigned invoice from an existing supplier still returns the same keys, so downstream code does not break on a layout change.

Node SDK and MCP server

The Node SDK and the MCP server are thin wrappers over the same REST API. With the MCP server, an AI agent in Claude, Cursor, or any MCP client calls extraction as a native tool, no glue code required.

npm install @talonic/node

import { Talonic } from "@talonic/node";

const client = new Talonic({ apiKey: process.env.TALONIC_API_KEY! });

const result = await client.extract({
  file_path: "invoice.pdf",
  schema: { vendor: "string", total_eur: "number" }
});
// result.data, result.confidence, result.provenance

Hosted MCP at https://mcp.talonic.com/mcp means zero install. For other languages, the OpenAPI spec drives codegen directly. See the developer guide for the three modes, auth, and cost headers.

Transparent, usage-based pricing

Start free with 5,000 credits per month and no credit card. Paid usage is credit-based at 1,000 credits per euro: page ingestion is 100 credits per page, and queries that resolve from already-ingested data are free. There are no per-seat or per-connector fees. Every synchronous response carries cost headers, so an agent reads exactly what a call cost and what balance remains.

See pricing for current tiers, or the API reference for the full endpoint catalog, error envelope, and webhook events.

Frequently asked questions

What does the data extraction API do?+

It turns a document into structured, typed JSON. You POST a file to /v1/extract and get back fields as a schema-validated object, with a confidence score and provenance on every value. It covers invoices, bank statements, contracts, receipts, and 529 document types across 25+ file formats, so one integration replaces a stack of format-specific parsers.

Do I have to define a schema?+

No. Send a document with no schema and the API auto-detects the fields and returns everything it finds. Send a document with your own schema, as JSON Schema, a simplified fields list, or a flat key-type map, and you get back exactly that shape. Most production callers send a schema because their code already knows the shape it needs.

How is confidence reported?+

Per field, not per document. Every value carries a confidence score from 0.0 to 1.0 and a provenance pointer to the source page and region. Synchronous responses also include cost headers (X-Talonic-Cost-Credits, X-Talonic-Balance-Credits) so an agent can track spend on every call without a second request.

Is there an SDK or MCP server?+

Yes. The Node SDK (@talonic/node) and the MCP server (@talonic/mcp) are thin wrappers over the same REST API, so an AI agent can call extraction as a tool with no glue code. For other languages the OpenAPI spec at talonic.com/openapi.json drives codegen directly.

How much does the data extraction API cost?+

There is a free tier with 5,000 credits per month and no credit card. Paid usage is credit-based at 1,000 credits per euro: page ingestion is 100 credits per page, and queries that resolve from already-ingested data are free. Pricing is published at talonic.com/pricing, with no per-seat or per-connector fees.

Where does extraction run?+

On EU-resident infrastructure: Microsoft Azure in Germany West Central with Mistral Large as the primary model. Data does not leave EU jurisdiction for customers who require it. The API is GDPR aligned and ISO 27001 aligned.

Make your first extraction call

Get an API key and POST a document. Read back schema-validated JSON with confidence and provenance in the same response.

Parsing invoices specifically? See invoice parsing and invoice data extraction.

Start free, get an API key →Book a demo →