Skip to main content

Documents

The documents resource lets you manage uploaded files in your workspace. Every call to extract() creates a document automatically, but you can also list, inspect, re-extract, filter, and delete documents independently.

Core document operations
// List documents with cursor-based pagination
const docs = await talonic.documents.list({ limit: 50 })
console.log(docs.data.map(d => d.filename))
console.log(docs.pagination) // { next_cursor: '...', has_more: true }

// Get a single document with full metadata
const doc = await talonic.documents.get('doc_abc123')
console.log(doc.filename)           // 'invoice.pdf'
console.log(doc.status)             // 'completed'
console.log(doc.pages)              // 3
console.log(doc.triage)             // { sensitivity: 'internal', pii_detected: true, ... }

// Get OCR markdown
const md = await talonic.documents.getMarkdown('doc_abc123')
console.log(md.markdown)            // '# Invoice\n\nVendor: Acme Corp...'

The list() method accepts ListDocumentsParams with filtering by source_id, status ('pending', 'processing', 'completed', 'error'), date range (after, before as ISO 8601 strings), and full-text search across filenames and extracted content. Pagination uses cursor-based navigation: pass limit for page size and cursor from a previous response's pagination.next_cursor to fetch the next page. The legacy page and per_page parameters are accepted as aliases but cursor-based is the canonical form. Results include a pagination object with next_cursor and has_more.

Use getMarkdown() to retrieve the raw OCR output for a document. This is useful for debugging extraction quality or building custom post-processing pipelines on top of the parsed text.

Filter documents by extracted field values
// Filter documents using composable conditions on extracted fields
const filtered = await talonic.documents.filter({
  conditions: [
    { field: 'vendor.name', operator: 'eq', value: 'Acme Corp' },
    { field: 'total_amount', operator: 'gt', value: 10000 },
  ],
  sort: { field: 'invoice_date', direction: 'desc' },
  limit: 25,
})

console.log(filtered.total)      // 47
console.log(filtered.documents)  // [{ id: '...', filename: '...', fieldValues: { ... } }, ...]

The filter() method lets you query documents by extracted field values using composable conditions. Each condition specifies a field (canonical name like 'vendor.name') or fieldId (UUID), an operator (eq, neq, gt, gte, lt, lte, between, contains, is_empty, is_not_empty), and a value. The between operator also accepts valueTo for range queries. Results include fieldValues with the matched field data for each document hit. You can optionally scope results to a specific source connection with source_connection_id.

Re-extract and delete
// Re-run extraction on an existing document (e.g. after schema update)
const reExtracted = await talonic.documents.reExtract('doc_abc123')
console.log(reExtracted.status)  // 'processing'
console.log(reExtracted.message) // 'Re-extraction started'

// Delete a document and all associated extractions (irreversible)
const deleted = await talonic.documents.delete('doc_abc123')
console.log(deleted.deleted) // true

The get() method returns a Document object with full metadata including triage classification data when available. The triage block contains sensitivity (public, internal, restricted), department, jurisdiction (ISO country code), pii_detected, pii_categories, regulated_data, and confidentiality_marking. The processing_log array shows each pipeline step with status, duration, and detail. These fields are populated progressively as the document moves through the processing pipeline.

Deleting a document also removes all associated extractions. If you need to preserve extraction history, fetch the data with extractions.getData() before deleting.

Frequently asked questions

How do I list documents with the Talonic SDK?+
Call talonic.documents.list() with optional parameters like limit, cursor, status, source_id, after, before, and search. Results are paginated using cursor-based navigation.
What does getMarkdown() return?+
It returns the raw OCR markdown output for a document, which is the text the extraction engine uses as input. Useful for debugging or building custom processing pipelines.
Does deleting a document remove its extractions?+
Yes. Deleting a document removes all associated extraction results. Retrieve any data you need before calling delete().
How do I filter documents by extracted field values?+
Use talonic.documents.filter() with an array of conditions. Each condition specifies a field name or fieldId, an operator (eq, gt, contains, between, etc.), and a value. You can combine multiple conditions, add sort order, and paginate with page and limit parameters.
What is the triage field on documents?+
The triage block contains automatic classification output including sensitivity tier, department, jurisdiction, PII detection, regulated data flags, and confidentiality markings. It is populated by the Talonic classification pipeline after document ingestion and may be null for newly uploaded documents.