Skip to main content

Four-Phase Pipeline

Every document in Talonic passes through a four-phase extraction pipeline: Resolve, Agent, Validate, and Re-read. Each phase adds confidence scores and per-cell provenance.

Pipeline overview

PhasePurposeOutput
1. ResolveParse document, classify type, match schema fields to Field RegistryDocument model with field mappings
2. AgentAI extraction of values from document regionsRaw extraction with bounding boxes
3. ValidateType checking, confidence scoring, constraint enforcementValidated data with confidence scores
4. Re-readCross-check low-confidence values against sourceFinal data with provenance traces

Phase 1: Resolve

The Resolve phase parses the document into a structured model. It identifies the document type using the 529-type ontology, maps schema fields to Field Registry entries, and prepares the extraction plan. For documents submitted without a schema, this phase also performs field discovery.

Phase 2: Agent

The Agent phase runs AI extraction against the document model. It locates values in the source document and extracts them with bounding box coordinates. Each extraction is tied to a specific region of the original document, forming the basis for per-cell provenance.

Phase 3: Validate

Validation applies type constraints, format checks, and the confidence gate. Values that pass validation receive a confidence score above the gate threshold. Values below the gate are flagged for human review. The Schema Graph enforces field-level constraints from the schema definition.

Confidence gate

The confidence gate is a configurable threshold (default 0.85) that determines whether extracted values are accepted automatically or flagged for review. Adjust the gate per schema to balance automation with accuracy. Values below the gate remain in the extraction output but are marked as requiring human verification.

Phase 4: Re-read

The Re-read phase cross-checks flagged and low-confidence values against the source document. It performs a second pass using different extraction strategies to improve accuracy. Values that improve above the confidence gate after re-reading are accepted. The reasoning trace records both extraction attempts.

Monitoring pipeline progress

Track pipeline progress via the jobs API. Each job reports its current phase and completion percentage. Use webhooks to receive notifications when extraction completes. The extract endpoint returns a job ID for async tracking.

Frequently asked questions

What is the Field Registry?+
The unified knowledge graph of all canonical fields discovered across documents. Fields are organized into three tiers based on frequency: Tier 1 (core), Tier 2 (established), Tier 3 (emerging).
How does the 4-phase extraction pipeline work?+
Phase 1 (Resolve) fills ~30% of cells from graph matches — no AI needed. Phase 2 (Agent) uses AI strategies. Phase 3 (Validation) runs cross-field checks. Phase 4 (Re-read) fills remaining gaps with targeted document re-reading.
What are cases in Talonic?+
Cases are groups of 2+ documents connected through shared entities (names, reference numbers, project codes). They are automatically discovered by the linking pipeline and include evidence chains and AI narration.
How does the confidence gate work?+
Once a cell is filled with confidence ≥ 0.7, no later pipeline phase can overwrite it. This prevents high-confidence lookup results (0.95) from being replaced by lower-confidence agent extractions (0.65).
What file formats are supported?+
25+ formats across three paths: text fast-path (TXT, MD, HTML, JSON, CSV), AI Vision (PNG, JPG, GIF, WEBP), and OCR (PDF, DOCX, PPTX, XLSX, MSG, BMP). ZIP archives are unpacked automatically.