Skip to main content

Schema Features Reference

Every field in a template supports advanced features beyond the basic name and type. These features control how values are extracted, validated, transformed, and delivered. You can layer features independently — for example, a single field can have a format constraint, a reference table for code lookup, modifiers for post-processing, and an output name remap for delivery. Features compose without conflicts, giving you fine-grained control over every aspect of the extraction and output pipeline.

Field features

ParameterTypeDescription
Format constraintregexValidates extracted values against a regex pattern. Failing values can be emptied, flagged, or replaced with a constant.
ModifierspipelinePost-processing transforms applied in order: format (date/number), alias (value mapping), max_length (truncation).
ConstraintsvalidationRules evaluated after modifiers: required, enum, date-format, length, cross-field expressions.
Bypass strategyskip LLMFields that don't need extraction: constant (fixed value), generator (auto-ID), reference (lookup from reference table).
Reference tablekey-valueInline lookup table for code mapping (e.g., country name → ISO code). Also supports multi-hop resolution chains.
Manual instructiontextUser-written extraction directive. Overrides the AI-synthesized master instruction from the field registry.
Capture submovesarrayOrdered execution: match (field matching), compute (calculation), reason (LLM inference).
Output namestringRenamed field in export output. The internal name stays the same.

When configuring a field, start with the basics — name, type, and registry mapping — then layer on advanced features as needed. For example, add a format constraint to enforce a date pattern, attach a reference table for code lookups, or define capture submoves to control the exact extraction sequence. Features compose independently, so you can mix and match without conflicts.

  • Format constraint — Regex validation with configurable mismatch behavior (clear, flag, or replace).
  • Modifiers — Post-processing pipeline: format (date/number conversion), alias (value mapping), max_length (truncation).
  • Constraints — Validation rules: required, enum, date-format, length, cross-field expressions.
  • Bypass strategy — Skip AI extraction: constant value, deterministic ID generator, or reference table lookup.
  • Reference table — Key-value pairs for code mapping with a 3-tier lookup cascade (normalization, fuzzy, AI).
  • Manual instruction — User-written extraction directive that overrides the AI-synthesized master instruction.
  • Capture submoves — Ordered extraction sequence: match (field matching), compute (calculation), reason (LLM inference).
  • Output name — Remap the field name in delivery and export output without changing the internal schema name.

The modifier pipeline runs in a fixed order during Phase 4 of the extraction pipeline: format transforms first (converting dates or numbers to your target format), then alias mapping (replacing values using a lookup), and finally max_length truncation. Constraint evaluation happens after all modifiers have been applied, so constraints validate the final transformed value, not the raw extraction.

For best results, use manual instructions sparingly and only for fields that the registry cannot match. A well-written instruction should describe the field in plain language, specify where in the document to look, and note any formatting expectations. Avoid vague instructions like "extract the value" — instead, write something like "Extract the net payment amount from the invoice summary section, excluding VAT."

Add a field with format constraint and reference table
curl -X POST https://api.talonic.com/v1/schemas/us_def456/fields \
  -H "Authorization: Bearer $TALONIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "display_name": "Purchase Order Number",
    "data_type": "string",
    "manual_instruction": "Extract the PO number from the order reference section",
    "constraints": {
      "format": {
        "type": "regex",
        "pattern": "PO-\\d{6}",
        "on_format_mismatch": "flag"
      }
    },
    "modifiers": {
      "max_length": 20
    },
    "output_name": "po_number"
  }'
Configure a bypass strategy (constant value)
curl -X PATCH https://api.talonic.com/v1/schemas/us_def456/fields/fld_xyz \
  -H "Authorization: Bearer $TALONIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "strategy": "constant",
    "constant_value": "USD"
  }'

# The field will always resolve to "USD" without any LLM call.
# Bypass strategies execute during Phase 1, before AI extraction.

Schema features can be combined to build sophisticated field definitions. For example, a "Vendor Code" field might use a reference table for code mapping, a format constraint to validate the output format (^V\d{5}$), an alias modifier to normalize legacy codes, and an output name remap for the downstream ERP system. Each feature operates at a different stage of the pipeline — bypass strategies in Phase 1, extraction instructions in Phase 2, reference table lookups in Phases 1 and 3, and modifiers plus format constraints in Phase 4 — so they compose without conflicts.

For the complete JSON Schema specification with all features, see the Full Schema Reference in the Platform Guide.

Frequently asked questions

What advanced features do schema fields support?+
Format constraints (regex), modifiers (format/alias/max_length), constraints (required/enum/date-format), bypass strategies, reference tables, manual instructions, capture submoves, and output name remapping.
Can I override AI extraction instructions with my own?+
Yes. Use the Manual instruction feature on a schema field. User-written instructions override the AI-synthesized master instruction from the field registry.
In what order are modifiers applied to extracted values?+
Modifiers run in a fixed order: format (date/number conversion) first, then alias (value mapping), then max_length (truncation). Constraints are evaluated after all modifiers complete.
How do I configure a field to skip LLM extraction entirely?+
Use a bypass strategy on the field. Set strategy to "constant" for a fixed value, "generator" for deterministic IDs, or "reference" for lookup-based resolution. Bypass strategies execute during Phase 1 at zero AI cost. If the bypass fails to produce a value, the field falls through to LLM extraction as a safety net.