Schema Features Reference

Every field in a template supports advanced features beyond the basic name and type. These features control how values are extracted, validated, transformed, and delivered. You can layer features independently: a single field can have a format constraint, a reference table for code lookup, modifiers for post-processing, and an output name remap for delivery. Features compose without conflicts, giving you fine-grained control over every aspect of the extraction and output pipeline.

In the production Spec and Pipeline path, value normalization moves out of the field and into Data Policies. The per-field resolution features below (modifiers, reference-table lookups, and resolution chains) are the legacy on-field form used by the quick Job tier. When you compose a Spec, express the same transforms and lookups as Data Policies attached to the rail's Resolution stage, where they are versioned, reusable across schemas, and applied as their own pipeline phase. Format constraints, bypass strategies, and output remaps remain field-level in both paths.

Field features

Parameter	Type	Description
Format constraint	regex	Validates extracted values against a regex pattern. Failing values can be emptied, flagged, or replaced with a constant.
Modifiers	pipeline	Post-processing transforms applied in order: format (date/number), alias (value mapping), max_length (truncation).
Constraints	validation	Rules evaluated after modifiers: required, enum, date-format, length, cross-field expressions.
Bypass strategy	skip LLM	Fields that don't need extraction: constant (fixed value), generator (auto-ID), reference (lookup from reference table).
Reference table	key-value	Inline lookup table for code mapping (e.g., country name → ISO code). Also supports multi-hop resolution chains.
Manual instruction	text	User-written extraction directive. Overrides the AI-synthesized master instruction from the field registry.
Capture submoves	array	Ordered execution: match (field matching), compute (calculation), reason (LLM inference).
Output name	string	Renamed field in export output. The internal name stays the same.

When configuring a field, start with the basics — name, type, and registry mapping — then layer on advanced features as needed. For example, add a format constraint to enforce a date pattern, attach a reference table for code lookups, or define capture submoves to control the exact extraction sequence. Features compose independently, so you can mix and match without conflicts.

Format constraint — Regex validation with configurable mismatch behavior (clear, flag, or replace).
Modifiers — Post-processing pipeline: format (date/number conversion), alias (value mapping), max_length (truncation).
Constraints — Validation rules: required, enum, date-format, length, cross-field expressions.
Bypass strategy — Skip AI extraction: constant value, deterministic ID generator, or reference table lookup.
Reference table — Key-value pairs for code mapping with a 3-tier lookup cascade (normalization, fuzzy, AI).
Manual instruction — User-written extraction directive that overrides the AI-synthesized master instruction.
Capture submoves — Ordered extraction sequence: match (field matching), compute (calculation), reason (LLM inference).
Output name — Remap the field name in delivery and export output without changing the internal schema name.

The modifier pipeline runs in a fixed order during Phase 4 of the extraction pipeline: format transforms first (converting dates or numbers to your target format), then alias mapping (replacing values using a lookup), and finally max_length truncation. Constraint evaluation happens after all modifiers have been applied, so constraints validate the final transformed value, not the raw extraction.

For best results, use manual instructions sparingly and only for fields that the registry cannot match. A well-written instruction should describe the field in plain language, specify where in the document to look, and note any formatting expectations. Avoid vague instructions like "extract the value" — instead, write something like "Extract the net payment amount from the invoice summary section, excluding VAT."

Example field configuration: format constraint, modifier, output remap

{
  "display_name": "Purchase Order Number",
  "data_type": "string",
  "manual_instruction": "Extract the PO number from the order reference section",
  "constraints": {
    "format": {
      "type": "regex",
      "pattern": "PO-\\d{6}",
      "on_format_mismatch": "flag"
    }
  },
  "modifiers": {
    "max_length": 20
  },
  "output_name": "po_number"
}

Example field configuration: constant bypass strategy

{
  "display_name": "Currency",
  "data_type": "string",
  "strategy": "constant",
  "constant_value": "USD"
}

// The field always resolves to "USD" without any LLM call.
// Bypass strategies execute during Phase 1, before AI extraction.

Schema features can be combined to build sophisticated field definitions. For example, a "Vendor Code" field might use a reference table for code mapping, a format constraint to validate the output format (^V\d{5}$), an alias modifier to normalize legacy codes, and an output name remap for the downstream ERP system. Each feature operates at a different stage of the pipeline — bypass strategies in Phase 1, extraction instructions in Phase 2, reference table lookups in Phases 1 and 3, and modifiers plus format constraints in Phase 4 — so they compose without conflicts.

Three features have their own deep-dive pages: Format Constraints for regex validation and mismatch behaviors, Bypass Strategies for skipping LLM extraction entirely, and Reference Tables for the 3-tier code lookup cascade.

Frequently asked questions

What advanced features do schema fields support?+

Format constraints (regex), modifiers (format/alias/max_length), constraints (required/enum/date-format), bypass strategies, reference tables, manual instructions, capture submoves, and output name remapping.

Can I override AI extraction instructions with my own?+

Yes. Use the Manual instruction feature on a schema field. User-written instructions override the AI-synthesized master instruction from the field registry.

In what order are modifiers applied to extracted values?+

Modifiers run in a fixed order: format (date/number conversion) first, then alias (value mapping), then max_length (truncation). Constraints are evaluated after all modifiers complete.

How do I configure a field to skip LLM extraction entirely?+

Use a bypass strategy on the field. Set strategy to "constant" for a fixed value, "generator" for deterministic IDs, or "reference" for lookup-based resolution. Bypass strategies execute during Phase 1 at zero AI cost. If the bypass fails to produce a value, the field falls through to LLM extraction as a safety net.

Format Constraints

Bypass Strategies

Reference Tables

Schema Features Reference

Field features

Frequently asked questions

Related