Skip to main content

Format Constraints

Format constraints apply regex-based validation to schema fields. They are evaluated post-extraction in Phase 4 of the pipeline, after all transforms have been applied. Original values are preserved for audit in original_extractions. This means you can always review what the AI originally extracted before the constraint was applied, giving you full visibility into the extraction pipeline even when values are cleared or replaced.

Mismatch behaviors

ParameterTypeDescription
emptydefaultIf the extracted value does not match the regex, the cell is cleared. The original is preserved for audit.
flagbehaviorThe value is kept but flagged with a format_applied indicator. Visible as an amber dot in the results grid.
constantbehaviorThe value is replaced with a constant you specify (e.g. "INVALID", "N/A").

Define format constraints in the schema field editor. The pattern uses standard regex syntax with support for inline flags like (?i) for case-insensitive matching. The editor provides a live test input so you can verify the pattern against sample values before saving. This immediate feedback loop helps you catch overly strict or overly permissive patterns before they affect real extraction runs.

Format constraints are especially useful for fields with strict formatting requirements in downstream systems. For example, a purchase order number that must follow the pattern PO-\d{6} or a date that must match \d{4}-\d{2}-\d{2}. By catching format violations at extraction time, you avoid importing malformed data into your ERP, accounting, or analytics systems.

Choose the mismatch behavior based on your data quality requirements. Use empty (the default) when you prefer no data over bad data — the downstream system will see a blank cell. Use flag when you want to review mismatches manually before deciding — flagged cells appear with an amber dot in the results grid. Use constant when your downstream system needs a specific sentinel value like "N/A" or "INVALID" to trigger its own error handling.

Add a format constraint to a schema field
curl -X PATCH https://api.talonic.com/v1/schemas/us_def456/fields/fld_po_number \
  -H "Authorization: Bearer $TALONIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "constraints": {
      "format": {
        "type": "regex",
        "pattern": "^PO-\\d{6}$",
        "on_format_mismatch": "flag"
      }
    }
  }'

# Values matching PO-123456 pass through unchanged.
# Values like "PO 123456" or "123456" are flagged with an amber dot.
# Original values are always preserved in original_extractions for audit.
Date format constraint with case-insensitive flag
# Validate ISO date format with optional time component:
curl -X PATCH https://api.talonic.com/v1/schemas/us_def456/fields/fld_date \
  -H "Authorization: Bearer $TALONIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "constraints": {
      "format": {
        "type": "regex",
        "pattern": "^\\d{4}-\\d{2}-\\d{2}(T\\d{2}:\\d{2}:\\d{2})?$",
        "on_format_mismatch": "empty"
      }
    }
  }'

# Values like "2025-03-15" or "2025-03-15T14:30:00" pass.
# Values like "March 15, 2025" are cleared (on_format_mismatch: "empty").

Format constraints are one of the most effective tools for ensuring downstream system compatibility. Many ERP and accounting systems reject records with malformed identifiers, dates outside their expected format, or amounts with unexpected characters. By catching these issues at extraction time with format constraints, you prevent bad data from reaching downstream systems entirely. The three mismatch behaviors give you control over the trade-off: use "empty" when no data is better than bad data, "flag" when you want human review before deciding, and "constant" when your downstream system needs a specific sentinel value to trigger error handling.

The regex evaluator includes ReDoS protection: nested quantifiers are rejected and input is capped at 1,000 characters. Use the (?i) inline flag for case-insensitive matching. Format constraints support standard JavaScript regex syntax, so you can use character classes, alternation, and lookahead assertions for complex validation patterns.

Frequently asked questions

What are format constraints?+
Format constraints apply regex-based validation to schema fields, evaluated post-extraction in Phase 4 after all transforms have been applied. Mismatch behaviors: empty (clear the cell, the default), flag (keep the value but show an amber dot in the results grid), or constant (replace with a fixed value like "INVALID" or "N/A"). The constraint validates the final transformed value, not the raw extraction.
Are original values preserved when format constraints clear a cell?+
Yes. Original values are always preserved for audit in the original_extractions table, regardless of the mismatch behavior applied. This means you can always review what the AI originally extracted before the constraint was applied, giving you full visibility into the extraction pipeline.
Can I use case-insensitive regex patterns?+
Yes. Use the (?i) inline flag at the start of your pattern for case-insensitive matching. The evaluator supports standard JavaScript regex syntax including character classes, alternation, and lookahead assertions. ReDoS protection is built in — nested quantifiers are rejected and input is capped at 1,000 characters.
What happens if my regex pattern causes performance issues?+
The evaluator includes built-in ReDoS (Regular Expression Denial of Service) protection. Patterns with nested quantifiers like (a+)+ are automatically rejected at save time. Additionally, input values are capped at 1,000 characters, preventing pathological backtracking on unexpectedly long strings. These safeguards run transparently — you do not need to optimize your patterns manually for performance.