Skip to main content

Extract data from Form 1040

Form 1040 is the US individual income tax return. Every US taxpayer with filing obligations files one, every year, by April 15 (or October 15 with an extension). The form has been the centerpiece of the US tax code since 1913, but its modern layout is the post-Tax Cuts and Jobs Act version with the consolidated single-page front and a list of schedules (Schedule A for itemized deductions, Schedule B for interest and dividends, Schedule C for self-employment, Schedule D for capital gains, Schedule E for rental and partnership income, Schedule SE for self-employment tax) attached as needed. Lenders verifying applicant income for mortgages and small business loans, accountants preparing the next year's return, financial advisors reviewing a client's tax position, and the IRS itself all need to read the 1040 quickly and accurately. The hard parts are line numbers and schedules. Line 1a is wages, salaries, tips. Line 2b is taxable interest. Line 3a is qualified dividends. Line 7 is capital gain or loss. Line 11 is adjusted gross income (AGI), the most cited number on the form because lenders, financial aid offices, and the IRS itself use AGI as the bright-line input to dozens of downstream calculations. Line 12 is the standard or itemized deduction. Line 15 is taxable income. Line 16 is the tax owed before credits. Line 24 is the total tax. Line 33 is the total payments. Line 34 is the refund (if payments exceeded tax) or Line 37 the amount owed. Dependents, filing status (Single, MFJ, MFS, HoH, QSS), and the taxpayer/spouse identification block sit at the top. Talonic extracts every line of the 1040, the taxpayer and spouse identification block, dependent rows, and any attached schedules. Numbers are returned as decimals with the dollar amount; cents are preserved when the source shows them, defaulted to zero when the source rounds. Per-cell confidence and pixel-region provenance let a lender or accountant audit any number against the source 1040 before relying on it for an income-verification decision.

What gets extracted from Form 1040

Tax Year2024
Filing StatusMarried Filing Jointly
Taxpayer Name & SSNJane Q. Smith, XXX-XX-1234
Spouse Name & SSNJohn R. Smith, XXX-XX-5678
Dependents2 qualifying children
Line 1a: Wages$148,500.00
Line 11: AGI$152,200.00
Line 12: Standard / Itemized Deduction$29,200.00 (standard, MFJ 2024)
Line 16: Tax$19,418.00
Line 24: Total Tax$19,418.00
Line 34 / 37: Refund or Amount Owed$1,420.00 refund

How extraction works for Form 1040

Form 1040 layouts shift slightly year over year. Talonic classifies each 1040 by tax year and revision (2023 vs 2024 vs prior versions) and runs it through the personal tax schema in the Field Registry, which encodes every numbered line plus the identification blocks. SSNs are detected and partially masked at extraction by default for downstream privacy. AGI on Line 11 is captured as a first-class field because so many downstream workflows (lender income verification, FAFSA, IRMAA Medicare premium calculation) depend on it. Attached schedules are detected when bundled in the same PDF and extracted as separate schema instances linked to the parent 1040. Per-cell confidence with pixel-region provenance follows DIN SPEC 91491 conformity, so an underwriter or tax preparer can audit Line 11 or any other figure against the source 1040 before relying on it.

Sample extraction

A simplified 2024 Form 1040 for a married couple filing jointly

{
  "tax_year": 2024,
  "filing_status": "Married Filing Jointly",
  "taxpayer": {
    "name": "Jane Q. Smith",
    "ssn": "XXX-XX-1234"
  },
  "spouse": {
    "name": "John R. Smith",
    "ssn": "XXX-XX-5678"
  },
  "dependents": [
    {
      "name": "Emma Smith",
      "relationship": "Daughter",
      "ssn": "XXX-XX-2222"
    },
    {
      "name": "Liam Smith",
      "relationship": "Son",
      "ssn": "XXX-XX-3333"
    }
  ],
  "line_1a_wages": 148500,
  "line_2b_taxable_interest": 412,
  "line_3a_qualified_dividends": 1860,
  "line_7_capital_gain": 1428,
  "line_11_agi": 152200,
  "line_12_deduction": 29200,
  "line_15_taxable_income": 123000,
  "line_16_tax": 19418,
  "line_24_total_tax": 19418,
  "line_33_total_payments": 20838,
  "line_34_refund": 1420,
  "line_37_amount_owed": 0
}

Frequently asked

Is AGI (Line 11) captured as a first-class field?

Yes. AGI is the single most-consumed value on the 1040 (used in lender underwriting, FAFSA, IRMAA, IRA contribution limits, and many other downstream workflows), so it surfaces as a top-level field rather than buried in a generic line array. Other lines are also available individually.

How are dependents handled?

Dependents are returned as an array. Each entry has a name, relationship to the taxpayer, SSN (typically masked on consumer copies), and credit-eligibility indicators (CTC, ODC). Multiple dependents are preserved in source order.

Can it handle attached schedules (A, B, C, D, E, SE)?

Yes, when the schedules are bundled in the same PDF as the parent 1040. Each schedule extracts as a separate schema instance with a link back to the parent return. Schedule C and Schedule E are extracted at line-item granularity for use cases like rental property analysis.

What about prior-year 1040 versions?

The schema covers post-2018 (post-TCJA) 1040 versions (2018 through 2024). Earlier versions had a different layout (two-page front, different line numbers); pre-2018 returns can be extracted with the legacy schema if requested. Most income-verification workflows look at the last 2 years of returns, so 2022 through 2024 are covered.

Are SSNs preserved fully or masked?

By default, SSNs are masked to the last 4 digits (XXX-XX-1234) at extraction. Full SSN retention is supported when the use case requires it (e.g., IRS internal reconciliation) and the storage destination meets the corresponding privacy requirements. The default is conservative.

Ready to extract from your own Form 1040?

Author note

Reviewed by Talonic engineering, tax-form subject-matter review · last reviewed 2026-05-12