Skip to main content

Extract data from W-2 forms

The IRS Form W-2 reports an employee's annual wages and the taxes withheld, and it is one of the most structured documents in American finance, which is exactly why getting it into data still trips people up. A tax-prep firm onboarding clients in February receives W-2s by the thousand: clean PDFs downloaded from ADP, phone photos of a paper copy, and scans faxed from an employer. Every one carries the same numbered boxes, and the numbers feed directly into a Form 1040. Box 1 is wages subject to federal income tax, Box 2 is the federal tax withheld, Box 3 and Box 5 are the Social Security and Medicare wage bases, and Boxes 15 through 17 carry the state wages and state income tax for each state the employee worked in. The difficulty is that the rigid layout is only rigid on paper. A scanned W-2 skews and the box grid shifts, so naive extraction reads the Box 2 amount into Box 1. Box 12 carries up to four coded entries (Code D for a 401(k) deferral, Code DD for employer health coverage, Code W for an HSA), and the single-letter code has to stay bound to its amount. An employee with two state lines has two Box 15 to 17 rows that must not collapse into one. The employer identification number in Box b follows a strict two-digit-then-seven-digit format that is a useful integrity check. Talonic reads the W-2 by its box numbers rather than its pixel position, so Box 1 wages and Box 2 withholding land in the right fields even on a skewed scan. Box 12 codes are captured as code-and-amount pairs, and multiple state lines are kept as separate records, so a preparer can file from clean data.

What gets extracted from W-2 forms

Employee NameSarah Mitchell
Employer EIN (Box b)12-3456789
Box 1 Wages$78,400.00
Box 2 Federal Tax Withheld$11,260.00
Box 3 Social Security Wages$78,400.00
Box 5 Medicare Wages$78,400.00
Box 12 CodesD: $6,000.00; DD: $14,200.00
Box 15 StateCA
Box 16 State Wages$78,400.00
Box 17 State Income Tax$4,120.00

How extraction works for W-2 forms

W-2s arrive as payroll-provider PDFs, employer printouts, and phone photos, and the official IRS layout is consistent enough that the right anchor is the box number, not the position. Talonic reads the form against the W-2 box map in the Field Registry, which binds each value to its numbered box rather than its location, so a skewed or rotated scan still routes Box 1 wages and Box 2 withholding correctly. Box 12 entries are captured as code-and-amount pairs (Code D, Code DD, Code W) so a single-letter code never detaches from its dollar figure. Multiple state lines in Boxes 15 through 17 are kept as separate state records. The Box b employer identification number format is validated. Every value returns with a confidence score and pixel-region provenance under DIN SPEC 91491 conformity, so a preparer can confirm a box against the source W-2 before filing.

Sample extraction

A scanned IRS Form W-2 with two state lines

{
  "tax_year": 2025,
  "employee_name": "Sarah Mitchell",
  "employee_ssn_last4": "1042",
  "employer_name": "Northgate Logistics Inc.",
  "employer_ein": "12-3456789",
  "box_1_wages": 78400,
  "box_2_federal_withholding": 11260,
  "box_3_ss_wages": 78400,
  "box_4_ss_tax": 4860.8,
  "box_5_medicare_wages": 78400,
  "box_6_medicare_tax": 1136.8,
  "box_12": [
    {
      "code": "D",
      "amount": 6000
    },
    {
      "code": "DD",
      "amount": 14200
    }
  ],
  "state_lines": [
    {
      "state": "CA",
      "box_16_wages": 78400,
      "box_17_tax": 4120
    }
  ]
}

Frequently asked

Does it read the W-2 by box number?

Yes. Each value is bound to its numbered box rather than its pixel position, which is what keeps Box 1 wages and Box 2 withholding correct on a skewed or rotated scan where a position-based extractor would swap them.

How are Box 12 codes captured?

Box 12 returns as an array of code-and-amount pairs, so Code D for a 401(k) deferral and Code DD for employer-sponsored health coverage each keep their own amount instead of collapsing into one number.

What about employees with more than one state?

Each set of Boxes 15 through 17 is captured as a separate state record, so an employee who worked in two states has two state wage and tax lines rather than a merged total.

Ready to extract from your own W-2 forms?

Author note

Reviewed by Talonic engineering, schema review · last reviewed 2026-06-13