Skip to main content

Extract data from balance sheets

A balance sheet is a snapshot of what a company owns and owes on a single date, and the people who read it for a living rarely receive it in a database. A credit analyst evaluating a $5,000,000 facility, a private-equity associate screening an acquisition, or an accountant rolling up a group consolidation all start from a PDF: a statement exported from QuickBooks, a page lifted from an audited annual report, or a board pack assembled in Excel and printed. They need the same line items every time: current assets such as Cash and Accounts Receivable, non-current assets such as Property, Plant and Equipment, current liabilities such as Accounts Payable, long-term debt, and the equity section, with the fundamental identity that total assets equal total liabilities plus equity. The reason this resists naive extraction is structure. Balance sheets are hierarchical: line items roll into subtotals (Total Current Assets), which roll into totals (Total Assets), and the indentation that signals the hierarchy is visual rather than tagged. Comparative statements show two or three periods side by side, so 2025-12-31 and 2024-12-31 sit in adjacent columns and have to stay separate. Companies reporting under US GAAP order the statement differently from those under IFRS. Parentheses denote negative or contra accounts. A line labeled in one company's chart of accounts as Trade Receivables is Accounts Receivable in another. Talonic reads the balance sheet and returns each line item with its label, its amount, the period it belongs to, and its place in the asset, liability, or equity hierarchy. Subtotals are preserved and the accounting identity is checked, so an analyst can load a clean, period-tagged statement instead of retyping figures from a scan.

What gets extracted from balance sheets

Entity NameHarbor Freight Components Inc.
Statement Date2025-12-31
Cash and Equivalents$1,240,000
Accounts Receivable$860,000
Total Current Assets$2,450,000Subtotal
Property, Plant and Equipment$3,100,000
Total Assets$5,920,000
Accounts Payable$540,000
Long-Term Debt$1,800,000
Total Equity$2,980,000

How extraction works for balance sheets

Balance sheets are exported from QuickBooks, NetSuite, Xero, and Sage, pulled from audited PDF filings, and rebuilt in Excel, so the chart-of-accounts labels and the column layout vary with every source. Talonic reads the statement and maps it to the financial-statement schema in the Field Registry, which models the asset, liability, and equity hierarchy rather than a flat list of numbers. Indentation and subtotal cues are used to attach each line to its parent, so Total Current Assets keeps its child accounts. Comparative columns are split by period and tagged with their statement dates. Parenthesized values are read as negatives. The accounting identity, total assets against total liabilities plus equity, is checked and any imbalance is flagged. Every figure returns with a confidence score and pixel-region provenance under DIN SPEC 91491 conformity, so an analyst can audit a captured number against the source statement.

Sample extraction

A single-period balance sheet exported to PDF from accounting software

{
  "entity_name": "Harbor Freight Components Inc.",
  "statement_date": "2025-12-31",
  "currency": "USD",
  "reporting_standard": "US GAAP",
  "assets": {
    "current": {
      "cash_and_equivalents": 1240000,
      "accounts_receivable": 860000,
      "inventory": 350000,
      "total": 2450000
    },
    "non_current": {
      "property_plant_equipment": 3100000,
      "intangibles": 370000,
      "total": 3470000
    },
    "total_assets": 5920000
  },
  "liabilities": {
    "current": {
      "accounts_payable": 540000,
      "accrued_expenses": 320000,
      "total": 860000
    },
    "non_current": {
      "long_term_debt": 1800000,
      "total": 2080000
    },
    "total_liabilities": 2940000
  },
  "equity": {
    "total_equity": 2980000
  }
}

Frequently asked

Does it preserve subtotals and the line-item hierarchy?

Yes. The statement is modeled as a tree, not a flat list. Total Current Assets retains its child accounts, and Total Assets retains the current and non-current subtotals, so a downstream model can either use the rolled-up figures or drill into the components.

How are comparative periods handled?

When a statement shows two or three periods side by side, each column is split out and tagged with its statement date, so 2025-12-31 and 2024-12-31 never blend into a single value.

Does it work across US GAAP and IFRS ordering?

Yes. The schema maps line items by their accounting meaning rather than their position, so the IFRS habit of listing non-current assets first and the US GAAP habit of listing current assets first both resolve to the same structured fields.

Ready to extract from your own balance sheets?

Author note

Reviewed by Talonic engineering, schema review · last reviewed 2026-06-10