Skip to main content

Extract data from safety data sheets

A Safety Data Sheet is the document that tells anyone handling a chemical what it is, how it can hurt them, and what to do when something goes wrong. Under the OSHA Hazard Communication Standard, every SDS follows the same 16-section GHS format, which makes it look easy to parse until you have 4,000 of them. An environmental health and safety team at a manufacturer maintains an SDS for every product in the building, from a drum of a specific solvent to a tube of thread locker, each sheet supplied by a different manufacturer such as 3M, Henkel, or Sigma-Aldrich, and each updated on its own revision cycle. The data the team needs sits in predictable sections: Section 1 product identifier and supplier, Section 2 hazard classification and signal word, Section 3 composition with CAS numbers, Section 9 physical properties, and Section 14 transport information. The challenge is that the 16-section structure is consistent in order but not in formatting. Section 3 lists each hazardous ingredient with a chemical name, a CAS Registry Number such as 67-64-1 for acetone, and a concentration range. Section 2 carries GHS hazard statements as H-codes (H225 for a highly flammable liquid) and precautionary P-codes, and the signal word is either Danger or Warning. Section 14 carries the UN number, the proper shipping name, and the hazard class that a shipping department needs. A sheet can run 11 pages, and a multi-component product lists several CAS numbers that each have to be captured. Talonic reads the SDS section by section and returns the product identity, the GHS classification with its H-codes and signal word, the composition with CAS numbers and concentration ranges, and the transport classification, so an EHS team can load a chemical inventory and a hazard register without retyping a 16-section sheet.

What gets extracted from safety data sheets

Product NameAcetone Technical Grade
SupplierSigma-Aldrich
Signal WordDanger
Hazard StatementsH225, H319, H336
ComponentAcetone
CAS Number67-64-1
Concentration99.5 to 100%
Flash Point-20 CSection 9
UN NumberUN1090Section 14

How extraction works for safety data sheets

Safety Data Sheets follow the OSHA Hazard Communication 16-section GHS order, but each manufacturer formats the sections differently, so a fixed-position reader fails across suppliers. Talonic reads the SDS against the GHS section map in the Field Registry, which anchors extraction to the 16 numbered sections rather than the page layout. Section 2 hazard and precautionary statements are captured as their H-codes and P-codes with the signal word, Section 3 components are captured with their chemical names, CAS Registry Numbers such as 67-64-1, and concentration ranges, and Section 14 transport data returns the UN number, proper shipping name, and hazard class. Multi-component products keep every CAS line. Every value returns with a confidence score and pixel-region provenance under DIN SPEC 91491 conformity, so an EHS team can verify a hazard code or CAS number against the source sheet.

Sample extraction

An 11-page Safety Data Sheet for a single-component solvent

{
  "product_name": "Acetone Technical Grade",
  "supplier_name": "Sigma-Aldrich",
  "revision_date": "2025-11-02",
  "signal_word": "Danger",
  "hazard_statements": [
    "H225",
    "H319",
    "H336"
  ],
  "precautionary_statements": [
    "P210",
    "P233",
    "P280"
  ],
  "composition": [
    {
      "component": "Acetone",
      "cas_number": "67-64-1",
      "concentration": "99.5 to 100%"
    }
  ],
  "physical_properties": {
    "flash_point": "-20 C",
    "boiling_point": "56 C",
    "physical_state": "liquid"
  },
  "transport": {
    "un_number": "UN1090",
    "proper_shipping_name": "Acetone",
    "hazard_class": "3",
    "packing_group": "II"
  }
}

Frequently asked

Does it follow the 16-section GHS structure?

Yes. Extraction is anchored to the 16 numbered sections of the OSHA Hazard Communication format, so Section 3 composition and Section 14 transport data are read from the correct section even when a manufacturer formats the page differently.

How are hazard codes captured?

GHS hazard statements are returned as their H-codes (such as H225 for a highly flammable liquid) and precautionary statements as P-codes, alongside the signal word Danger or Warning, so a hazard register can be built from coded data rather than free text.

Can it handle multi-component products?

Yes. Each hazardous ingredient in Section 3 is captured with its chemical name, CAS Registry Number, and concentration range, so a mixture with several components keeps every CAS line for the chemical inventory.

Ready to extract from your own safety data sheets?

Author note

Reviewed by Talonic engineering, schema review · last reviewed 2026-06-10