Document AI Custom Parsers

Document AI Custom Parsers and Deployment Framework

Google Cloud’s Document AI is a powerful tool for text extraction but form types are limited to pre-built lending and identity parsers. We’ve created a number of custom parsers as well as extended the capabilities of existing parsers through the use of the generic form parser API. Through our experience with Document AI we’ve also created a hub-and-spoke python framework to ease deployment of these models.

Request Demo


Extended Processors:

Google Cloud’s Document AI processing allows for near real-time parsing of documents via the extraction of key-value pairs, raw text and the reading of tables. In an effort to increase the versatility of existing GCP Document AI models we’ve added additional logic on top of the following existing lending and identity parsers:

  • Form 1040 SEs
  • Form 1040 Cs
  • Generic Form 1099s
  • Bank Statements
  • Credit Statements
  • Foreign Passports
  • Generic ID Cards
  • Paystubs
  • US Passports
  • US Drivers Licenses
  • Utility Statements
  • W2 Forms

The BlueVector AI additions to these parsers include NLP parts of speech analysis for proper noun recognition, pattern recognition via regex for dates, SSNs and dollar amounts and text location analysis for generic forms such as passports and 1099s.

Custom Processors:

In addition to extensions on top of existing GCP prebuilt parsers, we’ve created custom parsers by leveraging the generic form parser API and adding logic to reach higher accuracy scores than standard OCR. We’ve built the following custom parsers using this framework:

  • Affidavits/Letters
  • Business Plans
  • Business Licenses
  • Business Receipts
  • Employer Identification Numbers
  • Diplomas
  • Divorce Decrees
  • Earnings and Leave Statements
  • Foreign Passports (uses GCP Identity parser for photo recognition)
  • Lease Agreements
  • Marriage Certificates
  • Mortgage Statements
  • Offer Letters
  • Property Tax Statements
  • Transcripts

Deployment Framework:

The linked Github repository is set up in what we call the FormDocuments layout: this leverages a parent class entitled which instantiates a Document AI processor, completes necessary document trimming, attempts repairs if a document is partially corrupt, measures extraction accuracy and adds robust application logging. A class is then created for each document type that inherits the FormDocuments class, allowing for any amount of custom parsers to be added or removed from a project with relative ease.