BlueVector Google Cloud Marketplace

Document AI Custom Parsers

Document AI Custom Parsers and Deployment Framework

Google Cloud’s Document AI is a powerful tool for text extraction but form types are limited to pre-built lending and identity parsers. We’ve created a number of custom parsers as well as extended the capabilities of existing parsers through the use of the generic form parser API. Through our experience with Document AI we’ve also created a hub-and-spoke python framework to ease deployment of these models.

Request Demo

Overview

Extended Processors:

Google Cloud’s Document AI processing allows for near real-time parsing of documents via the extraction of key-value pairs, raw text and the reading of tables. In an effort to increase the versatility of existing GCP Document AI models we’ve added additional logic on top of the following existing lending and identity parsers:

Form 1040 SEs
Form 1040 Cs
Generic Form 1099s
Bank Statements
Credit Statements
Foreign Passports
Generic ID Cards
Paystubs
US Passports
US Drivers Licenses
Utility Statements
W2 Forms

The BlueVector AI additions to these parsers include NLP parts of speech analysis for proper noun recognition, pattern recognition via regex for dates, SSNs and dollar amounts and text location analysis for generic forms such as passports and 1099s.

Custom Processors:

In addition to extensions on top of existing GCP prebuilt parsers, we’ve created custom parsers by leveraging the generic form parser API and adding logic to reach higher accuracy scores than standard OCR. We’ve built the following custom parsers using this framework:

Affidavits/Letters
Business Plans
Business Licenses
Business Receipts
Employer Identification Numbers
Diplomas
Divorce Decrees
Earnings and Leave Statements
Foreign Passports (uses GCP Identity parser for photo recognition)
Lease Agreements
Marriage Certificates
Mortgage Statements
Offer Letters
Property Tax Statements
Transcripts

Deployment Framework:

The linked Github repository is set up in what we call the FormDocuments layout: this leverages a parent class entitled FormDocuments.py which instantiates a Document AI processor, completes necessary document trimming, attempts repairs if a document is partially corrupt, measures extraction accuracy and adds robust application logging. A class is then created for each document type that inherits the FormDocuments class, allowing for any amount of custom parsers to be added or removed from a project with relative ease.

Document AI Custom Parsers

Document AI Custom Parsers and Deployment Framework

Overview

Screenshots

Google Cloud Platform Technologies