Document AI Preprocessing

Document Preprocessing Solutions for Smooth Document AI Implementation

We've packaged several preprocessing solutions to enable Document AI for more use cases while enhancing overall parsing accuracy. There are three major preprocessors we offer - PDF conversion, corrupt file checks/repair and antivirus scanning.

Request Demo


PDF Conversion:

Users upload images from any device - this means a myriad of different file types and encoding protocols, not all of which will work with GCP's Document AI processors. We've come up with ways to convert the following filetypes, regardless of encoding, to PDF prior to processing for increased compatibility and accuracy:

  • PNG
  • TIFF

Corrupt File Checks/PDF Repair:

Documents from public sources can run into issues in Document AI processing if they are corrupt in any way. By default, we check for full corruption (if the file is unopenable) and ask a user to reupload the image before the document makes it to the processing step but even partial corruption on openable files can cause issues in Document AI. During the document preprocessing stage, we can determine if a file is partially corrupt and attempt to repair the file dynamically in code via the implementation of the open-source QPDF package.

Document Anti-Virus Scanning:

Security is a priority when receiving uploads from public sources. We’ve prepackaged an implementation of ClamAV, an open source malware detection software, to be hosted on GCP via Cloud Run. This implementation features scanning for known malicious hashes as well as decompression bombs, threat definitions are updated in real time allowing for zero-downtime in threat detection. Documents entering a GCP environment are automatically sent to a staging storage bucket where they will await scanning. A document determined clean will then be transferred to a ‘clean’ bucket and begin the process of Document AI extraction. Documents that do not pass antivirus scanning will be transferred to an ‘infected’ bucket where they can either be destroyed or kept for future reference.