Picture
Minnur Yunusov Senior Drupal Developer Follow
June 13, 2023

If you are a Digital Manager of an enterprise website, you likely have a large PDF problem. Over the years, your site has become a labyrinth of linked PDFs – thousands of them – accessible only via third-party applications. The search for content within these document files has been far from efficient because most search engines fall short of effectively indexing PDF content. 

You are now faced with a formidable digital challenge: How do you streamline access and searchability of these vast document repositories? 

This question of how best to manage this sprawling document landscape has sparked ongoing discussion about possible solutions to this issue.  

At Chapter Three, we believe all third-party documents should be integrated into the Drupal database as structured content. Previously no effective solutions existed, so we developed Document OCR.
 

 

Document OCR is a Drupal 9/10 module that extracts structured data from PDFs and images using Optical Character Recognition (OCR) services such as Google Document AI. Services like this convert many document types into accurately parsed, structured JSON payloads at scale. The Document OCR module offers several configuration steps to improve the import process.

DOcument OCR Flow

Notably, our integration with OpenAI imbues your content import with 'superpowers.' For instance, OpenAI could examine all your documents during import, extract keywords, create a taxonomy, and summarize each document.

Setting up Drupal

Getting your entity fields set up in Drupal is easy. It is just configuring a content type with fields where you want to see your document content.

Document OCR has  five main configuration areas: 

  1. Mapping maps the content structure to your Drupal entity types.
  2. Processors, configure your third-party processor Google AI.  
  3. Transformers clean up and transform your imported content. 
  4. Imports provide a list of all imports (processed and pending).
  5. One-time imports cover your imports when you require a few documents. 

Mapping

The mapping page creates and stores mapping between the source and destination entities. The source entity is where you add files that you need to process, and the destination entity stores extracted contents of the file.

Each type of mapping has its own settings, supports real-time processing and queue processing (which runs on cron), and can set the number of processing attempts.

In the Mapping form, you map the structure of your document payload to Drupal entities.

Processors

The processor page is a collection of all processing plugins enabled on your website, which supports Google Document AI and PDF Parser plugins by default. Google Document AI allows you to extract different structured documents like payment receipts, expenses, government form data, et cetera, into fields. PDF Parser is a simple library that extracts PDF contents as pure text. We encourage extending this module with other document systems.

For a simpler setup, credentials files should be stored as JSON in a private directory. Each plugin has its own configuration form.

Google Document AI processor configuration form.  You must create a Google account to use Document AI

PDF Parser configuration form

Transformers

Transformers are plugins that transform extracted data before saving it into Drupal fields. Basic transformers, OpenAI, and Pipeline transformers are the default plugins supported in the module.

The basic transformer includes the following configuration:

OpenAI transformer configuration form:

The pipeline transformer plugin allows the addition of multiple transformers in a chain.

One-time imports

One-time import functionality exists for cases where you don’t need to process files automatically. It is similar to regular mapping, except you don’t need to configure the source entity. You just need to configure the destination entity and pick an optional file field to store the actual file along with its extracted data.

Whether you have one file or thousands of them, Document OCR can fix your file problem and bring all your content into a unified content management platform using Drupal.

If you have a project and would like a demo. Please feel free to contact us.