angle-downangle-leftangle-rightangle-upfacebookinstagramlinkedinTagtwitteryoutube-play
Skip to main content

Indexing PDFs for Solr Search

Profile picture for user Carlos Espinoza
Carlos Espinoza Drupal Developer and Themer Follow
September 04, 2019

When a client needs to index PDF files for search, the best solution is to use Apache Solr with the Search API Attachments module. In this blog post, I will explain how to setup Solr on Pantheon and how to configure Solr and Search API Attachments.

What is Solr?

Apache Solr is a fast open-source Java search server.

Using Solr with Drupal 8 on Pantheon

Pantheon provides Apache Solr with most plans, including Sandbox, though it is not included in the Basic plan.

Pantheon offers complete instructions for enabling Solr with Drupal 8 on its platform. Here are some of the key steps.

To enable Apache Solr on Pantheon, go to Settings > Add Ons

solr

 

Add the Search API Pantheon module as a required dependency.

composer require "drupal/search_api_pantheon ~1.0" --prefer-dist

Commit the modules to the server.

search_api_pantheon
search_api
Search_api_solr

Add the Search Server.
/admin/config/search/search-api/add-server

solr-connect

Create your search index.

/admin/config/search/search-api/add-index

solr-index

 

Adding the Search API Attachments Module and Config

At this point you will have a working Solr search server in your Pantheon site. The next step is to add the Search API Attachments module and config.

Running Search API Attachments with Solr requires Tika. Tika is a Java library that can extract metadata from PDF documents and create a searchable index for Solr.

Install Search API Attachments

composer require drupal/search_api_attachments

Go to the Search API Attachments settings page at /admin/config/search/search_api_attachments and enter the following fields:

Extraction method: Tika Extractor
Path to java executable: java
Path to Tika .jar file: /srv/bin/tika-app-1.18.jar

Verify that your site is able to extract text from documents by clicking “Submit and test extraction.”

tika-success

At this point we have Solr and Search API Attachments working with Tika on Pantheon.

Go to your recent index to create a processor and enable the file attachments.

file-attachments

 

Adding PDF Fields

I use Media Entity to add PDF fields. You can either add PDF fields to your Content Type as the Media Type “File”, or you can create them as the “File” Type in your Content Type for the PDF file.

Media Entity Field

media-file

File Type Field

file-field

 

Based on the fields you added in your search index, and based on the name of your fields, select your Search API Attachments in the General fields section.

search-api-index

 

Re-index with the new fields. You will then be able to search text inside the PDF attachments fields.

index-fields

 

Displaying the Search Results in a View

Now that we have our Solr, indexing, and Search API Attachment settings working, it’s time to display the results. You’ll need to create a View with content from your Solr index.

Add Fields and Filter Criteria to display search results in a View page.

view-search

 

Your Full Text Search Filter criteria will allow fields to be searched by keywords.

full-text-search

 

Setting up Solr with Lando Locally

We now have everything working on Pantheon, but if you need to test locally, you can setup local config with Lando based on this example.

name: sitename
recipe: drupal8
drupal: true
config:
 webroot: .
 php: '7.2'
 xdebug: true
 drush: ^8
 
proxy:
 solr:
   - solr.example.lndo.site:8983
   # Provides a nice lndo url for the solr web interface
   # at http://admin.solr.lndo.site
   - admin.solr.lndo.site:8983
 
services:
 database:
   type: mysql:5.6
 
 solr:
   # Use solr version 5.5
   type: solr:5.5
   # Optionally allow access to the database at localhost:9999
   # You will need to make sure port 9999 is open on your machine
   #
   # You can also set `portforward: true` to have Lando dynamically assign
   # a port. Unlike specifying an actual port setting this to true will give you
   # a different port every time you restart your app
   portforward: 9999
   # Optionally declare the name of the solr core.
   # This setting is only applicable for versions 5.5 and above
   core: freedom
   config:
     conf: modules/contrib/search_api_solr/solr-conf/5.x
 appserver:
   extras:
     # Apache Tika
     - apt-get update -y
     - apt-get install -y openjdk-8-jre-headless
     - apt-get install -y openjdk-8-jdk
     - mkdir -p /app/srv/bin && cd /app/srv/bin
     - cd /app/srv/bin && curl -OL http://archive.apache.org/dist/tika/tika-app-1.16.jar
     - apt-get remove openjdk-8-jdk -y
       
tooling:
 drush:
   service: appserver
   cmd:
     - "drush"
     - "/app/vendor/bin/drush --root=/app/ --uri=http://site-name.lndo.site"

 

Configure the Solr Server as you did with Pantheon.

solr-local

 

Go to the Search API Attachments settings
/admin/config/search/search_api_attachments

tika-local

Once you submit the test, you should get the same green option and be ready to work in your local environment with Solr and Search API Attachment.

How Can We Help With
Your Next Project?