You are here

Using Tika and the Attachments module to index PDFs, DOC files, etc.

If you would like to index the text content of file attachments on your Drupal nodes (like PDFs, Word documents, spreadsheets, etc.), you can do so using the following modules:

Hosted Apache Solr includes Apache Tika, which is a software library that assists in extracting text from file attachments. The fastest and most customizable method of using Apache Tika is to have it installed on the same server where your Drupal site resides, but if you would like to use the extraction handler running on Hosted Apache Solr's servers, you can configure the Solr modules according to the below instructions:

Search API Attachments

Make sure the Search API Attachments module is enabled, then add a new file search index and configure it as instructed below:

  1. Visit the Search API Attachments configuration page (admin/config/search/search_api/attachments) and set the following options:
    1. Extraction method: Solr (remote server)
    2. Solr extracting servlet path: extract/tika
  2. Create a new file search index (click 'Add index' on the admin/config/search/search_api page, and choose 'File' for 'Item type').
  3. After the index is created, add a couple fields (like 'File name') to the 'Fields to index', and save the changes.
  4. For the file search index's filters configuration, make sure 'File attachments' is checked under 'Data Alterations', then save the changes.
  5. Go back to the index's 'Fields' tab, and you should now see a 'File content' (attachments_content) field in the list, set to be indexed. This means the file attachments will be indexed correctly for this file search index.

Apache Solr Search Attachments

Make sure the Apache Solr Attachments module is enabled, then follow the instructions below:

  1. On the Apache Solr search Attachments configuration page (admin/config/search/apachesolr/attachments), select 'Solr (remote server)' for the 'Extract using' option, and save the configuration.
  2. To verify that extraction is working, click the 'Test your tika extraction' button at the bottom of the same page.

If the testing fails, and if you see an error in your Drupal site's logs like java.lang.ClassNotFoundException: solr.extraction.ExtractingRequestHandler, then your solr configuration may be incorrect; in some cases an older Apache Solr Search configuration was used, and it incorrectly looks for a file apache-solr-cell..., when it should be searching for solr-cell... (using Solr 4.10.4 or later).