Drupal modules for browsing and managing Fedora-based digital repositories.

History

dannylamb 5453b10ad9 Text extraction (#163 )		7 years ago
..
src/Plugin	Text extraction (#163 )	7 years ago
tests/src/Functional	Text extraction (#163 )	7 years ago
LICENSE	Text extraction (#163 )	7 years ago
README.md	Text extraction (#163 )	7 years ago
islandora_text_extraction.info.yml	Text extraction (#163 )	7 years ago
islandora_text_extraction.install	Text extraction (#163 )	7 years ago
islandora_text_extraction.module	Text extraction (#163 )	7 years ago

README.md

islandora_text_extraction

Connects Islandora 8 to Hypercube microservice and extracts text from PDFs

Install module in the usual way, then copy assets/ca.islandora.alpaca.connector.ocr.blueprint.xml to /opt/karaf/deploy on the server. note: This config file assumes a url of http://localhost:8000/hypercube.
If your service is found elsewhere this must be changed. There is no need to restart.

In the usual Ansible build this will require no modification.

If a parent node is tagged as Digital Document an Image tagged media will extract text from that image at the time of ingestion.
The content type of the parent node should be configured to allow multiple tags.

note: Media are linked to their parent nodes with the Media Of entity reference field. If you wish to attach the PDF (or any other ) media type to a parent node which has any content type other than Repository Item (islandora_object) the parent content type will have to be added to the Media Of field in the media type description.

Prepare module for PDF text extraction

Install texttopdf on your server if not already present. On an ubuntu/debian machine like the default claw playbook run sudo apt-get install poppler-utils

test to see its been properly installed with which pdftotext

Install php libraries with composer require spatie/pdf-to-text

In the unlikely event that your pdftotext binary exists on your server outside of the system path, the path to the binary can be set at /admin/config/islandora/text_extraction.

Using text extraction

The containing document must be tagged as Digital Document, and the media must be tagged as Original File. A new editable Extracted Text media will be created and attached when PDF or Image media types are added to a node.