islandora/modules/islandora_text_extraction/README.md

# islandora_text_extraction
### Connects Islandora 8 to Hypercube microservice and extracts text from PDFs

Install module in the usual way, 
then copy `assets/ca.islandora.alpaca.connector.ocr.blueprint.xml` 
to `/opt/karaf/deploy` on the server. 
 _note:_ This config file assumes a URL of `http://localhost:8000/hypercube`.  
If your service is found elsewhere this must be changed.
There is no need to restart.
  
In the usual Ansible build this will require no modification.

If a parent node is tagged as `Digital Document` an `Image` tagged media
will extract text from that image at the time of ingestion.  
The content type of the parent node should be configured to allow multiple tags.

_note:_ Media are linked to their parent nodes with the `Media Of` 
entity reference field.  If you wish to attach the PDF (or any other ) media type
to a parent node which has any content type other than Repository Item 
(islandora_object) the parent content type will have to be added to the `Media Of`
field in the media type description.

## Prepare module for PDF text extraction
Install `texttopdf` on your server if not already present.
On an Ubuntu/Debian machine like the default claw playbook run 
`sudo apt-get install poppler-utils`

test to see its been properly installed with `which pdftotext`

Install php libraries with  `composer require spatie/pdf-to-text`

In the unlikely event that your `pdftotext` binary exists on your server 
outside of the system path, the path to the binary can be set at 
`/admin/config/islandora/text_extraction`.

## Using text extraction ##
The containing document must be tagged as `Digital Document`, 
and the media must be tagged as `Original File`.
A new editable `Extracted Text` media will be created and attached when `PDF` or 
`Image` media types are added to a node.
Text extraction (#163) 5 years ago			`# islandora_text_extraction`
			`### Connects Islandora 8 to Hypercube microservice and extracts text from PDFs`

			`Install module in the usual way,`
			then copy `assets/ca.islandora.alpaca.connector.ocr.blueprint.xml`
			to `/opt/karaf/deploy` on the server.
Ran Aspell CLI spell checker (#178) I used an Aspell "word list" based on the "word list" from the ISLE repo found here... https://raw.githubusercontent.com/Islandora-Collaboration-Group/ISLE/master/.aspell.en.pws 5 years ago			_note:_ This config file assumes a URL of `http://localhost:8000/hypercube`.
Text extraction (#163) 5 years ago			`If your service is found elsewhere this must be changed.`
			`There is no need to restart.`

			`In the usual Ansible build this will require no modification.`

			If a parent node is tagged as `Digital Document` an `Image` tagged media
			`will extract text from that image at the time of ingestion.`
			`The content type of the parent node should be configured to allow multiple tags.`

			_note:_ Media are linked to their parent nodes with the `Media Of`
			`entity reference field. If you wish to attach the PDF (or any other ) media type`
			`to a parent node which has any content type other than Repository Item`
			(islandora_object) the parent content type will have to be added to the `Media Of`
			`field in the media type description.`

			`## Prepare module for PDF text extraction`
			Install `texttopdf` on your server if not already present.
Ran Aspell CLI spell checker (#178) I used an Aspell "word list" based on the "word list" from the ISLE repo found here... https://raw.githubusercontent.com/Islandora-Collaboration-Group/ISLE/master/.aspell.en.pws 5 years ago			`On an Ubuntu/Debian machine like the default claw playbook run`
Text extraction (#163) 5 years ago			`sudo apt-get install poppler-utils`

			test to see its been properly installed with `which pdftotext`

			Install php libraries with `composer require spatie/pdf-to-text`

			In the unlikely event that your `pdftotext` binary exists on your server
			`outside of the system path, the path to the binary can be set at`
			`/admin/config/islandora/text_extraction`.

			`## Using text extraction ##`
			The containing document must be tagged as `Digital Document`,
			and the media must be tagged as `Original File`.
			A new editable `Extracted Text` media will be created and attached when `PDF` or
			`Image` media types are added to a node.