Browse Source

More CLAWditing

pull/739/head
dannylamb 5 years ago
parent
commit
a9497941a8
  1. 2
      CONTRIBUTING.md
  2. 67
      modules/islandora_text_extraction/README.md

2
CONTRIBUTING.md

@ -8,7 +8,7 @@ Please note that this project operates under the [Islandora Community Code of Co
## Workflows ## Workflows
The group meets each Wednesday at 1:00 PM Eastern. Meeting notes and announcements are posted to the [Islandora community list](https://groups.google.com/forum/#!forum/islandora) and the [Islandora developers list](https://groups.google.com/forum/#!forum/islandora-dev). You can view meeting agendas, notes, and call-in information [here](https://github.com/Islandora/documentation/wiki#islandora-claw-tech-calls). Anybody is welcome to join the calls, and add items to the agenda. The group meets each Wednesday at 1:00 PM Eastern. Meeting notes and announcements are posted to the [Islandora community list](https://groups.google.com/forum/#!forum/islandora) and the [Islandora developers list](https://groups.google.com/forum/#!forum/islandora-dev). You can view meeting agendas, notes, and call-in information [here](https://github.com/Islandora/documentation/wiki#islandora-8-tech-calls). Anybody is welcome to join the calls, and add items to the agenda.
### Use cases ### Use cases

67
modules/islandora_text_extraction/README.md

@ -1,44 +1,49 @@
# islandora_text_extraction # Islandora Text Extraction `
### Connects Islandora 8 to Hypercube microservice and extracts text from PDFs
Install module in the usual way, [![Minimum PHP Version](https://img.shields.io/badge/php-%3E%3D%207.2-8892BF.svg?style=flat-square)](https://php.net/)
then copy `assets/ca.islandora.alpaca.connector.ocr.blueprint.xml` [![Contribution Guidelines](http://img.shields.io/badge/CONTRIBUTING-Guidelines-blue.svg)](./CONTRIBUTING.md)
to `/opt/karaf/deploy` on the server. [![LICENSE](https://img.shields.io/badge/license-GPLv2-blue.svg?style=flat-square)](./LICENSE)
_note:_ This config file assumes a URL of `http://localhost:8000/hypercube`.
If your service is found elsewhere this must be changed.
There is no need to restart.
In the usual Ansible build this will require no modification. ## Introduction
If a parent node is tagged as `Digital Document` an `Image` tagged media Provides actions to extract text with a [Hypercube](https://github.com/Islandora/Crayfish/tree/dev/Hypercube) (`tessseract` and `pdftotext`) server, as well as a Media type to hold the extracted text.
will extract text from that image at the time of ingestion.
The content type of the parent node should be configured to allow multiple tags.
_note:_ Media are linked to their parent nodes with the `Media Of` ## Requirements
entity reference field. If you wish to attach the PDF (or any other ) media type
to a parent node which has any content type other than Repository Item
(islandora_object) the parent content type will have to be added to the `Media Of`
field in the media type description.
## Prepare module for PDF text extraction - `islandora` and `islandora_core_feature`
Install `texttopdf` on your server if not already present. - A Hypercube microservice
On an Ubuntu/Debian machine like the default claw playbook run - A message broker (e.g. Activemq) for Islandora 8
`sudo apt-get install poppler-utils` - An instance of `islandora-connector-derivative` configured for Hypercube
test to see its been properly installed with `which pdftotext` ## Installation
Install php libraries with `composer require spatie/pdf-to-text` For a full digital repository solution (including a Hypercube microservice), see our [installation documentation](https://islandora.github.io/documentation/installation/).
In the unlikely event that your `pdftotext` binary exists on your server To download/enable just this module, use the following from the command line:
outside of the system path, the path to the binary can be set at
`/admin/config/islandora/text_extraction`.
## Using text extraction ## ```bash
The containing document must be tagged as `Digital Document`, $ composer require islandora/islandora
and the media must be tagged as `Original File`. $ drush en islandora_core_feature
A new editable `Extracted Text` media will be created and attached when `PDF` or $ drush mim islandora_tags
`Image` media types are added to a node. $ drush en islandora_text_extraction
```
## Documentation
Official documentation is available on the [Islandora 8 documentation site](https://islandora.github.io/documentation/).
## Sponsors
Original work for this module was done by @ajstanley for @roblib at University of Prince Edward Island.
## Development
If you would like to contribute, please get involved by attending our weekly [Tech Call](https://github.com/Islandora/documentation/wiki). We love to hear from you!
If you would like to contribute code to the project, you need to be covered by an Islandora Foundation [Contributor License Agreement](http://islandora.ca/sites/default/files/islandora_cla.pdf) or [Corporate Contributor License Agreement](http://islandora.ca/sites/default/files/islandora_ccla.pdf). Please see the [Contributors](http://islandora.ca/resources/contributors) pages on Islandora.ca for more information.
We recommend using the [islandora-playbook](https://github.com/Islandora-Devops/islandora-playbook) to get started.
## License
[GPLv2](http://www.gnu.org/licenses/gpl-2.0.txt)

Loading…
Cancel
Save