Browse Source

More CLAWditing

pull/739/head
dannylamb 5 years ago
parent
commit
a9497941a8
  1. 2
      CONTRIBUTING.md
  2. 67
      modules/islandora_text_extraction/README.md

2
CONTRIBUTING.md

@ -8,7 +8,7 @@ Please note that this project operates under the [Islandora Community Code of Co
## Workflows
The group meets each Wednesday at 1:00 PM Eastern. Meeting notes and announcements are posted to the [Islandora community list](https://groups.google.com/forum/#!forum/islandora) and the [Islandora developers list](https://groups.google.com/forum/#!forum/islandora-dev). You can view meeting agendas, notes, and call-in information [here](https://github.com/Islandora/documentation/wiki#islandora-claw-tech-calls). Anybody is welcome to join the calls, and add items to the agenda.
The group meets each Wednesday at 1:00 PM Eastern. Meeting notes and announcements are posted to the [Islandora community list](https://groups.google.com/forum/#!forum/islandora) and the [Islandora developers list](https://groups.google.com/forum/#!forum/islandora-dev). You can view meeting agendas, notes, and call-in information [here](https://github.com/Islandora/documentation/wiki#islandora-8-tech-calls). Anybody is welcome to join the calls, and add items to the agenda.
### Use cases

67
modules/islandora_text_extraction/README.md

@ -1,44 +1,49 @@
# islandora_text_extraction
### Connects Islandora 8 to Hypercube microservice and extracts text from PDFs
# Islandora Text Extraction `
Install module in the usual way,
then copy `assets/ca.islandora.alpaca.connector.ocr.blueprint.xml`
to `/opt/karaf/deploy` on the server.
_note:_ This config file assumes a URL of `http://localhost:8000/hypercube`.
If your service is found elsewhere this must be changed.
There is no need to restart.
[![Minimum PHP Version](https://img.shields.io/badge/php-%3E%3D%207.2-8892BF.svg?style=flat-square)](https://php.net/)
[![Contribution Guidelines](http://img.shields.io/badge/CONTRIBUTING-Guidelines-blue.svg)](./CONTRIBUTING.md)
[![LICENSE](https://img.shields.io/badge/license-GPLv2-blue.svg?style=flat-square)](./LICENSE)
In the usual Ansible build this will require no modification.
## Introduction
If a parent node is tagged as `Digital Document` an `Image` tagged media
will extract text from that image at the time of ingestion.
The content type of the parent node should be configured to allow multiple tags.
Provides actions to extract text with a [Hypercube](https://github.com/Islandora/Crayfish/tree/dev/Hypercube) (`tessseract` and `pdftotext`) server, as well as a Media type to hold the extracted text.
_note:_ Media are linked to their parent nodes with the `Media Of`
entity reference field. If you wish to attach the PDF (or any other ) media type
to a parent node which has any content type other than Repository Item
(islandora_object) the parent content type will have to be added to the `Media Of`
field in the media type description.
## Requirements
## Prepare module for PDF text extraction
Install `texttopdf` on your server if not already present.
On an Ubuntu/Debian machine like the default claw playbook run
`sudo apt-get install poppler-utils`
- `islandora` and `islandora_core_feature`
- A Hypercube microservice
- A message broker (e.g. Activemq) for Islandora 8
- An instance of `islandora-connector-derivative` configured for Hypercube
test to see its been properly installed with `which pdftotext`
## Installation
Install php libraries with `composer require spatie/pdf-to-text`
For a full digital repository solution (including a Hypercube microservice), see our [installation documentation](https://islandora.github.io/documentation/installation/).
In the unlikely event that your `pdftotext` binary exists on your server
outside of the system path, the path to the binary can be set at
`/admin/config/islandora/text_extraction`.
To download/enable just this module, use the following from the command line:
## Using text extraction ##
The containing document must be tagged as `Digital Document`,
and the media must be tagged as `Original File`.
A new editable `Extracted Text` media will be created and attached when `PDF` or
`Image` media types are added to a node.
```bash
$ composer require islandora/islandora
$ drush en islandora_core_feature
$ drush mim islandora_tags
$ drush en islandora_text_extraction
```
## Documentation
Official documentation is available on the [Islandora 8 documentation site](https://islandora.github.io/documentation/).
## Sponsors
Original work for this module was done by @ajstanley for @roblib at University of Prince Edward Island.
## Development
If you would like to contribute, please get involved by attending our weekly [Tech Call](https://github.com/Islandora/documentation/wiki). We love to hear from you!
If you would like to contribute code to the project, you need to be covered by an Islandora Foundation [Contributor License Agreement](http://islandora.ca/sites/default/files/islandora_cla.pdf) or [Corporate Contributor License Agreement](http://islandora.ca/sites/default/files/islandora_ccla.pdf). Please see the [Contributors](http://islandora.ca/resources/contributors) pages on Islandora.ca for more information.
We recommend using the [islandora-playbook](https://github.com/Islandora-Devops/islandora-playbook) to get started.
## License
[GPLv2](http://www.gnu.org/licenses/gpl-2.0.txt)

Loading…
Cancel
Save