Changeset 35783


Ignore:
Timestamp:
2021-12-09T15:05:51+13:00 (2 years ago)
Author:
cstephen
Message:

Update README

Location:
gs3-extensions/atea-nlp-tools/trunk/src/ocr
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/atea-nlp-tools/trunk/src/ocr/README.md

    r35733 r35783  
    1 # Installation
     1# OCR Servlet Setup
    22
    3 Find the path to your tesseract binaries directory, then run
     31. Download the latest Tesseract data models for English, Maori and OSD and place them into a directory of your choosing.
    44
    5 ```
    6 ./setup.sh
     5    - [English](https://github.com/tesseract-ocr/tessdata_fast/raw/main/eng.traineddata)
     6    - [Maori](https://github.com/tesseract-ocr/tessdata_fast/raw/main/mri.traineddata)
     7    - [OSD](https://github.com/tesseract-ocr/tessdata_fast/raw/main/osd.traineddata)
     8
     9    Then, run the setup script, providing the path to this directory. You MUST NOT move or delete this directory without re-running the setup script and re-compiling the project.
     10
     11    ```sh
     12    > ./setup.sh
     13    ```
     14
     152. Edit the HTML content of the `unauthorised` page so that the button redirects to your preferred location.
     16
     17    ```sh
     18    nano src/main/webapp/webContent/unauthorised.html
     19    ```
     20
     213. Compile and install the WAR file.
     22
     23    ```sh
     24    > ant install
     25    ```
     26
     274. Update the apache2 config with the relevant `ProxyPass` rules
     28
     29    ```sh
     30    > sudo nano /etc/apache2/sites-enabled/000-default-le-ssl.conf
     31    > sudo nano /etc/apache2/sites-enabled/000-default.conf
     32   
     33    ProxyPass /gs3-koreromaori http://localhost:8383/gs3-macroniser
     34    ProxyPassReverse /gs3-koreromaori http://localhost:8383/gs3-macroniser
     35    ```
     36
     375. If `403 Forbidden` errors are observed when consuming the API, update the CORS filter in `web.xml` to include your root domains, and re-install.
     38
     39    ```sh
     40    > nano src/main/webapp/WEB-INF/web.xml
     41
     42    <filter>
     43        <filter-name>CorsFilter</filter-name>
     44        <filter-class>org.apache.catalina.filters.CorsFilter</filter-class>
     45        <init-param>
     46            <param-name>cors.allowed.origins</param-name>
     47            -- <param-value>http://localhost:8080</param-value> <!-- Separate values by a comma -->
     48            ++ <param-value>http://localhost:8080,http://atea.space,https://atea.space</param-value>
     49        </init-param>
     50    </filter>
     51    ```
     52
     53# Consuming the API
     54
     55## OCR Endpoint
     56
     57- Endpoint: `/tesseract`
     58- Method: `POST`
     59- Request Content Type: `multipart/form-data`
     60- Response Content Type: `application/json`
     61
     62The `tesseract` endpoint runs the Tesseract OCR engine on the provided images, and returns the results.
     63
     64### Expected Form Parts
     65
     66Name | Type | Optional | Description
     67--|--|--|--
     68`options` | `OptionMap` | Yes | The options to use when macronising each file.
     69Image file parts | `blob` | At least one. | The images to perform OCR on.
     70
     71#### `OptionMap` Object
     72
     73A map of options for each submitted file. The option key should match the name of the corresponding file part in the request.
     74
     75- `layoutDetection`: A value indicating whether or not Tesseract should attempt to automatically detect the layout of the image.
     76
     77```json
     78{
     79    "key1": {
     80        "layoutDetection": true
     81    },
     82    ...
     83}
    784```
    885
    9 # Now edit:
     86### Response Fields
    1087
    11   src/main/webapp/webContent/unauthorizsed.html
     88Name | Type | Optional | Description
     89--|--|--|--
     90`key` | `string` | No | The unique key of the image that this result was produced from. Matches the name of the file part in the request.
     91`fileName` | `string` | No | The name of the file that this result was produced from.
     92`text` | `string` | No | The extracted text.
     93`thresholdedImageKey` | `string` | No | A key that can be used with the `image` endpoint to retrieve the 'thresholded image', which is the final stage of Tesseract's internal image processing before it runs the OCR algorithm.
    1294
    13 so the button displays the name Atea, and points to the https://atea.space website
     95#### Example Response
    1496
     97```json
     98[
     99    {
     100       "key": "0test.png",
     101       "fileName": "test.png",
     102       "text": "Te Kāwanatanga o Aotearoa\n",
     103       "thresholdedImageKey": "7e383d85-4a4c-481d-83bf-c5e384512399.webp"
     104    },
     105    ...
     106]
     107```
    15108
    16 # Compile up and install the war file
     109## Image Retrieval
    17110
    18   ant install
     111- Endpoint: `/image`
     112- Method: `GET`
     113- Response Content Type: `image/webp`
    19114
     115The `image` endpoint can be used to retrieve images associated with the OCR process that occurs when calling other endpoints. It returns images in the `webp` format.
    20116
    21 # Now update apache2 config with ProxyPass rules
     117### Expected Request Parameters
    22118
    23  sudo emacs /etc/apache2/sites-enabled/000-default-le-ssl.conf
    24 
    25    ProxyPass /gs3-atea-ocr http://localhost:8383/gs3-atea-ocr
    26    ProxyPassReverse /gs3-atea-ocr http://localhost:8383/gs3-atea-ocr
    27 
    28 #Same for the http version:
    29 
    30  sudo emacs /etc/apache2/sites-enabled/000-default.conf
     119Name | Type | Optional | Description
     120--|--|--|--
     121`key` | `string` | No | The key of the image to retrieve.
  • gs3-extensions/atea-nlp-tools/trunk/src/ocr/src/main/java/org/atea/nlptools/ocr/services/TesseractOcrService.java

    r35780 r35783  
    7171            File thresholdOutput = new File(this.thresholdOutputPath, fileName);
    7272            thresholdedImage = api.GetThresholdedImage();
    73             // lept.pixWriteWebP("temp", thresholdedImage, 75, 100);
    7473            lept.pixWrite(thresholdOutput.getAbsolutePath(), thresholdedImage, lept.IFF_WEBP);
    7574
Note: See TracChangeset for help on using the changeset viewer.