Changeset 35783
- Timestamp:
- 2021-12-09T15:05:51+13:00 (2 years ago)
- Location:
- gs3-extensions/atea-nlp-tools/trunk/src/ocr
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/atea-nlp-tools/trunk/src/ocr/README.md
r35733 r35783 1 # Installation1 # OCR Servlet Setup 2 2 3 Find the path to your tesseract binaries directory, then run 3 1. Download the latest Tesseract data models for English, Maori and OSD and place them into a directory of your choosing. 4 4 5 ``` 6 ./setup.sh 5 - [English](https://github.com/tesseract-ocr/tessdata_fast/raw/main/eng.traineddata) 6 - [Maori](https://github.com/tesseract-ocr/tessdata_fast/raw/main/mri.traineddata) 7 - [OSD](https://github.com/tesseract-ocr/tessdata_fast/raw/main/osd.traineddata) 8 9 Then, run the setup script, providing the path to this directory. You MUST NOT move or delete this directory without re-running the setup script and re-compiling the project. 10 11 ```sh 12 > ./setup.sh 13 ``` 14 15 2. Edit the HTML content of the `unauthorised` page so that the button redirects to your preferred location. 16 17 ```sh 18 nano src/main/webapp/webContent/unauthorised.html 19 ``` 20 21 3. Compile and install the WAR file. 22 23 ```sh 24 > ant install 25 ``` 26 27 4. Update the apache2 config with the relevant `ProxyPass` rules 28 29 ```sh 30 > sudo nano /etc/apache2/sites-enabled/000-default-le-ssl.conf 31 > sudo nano /etc/apache2/sites-enabled/000-default.conf 32 33 ProxyPass /gs3-koreromaori http://localhost:8383/gs3-macroniser 34 ProxyPassReverse /gs3-koreromaori http://localhost:8383/gs3-macroniser 35 ``` 36 37 5. If `403 Forbidden` errors are observed when consuming the API, update the CORS filter in `web.xml` to include your root domains, and re-install. 38 39 ```sh 40 > nano src/main/webapp/WEB-INF/web.xml 41 42 <filter> 43 <filter-name>CorsFilter</filter-name> 44 <filter-class>org.apache.catalina.filters.CorsFilter</filter-class> 45 <init-param> 46 <param-name>cors.allowed.origins</param-name> 47 -- <param-value>http://localhost:8080</param-value> <!-- Separate values by a comma --> 48 ++ <param-value>http://localhost:8080,http://atea.space,https://atea.space</param-value> 49 </init-param> 50 </filter> 51 ``` 52 53 # Consuming the API 54 55 ## OCR Endpoint 56 57 - Endpoint: `/tesseract` 58 - Method: `POST` 59 - Request Content Type: `multipart/form-data` 60 - Response Content Type: `application/json` 61 62 The `tesseract` endpoint runs the Tesseract OCR engine on the provided images, and returns the results. 63 64 ### Expected Form Parts 65 66 Name | Type | Optional | Description 67 --|--|--|-- 68 `options` | `OptionMap` | Yes | The options to use when macronising each file. 69 Image file parts | `blob` | At least one. | The images to perform OCR on. 70 71 #### `OptionMap` Object 72 73 A map of options for each submitted file. The option key should match the name of the corresponding file part in the request. 74 75 - `layoutDetection`: A value indicating whether or not Tesseract should attempt to automatically detect the layout of the image. 76 77 ```json 78 { 79 "key1": { 80 "layoutDetection": true 81 }, 82 ... 83 } 7 84 ``` 8 85 9 # Now edit:86 ### Response Fields 10 87 11 src/main/webapp/webContent/unauthorizsed.html 88 Name | Type | Optional | Description 89 --|--|--|-- 90 `key` | `string` | No | The unique key of the image that this result was produced from. Matches the name of the file part in the request. 91 `fileName` | `string` | No | The name of the file that this result was produced from. 92 `text` | `string` | No | The extracted text. 93 `thresholdedImageKey` | `string` | No | A key that can be used with the `image` endpoint to retrieve the 'thresholded image', which is the final stage of Tesseract's internal image processing before it runs the OCR algorithm. 12 94 13 so the button displays the name Atea, and points to the https://atea.space website95 #### Example Response 14 96 97 ```json 98 [ 99 { 100 "key": "0test.png", 101 "fileName": "test.png", 102 "text": "Te KÄwanatanga o Aotearoa\n", 103 "thresholdedImageKey": "7e383d85-4a4c-481d-83bf-c5e384512399.webp" 104 }, 105 ... 106 ] 107 ``` 15 108 16 # Compile up and install the war file109 ## Image Retrieval 17 110 18 ant install 111 - Endpoint: `/image` 112 - Method: `GET` 113 - Response Content Type: `image/webp` 19 114 115 The `image` endpoint can be used to retrieve images associated with the OCR process that occurs when calling other endpoints. It returns images in the `webp` format. 20 116 21 # Now update apache2 config with ProxyPass rules117 ### Expected Request Parameters 22 118 23 sudo emacs /etc/apache2/sites-enabled/000-default-le-ssl.conf 24 25 ProxyPass /gs3-atea-ocr http://localhost:8383/gs3-atea-ocr 26 ProxyPassReverse /gs3-atea-ocr http://localhost:8383/gs3-atea-ocr 27 28 #Same for the http version: 29 30 sudo emacs /etc/apache2/sites-enabled/000-default.conf 119 Name | Type | Optional | Description 120 --|--|--|-- 121 `key` | `string` | No | The key of the image to retrieve. -
gs3-extensions/atea-nlp-tools/trunk/src/ocr/src/main/java/org/atea/nlptools/ocr/services/TesseractOcrService.java
r35780 r35783 71 71 File thresholdOutput = new File(this.thresholdOutputPath, fileName); 72 72 thresholdedImage = api.GetThresholdedImage(); 73 // lept.pixWriteWebP("temp", thresholdedImage, 75, 100);74 73 lept.pixWrite(thresholdOutput.getAbsolutePath(), thresholdedImage, lept.IFF_WEBP); 75 74
Note:
See TracChangeset
for help on using the changeset viewer.