source: gs3-extensions/maori-lang-detection/conf/config.properties@ 33480

Last change on this file since 33480 was 33480, checked in by ak19, 5 years ago

Much harder to remove pages where words are fused together as some are shorter than valid word-lengths of 15 chars, some are long, when the number of valid words still come to more than the required number of 20. The next solution was to ignore pages that had more than 2 instances of camelcase, but valid pages (actual Maori language pages) may end up with a few more camelcased words if navigation items get fused together. Not sure what to do.

File size: 1.0 KB
Line 
1# https://www.linuxjournal.com/content/downloading-entire-web-site-wget
2# https://linuxreviews.org/Wget:_download_whole_or_parts_of_websites_with_ease
3# https://www.webhostface.com/kb/knowledgebase/examples-using-wget/
4# "You can replicate the HTML content of a website with the –mirror option (or -m for short)
5# wget -m http://domain.com"
6# https://www.linuxquestions.org/questions/linux-server-73/wget-how-to-download-more-than-one-file-at-once-instead-of-file-after-file-704693/
7wget.mirror.cmd=wget -Q10m -m %%BASE_URL%%
8
9# for downloading a single file
10wget.file.cmd=wget %%FILE_URL%%
11
12# Arbitrary cutoff values for WETProcessor.java
13WETprocessor.min.content.length=100
14WETprocessor.min.line.count=2
15WETprocessor.min.content.length.wrapped.line=500
16WETprocessor.min.spaces.per.wrapped.line=10
17
18# Arbitrary cutoff values for WETProcessor.java
19# for determining whether a WET record has sufficient and sensible content
20WETprocessor.max.word.length=15
21WETprocessor.min.num.words=20
22WETprocessor.max.words.camelcase=10
Note: See TracBrowser for help on using the repository browser.