source: gs3-extensions/maori-lang-detection/conf/config.properties@ 33623

Last change on this file since 33623 was 33623, checked in by ak19, 4 years ago
  1. Incorporated Dr Nichols earlier suggestion of storing page modified time and char-encoding metadata if present in the crawl dump output. Have done so, but neither modifiedTime nor fetchTime metadata of the dump file appear to be a webpage's actual modified time, as they're from 2019 and set around the period we've been crawling. 2. Moved getDomainFromURL() function from CCWETProcessor.java to Utility.java since it's been reused. 3. MongoDBAccess class successfully connects (at least, no exceptions) and uses the newly added properties in config.properties to make the connection.
File size: 1.3 KB
Line 
1# https://www.linuxjournal.com/content/downloading-entire-web-site-wget
2# https://linuxreviews.org/Wget:_download_whole_or_parts_of_websites_with_ease
3# https://www.webhostface.com/kb/knowledgebase/examples-using-wget/
4# "You can replicate the HTML content of a website with the –mirror option (or -m for short)
5# wget -m http://domain.com"
6# https://www.linuxquestions.org/questions/linux-server-73/wget-how-to-download-more-than-one-file-at-once-instead-of-file-after-file-704693/
7wget.mirror.cmd=wget -Q10m -m %%BASE_URL%%
8
9# for downloading a single file
10wget.file.cmd=wget %%FILE_URL%%
11
12# Arbitrary cutoff values for WETProcessor.java
13WETprocessor.min.content.length=100
14WETprocessor.min.line.count=2
15WETprocessor.min.content.length.wrapped.line=500
16WETprocessor.min.spaces.per.wrapped.line=10
17
18# Arbitrary cutoff values for WETProcessor.java
19# for determining whether a WET record has sufficient and sensible content
20WETprocessor.max.word.length=15
21WETprocessor.min.num.words=20
22WETprocessor.max.words.camelcase=10
23
24
25mongodb.user=anupama
26mongodb.pwd=chang3m3
27# default mongodb port is 27017. Don't change the port unless you really have configured
28# your mongodb server to listen at some other port
29mongodb.port=27017
30mongodb.host=mongodb.cms.waikato.ac.nz
31mongodb.dbname=ateacrawldata
Note: See TracBrowser for help on using the repository browser.