Last change
on this file since 33623 was 33623, checked in by ak19, 4 years ago |
- Incorporated Dr Nichols earlier suggestion of storing page modified time and char-encoding metadata if present in the crawl dump output. Have done so, but neither modifiedTime nor fetchTime metadata of the dump file appear to be a webpage's actual modified time, as they're from 2019 and set around the period we've been crawling. 2. Moved getDomainFromURL() function from CCWETProcessor.java to Utility.java since it's been reused. 3. MongoDBAccess class successfully connects (at least, no exceptions) and uses the newly added properties in config.properties to make the connection.
|
File size:
1.3 KB
|
Line | |
---|
1 | # https://www.linuxjournal.com/content/downloading-entire-web-site-wget
|
---|
2 | # https://linuxreviews.org/Wget:_download_whole_or_parts_of_websites_with_ease
|
---|
3 | # https://www.webhostface.com/kb/knowledgebase/examples-using-wget/
|
---|
4 | # "You can replicate the HTML content of a website with the âmirror option (or -m for short)
|
---|
5 | # wget -m http://domain.com"
|
---|
6 | # https://www.linuxquestions.org/questions/linux-server-73/wget-how-to-download-more-than-one-file-at-once-instead-of-file-after-file-704693/
|
---|
7 | wget.mirror.cmd=wget -Q10m -m %%BASE_URL%%
|
---|
8 |
|
---|
9 | # for downloading a single file
|
---|
10 | wget.file.cmd=wget %%FILE_URL%%
|
---|
11 |
|
---|
12 | # Arbitrary cutoff values for WETProcessor.java
|
---|
13 | WETprocessor.min.content.length=100
|
---|
14 | WETprocessor.min.line.count=2
|
---|
15 | WETprocessor.min.content.length.wrapped.line=500
|
---|
16 | WETprocessor.min.spaces.per.wrapped.line=10
|
---|
17 |
|
---|
18 | # Arbitrary cutoff values for WETProcessor.java
|
---|
19 | # for determining whether a WET record has sufficient and sensible content
|
---|
20 | WETprocessor.max.word.length=15
|
---|
21 | WETprocessor.min.num.words=20
|
---|
22 | WETprocessor.max.words.camelcase=10
|
---|
23 |
|
---|
24 |
|
---|
25 | mongodb.user=anupama
|
---|
26 | mongodb.pwd=chang3m3
|
---|
27 | # default mongodb port is 27017. Don't change the port unless you really have configured
|
---|
28 | # your mongodb server to listen at some other port
|
---|
29 | mongodb.port=27017
|
---|
30 | mongodb.host=mongodb.cms.waikato.ac.nz
|
---|
31 | mongodb.dbname=ateacrawldata
|
---|
Note:
See
TracBrowser
for help on using the repository browser.