source: gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @33615   5 years ak19 1. Worked out how to configure log4j to log both to console and …
(edit) @33604   5 years ak19 1. Better output into possible-product-sites.txt including the …
(edit) @33603   5 years ak19 Incorporating Dr Nichols suggestion to help weed out product sites: if …
(edit) @33582   5 years ak19 NutchTextDumpProcessor prints each crawled site's stats: number of …
(edit) @33575   5 years ak19 Correcting usage string for CCWETProcessor before committing new java …
(edit) @33573   5 years ak19 Forgot to document that spaces were also allowed as separator in the …
(edit) @33569   5 years ak19 1. batchcrawl.sh now does what it should have from the start, which is …
(edit) @33568   5 years ak19 1. More sites greylisted and blacklisted, discovered as I attempted to …
(edit) @33565   5 years ak19 CCWETProcessor: domain url now goes in as a seedURL after the …
(edit) @33562   5 years ak19 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit) @33561   5 years ak19 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
(edit) @33560   5 years ak19 1. Incorporated Dr Bainbridge's suggested improvements: only when …
(edit) @33557   5 years ak19 Implemented the topSitesMap of topsite domain to url pattern in the …
(edit) @33552   5 years ak19 1. Code now processes ccrawldata folder, containing each individual …
(edit) @33519   5 years ak19 Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
(edit) @33518   5 years ak19 Intermediate commit: got the seed urls file temporarily written out as …
(edit) @33517   5 years ak19 1. Blacklists were introduced so that too many instances of camelcased …
(edit) @33515   5 years ak19 Removed an unused function
(edit) @33503   5 years ak19 More efficient blacklisting/greylisting/whitelisting now by reading in …
(add) @33501   5 years ak19 Refactored code into 2 classes: The existing WETProcessor, which …
Note: See TracRevisionLog for help on using the revision log.