|
|
@33624
|
5 years |
ak19 |
Some cleanup surrounding the now renamed function createSeedURLsFile, …
|
|
|
@33623
|
5 years |
ak19 |
1. Incorporated Dr Nichols earlier suggestion of storing page modified …
|
|
|
@33615
|
5 years |
ak19 |
1. Worked out how to configure log4j to log both to console and …
|
|
|
@33604
|
5 years |
ak19 |
1. Better output into possible-product-sites.txt including the …
|
|
|
@33603
|
5 years |
ak19 |
Incorporating Dr Nichols suggestion to help weed out product sites: if …
|
|
|
@33582
|
5 years |
ak19 |
NutchTextDumpProcessor prints each crawled site's stats: number of …
|
|
|
@33575
|
5 years |
ak19 |
Correcting usage string for CCWETProcessor before committing new java …
|
|
|
@33573
|
5 years |
ak19 |
Forgot to document that spaces were also allowed as separator in the …
|
|
|
@33569
|
5 years |
ak19 |
1. batchcrawl.sh now does what it should have from the start, which is …
|
|
|
@33568
|
5 years |
ak19 |
1. More sites greylisted and blacklisted, discovered as I attempted to …
|
|
|
@33565
|
5 years |
ak19 |
CCWETProcessor: domain url now goes in as a seedURL after the …
|
|
|
@33562
|
5 years |
ak19 |
1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
|
|
|
@33561
|
5 years |
ak19 |
1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
|
|
|
@33560
|
5 years |
ak19 |
1. Incorporated Dr Bainbridge's suggested improvements: only when …
|
|
|
@33557
|
5 years |
ak19 |
Implemented the topSitesMap of topsite domain to url pattern in the …
|
|
|
@33552
|
5 years |
ak19 |
1. Code now processes ccrawldata folder, containing each individual …
|
|
|
@33519
|
5 years |
ak19 |
Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
|
|
|
@33518
|
5 years |
ak19 |
Intermediate commit: got the seed urls file temporarily written out as …
|
|
|
@33517
|
5 years |
ak19 |
1. Blacklists were introduced so that too many instances of camelcased …
|
|
|
@33515
|
5 years |
ak19 |
Removed an unused function
|
|
|
@33503
|
5 years |
ak19 |
More efficient blacklisting/greylisting/whitelisting now by reading in …
|
|
|
@33501
|
5 years |
ak19 |
Refactored code into 2 classes: The existing WETProcessor, which …
|