|
|
@33623
|
5 years |
ak19 |
1. Incorporated Dr Nichols earlier suggestion of storing page modified …
|
|
|
@33615
|
5 years |
ak19 |
1. Worked out how to configure log4j to log both to console and …
|
|
|
@33604
|
5 years |
ak19 |
1. Better output into possible-product-sites.txt including the …
|
|
|
@33603
|
5 years |
ak19 |
Incorporating Dr Nichols suggestion to help weed out product sites: if …
|
|
|
@33569
|
5 years |
ak19 |
1. batchcrawl.sh now does what it should have from the start, which is …
|
|
|
@33568
|
5 years |
ak19 |
1. More sites greylisted and blacklisted, discovered as I attempted to …
|
|
|
@33565
|
5 years |
ak19 |
CCWETProcessor: domain url now goes in as a seedURL after the …
|
|
|
@33562
|
5 years |
ak19 |
1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
|
|
|
@33561
|
5 years |
ak19 |
1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
|
|
|
@33559
|
5 years |
ak19 |
1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
|
|
|
@33556
|
5 years |
ak19 |
Blacklisted wikipedia pages that are actually in other languages which …
|
|
|
@33555
|
5 years |
ak19 |
Modified top sites list as Dr Bainbridge described: suffixes for the …
|
|
|
@33554
|
5 years |
ak19 |
Added more to blacklist and greylist. And removed remaining duplicates …
|
|
|
@33553
|
5 years |
ak19 |
Comments
|
|
|
@33551
|
5 years |
ak19 |
Added in top 500 urls from moz.com/top500 and removed duplicates, and …
|
|
|
@33550
|
5 years |
ak19 |
First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
|
|
|
@33532
|
5 years |
ak19 |
Found the other top 500 sites link again at last which Dr Bainbridge …
|
|
|
@33531
|
5 years |
ak19 |
Added whitelist for mi.wikipedia.org, and updates to blacklist and …
|
|
|
@33502
|
5 years |
ak19 |
Current url pattern blacklist and greylist filter files. Used by …
|
|
|
@33480
|
5 years |
ak19 |
Much harder to remove pages where words are fused together as some are …
|
|
|
@33467
|
5 years |
ak19 |
Improved the code to use a static block to load the needed properties …
|
|
|
@33412
|
5 years |
ak19 |
config command for wgetting a single file
|
|
|
@33400
|
5 years |
ak19 |
1. Setting up log4j.properties based on the macronizer's basic one …
|
|
|
@33399
|
5 years |
ak19 |
Putting properties files into the conf folder and keeping the lib …
|