|
|
@33573
|
5 years |
ak19 |
Forgot to document that spaces were also allowed as separator in the …
|
|
|
@33569
|
5 years |
ak19 |
1. batchcrawl.sh now does what it should have from the start, which is …
|
|
|
@33568
|
5 years |
ak19 |
1. More sites greylisted and blacklisted, discovered as I attempted to …
|
|
|
@33565
|
5 years |
ak19 |
CCWETProcessor: domain url now goes in as a seedURL after the …
|
|
|
@33562
|
5 years |
ak19 |
1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
|
|
|
@33561
|
5 years |
ak19 |
1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
|
|
|
@33560
|
5 years |
ak19 |
1. Incorporated Dr Bainbridge's suggested improvements: only when …
|
|
|
@33557
|
5 years |
ak19 |
Implemented the topSitesMap of topsite domain to url pattern in the …
|
|
|
@33552
|
5 years |
ak19 |
1. Code now processes ccrawldata folder, containing each individual …
|
|
|
@33519
|
5 years |
ak19 |
Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
|
|
|
@33518
|
5 years |
ak19 |
Intermediate commit: got the seed urls file temporarily written out as …
|
|
|
@33517
|
5 years |
ak19 |
1. Blacklists were introduced so that too many instances of camelcased …
|
|
|
@33515
|
5 years |
ak19 |
Removed an unused function
|
|
|
@33503
|
5 years |
ak19 |
More efficient blacklisting/greylisting/whitelisting now by reading in …
|
|
|
@33501
|
5 years |
ak19 |
Refactored code into 2 classes: The existing WETProcessor, which …
|
|
|
@33497
|
5 years |
ak19 |
First version of discard url filter file. Inefficient implementation. …
|
|
|
@33488
|
5 years |
ak19 |
new function createSeedURLsFiles() in WETProcessor that replaces the …
|
|
|
@33480
|
5 years |
ak19 |
Much harder to remove pages where words are fused together as some are …
|
|
|
@33471
|
5 years |
ak19 |
Very minor changes.
|
|
|
@33469
|
5 years |
ak19 |
Don't want URLs with the word product(s) in them (but production …
|
|
|
@33468
|
5 years |
ak19 |
More meaningful to (also) write out the keep vs discard URLs into keep …
|
|
|
@33467
|
5 years |
ak19 |
Improved the code to use a static block to load the needed properties …
|
|
|
@33466
|
5 years |
ak19 |
1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
|
|
|
@33465
|
5 years |
ak19 |
Committing first version of the WETProcessor.java which takes a …
|
|
|
@33411
|
5 years |
ak19 |
Newer version now doesn't mirror sites with wget but gets WET files …
|
|
|
@33410
|
5 years |
ak19 |
Committing some variable name changes before I replace this file with …
|
|
|
@33405
|
5 years |
ak19 |
Even though we're probably not going to use this code after all, will …
|
|
|
@33402
|
5 years |
ak19 |
Beginnings of the Java class to wget sites and process its pages to …
|
|
|
@33401
|
5 years |
ak19 |
MaoriTextDetector.class file now generated inside its package folder …
|
|
|
@33398
|
5 years |
ak19 |
Committing the actual package structure and the updated README after …
|
|
|
@33397
|
5 years |
ak19 |
1. Changing package structure and instructions on compiling/running as …
|
|
|
@33355
|
5 years |
ak19 |
Changes for adding in the new gen_SentenceDetection_model.sh script, …
|
|
|
@33350
|
5 years |
ak19 |
Better comments. Tested macronised vs unmacronised Māori language test …
|
|
|
@33338
|
5 years |
ak19 |
1.After renaming the java class, changed all occurrences of the old …
|
|
|
@33337
|
5 years |
ak19 |
Renaming the class to MaoriTextDetector, since it doesn't detect audio …
|
|
|
@33336
|
5 years |
ak19 |
Major rewrite to make this class more useful to callers. …
|
|
|
@33335
|
5 years |
ak19 |
First java file for Māori language detection using openNLP with the …
|