source: gs3-extensions/maori-lang-detection/src

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @33573   5 years ak19 Forgot to document that spaces were also allowed as separator in the …
(edit) @33569   5 years ak19 1. batchcrawl.sh now does what it should have from the start, which is …
(edit) @33568   5 years ak19 1. More sites greylisted and blacklisted, discovered as I attempted to …
(edit) @33565   5 years ak19 CCWETProcessor: domain url now goes in as a seedURL after the …
(edit) @33562   5 years ak19 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit) @33561   5 years ak19 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
(edit) @33560   5 years ak19 1. Incorporated Dr Bainbridge's suggested improvements: only when …
(edit) @33557   5 years ak19 Implemented the topSitesMap of topsite domain to url pattern in the …
(edit) @33552   5 years ak19 1. Code now processes ccrawldata folder, containing each individual …
(edit) @33519   5 years ak19 Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
(edit) @33518   5 years ak19 Intermediate commit: got the seed urls file temporarily written out as …
(edit) @33517   5 years ak19 1. Blacklists were introduced so that too many instances of camelcased …
(edit) @33515   5 years ak19 Removed an unused function
(edit) @33503   5 years ak19 More efficient blacklisting/greylisting/whitelisting now by reading in …
(edit) @33501   5 years ak19 Refactored code into 2 classes: The existing WETProcessor, which …
(edit) @33497   5 years ak19 First version of discard url filter file. Inefficient implementation. …
(edit) @33488   5 years ak19 new function createSeedURLsFiles() in WETProcessor that replaces the …
(edit) @33480   5 years ak19 Much harder to remove pages where words are fused together as some are …
(edit) @33471   5 years ak19 Very minor changes.
(edit) @33469   5 years ak19 Don't want URLs with the word product(s) in them (but production …
(edit) @33468   5 years ak19 More meaningful to (also) write out the keep vs discard URLs into keep …
(edit) @33467   5 years ak19 Improved the code to use a static block to load the needed properties …
(edit) @33466   5 years ak19 1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
(edit) @33465   5 years ak19 Committing first version of the WETProcessor.java which takes a …
(edit) @33411   5 years ak19 Newer version now doesn't mirror sites with wget but gets WET files …
(edit) @33410   5 years ak19 Committing some variable name changes before I replace this file with …
(edit) @33405   5 years ak19 Even though we're probably not going to use this code after all, will …
(edit) @33402   5 years ak19 Beginnings of the Java class to wget sites and process its pages to …
(edit) @33401   5 years ak19 MaoriTextDetector.class file now generated inside its package folder …
(edit) @33398   5 years ak19 Committing the actual package structure and the updated README after …
(edit) @33397   5 years ak19 1. Changing package structure and instructions on compiling/running as …
(edit) @33355   5 years ak19 Changes for adding in the new gen_SentenceDetection_model.sh script, …
(edit) @33350   5 years ak19 Better comments. Tested macronised vs unmacronised Māori language test …
(edit) @33338   5 years ak19 1.After renaming the java class, changed all occurrences of the old …
(edit) @33337   5 years ak19 Renaming the class to MaoriTextDetector, since it doesn't detect audio …
(edit) @33336   5 years ak19 Major rewrite to make this class more useful to callers. …
(add) @33335   5 years ak19 First java file for Māori language detection using openNLP with the …
Note: See TracRevisionLog for help on using the revision log.