source: gs3-extensions

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @33562   5 years ak19 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit) @33561   5 years ak19 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
(edit) @33560   5 years ak19 1. Incorporated Dr Bainbridge's suggested improvements: only when …
(edit) @33559   5 years ak19 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
(edit) @33558   5 years ak19 Committing cumulative changes since last commit.
(edit) @33557   5 years ak19 Implemented the topSitesMap of topsite domain to url pattern in the …
(edit) @33556   5 years ak19 Blacklisted wikipedia pages that are actually in other languages which …
(edit) @33555   5 years ak19 Modified top sites list as Dr Bainbridge described: suffixes for the …
(edit) @33554   5 years ak19 Added more to blacklist and greylist. And removed remaining duplicates …
(edit) @33553   5 years ak19 Comments
(edit) @33552   5 years ak19 1. Code now processes ccrawldata folder, containing each individual …
(edit) @33551   5 years ak19 Added in top 500 urls from moz.com/top500 and removed duplicates, and …
(edit) @33550   5 years ak19 First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
(edit) @33549   5 years ak19 All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
(edit) @33548   5 years davidb Include new wavesurfer sub-project to install
(edit) @33546   5 years davidb Initial cut at wave-surfer based JS audio player extension for Greenstone
(edit) @33545   5 years ak19 Mainly changes to crawling-Nutch.txt and some minor changes to other …
(edit) @33543   5 years ak19 Filled in some missing instructions
(edit) @33541   5 years ak19 1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
(edit) @33540   5 years ak19 Since I wasn't getting further with nutch 2 to grab an entire site, I …
(edit) @33539   5 years ak19 File rename
(edit) @33538   5 years ak19 Some additions to the setup.sh script to query commoncrawl for MRI …
(edit) @33537   5 years ak19 More nutch and general site mirroring related links
(edit) @33536   5 years ak19 Changes required to the commoncrawl related Vagrant github project to …
(edit) @33535   5 years ak19 1. New setup.sh script for on a hadoop system to setup the git …
(edit) @33534   5 years ak19 Correction: toplevel script has to be placed inside cc-index-table not …
(edit) @33532   5 years ak19 Found the other top 500 sites link again at last which Dr Bainbridge …
(edit) @33531   5 years ak19 Added whitelist for mi.wikipedia.org, and updates to blacklist and …
(edit) @33530   5 years ak19 Completed sentence that was left hanging.
(edit) @33529   5 years ak19 Forgot to add most basic nutch links
(edit) @33528   5 years ak19 Adding in Nutch links
(edit) @33527   5 years ak19 Name change for folder
(edit) @33526   5 years ak19 Moved hadoop related scripts from bin/script into hdfs-instructions
(edit) @33525   5 years ak19 Rename before latest version
(edit) @33524   5 years ak19 1. Further adjustments to documenting what we did to get things to run …
(edit) @33523   5 years ak19 Instructional comment
(edit) @33522   5 years ak19 Some comments and an improvement
(edit) @33519   5 years ak19 Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
(edit) @33518   5 years ak19 Intermediate commit: got the seed urls file temporarily written out as …
(edit) @33517   5 years ak19 1. Blacklists were introduced so that too many instances of camelcased …
(edit) @33516   5 years ak19 Before I accidentally lose it, committing the script Dr Bainbridge …
(edit) @33515   5 years ak19 Removed an unused function
(edit) @33514   5 years ak19 Committing README on starting off with the vagrant VM for hadoop-spark …
(edit) @33513   5 years ak19 Higher level script that runs against each named crawl since Sep 2018 …
(edit) @33503   5 years ak19 More efficient blacklisting/greylisting/whitelisting now by reading in …
(edit) @33502   5 years ak19 Current url pattern blacklist and greylist filter files. Used by …
(edit) @33501   5 years ak19 Refactored code into 2 classes: The existing WETProcessor, which …
(edit) @33499   5 years ak19 Explicitly adding in IAM policy configuration details instead of just …
(edit) @33498   5 years ak19 Corrections to script. Modified the tests checking for file/dir …
(edit) @33497   5 years ak19 First version of discard url filter file. Inefficient implementation. …
(edit) @33496   5 years ak19 Minor changes to reading list file
(edit) @33495   5 years ak19 Pruned out unused commands, added comments, marked unused variables to …
(edit) @33494   5 years ak19 All in one script that takes as parameter a common crawl identifier of …
(edit) @33489   5 years ak19 Handy file to not have to keep manually repeating commands when …
(edit) @33488   5 years ak19 new function createSeedURLsFiles() in WETProcessor that replaces the …
(edit) @33480   5 years ak19 Much harder to remove pages where words are fused together as some are …
(edit) @33471   5 years ak19 Very minor changes.
(edit) @33470   5 years ak19 A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
(edit) @33469   5 years ak19 Don't want URLs with the word product(s) in them (but production …
(edit) @33468   5 years ak19 More meaningful to (also) write out the keep vs discard URLs into keep …
(edit) @33467   5 years ak19 Improved the code to use a static block to load the needed properties …
(edit) @33466   5 years ak19 1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
(edit) @33465   5 years ak19 Committing first version of the WETProcessor.java which takes a …
(edit) @33457   5 years ak19 Got stage 1, the WARC to WET conversion, working, after necessary …
(edit) @33456   5 years ak19 Link to discussion on how to convert WARC to WET
(edit) @33448   5 years ak19 Minor clarification and inclusion of helpful command
(edit) @33446   5 years ak19 1. Committing working version of export_maori_subset.sh which takes …
(edit) @33445   5 years ak19 The first working hadoop spark script for processing common crawl …
(edit) @33443   5 years ak19 More notes
(edit) @33442   5 years ak19 Updated gutil.jar file (with SafeProcses debugging)
(edit) @33441   5 years ak19 Adding further notes to do with running the CC-index examples on spark.
(edit) @33440   5 years ak19 Split file to move vagrant-spark-hadoop notes into own file.
(edit) @33428   5 years ak19 Working commoncrawl cc-warc-examples' WET wordcount example using …
(edit) @33425   5 years ak19 A few more links now that I got past getting the vagrant VM with spark …
(edit) @33423   5 years ak19 Adding in the link to the vagrant VM with Hadoop, Spark for cluster …
(edit) @33422   5 years ak19 Some more links.
(edit) @33419   5 years ak19 Last evening, I had found some links about how language-detection is …
(edit) @33414   5 years ak19 Adding important links
(edit) @33413   5 years ak19 Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, …
(edit) @33412   5 years ak19 config command for wgetting a single file
(edit) @33411   5 years ak19 Newer version now doesn't mirror sites with wget but gets WET files …
(edit) @33410   5 years ak19 Committing some variable name changes before I replace this file with …
(edit) @33409   5 years ak19 Forgot to commit 2 files with links and shuffling some links around …
(edit) @33408   5 years ak19 Some rough notes. Will move into appropriate file later.
(edit) @33407   5 years ak19 gutil.jar was rebuilt yesterday in GS3 after a bugfix. Recommitting …
(edit) @33405   5 years ak19 Even though we're probably not going to use this code after all, will …
(edit) @33404   5 years ak19 1. Links to other Java ways of extracting text from web content. 2. …
(edit) @33402   5 years ak19 Beginnings of the Java class to wget sites and process its pages to …
(edit) @33401   5 years ak19 MaoriTextDetector.class file now generated inside its package folder …
(edit) @33400   5 years ak19 1. Setting up log4j.properties based on the macronizer's basic one …
(edit) @33399   5 years ak19 Putting properties files into the conf folder and keeping the lib …
(edit) @33398   5 years ak19 Committing the actual package structure and the updated README after …
(edit) @33397   5 years ak19 1. Changing package structure and instructions on compiling/running as …
(edit) @33396   5 years ak19 Georgian language gs3colcfg module of GS interface. Many thanks to …
(edit) @33394   5 years ak19 1. Started a file on feasibility with the data now available and some …
(edit) @33393   5 years ak19 Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls …
(edit) @33392   5 years ak19 Kathy found a problem whereby she wanted to run consecutive buildcols …
(edit) @33391   5 years ak19 Some rough bash scripting lines that work but aren't complete.
(edit) @33390   5 years ak19 Minor message telling the user to wait for a task that takes some time.
(edit) @33388   5 years kjdon tidied up some debug statements
Note: See TracRevisionLog for help on using the revision log.