root/gs3-extensions

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Rev Chgset Date Author Log Message
(edit) @33560 [33560] 7 days ak19 1. Incorporated Dr Bainbridge's suggested improvements: only when there is …
(edit) @33559 [33559] 7 days ak19 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
(edit) @33558 [33558] 7 days ak19 Committing cumulative changes since last commit.
(edit) @33557 [33557] 8 days ak19 Implemented the topSitesMap of topsite domain to url pattern in the only …
(edit) @33556 [33556] 8 days ak19 Blacklisted wikipedia pages that are actually in other languages which had …
(edit) @33555 [33555] 8 days ak19 Modified top sites list as Dr Bainbridge described: suffixes for the same …
(edit) @33554 [33554] 8 days ak19 Added more to blacklist and greylist. And removed remaining duplicates …
(edit) @33553 [33553] 13 days ak19 Comments
(edit) @33552 [33552] 13 days ak19 1. Code now processes ccrawldata folder, containing each individual common …
(edit) @33551 [33551] 2 weeks ak19 Added in top 500 urls from moz.com/top500 and removed duplicates, and …
(edit) @33550 [33550] 2 weeks ak19 First stage of introducing sites-too-big-to-exhaustively-crawl.tx: split …
(edit) @33549 [33549] 2 weeks ak19 All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 (when …
(edit) @33548 [33548] 2 weeks davidb Include new wavesurfer sub-project to install
(edit) @33546 [33546] 2 weeks davidb Initial cut at wave-surfer based JS audio player extension for Greenstone
(edit) @33545 [33545] 2 weeks ak19 Mainly changes to crawling-Nutch.txt and some minor changes to other txt …
(edit) @33543 [33543] 2 weeks ak19 Filled in some missing instructions
(edit) @33541 [33541] 2 weeks ak19 1. hdfs-cc-work/GS_README.txt now contains the complete instructions to …
(edit) @33540 [33540] 2 weeks ak19 Since I wasn't getting further with nutch 2 to grab an entire site, I am …
(edit) @33539 [33539] 2 weeks ak19 File rename
(edit) @33538 [33538] 2 weeks ak19 Some additions to the setup.sh script to query commoncrawl for MRI data on …
(edit) @33537 [33537] 2 weeks ak19 More nutch and general site mirroring related links
(edit) @33536 [33536] 2 weeks ak19 Changes required to the commoncrawl related Vagrant github project to get …
(edit) @33535 [33535] 2 weeks ak19 1. New setup.sh script for on a hadoop system to setup the git projects we …
(edit) @33534 [33534] 3 weeks ak19 Correction: toplevel script has to be placed inside cc-index-table not its …
(edit) @33532 [33532] 3 weeks ak19 Found the other top 500 sites link again at last which Dr Bainbridge had …
(edit) @33531 [33531] 3 weeks ak19 Added whitelist for mi.wikipedia.org, and updates to blacklist and …
(edit) @33530 [33530] 3 weeks ak19 Completed sentence that was left hanging.
(edit) @33529 [33529] 3 weeks ak19 Forgot to add most basic nutch links
(edit) @33528 [33528] 3 weeks ak19 Adding in Nutch links
(edit) @33527 [33527] 3 weeks ak19 Name change for folder
(edit) @33526 [33526] 3 weeks ak19 Moved hadoop related scripts from bin/script into hdfs-instructions
(edit) @33525 [33525] 3 weeks ak19 Rename before latest version
(edit) @33524 [33524] 3 weeks ak19 1. Further adjustments to documenting what we did to get things to run on …
(edit) @33523 [33523] 3 weeks ak19 Instructional comment
(edit) @33522 [33522] 3 weeks ak19 Some comments and an improvement
(edit) @33519 [33519] 3 weeks ak19 Code still writes out the global seedURLs.txt and regex-urlfilter.txt (in …
(edit) @33518 [33518] 3 weeks ak19 Intermediate commit: got the seed urls file temporarily written out as …
(edit) @33517 [33517] 3 weeks ak19 1. Blacklists were introduced so that too many instances of camelcased …
(edit) @33516 [33516] 3 weeks ak19 Before I accidentally lose it, committing the script Dr Bainbridge wrote …
(edit) @33515 [33515] 3 weeks ak19 Removed an unused function
(edit) @33514 [33514] 3 weeks ak19 Committing README on starting off with the vagrant VM for hadoop-spark to …
(edit) @33513 [33513] 3 weeks ak19 Higher level script that runs against each named crawl since Sep 2018 …
(edit) @33503 [33503] 3 weeks ak19 More efficient blacklisting/greylisting/whitelisting now by reading in the …
(edit) @33502 [33502] 3 weeks ak19 Current url pattern blacklist and greylist filter files. Used by …
(edit) @33501 [33501] 3 weeks ak19 Refactored code into 2 classes: The existing WETProcessor, which processes …
(edit) @33499 [33499] 3 weeks ak19 Explicitly adding in IAM policy configuration details instead of just …
(edit) @33498 [33498] 3 weeks ak19 Corrections to script. Modified the tests checking for file/dir existence …
(edit) @33497 [33497] 4 weeks ak19 First version of discard url filter file. Inefficient implementation. …
(edit) @33496 [33496] 4 weeks ak19 Minor changes to reading list file
(edit) @33495 [33495] 4 weeks ak19 Pruned out unused commands, added comments, marked unused variables to be …
(edit) @33494 [33494] 4 weeks ak19 All in one script that takes as parameter a common crawl identifier of the …
(edit) @33489 [33489] 4 weeks ak19 Handy file to not have to keep manually repeating commands when deleting …
(edit) @33488 [33488] 4 weeks ak19 new function createSeedURLsFiles() in WETProcessor that replaces the bash …
(edit) @33480 [33480] 4 weeks ak19 Much harder to remove pages where words are fused together as some are …
(edit) @33471 [33471] 5 weeks ak19 Very minor changes.
(edit) @33470 [33470] 5 weeks ak19 A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
(edit) @33469 [33469] 5 weeks ak19 Don't want URLs with the word product(s) in them (but production should be …
(edit) @33468 [33468] 5 weeks ak19 More meaningful to (also) write out the keep vs discard URLs into keep and …
(edit) @33467 [33467] 5 weeks ak19 Improved the code to use a static block to load the needed properties from …
(edit) @33466 [33466] 5 weeks ak19 1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) files. …
(edit) @33465 [33465] 5 weeks ak19 Committing first version of the WETProcessor.java which takes a .warc.wet …
(edit) @33457 [33457] 6 weeks ak19 Got stage 1, the WARC to WET conversion, working, after necessary …
(edit) @33456 [33456] 6 weeks ak19 Link to discussion on how to convert WARC to WET
(edit) @33448 [33448] 7 weeks ak19 Minor clarification and inclusion of helpful command
(edit) @33446 [33446] 7 weeks ak19 1. Committing working version of export_maori_subset.sh which takes the …
(edit) @33445 [33445] 7 weeks ak19 The first working hadoop spark script for processing common crawl data. …
(edit) @33443 [33443] 7 weeks ak19 More notes
(edit) @33442 [33442] 7 weeks ak19 Updated gutil.jar file (with SafeProcses? debugging)
(edit) @33441 [33441] 7 weeks ak19 Adding further notes to do with running the CC-index examples on spark.
(edit) @33440 [33440] 7 weeks ak19 Split file to move vagrant-spark-hadoop notes into own file.
(edit) @33428 [33428] 2 months ak19 Working commoncrawl cc-warc-examples' WET wordcount example using Hadoop. …
(edit) @33425 [33425] 2 months ak19 A few more links now that I got past getting the vagrant VM with spark and …
(edit) @33423 [33423] 2 months ak19 Adding in the link to the vagrant VM with Hadoop, Spark for cluster …
(edit) @33422 [33422] 2 months ak19 Some more links.
(edit) @33419 [33419] 2 months ak19 Last evening, I had found some links about how language-detection is done …
(edit) @33414 [33414] 2 months ak19 Adding important links
(edit) @33413 [33413] 2 months ak19 Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, …
(edit) @33412 [33412] 2 months ak19 config command for wgetting a single file
(edit) @33411 [33411] 2 months ak19 Newer version now doesn't mirror sites with wget but gets WET files and …
(edit) @33410 [33410] 2 months ak19 Committing some variable name changes before I replace this file with the …
(edit) @33409 [33409] 2 months ak19 Forgot to commit 2 files with links and shuffling some links around into …
(edit) @33408 [33408] 2 months ak19 Some rough notes. Will move into appropriate file later.
(edit) @33407 [33407] 2 months ak19 gutil.jar was rebuilt yesterday in GS3 after a bugfix. Recommitting for …
(edit) @33405 [33405] 2 months ak19 Even though we're probably not going to use this code after all, will …
(edit) @33404 [33404] 2 months ak19 1. Links to other Java ways of extracting text from web content. 2. …
(edit) @33402 [33402] 2 months ak19 Beginnings of the Java class to wget sites and process its pages to detect …
(edit) @33401 [33401] 2 months ak19 MaoriTextDetector?.class file now generated inside its package folder (for …
(edit) @33400 [33400] 2 months ak19 1. Setting up log4j.properties based on the macronizer's basic one that I …
(edit) @33399 [33399] 2 months ak19 Putting properties files into the conf folder and keeping the lib folder …
(edit) @33398 [33398] 2 months ak19 Committing the actual package structure and the updated README after …
(edit) @33397 [33397] 2 months ak19 1. Changing package structure and instructions on compiling/running as …
(edit) @33396 [33396] 2 months ak19 Georgian language gs3colcfg module of GS interface. Many thanks to Vano …
(edit) @33394 [33394] 2 months ak19 1. Started a file on feasibility with the data now available and some …
(edit) @33393 [33393] 2 months ak19 Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls file …
(edit) @33392 [33392] 2 months ak19 Kathy found a problem whereby she wanted to run consecutive buildcols …
(edit) @33391 [33391] 2 months ak19 Some rough bash scripting lines that work but aren't complete.
(edit) @33390 [33390] 2 months ak19 Minor message telling the user to wait for a task that takes some time.
(edit) @33388 [33388] 2 months kjdon tidied up some debug statements
(edit) @33379 [33379] 3 months ak19 New script to automate getting a file listing of the common crawl URL data …
(edit) @33378 [33378] 3 months ak19 New bin/script folder and relocating gen_SentenceDetection_model.sh to …
Note: See TracRevisionLog for help on using the revision log.