source:

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @33563   5 years ak19 Committing inactive testing batch scripts (only creates the …
(edit) @33562   5 years ak19 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit) @33561   5 years ak19 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
(edit) @33560   5 years ak19 1. Incorporated Dr Bainbridge's suggested improvements: only when …
(edit) @33559   5 years ak19 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
(edit) @33558   5 years ak19 Committing cumulative changes since last commit.
(edit) @33557   5 years ak19 Implemented the topSitesMap of topsite domain to url pattern in the …
(edit) @33556   5 years ak19 Blacklisted wikipedia pages that are actually in other languages which …
(edit) @33555   5 years ak19 Modified top sites list as Dr Bainbridge described: suffixes for the …
(edit) @33554   5 years ak19 Added more to blacklist and greylist. And removed remaining duplicates …
(edit) @33553   5 years ak19 Comments
(edit) @33552   5 years ak19 1. Code now processes ccrawldata folder, containing each individual …
(edit) @33551   5 years ak19 Added in top 500 urls from moz.com/top500 and removed duplicates, and …
(edit) @33550   5 years ak19 First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
(edit) @33549   5 years ak19 All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
(edit) @33548   5 years davidb Include new wavesurfer sub-project to install
(edit) @33547   5 years davidb Initial cut at wavesurfer JS audio player version of AMC music content …
(edit) @33546   5 years davidb Initial cut at wave-surfer based JS audio player extension for Greenstone
(edit) @33545   5 years ak19 Mainly changes to crawling-Nutch.txt and some minor changes to other …
(edit) @33544   5 years ak19 1. Dr Bainbridge had the correct fix for solr dealing with phrase …
(edit) @33543   5 years ak19 Filled in some missing instructions
(edit) @33542   5 years kjdon use_hlist_for option is no longer valid
(edit) @33541   5 years ak19 1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
(edit) @33540   5 years ak19 Since I wasn't getting further with nutch 2 to grab an entire site, I …
(edit) @33539   5 years ak19 File rename
(edit) @33538   5 years ak19 Some additions to the setup.sh script to query commoncrawl for MRI …
(edit) @33537   5 years ak19 More nutch and general site mirroring related links
(edit) @33536   5 years ak19 Changes required to the commoncrawl related Vagrant github project to …
(edit) @33535   5 years ak19 1. New setup.sh script for on a hadoop system to setup the git …
(edit) @33534   5 years ak19 Correction: toplevel script has to be placed inside cc-index-table not …
(edit) @33533   5 years kjdon some collections might not have Title or root_Title metadata, so check …
(edit) @33532   5 years ak19 Found the other top 500 sites link again at last which Dr Bainbridge …
(edit) @33531   5 years ak19 Added whitelist for mi.wikipedia.org, and updates to blacklist and …
(edit) @33530   5 years ak19 Completed sentence that was left hanging.
(edit) @33529   5 years ak19 Forgot to add most basic nutch links
(edit) @33528   5 years ak19 Adding in Nutch links
(edit) @33527   5 years ak19 Name change for folder
(edit) @33526   5 years ak19 Moved hadoop related scripts from bin/script into hdfs-instructions
(edit) @33525   5 years ak19 Rename before latest version
(edit) @33524   5 years ak19 1. Further adjustments to documenting what we did to get things to run …
(edit) @33523   5 years ak19 Instructional comment
(edit) @33522   5 years ak19 Some comments and an improvement
(edit) @33521   5 years ak19 AUTOCOMMIT by gen-model-colls.sh script. Message: Redoing the CDS-ISIS …
(edit) @33520   5 years ak19 AUTOCOMMIT by gen-model-colls.sh script. Message: Redoing the CDS-ISIS …
(edit) @33519   5 years ak19 Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
(edit) @33518   5 years ak19 Intermediate commit: got the seed urls file temporarily written out as …
(edit) @33517   5 years ak19 1. Blacklists were introduced so that too many instances of camelcased …
(edit) @33516   5 years ak19 Before I accidentally lose it, committing the script Dr Bainbridge …
(edit) @33515   5 years ak19 Removed an unused function
(edit) @33514   5 years ak19 Committing README on starting off with the vagrant VM for hadoop-spark …
(edit) @33513   5 years ak19 Higher level script that runs against each named crawl since Sep 2018 …
(edit) @33512   5 years ak19 AUTOCOMMIT by gen-model-colls.sh script. Message: Rebuilding all the …
(edit) @33511   5 years ak19 AUTOCOMMIT by gen-model-colls.sh script. Message: Rebuilding all the …
(edit) @33510   5 years kjdon isEditingTurnedOn renamed to isEditingAllowed, and added …
(edit) @33509   5 years kjdon only display Map GPS editing stuff if its allowed in config file
(edit) @33508   5 years kjdon pass a param into readyPageForEditing - indicates whether to add the …
(edit) @33507   5 years kjdon moved canDoEditing variable code to top, so can be used everywhere in …
(edit) @33506   5 years kjdon need to check whether document editing is turned on, not just if the …
(edit) @33505   5 years kjdon allowUserComments option changed to start with lower case a, to match …
(edit) @33504   5 years kjdon allowDocumentEditing option changed to start with lower case a, to …
(edit) @33503   5 years ak19 More efficient blacklisting/greylisting/whitelisting now by reading in …
(edit) @33502   5 years ak19 Current url pattern blacklist and greylist filter files. Used by …
(edit) @33501   5 years ak19 Refactored code into 2 classes: The existing WETProcessor, which …
(edit) @33500   5 years ak19 ThemeRoller download functionality currently offline. So uploading the …
(edit) @33499   5 years ak19 Explicitly adding in IAM policy configuration details instead of just …
(edit) @33498   5 years ak19 Corrections to script. Modified the tests checking for file/dir …
(edit) @33497   5 years ak19 First version of discard url filter file. Inefficient implementation. …
(edit) @33496   5 years ak19 Minor changes to reading list file
(edit) @33495   5 years ak19 Pruned out unused commands, added comments, marked unused variables to …
(edit) @33494   5 years ak19 All in one script that takes as parameter a common crawl identifier of …
(edit) @33493   5 years kjdon if we are on a cross collection search page, the collection for each …
(edit) @33492   5 years kjdon not all ccs pages has hierarchy element, so just test on s1.collection
(edit) @33491   5 years kjdon need to add optional args for doc links into the CCS format links. …
(edit) @33490   5 years kjdon changed default partition sizes back to 20, to match what was there …
(edit) @33489   5 years ak19 Handy file to not have to keep manually repeating commands when …
(edit) @33488   5 years ak19 new function createSeedURLsFiles() in WETProcessor that replaces the …
(edit) @33487   5 years kjdon added code to display any error messages
(edit) @33486   5 years kjdon reindented the page, added some extra links, and organised the items …
(edit) @33485   5 years kjdon removed an erroneous space
(edit) @33484   5 years kjdon some changes and additions to the debuginfo page texts
(edit) @33483   5 years kjdon added an explicit space after Error:
(edit) @33482   5 years kjdon changed standardize_capitalization to …
(edit) @33481   5 years kjdon a few more refinements to List strings
(edit) @33480   5 years ak19 Much harder to remove pages where words are fused together as some are …
(edit) @33479   5 years kjdon changed numeric option order to match letter options
(edit) @33478   5 years kjdon some refining of list option descriptions
(edit) @33477   5 years kjdon need to call setup_custom_sort to allow for collection's customsorttools.pm
(edit) @33476   5 years kjdon enabled having customsorttools in collection's perllib folder. you can …
(edit) @33475   5 years kjdon added numeric partition defaults to match partition type
(edit) @33474   5 years kjdon it turns out that childtype is not set in all cases, so put in the …
(edit) @33473   5 years kjdon still didn't get it quite right…
(edit) @33472   5 years kjdon forgot the -> to access member of a hash ref
(edit) @33471   5 years ak19 Very minor changes.
(edit) @33470   5 years ak19 A new script to reduce keepURLs.txt to unique URLs, 1 from each unique …
(edit) @33469   5 years ak19 Don't want URLs with the word product(s) in them (but production …
(edit) @33468   5 years ak19 More meaningful to (also) write out the keep vs discard URLs into keep …
(edit) @33467   5 years ak19 Improved the code to use a static block to load the needed properties …
(edit) @33466   5 years ak19 1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) …
(edit) @33465   5 years ak19 Committing first version of the WETProcessor.java which takes a …
(edit) @33464   5 years kjdon I committed the last changes by mistake, using the previous revision …
Note: See TracRevisionLog for help on using the revision log.