Timeline
2019-10-17:
- 23:12 Changeset [33582] by
- NutchTextDumpProcessor prints each crawled site's stats: number of …
- 21:53 Changeset [33581] by
- Minor fix. Noticed when looking for work I did on MRI sentence detection
- 21:44 Changeset [33580] by
- Finally fixed the thus-far identified bugs when parsing dump.txt.
- 21:05 Changeset [33579] by
- Debugging. Solved one problem.
- 19:31 Changeset [33578] by
- Corrections for compiling the 2 new classes.
- 19:12 Changeset [33577] by
- Forgot to adjust usage statement to say that silent mode was already …
2019-10-16:
- 23:37 Changeset [33576] by
- Introducing 2 new Java files still being written and untested. …
- 23:36 Changeset [33575] by
- Correcting usage string for CCWETProcessor before committing new java …
- 23:35 Changeset [33574] by
- If nutch stores a crawled site in more than 1 file, then cat all of …
- 21:39 Changeset [33573] by
- Forgot to document that spaces were also allowed as separator in the …
- 21:18 Changeset [33572] by
- Only meant to store the wet.gz versions of these files, not also the …
- 21:11 Changeset [33571] by
- Adding Dr Bainbridge's suggestion of appending the crawlId of each …
- 20:04 Changeset [33570] by
- Need to check if UNFINISHED file actually exists before moving it …
- 20:00 Changeset [33569] by
- 1. batchcrawl.sh now does what it should have from the start, which is …
2019-10-14:
- 23:36 Changeset [33568] by
- 1. More sites greylisted and blacklisted, discovered as I attempted to …
- 22:40 Changeset [33567] by
- batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
- 22:07 Changeset [33566] by
- batchcrawl.sh script now supports taking a comma or space separated …
- 21:04 Changeset [33565] by
- CCWETProcessor: domain url now goes in as a seedURL after the …
- 21:01 Changeset [33564] by
- batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
2019-10-11:
- 23:29 Changeset [33563] by
- Committing inactive testing batch scripts (only creates the …
- 21:52 Changeset [33562] by
- 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
- 20:49 Changeset [33561] by
- 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
2019-10-10:
- 23:49 Changeset [33560] by
- 1. Incorporated Dr Bainbridge's suggested improvements: only when …
- 23:44 Changeset [33559] by
- 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
- 23:41 Changeset [33558] by
- Committing cumulative changes since last commit.
2019-10-09:
- 23:10 Changeset [33557] by
- Implemented the topSitesMap of topsite domain to url pattern in the …
- 18:58 Changeset [33556] by
- Blacklisted wikipedia pages that are actually in other languages which …
- 18:43 Changeset [33555] by
- Modified top sites list as Dr Bainbridge described: suffixes for the …
- 18:11 Changeset [33554] by
- Added more to blacklist and greylist. And removed remaining duplicates …
2019-10-04:
- 22:19 Changeset [33553] by
- Comments
- 22:00 Changeset [33552] by
- 1. Code now processes ccrawldata folder, containing each individual …
- 19:35 Changeset [33551] by
- Added in top 500 urls from moz.com/top500 and removed duplicates, and …
- 19:06 Changeset [33550] by
- First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
- 18:29 Changeset [33549] by
- All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
- 14:36 Changeset [33548] by
- Include new wavesurfer sub-project to install
- 14:30 Changeset [33547] by
- Initial cut at wavesurfer JS audio player version of AMC music content …
- 14:19 Changeset [33546] by
- Initial cut at wave-surfer based JS audio player extension for Greenstone
2019-10-03:
- 22:38 Changeset [33545] by
- Mainly changes to crawling-Nutch.txt and some minor changes to other …
- 18:56 Changeset [33544] by
- 1. Dr Bainbridge had the correct fix for solr dealing with phrase …
2019-10-02:
- 17:01 Changeset [33543] by
- Filled in some missing instructions
- 15:25 Changeset [33542] by
- use_hlist_for option is no longer valid
2019-10-01:
- 22:27 Changeset [33541] by
- 1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
- 21:40 Changeset [33540] by
- Since I wasn't getting further with nutch 2 to grab an entire site, I …
- 21:36 Changeset [33539] by
- File rename
- 21:36 Changeset [33538] by
- Some additions to the setup.sh script to query commoncrawl for MRI …
2019-09-30:
- 22:51 Changeset [33537] by
- More nutch and general site mirroring related links
- 21:28 Changeset [33536] by
- Changes required to the commoncrawl related Vagrant github project to …
- 19:20 Ticket #956 (Minor changes to config for Images GPS tutorial after 3.09) created by
- As at 30 Sep 2019, the Images GPS collection works the same with a …
- 16:49 Changeset [33535] by
- 1. New setup.sh script for on a hadoop system to setup the git …
2019-09-27:
- 17:05 Changeset [33534] by
- Correction: toplevel script has to be placed inside cc-index-table not …
- 11:02 Changeset [33533] by
- some collections might not have Title or root_Title metadata, so check …
2019-09-26:
- 23:06 Changeset [33532] by
- Found the other top 500 sites link again at last which Dr Bainbridge …
- 23:03 Changeset [33531] by
- Added whitelist for mi.wikipedia.org, and updates to blacklist and …
- 22:41 Changeset [33530] by
- Completed sentence that was left hanging.
- 22:22 Changeset [33529] by
- Forgot to add most basic nutch links
- 21:47 Changeset [33528] by
- Adding in Nutch links
- 20:39 Changeset [33527] by
- Name change for folder
- 20:38 Changeset [33526] by
- Moved hadoop related scripts from bin/script into hdfs-instructions
- 20:35 Changeset [33525] by
- Rename before latest version
- 20:34 Changeset [33524] by
- 1. Further adjustments to documenting what we did to get things to run …
- 19:00 Changeset [33523] by
- Instructional comment
- 19:00 Changeset [33522] by
- Some comments and an improvement
- 17:49 Changeset [33521] by
- AUTOCOMMIT by gen-model-colls.sh script. Message: Redoing the CDS-ISIS …
- 17:49 Changeset [33520] by
- AUTOCOMMIT by gen-model-colls.sh script. Message: Redoing the CDS-ISIS …
2019-09-24:
- 21:40 Changeset [33519] by
- Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
- 21:13 Changeset [33518] by
- Intermediate commit: got the seed urls file temporarily written out as …
- 20:30 Changeset [33517] by
- 1. Blacklists were introduced so that too many instances of camelcased …
- 20:14 Changeset [33516] by
- Before I accidentally lose it, committing the script Dr Bainbridge …
- 19:50 Changeset [33515] by
- Removed an unused function
- 19:44 Changeset [33514] by
- Committing README on starting off with the vagrant VM for hadoop-spark …
- 19:15 Changeset [33513] by
- Higher level script that runs against each named crawl since Sep 2018 …
- 15:17 Changeset [33512] by
- AUTOCOMMIT by gen-model-colls.sh script. Message: Rebuilding all the …
- 15:16 Changeset [33511] by
- AUTOCOMMIT by gen-model-colls.sh script. Message: Rebuilding all the …
- 14:13 Changeset [33510] by
- isEditingTurnedOn renamed to isEditingAllowed, and added …
- 14:12 Changeset [33509] by
- only display Map GPS editing stuff if its allowed in config file
- 14:07 Changeset [33508] by
- pass a param into readyPageForEditing - indicates whether to add the …
- 13:24 Changeset [33507] by
- moved canDoEditing variable code to top, so can be used everywhere in …
- 13:04 Changeset [33506] by
- need to check whether document editing is turned on, not just if the …
- 12:55 Changeset [33505] by
- allowUserComments option changed to start with lower case a, to match …
- 12:53 Changeset [33504] by
- allowDocumentEditing option changed to start with lower case a, to …
- 10:23 Ticket #955 (Use of GreenStone/Koha with Multimedia Production Management - e.g. Lumiera) created by
- This was a feature request added to sourceforge greenstone 3 project …
2019-09-23:
- 23:16 Changeset [33503] by
- More efficient blacklisting/greylisting/whitelisting now by reading in …
- 23:11 Changeset [33502] by
- Current url pattern blacklist and greylist filter files. Used by …
- 21:28 Changeset [33501] by
- Refactored code into 2 classes: The existing WETProcessor, which …
- 19:05 Changeset [33500] by
- ThemeRoller download functionality currently offline. So uploading the …
- 17:59 Changeset [33499] by
- Explicitly adding in IAM policy configuration details instead of just …
- 16:43 Changeset [33498] by
- Corrections to script. Modified the tests checking for file/dir …
2019-09-22:
- 21:17 Changeset [33497] by
- First version of discard url filter file. Inefficient implementation. …
- 19:23 Changeset [33496] by
- Minor changes to reading list file
- 19:19 Changeset [33495] by
- Pruned out unused commands, added comments, marked unused variables to …
2019-09-21:
- 22:49 Changeset [33494] by
- All in one script that takes as parameter a common crawl identifier of …
2019-09-19:
- 14:24 Changeset [33493] by
- if we are on a cross collection search page, the collection for each …
- 13:43 Changeset [33492] by
- not all ccs pages has hierarchy element, so just test on s1.collection
- 13:23 Changeset [33491] by
- need to add optional args for doc links into the CCS format links. …
- 12:34 Changeset [33490] by
- changed default partition sizes back to 20, to match what was there …
2019-09-18:
- 20:20 Changeset [33489] by
- Handy file to not have to keep manually repeating commands when …
2019-09-17:
- 14:48 Changeset [33488] by
- new function createSeedURLsFiles() in WETProcessor that replaces the …
- 14:24 Changeset [33487] by
- added code to display any error messages
- 14:23 Changeset [33486] by
- reindented the page, added some extra links, and organised the items …
- 14:22 Changeset [33485] by
- removed an erroneous space
- 14:21 Changeset [33484] by
- some changes and additions to the debuginfo page texts
- 14:20 Changeset [33483] by
- added an explicit space after Error:
- 10:55 Changeset [33482] by
- changed standardize_capitalization to …
- 10:41 Changeset [33481] by
- a few more refinements to List strings
Note:
See TracTimeline
for information about the timeline view.