Timeline



2019-10-24:

23:22 Changeset [33604] by ak19
1. Better output into possible-product-sites.txt including the …
22:04 Changeset [33603] by ak19
Incorporating Dr Nichols suggestion to help weed out product sites: if …

2019-10-23:

23:49 Changeset [33602] by ak19
1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
23:22 Changeset [33601] by ak19
Creates the 2nd csv file, with info about webpages. At present stores …
23:05 Changeset [33600] by ak19
Work in progress of writing out CSV files. In future, may write the …

2019-10-22:

20:49 Changeset [33599] by ak19
First one-third sites crawled. Committing to SVN despite the tarred …
20:19 Changeset [33598] by ak19
More instructions on setting up Nutch now that I've remembered to …
20:05 Changeset [33597] by ak19
Committing active version of template file which has a newline at end …
18:44 Changeset [33596] by ak19
Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template …
14:05 Changeset [33595] by kjdon
new displayBaskets template - to avoid replicating code in query and …
14:00 Changeset [33594] by kjdon
call gslib:displayBasket instead of replicating the code here
13:59 Changeset [33593] by kjdon
the test for facets should be facetList/facet/count, as the facets get …
13:51 Changeset [33592] by kjdon
reindented the file
11:51 Changeset [33591] by kjdon
added in some strings for 'this collection contains x documents and …
11:12 Changeset [33590] by kjdon
added 'this colleciton contains X documents and was last build Y days …

2019-10-21:

21:45 Changeset [33589] by cpb16
final01. Need Map results still

2019-10-18:

23:20 Changeset [33588] by ak19
Committing the MRI sentence model that I'm actually using, the one in …
23:16 Changeset [33587] by ak19
1. Better stats reporting on crawled sites: not just if a page was in …
22:20 Changeset [33586] by ak19
Refactored MaoriTextDetector.java class into more general …
21:41 Changeset [33585] by ak19
Much simpler way of using sentence and language detection model to …
21:20 Changeset [33584] by ak19
Committing experimental version 2 using the sentence detector model, …
21:20 Changeset [33583] by ak19
Committing experimental version 1 using the sentence detector model, …

2019-10-17:

23:12 Changeset [33582] by ak19
NutchTextDumpProcessor prints each crawled site's stats: number of …
21:53 Changeset [33581] by ak19
Minor fix. Noticed when looking for work I did on MRI sentence detection
21:44 Changeset [33580] by ak19
Finally fixed the thus-far identified bugs when parsing dump.txt.
21:05 Changeset [33579] by ak19
Debugging. Solved one problem.
19:31 Changeset [33578] by ak19
Corrections for compiling the 2 new classes.
19:12 Changeset [33577] by ak19
Forgot to adjust usage statement to say that silent mode was already …

2019-10-16:

23:37 Changeset [33576] by ak19
Introducing 2 new Java files still being written and untested. …
23:36 Changeset [33575] by ak19
Correcting usage string for CCWETProcessor before committing new java …
23:35 Changeset [33574] by ak19
If nutch stores a crawled site in more than 1 file, then cat all of …
21:39 Changeset [33573] by ak19
Forgot to document that spaces were also allowed as separator in the …
21:18 Changeset [33572] by ak19
Only meant to store the wet.gz versions of these files, not also the …
21:11 Changeset [33571] by ak19
Adding Dr Bainbridge's suggestion of appending the crawlId of each …
20:04 Changeset [33570] by ak19
Need to check if UNFINISHED file actually exists before moving it …
20:00 Changeset [33569] by ak19
1. batchcrawl.sh now does what it should have from the start, which is …

2019-10-14:

23:36 Changeset [33568] by ak19
1. More sites greylisted and blacklisted, discovered as I attempted to …
22:40 Changeset [33567] by ak19
batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
22:07 Changeset [33566] by ak19
batchcrawl.sh script now supports taking a comma or space separated …
21:04 Changeset [33565] by ak19
CCWETProcessor: domain url now goes in as a seedURL after the …
21:01 Changeset [33564] by ak19
batchcrawl.sh now does the crawl and logs output of the crawl, dumps …

2019-10-11:

23:29 Changeset [33563] by ak19
Committing inactive testing batch scripts (only creates the …
21:52 Changeset [33562] by ak19
1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
20:49 Changeset [33561] by ak19
1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …

2019-10-10:

23:49 Changeset [33560] by ak19
1. Incorporated Dr Bainbridge's suggested improvements: only when …
23:44 Changeset [33559] by ak19
1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
23:41 Changeset [33558] by ak19
Committing cumulative changes since last commit.

2019-10-09:

23:10 Changeset [33557] by ak19
Implemented the topSitesMap of topsite domain to url pattern in the …
18:58 Changeset [33556] by ak19
Blacklisted wikipedia pages that are actually in other languages which …
18:43 Changeset [33555] by ak19
Modified top sites list as Dr Bainbridge described: suffixes for the …
18:11 Changeset [33554] by ak19
Added more to blacklist and greylist. And removed remaining duplicates …

2019-10-04:

22:19 Changeset [33553] by ak19
Comments
22:00 Changeset [33552] by ak19
1. Code now processes ccrawldata folder, containing each individual …
19:35 Changeset [33551] by ak19
Added in top 500 urls from moz.com/top500 and removed duplicates, and …
19:06 Changeset [33550] by ak19
First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
18:29 Changeset [33549] by ak19
All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
14:36 Changeset [33548] by davidb
Include new wavesurfer sub-project to install
14:30 Changeset [33547] by davidb
Initial cut at wavesurfer JS audio player version of AMC music content …
14:19 Changeset [33546] by davidb
Initial cut at wave-surfer based JS audio player extension for Greenstone

2019-10-03:

22:38 Changeset [33545] by ak19
Mainly changes to crawling-Nutch.txt and some minor changes to other …
18:56 Changeset [33544] by ak19
1. Dr Bainbridge had the correct fix for solr dealing with phrase …

2019-10-02:

17:01 Changeset [33543] by ak19
Filled in some missing instructions
15:25 Changeset [33542] by kjdon
use_hlist_for option is no longer valid

2019-10-01:

22:27 Changeset [33541] by ak19
1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
21:40 Changeset [33540] by ak19
Since I wasn't getting further with nutch 2 to grab an entire site, I …
21:36 Changeset [33539] by ak19
File rename
21:36 Changeset [33538] by ak19
Some additions to the setup.sh script to query commoncrawl for MRI …

2019-09-30:

22:51 Changeset [33537] by ak19
More nutch and general site mirroring related links
21:28 Changeset [33536] by ak19
Changes required to the commoncrawl related Vagrant github project to …
19:20 Ticket #956 (Minor changes to config for Images GPS tutorial after 3.09) created by ak19
As at 30 Sep 2019, the Images GPS collection works the same with a …
16:49 Changeset [33535] by ak19
1. New setup.sh script for on a hadoop system to setup the git …

2019-09-27:

17:05 Changeset [33534] by ak19
Correction: toplevel script has to be placed inside cc-index-table not …
11:02 Changeset [33533] by kjdon
some collections might not have Title or root_Title metadata, so check …

2019-09-26:

23:06 Changeset [33532] by ak19
Found the other top 500 sites link again at last which Dr Bainbridge …
23:03 Changeset [33531] by ak19
Added whitelist for mi.wikipedia.org, and updates to blacklist and …
22:41 Changeset [33530] by ak19
Completed sentence that was left hanging.
22:22 Changeset [33529] by ak19
Forgot to add most basic nutch links
21:47 Changeset [33528] by ak19
Adding in Nutch links
20:39 Changeset [33527] by ak19
Name change for folder
20:38 Changeset [33526] by ak19
Moved hadoop related scripts from bin/script into hdfs-instructions
20:35 Changeset [33525] by ak19
Rename before latest version
20:34 Changeset [33524] by ak19
1. Further adjustments to documenting what we did to get things to run …
19:00 Changeset [33523] by ak19
Instructional comment
19:00 Changeset [33522] by ak19
Some comments and an improvement
17:49 Changeset [33521] by ak19
AUTOCOMMIT by gen-model-colls.sh script. Message: Redoing the CDS-ISIS …
17:49 Changeset [33520] by ak19
AUTOCOMMIT by gen-model-colls.sh script. Message: Redoing the CDS-ISIS …

2019-09-24:

21:40 Changeset [33519] by ak19
Code still writes out the global seedURLs.txt and regex-urlfilter.txt …
21:13 Changeset [33518] by ak19
Intermediate commit: got the seed urls file temporarily written out as …
20:30 Changeset [33517] by ak19
1. Blacklists were introduced so that too many instances of camelcased …
20:14 Changeset [33516] by ak19
Before I accidentally lose it, committing the script Dr Bainbridge …
19:50 Changeset [33515] by ak19
Removed an unused function
19:44 Changeset [33514] by ak19
Committing README on starting off with the vagrant VM for hadoop-spark …
19:15 Changeset [33513] by ak19
Higher level script that runs against each named crawl since Sep 2018 …
15:17 Changeset [33512] by ak19
AUTOCOMMIT by gen-model-colls.sh script. Message: Rebuilding all the …
15:16 Changeset [33511] by ak19
AUTOCOMMIT by gen-model-colls.sh script. Message: Rebuilding all the …
14:13 Changeset [33510] by kjdon
isEditingTurnedOn renamed to isEditingAllowed, and added …
14:12 Changeset [33509] by kjdon
only display Map GPS editing stuff if its allowed in config file
14:07 Changeset [33508] by kjdon
pass a param into readyPageForEditing - indicates whether to add the …
13:24 Changeset [33507] by kjdon
moved canDoEditing variable code to top, so can be used everywhere in …
13:04 Changeset [33506] by kjdon
need to check whether document editing is turned on, not just if the …
12:55 Changeset [33505] by kjdon
allowUserComments option changed to start with lower case a, to match …
12:53 Changeset [33504] by kjdon
allowDocumentEditing option changed to start with lower case a, to …
10:23 Ticket #955 (Use of GreenStone/Koha with Multimedia Production Management - e.g. Lumiera) created by kjdon
This was a feature request added to sourceforge greenstone 3 project …
Note: See TracTimeline for information about the timeline view.