source:

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @33645   5 years ak19 Fix to 2 bugs when sending data to MongoDB: 1. overlappingSentences …
(edit) @33644   5 years ak19 Just committing the growing mongodb.txt file with links and …
(edit) @33643   5 years ak19 Brought the template log4j.properties.in back up to speed. I forgot it …
(edit) @33642   5 years ak19 Forgot to commit the java driver for mongodb when I committed the Java …
(edit) @33641   5 years kjdon commented out some debug statements
(edit) @33640   5 years kjdon oops, I must have 'tidied' up the file and then not compiled it to …
(edit) @33639   5 years kjdon need to select child nodes, otherwise the gsf:default node ends up in …
(edit) @33638   5 years kjdon gslib doesn't use xml-to-string.xsl. its only used by formatmanager, …
(edit) @33637   5 years kjdon we can now use gsf and gslib in layout files.
(edit) @33636   5 years kjdon include means the stylesheet gets added inline, import mea s it gets …
(edit) @33635   5 years ak19 Maori-language-detection doesn't use Greenstone 3 at present, it's not …
(edit) @33634   5 years ak19 Rewrote NutchTextDumpProcessor as NutchTextDumpToMongoDB.java, which …
(edit) @33633   5 years ak19 1. TextLanguageDetector now has methods for collecting all sentences …
(edit) @33632   5 years kjdon overhaul of TransformingReceptionist. changed the order of inlining …
(edit) @33631   5 years kjdon added a bit more error reporting
(edit) @33630   5 years kjdon minor comment changes
(edit) @33629   5 years kjdon added methods using Parameter2 - for params with text node values
(edit) @33628   5 years kjdon not sure why documentNode was a gsf:template here. Can't be like that …
(edit) @33627   5 years kjdon removed unnecessary comments
(edit) @33626   5 years ak19 TODOs
(edit) @33625   5 years ak19 A file listing domains with seedurls containing /mi(/) that are …
(edit) @33624   5 years ak19 Some cleanup surrounding the now renamed function createSeedURLsFile, …
(edit) @33623   5 years ak19 1. Incorporated Dr Nichols earlier suggestion of storing page modified …
(edit) @33622   5 years ak19 File rename
(edit) @33621   5 years ak19 Comitting jotted down mongodb related instructions from what Dr …
(edit) @33620   5 years ak19 Final crawl, done on vagrant VM node6. Crawl site IDs 01407-01462.
(edit) @33619   5 years kjdon need to handle the case where a collection file (eg image) gets …
(edit) @33618   5 years ak19 Adding in the download URL
(edit) @33617   5 years ak19 Node5 is now full and here is the finished crawl (up to and including …
(edit) @33616   5 years ak19 Beginnings of Java class that is to interact with MongoDB. I don't yet …
(edit) @33615   5 years ak19 1. Worked out how to configure log4j to log both to console and …
(edit) @33614   5 years kjdon added a new line
(edit) @33613   5 years kjdon added allowdocumentediting and allowmapgpsediting options, plus also …
(edit) @33612   5 years kjdon work to do with params. add in default values to params if they are …
(edit) @33611   5 years kjdon added global setting to params - thesea re for params that are valid …
(edit) @33610   5 years kjdon USER_SESSION_CACHE_ATT moved to GSParams, as it is stored in session …
(edit) @33609   5 years ak19 The tar files containing the crawled sites data shouldn't be called …
(edit) @33608   5 years ak19 1. New script to export from HBase so that we could in theory reimport …
(edit) @33607   5 years ak19 Updated with the remaining successfully crawled sites on node4 before …
(edit) @33606   5 years ak19 1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. …
(edit) @33605   5 years ak19 Node 4 VM still works, but committing first set of crawled sites on there
(edit) @33604   5 years ak19 1. Better output into possible-product-sites.txt including the …
(edit) @33603   5 years ak19 Incorporating Dr Nichols suggestion to help weed out product sites: if …
(edit) @33602   5 years ak19 1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
(edit) @33601   5 years ak19 Creates the 2nd csv file, with info about webpages. At present stores …
(edit) @33600   5 years ak19 Work in progress of writing out CSV files. In future, may write the …
(edit) @33599   5 years ak19 First one-third sites crawled. Committing to SVN despite the tarred …
(edit) @33598   5 years ak19 More instructions on setting up Nutch now that I've remembered to …
(edit) @33597   5 years ak19 Committing active version of template file which has a newline at end …
(edit) @33596   5 years ak19 Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template …
(edit) @33595   5 years kjdon new displayBaskets template - to avoid replicating code in query and …
(edit) @33594   5 years kjdon call gslib:displayBasket instead of replicating the code here
(edit) @33593   5 years kjdon the test for facets should be facetList/facet/count, as the facets get …
(edit) @33592   5 years kjdon reindented the file
(edit) @33591   5 years kjdon added in some strings for 'this collection contains x documents and …
(edit) @33590   5 years kjdon added 'this colleciton contains X documents and was last build Y days …
(edit) @33589   5 years cpb16 final01. Need Map results still
(edit) @33588   5 years ak19 Committing the MRI sentence model that I'm actually using, the one in …
(edit) @33587   5 years ak19 1. Better stats reporting on crawled sites: not just if a page was in …
(edit) @33586   5 years ak19 Refactored MaoriTextDetector.java class into more general …
(edit) @33585   5 years ak19 Much simpler way of using sentence and language detection model to …
(edit) @33584   5 years ak19 Committing experimental version 2 using the sentence detector model, …
(edit) @33583   5 years ak19 Committing experimental version 1 using the sentence detector model, …
(edit) @33582   5 years ak19 NutchTextDumpProcessor prints each crawled site's stats: number of …
(edit) @33581   5 years ak19 Minor fix. Noticed when looking for work I did on MRI sentence detection
(edit) @33580   5 years ak19 Finally fixed the thus-far identified bugs when parsing dump.txt.
(edit) @33579   5 years ak19 Debugging. Solved one problem.
(edit) @33578   5 years ak19 Corrections for compiling the 2 new classes.
(edit) @33577   5 years ak19 Forgot to adjust usage statement to say that silent mode was already …
(edit) @33576   5 years ak19 Introducing 2 new Java files still being written and untested. …
(edit) @33575   5 years ak19 Correcting usage string for CCWETProcessor before committing new java …
(edit) @33574   5 years ak19 If nutch stores a crawled site in more than 1 file, then cat all of …
(edit) @33573   5 years ak19 Forgot to document that spaces were also allowed as separator in the …
(edit) @33572   5 years ak19 Only meant to store the wet.gz versions of these files, not also the …
(edit) @33571   5 years ak19 Adding Dr Bainbridge's suggestion of appending the crawlId of each …
(edit) @33570   5 years ak19 Need to check if UNFINISHED file actually exists before moving it …
(edit) @33569   5 years ak19 1. batchcrawl.sh now does what it should have from the start, which is …
(edit) @33568   5 years ak19 1. More sites greylisted and blacklisted, discovered as I attempted to …
(edit) @33567   5 years ak19 batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
(edit) @33566   5 years ak19 batchcrawl.sh script now supports taking a comma or space separated …
(edit) @33565   5 years ak19 CCWETProcessor: domain url now goes in as a seedURL after the …
(edit) @33564   5 years ak19 batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
(edit) @33563   5 years ak19 Committing inactive testing batch scripts (only creates the …
(edit) @33562   5 years ak19 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit) @33561   5 years ak19 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
(edit) @33560   5 years ak19 1. Incorporated Dr Bainbridge's suggested improvements: only when …
(edit) @33559   5 years ak19 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
(edit) @33558   5 years ak19 Committing cumulative changes since last commit.
(edit) @33557   5 years ak19 Implemented the topSitesMap of topsite domain to url pattern in the …
(edit) @33556   5 years ak19 Blacklisted wikipedia pages that are actually in other languages which …
(edit) @33555   5 years ak19 Modified top sites list as Dr Bainbridge described: suffixes for the …
(edit) @33554   5 years ak19 Added more to blacklist and greylist. And removed remaining duplicates …
(edit) @33553   5 years ak19 Comments
(edit) @33552   5 years ak19 1. Code now processes ccrawldata folder, containing each individual …
(edit) @33551   5 years ak19 Added in top 500 urls from moz.com/top500 and removed duplicates, and …
(edit) @33550   5 years ak19 First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
(edit) @33549   5 years ak19 All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
(edit) @33548   5 years davidb Include new wavesurfer sub-project to install
(edit) @33547   5 years davidb Initial cut at wavesurfer JS audio player version of AMC music content …
(edit) @33546   5 years davidb Initial cut at wave-surfer based JS audio player extension for Greenstone
Note: See TracRevisionLog for help on using the revision log.