source:

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @33683   5 years davidb Updated to process latest version of spreadsheet
(edit) @33682   5 years davidb Changes made around the time of the launch
(edit) @33681   5 years davidb Added in flock technique to avoid multiple people running the same script
(edit) @33680   5 years davidb Greenstone3 is fixed, so don't need to print out message about runing …
(edit) @33679   5 years davidb Folder for working on updates (PDFs to del, PDFs to add) from Kiri
(edit) @33678   5 years davidb setup for greenstone ext
(edit) @33677   5 years davidb Intro text
(edit) @33676   5 years davidb Some initial work getting a plugin going that call's Alex's VirusTotal
(edit) @33675   5 years ak19 Committing the newer query results (but from before today's …
(edit) @33674   5 years ak19 Changes to support the top 5 predicted langcodes and their confidence …
(edit) @33673   5 years ak19 Waikato Education Department's Science Activities and Maths Activities …
(edit) @33672   5 years kjdon modified slightly so that the error messages come from the dictionary …
(edit) @33671   5 years kjdon added a static getTextString method - currently this is in Action.java …
(edit) @33670   5 years kjdon added editEnabled att string
(edit) @33669   5 years kjdon removed an annoying debug message
(edit) @33668   5 years kjdon a few changes to debuginfo texts
(edit) @33667   5 years kjdon preProcess.xsl renamed to expand-gslib.xsl to better indicate what it does
(edit) @33666   5 years ak19 Having finished sending all the crawl data to mongodb 1. Recrawled the …
(edit) @33665   5 years davidb Fixed jar name
(edit) @33664   5 years davidb Initial version code for running VirusTotal API against files, CLI scripts
(edit) @33663   5 years davidb Changes after testing the scripts
(edit) @33662   5 years davidb Scripts to compile and run java code
(edit) @33661   5 years davidb Compiling needs to use Maven
(edit) @33660   5 years davidb For Java source code
(edit) @33659   5 years davidb Top-level folder for new extension based on TotalVirus API which scans …
(edit) @33658   5 years davidb Top-level folder for new extension based on TotalVirus API which scans …
(edit) @33657   5 years ak19 Some fixes after brief testing against 1/3 of the crawl. Restarted …
(edit) @33656   5 years ak19 Final minor changes before I start processing the crawls of node2.
(edit) @33655   5 years ak19 Minor change to print statement
(edit) @33654   5 years ak19 Removing jar file that wasn't used after all.
(edit) @33653   5 years ak19 1. As suggested by Dr Bainbridge, made the code changes to use Morphia …
(edit) @33652   5 years ak19 Introducing morphia subpackage
(edit) @33651   5 years ak19 1. Bugfix: overlappingSentences works. 2. storing numSentencesInMaor
(edit) @33650   5 years kjdon updated to match the new xsl file names; lots of variable renames to …
(edit) @33649   5 years kjdon renamed config_format and text_fragment_format to better represent …
(edit) @33648   5 years kjdon changed the debuginfo xsl and strings to match the new o=xxx debug options
(edit) @33647   5 years kjdon added/changed a few of the output values for debugging the transform
(edit) @33646   5 years ak19 Saving the mongodb queries and learning links that Dr Bainbridge found …
(edit) @33645   5 years ak19 Fix to 2 bugs when sending data to MongoDB: 1. overlappingSentences …
(edit) @33644   5 years ak19 Just committing the growing mongodb.txt file with links and …
(edit) @33643   5 years ak19 Brought the template log4j.properties.in back up to speed. I forgot it …
(edit) @33642   5 years ak19 Forgot to commit the java driver for mongodb when I committed the Java …
(edit) @33641   5 years kjdon commented out some debug statements
(edit) @33640   5 years kjdon oops, I must have 'tidied' up the file and then not compiled it to …
(edit) @33639   5 years kjdon need to select child nodes, otherwise the gsf:default node ends up in …
(edit) @33638   5 years kjdon gslib doesn't use xml-to-string.xsl. its only used by formatmanager, …
(edit) @33637   5 years kjdon we can now use gsf and gslib in layout files.
(edit) @33636   5 years kjdon include means the stylesheet gets added inline, import mea s it gets …
(edit) @33635   5 years ak19 Maori-language-detection doesn't use Greenstone 3 at present, it's not …
(edit) @33634   5 years ak19 Rewrote NutchTextDumpProcessor as NutchTextDumpToMongoDB.java, which …
(edit) @33633   5 years ak19 1. TextLanguageDetector now has methods for collecting all sentences …
(edit) @33632   5 years kjdon overhaul of TransformingReceptionist. changed the order of inlining …
(edit) @33631   5 years kjdon added a bit more error reporting
(edit) @33630   5 years kjdon minor comment changes
(edit) @33629   5 years kjdon added methods using Parameter2 - for params with text node values
(edit) @33628   5 years kjdon not sure why documentNode was a gsf:template here. Can't be like that …
(edit) @33627   5 years kjdon removed unnecessary comments
(edit) @33626   5 years ak19 TODOs
(edit) @33625   5 years ak19 A file listing domains with seedurls containing /mi(/) that are …
(edit) @33624   5 years ak19 Some cleanup surrounding the now renamed function createSeedURLsFile, …
(edit) @33623   5 years ak19 1. Incorporated Dr Nichols earlier suggestion of storing page modified …
(edit) @33622   5 years ak19 File rename
(edit) @33621   5 years ak19 Comitting jotted down mongodb related instructions from what Dr …
(edit) @33620   5 years ak19 Final crawl, done on vagrant VM node6. Crawl site IDs 01407-01462.
(edit) @33619   5 years kjdon need to handle the case where a collection file (eg image) gets …
(edit) @33618   5 years ak19 Adding in the download URL
(edit) @33617   5 years ak19 Node5 is now full and here is the finished crawl (up to and including …
(edit) @33616   5 years ak19 Beginnings of Java class that is to interact with MongoDB. I don't yet …
(edit) @33615   5 years ak19 1. Worked out how to configure log4j to log both to console and …
(edit) @33614   5 years kjdon added a new line
(edit) @33613   5 years kjdon added allowdocumentediting and allowmapgpsediting options, plus also …
(edit) @33612   5 years kjdon work to do with params. add in default values to params if they are …
(edit) @33611   5 years kjdon added global setting to params - thesea re for params that are valid …
(edit) @33610   5 years kjdon USER_SESSION_CACHE_ATT moved to GSParams, as it is stored in session …
(edit) @33609   5 years ak19 The tar files containing the crawled sites data shouldn't be called …
(edit) @33608   5 years ak19 1. New script to export from HBase so that we could in theory reimport …
(edit) @33607   5 years ak19 Updated with the remaining successfully crawled sites on node4 before …
(edit) @33606   5 years ak19 1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. …
(edit) @33605   5 years ak19 Node 4 VM still works, but committing first set of crawled sites on there
(edit) @33604   5 years ak19 1. Better output into possible-product-sites.txt including the …
(edit) @33603   5 years ak19 Incorporating Dr Nichols suggestion to help weed out product sites: if …
(edit) @33602   5 years ak19 1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
(edit) @33601   5 years ak19 Creates the 2nd csv file, with info about webpages. At present stores …
(edit) @33600   5 years ak19 Work in progress of writing out CSV files. In future, may write the …
(edit) @33599   5 years ak19 First one-third sites crawled. Committing to SVN despite the tarred …
(edit) @33598   5 years ak19 More instructions on setting up Nutch now that I've remembered to …
(edit) @33597   5 years ak19 Committing active version of template file which has a newline at end …
(edit) @33596   5 years ak19 Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template …
(edit) @33595   5 years kjdon new displayBaskets template - to avoid replicating code in query and …
(edit) @33594   5 years kjdon call gslib:displayBasket instead of replicating the code here
(edit) @33593   5 years kjdon the test for facets should be facetList/facet/count, as the facets get …
(edit) @33592   5 years kjdon reindented the file
(edit) @33591   5 years kjdon added in some strings for 'this collection contains x documents and …
(edit) @33590   5 years kjdon added 'this colleciton contains X documents and was last build Y days …
(edit) @33589   5 years cpb16 final01. Need Map results still
(edit) @33588   5 years ak19 Committing the MRI sentence model that I'm actually using, the one in …
(edit) @33587   5 years ak19 1. Better stats reporting on crawled sites: not just if a page was in …
(edit) @33586   5 years ak19 Refactored MaoriTextDetector.java class into more general …
(edit) @33585   5 years ak19 Much simpler way of using sentence and language detection model to …
(edit) @33584   5 years ak19 Committing experimental version 2 using the sentence detector model, …
Note: See TracRevisionLog for help on using the revision log.