source: gs3-extensions

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @34348   4 years davidb Adding in Essential source code to go along with compile scripts
(edit) @34347   4 years davidb Adding in Essential compile scripts
(edit) @34346   4 years davidb Further dir that needs to be installed as a header file area
(edit) @34345   4 years davidb Already done in setup.bash
(edit) @34344   4 years davidb Extended to now setup/install Eigen3
(edit) @34343   4 years davidb Tweak to sourcing file
(edit) @34342   4 years davidb Added block to set GSDLOS
(edit) @34341   4 years davidb Shift to using cascade-make
(edit) @34340   4 years davidb Added in cascade-make as an external property
(edit) @34339   4 years davidb Some initial files to compile up essentia, used in the Mars extension …
(edit) @34166   4 years ak19 Adding Italian language translations of the gs3colcfg module. Many …
(edit) @33997   4 years davidb Top-level folder for MARS related Greenstone3 code
(edit) @33736   4 years kjdon fixed a spelling mistake
(edit) @33635   4 years ak19 Maori-language-detection doesn't use Greenstone 3 at present, it's not …
(edit) @33634   4 years ak19 Rewrote NutchTextDumpProcessor as NutchTextDumpToMongoDB.java, which …
(edit) @33633   4 years ak19 1. TextLanguageDetector now has methods for collecting all sentences …
(edit) @33626   4 years ak19 TODOs
(edit) @33625   4 years ak19 A file listing domains with seedurls containing /mi(/) that are …
(edit) @33624   4 years ak19 Some cleanup surrounding the now renamed function createSeedURLsFile, …
(edit) @33623   4 years ak19 1. Incorporated Dr Nichols earlier suggestion of storing page modified …
(edit) @33622   4 years ak19 File rename
(edit) @33621   4 years ak19 Comitting jotted down mongodb related instructions from what Dr …
(edit) @33620   4 years ak19 Final crawl, done on vagrant VM node6. Crawl site IDs 01407-01462.
(edit) @33618   4 years ak19 Adding in the download URL
(edit) @33617   4 years ak19 Node5 is now full and here is the finished crawl (up to and including …
(edit) @33616   4 years ak19 Beginnings of Java class that is to interact with MongoDB. I don't yet …
(edit) @33615   4 years ak19 1. Worked out how to configure log4j to log both to console and …
(edit) @33609   4 years ak19 The tar files containing the crawled sites data shouldn't be called …
(edit) @33608   4 years ak19 1. New script to export from HBase so that we could in theory reimport …
(edit) @33607   4 years ak19 Updated with the remaining successfully crawled sites on node4 before …
(edit) @33606   4 years ak19 1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. …
(edit) @33605   4 years ak19 Node 4 VM still works, but committing first set of crawled sites on there
(edit) @33604   4 years ak19 1. Better output into possible-product-sites.txt including the …
(edit) @33603   4 years ak19 Incorporating Dr Nichols suggestion to help weed out product sites: if …
(edit) @33602   4 years ak19 1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
(edit) @33601   4 years ak19 Creates the 2nd csv file, with info about webpages. At present stores …
(edit) @33600   4 years ak19 Work in progress of writing out CSV files. In future, may write the …
(edit) @33599   4 years ak19 First one-third sites crawled. Committing to SVN despite the tarred …
(edit) @33598   4 years ak19 More instructions on setting up Nutch now that I've remembered to …
(edit) @33597   4 years ak19 Committing active version of template file which has a newline at end …
(edit) @33596   4 years ak19 Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template …
(edit) @33588   5 years ak19 Committing the MRI sentence model that I'm actually using, the one in …
(edit) @33587   5 years ak19 1. Better stats reporting on crawled sites: not just if a page was in …
(edit) @33586   5 years ak19 Refactored MaoriTextDetector.java class into more general …
(edit) @33585   5 years ak19 Much simpler way of using sentence and language detection model to …
(edit) @33584   5 years ak19 Committing experimental version 2 using the sentence detector model, …
(edit) @33583   5 years ak19 Committing experimental version 1 using the sentence detector model, …
(edit) @33582   5 years ak19 NutchTextDumpProcessor prints each crawled site's stats: number of …
(edit) @33581   5 years ak19 Minor fix. Noticed when looking for work I did on MRI sentence detection
(edit) @33580   5 years ak19 Finally fixed the thus-far identified bugs when parsing dump.txt.
(edit) @33579   5 years ak19 Debugging. Solved one problem.
(edit) @33578   5 years ak19 Corrections for compiling the 2 new classes.
(edit) @33577   5 years ak19 Forgot to adjust usage statement to say that silent mode was already …
(edit) @33576   5 years ak19 Introducing 2 new Java files still being written and untested. …
(edit) @33575   5 years ak19 Correcting usage string for CCWETProcessor before committing new java …
(edit) @33574   5 years ak19 If nutch stores a crawled site in more than 1 file, then cat all of …
(edit) @33573   5 years ak19 Forgot to document that spaces were also allowed as separator in the …
(edit) @33572   5 years ak19 Only meant to store the wet.gz versions of these files, not also the …
(edit) @33571   5 years ak19 Adding Dr Bainbridge's suggestion of appending the crawlId of each …
(edit) @33570   5 years ak19 Need to check if UNFINISHED file actually exists before moving it …
(edit) @33569   5 years ak19 1. batchcrawl.sh now does what it should have from the start, which is …
(edit) @33568   5 years ak19 1. More sites greylisted and blacklisted, discovered as I attempted to …
(edit) @33567   5 years ak19 batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
(edit) @33566   5 years ak19 batchcrawl.sh script now supports taking a comma or space separated …
(edit) @33565   5 years ak19 CCWETProcessor: domain url now goes in as a seedURL after the …
(edit) @33564   5 years ak19 batchcrawl.sh now does the crawl and logs output of the crawl, dumps …
(edit) @33563   5 years ak19 Committing inactive testing batch scripts (only creates the …
(edit) @33562   5 years ak19 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit) @33561   5 years ak19 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated …
(edit) @33560   5 years ak19 1. Incorporated Dr Bainbridge's suggested improvements: only when …
(edit) @33559   5 years ak19 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
(edit) @33558   5 years ak19 Committing cumulative changes since last commit.
(edit) @33557   5 years ak19 Implemented the topSitesMap of topsite domain to url pattern in the …
(edit) @33556   5 years ak19 Blacklisted wikipedia pages that are actually in other languages which …
(edit) @33555   5 years ak19 Modified top sites list as Dr Bainbridge described: suffixes for the …
(edit) @33554   5 years ak19 Added more to blacklist and greylist. And removed remaining duplicates …
(edit) @33553   5 years ak19 Comments
(edit) @33552   5 years ak19 1. Code now processes ccrawldata folder, containing each individual …
(edit) @33551   5 years ak19 Added in top 500 urls from moz.com/top500 and removed duplicates, and …
(edit) @33550   5 years ak19 First stage of introducing sites-too-big-to-exhaustively-crawl.tx: …
(edit) @33549   5 years ak19 All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 …
(edit) @33548   5 years davidb Include new wavesurfer sub-project to install
(edit) @33546   5 years davidb Initial cut at wave-surfer based JS audio player extension for Greenstone
(edit) @33545   5 years ak19 Mainly changes to crawling-Nutch.txt and some minor changes to other …
(edit) @33543   5 years ak19 Filled in some missing instructions
(edit) @33541   5 years ak19 1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
(edit) @33540   5 years ak19 Since I wasn't getting further with nutch 2 to grab an entire site, I …
(edit) @33539   5 years ak19 File rename
(edit) @33538   5 years ak19 Some additions to the setup.sh script to query commoncrawl for MRI …
(edit) @33537   5 years ak19 More nutch and general site mirroring related links
(edit) @33536   5 years ak19 Changes required to the commoncrawl related Vagrant github project to …
(edit) @33535   5 years ak19 1. New setup.sh script for on a hadoop system to setup the git …
(edit) @33534   5 years ak19 Correction: toplevel script has to be placed inside cc-index-table not …
(edit) @33532   5 years ak19 Found the other top 500 sites link again at last which Dr Bainbridge …
(edit) @33531   5 years ak19 Added whitelist for mi.wikipedia.org, and updates to blacklist and …
(edit) @33530   5 years ak19 Completed sentence that was left hanging.
(edit) @33529   5 years ak19 Forgot to add most basic nutch links
(edit) @33528   5 years ak19 Adding in Nutch links
(edit) @33527   5 years ak19 Name change for folder
(edit) @33526   5 years ak19 Moved hadoop related scripts from bin/script into hdfs-instructions
Note: See TracRevisionLog for help on using the revision log.