root/gs3-extensions

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Rev Chgset Date Author Log Message
(edit) @33634 [33634] 6 days ak19 Rewrote NutchTextDumpProcessor? as NutchTextDumpToMongoDB.java, which uses …
(edit) @33633 [33633] 6 days ak19 1. TextLanguageDetector? now has methods for collecting all sentences and …
(edit) @33626 [33626] 9 days ak19 TODOs
(edit) @33625 [33625] 9 days ak19 A file listing domains with seedurls containing /mi(/) that are located …
(edit) @33624 [33624] 9 days ak19 Some cleanup surrounding the now renamed function createSeedURLsFile, now …
(edit) @33623 [33623] 9 days ak19 1. Incorporated Dr Nichols earlier suggestion of storing page modified …
(edit) @33622 [33622] 9 days ak19 File rename
(edit) @33621 [33621] 10 days ak19 Comitting jotted down mongodb related instructions from what Dr Bainbridge …
(edit) @33620 [33620] 10 days ak19 Final crawl, done on vagrant VM node6. Crawl site IDs 01407-01462.
(edit) @33618 [33618] 13 days ak19 Adding in the download URL
(edit) @33617 [33617] 13 days ak19 Node5 is now full and here is the finished crawl (up to and including site …
(edit) @33616 [33616] 2 weeks ak19 Beginnings of Java class that is to interact with MongoDB. I don't yet …
(edit) @33615 [33615] 2 weeks ak19 1. Worked out how to configure log4j to log both to console and logfile, …
(edit) @33609 [33609] 2 weeks ak19 The tar files containing the crawled sites data shouldn't be called tar.gz …
(edit) @33608 [33608] 2 weeks ak19 1. New script to export from HBase so that we could in theory reimport …
(edit) @33607 [33607] 2 weeks ak19 Updated with the remaining successfully crawled sites on node4 before …
(edit) @33606 [33606] 2 weeks ak19 1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. …
(edit) @33605 [33605] 2 weeks ak19 Node 4 VM still works, but committing first set of crawled sites on there
(edit) @33604 [33604] 3 weeks ak19 1. Better output into possible-product-sites.txt including the overseas …
(edit) @33603 [33603] 3 weeks ak19 Incorporating Dr Nichols suggestion to help weed out product sites: if tld …
(edit) @33602 [33602] 3 weeks ak19 1. The final csv file, mri-sentences.csv, is now written out. 2. Only …
(edit) @33601 [33601] 3 weeks ak19 Creates the 2nd csv file, with info about webpages. At present stores …
(edit) @33600 [33600] 3 weeks ak19 Work in progress of writing out CSV files. In future, may write the same …
(edit) @33599 [33599] 3 weeks ak19 First one-third sites crawled. Committing to SVN despite the tarred …
(edit) @33598 [33598] 3 weeks ak19 More instructions on setting up Nutch now that I've remembered to commit …
(edit) @33597 [33597] 3 weeks ak19 Committing active version of template file which has a newline at end of …
(edit) @33596 [33596] 3 weeks ak19 Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template file …
(edit) @33588 [33588] 4 weeks ak19 Committing the MRI sentence model that I'm actually using, the one in my …
(edit) @33587 [33587] 4 weeks ak19 1. Better stats reporting on crawled sites: not just if a page was in MRI …
(edit) @33586 [33586] 4 weeks ak19 Refactored MaoriTextDetector?.java class into more general …
(edit) @33585 [33585] 4 weeks ak19 Much simpler way of using sentence and language detection model to work on …
(edit) @33584 [33584] 4 weeks ak19 Committing experimental version 2 using the sentence detector model, …
(edit) @33583 [33583] 4 weeks ak19 Committing experimental version 1 using the sentence detector model, …
(edit) @33582 [33582] 4 weeks ak19 NutchTextDumpProcessor? prints each crawled site's stats: number of …
(edit) @33581 [33581] 4 weeks ak19 Minor fix. Noticed when looking for work I did on MRI sentence detection
(edit) @33580 [33580] 4 weeks ak19 Finally fixed the thus-far identified bugs when parsing dump.txt.
(edit) @33579 [33579] 4 weeks ak19 Debugging. Solved one problem.
(edit) @33578 [33578] 4 weeks ak19 Corrections for compiling the 2 new classes.
(edit) @33577 [33577] 4 weeks ak19 Forgot to adjust usage statement to say that silent mode was already …
(edit) @33576 [33576] 4 weeks ak19 Introducing 2 new Java files still being written and untested. …
(edit) @33575 [33575] 4 weeks ak19 Correcting usage string for CCWETProcessor before committing new java …
(edit) @33574 [33574] 4 weeks ak19 If nutch stores a crawled site in more than 1 file, then cat all of them …
(edit) @33573 [33573] 4 weeks ak19 Forgot to document that spaces were also allowed as separator in the input …
(edit) @33572 [33572] 4 weeks ak19 Only meant to store the wet.gz versions of these files, not also the …
(edit) @33571 [33571] 4 weeks ak19 Adding Dr Bainbridge's suggestion of appending the crawlId of each site to …
(edit) @33570 [33570] 4 weeks ak19 Need to check if UNFINISHED file actually exists before moving it across …
(edit) @33569 [33569] 4 weeks ak19 1. batchcrawl.sh now does what it should have from the start, which is to …
(edit) @33568 [33568] 4 weeks ak19 1. More sites greylisted and blacklisted, discovered as I attempted to …
(edit) @33567 [33567] 4 weeks ak19 batchcrawl.sh now supports -all flag (and prints usage on 0 args). The …
(edit) @33566 [33566] 4 weeks ak19 batchcrawl.sh script now supports taking a comma or space separated list …
(edit) @33565 [33565] 4 weeks ak19 CCWETProcessor: domain url now goes in as a seedURL after the individual …
(edit) @33564 [33564] 4 weeks ak19 batchcrawl.sh now does the crawl and logs output of the crawl, dumps text …
(edit) @33563 [33563] 5 weeks ak19 Committing inactive testing batch scripts (only creates the …
(edit) @33562 [33562] 5 weeks ak19 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a …
(edit) @33561 [33561] 5 weeks ak19 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated list. …
(edit) @33560 [33560] 5 weeks ak19 1. Incorporated Dr Bainbridge's suggested improvements: only when there is …
(edit) @33559 [33559] 5 weeks ak19 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge …
(edit) @33558 [33558] 5 weeks ak19 Committing cumulative changes since last commit.
(edit) @33557 [33557] 5 weeks ak19 Implemented the topSitesMap of topsite domain to url pattern in the only …
(edit) @33556 [33556] 5 weeks ak19 Blacklisted wikipedia pages that are actually in other languages which had …
(edit) @33555 [33555] 5 weeks ak19 Modified top sites list as Dr Bainbridge described: suffixes for the same …
(edit) @33554 [33554] 5 weeks ak19 Added more to blacklist and greylist. And removed remaining duplicates …
(edit) @33553 [33553] 6 weeks ak19 Comments
(edit) @33552 [33552] 6 weeks ak19 1. Code now processes ccrawldata folder, containing each individual common …
(edit) @33551 [33551] 6 weeks ak19 Added in top 500 urls from moz.com/top500 and removed duplicates, and …
(edit) @33550 [33550] 6 weeks ak19 First stage of introducing sites-too-big-to-exhaustively-crawl.tx: split …
(edit) @33549 [33549] 6 weeks ak19 All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 (when …
(edit) @33548 [33548] 6 weeks davidb Include new wavesurfer sub-project to install
(edit) @33546 [33546] 6 weeks davidb Initial cut at wave-surfer based JS audio player extension for Greenstone
(edit) @33545 [33545] 6 weeks ak19 Mainly changes to crawling-Nutch.txt and some minor changes to other txt …
(edit) @33543 [33543] 6 weeks ak19 Filled in some missing instructions
(edit) @33541 [33541] 6 weeks ak19 1. hdfs-cc-work/GS_README.txt now contains the complete instructions to …
(edit) @33540 [33540] 6 weeks ak19 Since I wasn't getting further with nutch 2 to grab an entire site, I am …
(edit) @33539 [33539] 6 weeks ak19 File rename
(edit) @33538 [33538] 6 weeks ak19 Some additions to the setup.sh script to query commoncrawl for MRI data on …
(edit) @33537 [33537] 6 weeks ak19 More nutch and general site mirroring related links
(edit) @33536 [33536] 6 weeks ak19 Changes required to the commoncrawl related Vagrant github project to get …
(edit) @33535 [33535] 6 weeks ak19 1. New setup.sh script for on a hadoop system to setup the git projects we …
(edit) @33534 [33534] 7 weeks ak19 Correction: toplevel script has to be placed inside cc-index-table not its …
(edit) @33532 [33532] 7 weeks ak19 Found the other top 500 sites link again at last which Dr Bainbridge had …
(edit) @33531 [33531] 7 weeks ak19 Added whitelist for mi.wikipedia.org, and updates to blacklist and …
(edit) @33530 [33530] 7 weeks ak19 Completed sentence that was left hanging.
(edit) @33529 [33529] 7 weeks ak19 Forgot to add most basic nutch links
(edit) @33528 [33528] 7 weeks ak19 Adding in Nutch links
(edit) @33527 [33527] 7 weeks ak19 Name change for folder
(edit) @33526 [33526] 7 weeks ak19 Moved hadoop related scripts from bin/script into hdfs-instructions
(edit) @33525 [33525] 7 weeks ak19 Rename before latest version
(edit) @33524 [33524] 7 weeks ak19 1. Further adjustments to documenting what we did to get things to run on …
(edit) @33523 [33523] 7 weeks ak19 Instructional comment
(edit) @33522 [33522] 7 weeks ak19 Some comments and an improvement
(edit) @33519 [33519] 7 weeks ak19 Code still writes out the global seedURLs.txt and regex-urlfilter.txt (in …
(edit) @33518 [33518] 7 weeks ak19 Intermediate commit: got the seed urls file temporarily written out as …
(edit) @33517 [33517] 7 weeks ak19 1. Blacklists were introduced so that too many instances of camelcased …
(edit) @33516 [33516] 7 weeks ak19 Before I accidentally lose it, committing the script Dr Bainbridge wrote …
(edit) @33515 [33515] 7 weeks ak19 Removed an unused function
(edit) @33514 [33514] 7 weeks ak19 Committing README on starting off with the vagrant VM for hadoop-spark to …
(edit) @33513 [33513] 7 weeks ak19 Higher level script that runs against each named crawl since Sep 2018 …
(edit) @33503 [33503] 7 weeks ak19 More efficient blacklisting/greylisting/whitelisting now by reading in the …
(edit) @33502 [33502] 7 weeks ak19 Current url pattern blacklist and greylist filter files. Used by …
(edit) @33501 [33501] 7 weeks ak19 Refactored code into 2 classes: The existing WETProcessor, which processes …
Note: See TracRevisionLog for help on using the revision log.