source: other-projects/maori-lang-detection/src

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @34005   4 years ak19 InfoOnEmptyPagesNotInMongoDB.txt is now written out to a file, instead …
(edit) @34000   4 years ak19 Some debugging and other minor changes
(edit) @33988   4 years ak19 1. Print out which web pages of which web site's dump.txt were empty. …
(edit) @33984   4 years ak19 Simple class to summarise some basic counts of the input common crawl data
(edit) @33983   4 years ak19 More sensible name for method which had too long kept its old name …
(edit) @33982   4 years ak19 SummaryTool.java now processed the handcrafted UNIQUE domains counts …
(edit) @33981   4 years ak19 As Dr Bainbridge suggested, code now opens a new firefox tab with a …
(edit) @33978   4 years ak19 Opens all geoJSON maps in new tabs instead of waiting for user to have …
(edit) @33965   4 years ak19 1. Adding a basicDomain column (stripped of http/https and www prefix) …
(edit) @33963   4 years ak19 Added a new helper method to MongoDBQueryer.java to add numPagesInMRI …
(edit) @33961   4 years ak19 New category, LINK_TEXT, introduced for the random web page URL samples.
(edit) @33959   4 years ak19 URIEncoding the mapData makes it unparseable by geojson.io
(edit) @33952   4 years ak19 Minor changes for processing
(edit) @33948   4 years ak19 Reviewed the random sampled web page URLs marked as …
(edit) @33946   4 years ak19 1. New function to handle user input assigning the newly introduced …
(edit) @33941   4 years ak19 1. Uppercase 3rd field (Y/N/? field) read back in from file before …
(edit) @33940   4 years ak19 1. In order to make it easier to do the manual work of inspecting 260 …
(edit) @33938   4 years ak19 1. Don't regenerate random sample of web page urls and full web page …
(edit) @33926   4 years ak19 Investigated some other options for screen capturing and Google chrome …
(edit) @33925   4 years ak19 1. Bugfix: oversight, should return uri encoded URL for mapData, …
(edit) @33924   4 years ak19 Adding in Dr Bainbridge's command to check the JSON generated is …
(edit) @33919   4 years ak19 SummaryTool now uses the CountryCodeCountsMapData.java class to …
(edit) @33917   4 years ak19 Added some better reporting when confirming sample size was correct
(edit) @33913   4 years ak19 1. Adjusted table mongodb query statements to be more exact, but same …
(edit) @33912   4 years ak19 Forgot to svn add the new MongoDBQueryer.java class with commit 33909. …
(edit) @33911   4 years ak19 Correct commit message for previous and current commit: 1. After …
(edit) @33910   4 years ak19 1. Implementing tables 3 to 5. 2. Rolled back the introduction of the …
(edit) @33909   4 years ak19 1. Implementing tables 3 to 5. 2. Rolled back the introduction of the …
(edit) @33906   4 years ak19 Code is intermediate state. 1. Introduced basicDomain field to MongoDB …
(edit) @33887   4 years ak19 1. Added support for writing out tables in csv format too. 2. Second …
(edit) @33885   4 years ak19 Attempting to write the tables. csv not yet supported. Table 1 done.
(edit) @33884   4 years ak19 0. Previous commit had lots of modifications, and only 2 files matched …
(edit) @33883   4 years ak19 Clarifications
(edit) @33882   4 years ak19 Code now writes both a listing of all non-autotranslated websites and …
(edit) @33881   4 years ak19 Uses lambda expression to process each doc in a mongodb aggregate …
(edit) @33880   4 years ak19 Write out the 5counts_tentativeNonAutotranslatedSites.json file with …
(edit) @33879   4 years ak19 Have the 2 mongodb aggregate() calls working that
(edit) @33876   4 years ak19 Some missteps, but have got complex collection.aggregate() working at last.
(edit) @33873   4 years ak19 Beginnings of WebPageURLsListing program whose purpose Dr Bainbridge …
(edit) @33871   4 years ak19 Removed mostly duplicated older version of method but left the …
(edit) @33870   4 years ak19 Got the mongodb query working in Java in 2 different ways: the fully …
(edit) @33869   4 years ak19 First cut at the RandomURLsForDomainGenerator.java class and the …
(edit) @33867   4 years ak19 Moved the code handling of special case large rectangles and those …
(edit) @33858   4 years ak19 Fixes to the code committed yesterday: correct calculation of the …
(edit) @33853   4 years ak19 Handling map coordinates that are horizontally excessive (beyond …
(edit) @33812   4 years ak19 Better handling of multi-line comment symbols, so I can now include …
(edit) @33811   4 years ak19 Returning to using a single variable, urlContainsLangCodeInPath, to …
(edit) @33810   4 years ak19 Bugfix: mi in url path should be checked for for each page of site, …
(edit) @33808   4 years ak19 Storing not just whether /mi(/) suffix is in path, but also whether …
(edit) @33805   4 years ak19 1. Moving the static countrycodes.json file to conf folder and updated …
(edit) @33801   4 years ak19 1. NutchTextDumpToMongoDB Added an extra field to each document in …
(edit) @33800   4 years ak19 Removed an adult site from crawled contents and added its url to …
(edit) @33799   4 years ak19 1. Adding breadcrumb for next step at end of running …
(edit) @33796   4 years ak19 Instead of a hack for US' count being too great that its histogram …
(edit) @33794   4 years ak19 Wrote the geojson map data created from the site counts per …
(edit) @33790   4 years ak19 Got the MultiPoint geojson mapdata of the country code counts working: …
(edit) @33778   4 years ak19 Made a beginning on getting the geojson map data automated. Couldn't …
(edit) @33698   4 years ak19 Links to more reading
(edit) @33674   4 years ak19 Changes to support the top 5 predicted langcodes and their confidence …
(edit) @33666   4 years ak19 Having finished sending all the crawl data to mongodb 1. Recrawled the …
(edit) @33657   4 years ak19 Some fixes after brief testing against 1/3 of the crawl. Restarted …
(edit) @33656   4 years ak19 Final minor changes before I start processing the crawls of node2.
(edit) @33655   4 years ak19 Minor change to print statement
(edit) @33653   4 years ak19 1. As suggested by Dr Bainbridge, made the code changes to use Morphia …
(edit) @33652   4 years ak19 Introducing morphia subpackage
(edit) @33651   4 years ak19 1. Bugfix: overlappingSentences works. 2. storing numSentencesInMaor
(edit) @33645   4 years ak19 Fix to 2 bugs when sending data to MongoDB: 1. overlappingSentences …
(copy) @33635   4 years ak19 Maori-language-detection doesn't use Greenstone 3 at present, it's not …
copied from gs3-extensions/maori-lang-detection/src
(edit) @33634   4 years ak19 Rewrote NutchTextDumpProcessor as NutchTextDumpToMongoDB.java, which …
Note: See TracRevisionLog for help on using the revision log.