|
|
@34005
|
4 years |
ak19 |
InfoOnEmptyPagesNotInMongoDB.txt is now written out to a file, instead …
|
|
|
@34000
|
4 years |
ak19 |
Some debugging and other minor changes
|
|
|
@33988
|
4 years |
ak19 |
1. Print out which web pages of which web site's dump.txt were empty. …
|
|
|
@33984
|
4 years |
ak19 |
Simple class to summarise some basic counts of the input common crawl data
|
|
|
@33983
|
4 years |
ak19 |
More sensible name for method which had too long kept its old name …
|
|
|
@33982
|
4 years |
ak19 |
SummaryTool.java now processed the handcrafted UNIQUE domains counts …
|
|
|
@33981
|
4 years |
ak19 |
As Dr Bainbridge suggested, code now opens a new firefox tab with a …
|
|
|
@33978
|
4 years |
ak19 |
Opens all geoJSON maps in new tabs instead of waiting for user to have …
|
|
|
@33965
|
4 years |
ak19 |
1. Adding a basicDomain column (stripped of http/https and www prefix) …
|
|
|
@33963
|
4 years |
ak19 |
Added a new helper method to MongoDBQueryer.java to add numPagesInMRI …
|
|
|
@33961
|
4 years |
ak19 |
New category, LINK_TEXT, introduced for the random web page URL samples.
|
|
|
@33959
|
4 years |
ak19 |
URIEncoding the mapData makes it unparseable by geojson.io
|
|
|
@33952
|
4 years |
ak19 |
Minor changes for processing
|
|
|
@33948
|
4 years |
ak19 |
Reviewed the random sampled web page URLs marked as …
|
|
|
@33946
|
4 years |
ak19 |
1. New function to handle user input assigning the newly introduced …
|
|
|
@33941
|
4 years |
ak19 |
1. Uppercase 3rd field (Y/N/? field) read back in from file before …
|
|
|
@33940
|
4 years |
ak19 |
1. In order to make it easier to do the manual work of inspecting 260 …
|
|
|
@33938
|
4 years |
ak19 |
1. Don't regenerate random sample of web page urls and full web page …
|
|
|
@33926
|
4 years |
ak19 |
Investigated some other options for screen capturing and Google chrome …
|
|
|
@33925
|
4 years |
ak19 |
1. Bugfix: oversight, should return uri encoded URL for mapData, …
|
|
|
@33924
|
4 years |
ak19 |
Adding in Dr Bainbridge's command to check the JSON generated is …
|
|
|
@33919
|
4 years |
ak19 |
SummaryTool now uses the CountryCodeCountsMapData.java class to …
|
|
|
@33917
|
4 years |
ak19 |
Added some better reporting when confirming sample size was correct
|
|
|
@33913
|
4 years |
ak19 |
1. Adjusted table mongodb query statements to be more exact, but same …
|
|
|
@33912
|
4 years |
ak19 |
Forgot to svn add the new MongoDBQueryer.java class with commit 33909. …
|
|
|
@33911
|
4 years |
ak19 |
Correct commit message for previous and current commit: 1. After …
|
|
|
@33910
|
4 years |
ak19 |
1. Implementing tables 3 to 5. 2. Rolled back the introduction of the …
|
|
|
@33909
|
4 years |
ak19 |
1. Implementing tables 3 to 5. 2. Rolled back the introduction of the …
|
|
|
@33906
|
4 years |
ak19 |
Code is intermediate state. 1. Introduced basicDomain field to MongoDB …
|
|
|
@33887
|
4 years |
ak19 |
1. Added support for writing out tables in csv format too. 2. Second …
|
|
|
@33885
|
4 years |
ak19 |
Attempting to write the tables. csv not yet supported. Table 1 done.
|
|
|
@33884
|
4 years |
ak19 |
0. Previous commit had lots of modifications, and only 2 files matched …
|
|
|
@33883
|
4 years |
ak19 |
Clarifications
|
|
|
@33882
|
4 years |
ak19 |
Code now writes both a listing of all non-autotranslated websites and …
|
|
|
@33881
|
4 years |
ak19 |
Uses lambda expression to process each doc in a mongodb aggregate …
|
|
|
@33880
|
4 years |
ak19 |
Write out the 5counts_tentativeNonAutotranslatedSites.json file with …
|
|
|
@33879
|
4 years |
ak19 |
Have the 2 mongodb aggregate() calls working that
|
|
|
@33876
|
4 years |
ak19 |
Some missteps, but have got complex collection.aggregate() working at last.
|
|
|
@33873
|
4 years |
ak19 |
Beginnings of WebPageURLsListing program whose purpose Dr Bainbridge …
|
|
|
@33871
|
4 years |
ak19 |
Removed mostly duplicated older version of method but left the …
|
|
|
@33870
|
4 years |
ak19 |
Got the mongodb query working in Java in 2 different ways: the fully …
|
|
|
@33869
|
4 years |
ak19 |
First cut at the RandomURLsForDomainGenerator.java class and the …
|
|
|
@33867
|
4 years |
ak19 |
Moved the code handling of special case large rectangles and those …
|
|
|
@33858
|
4 years |
ak19 |
Fixes to the code committed yesterday: correct calculation of the …
|
|
|
@33853
|
4 years |
ak19 |
Handling map coordinates that are horizontally excessive (beyond …
|
|
|
@33812
|
4 years |
ak19 |
Better handling of multi-line comment symbols, so I can now include …
|
|
|
@33811
|
4 years |
ak19 |
Returning to using a single variable, urlContainsLangCodeInPath, to …
|
|
|
@33810
|
4 years |
ak19 |
Bugfix: mi in url path should be checked for for each page of site, …
|
|
|
@33808
|
4 years |
ak19 |
Storing not just whether /mi(/) suffix is in path, but also whether …
|
|
|
@33805
|
4 years |
ak19 |
1. Moving the static countrycodes.json file to conf folder and updated …
|
|
|
@33801
|
4 years |
ak19 |
1. NutchTextDumpToMongoDB Added an extra field to each document in …
|
|
|
@33800
|
4 years |
ak19 |
Removed an adult site from crawled contents and added its url to …
|
|
|
@33799
|
4 years |
ak19 |
1. Adding breadcrumb for next step at end of running …
|
|
|
@33796
|
4 years |
ak19 |
Instead of a hack for US' count being too great that its histogram …
|
|
|
@33794
|
4 years |
ak19 |
Wrote the geojson map data created from the site counts per …
|
|
|
@33790
|
4 years |
ak19 |
Got the MultiPoint geojson mapdata of the country code counts working: …
|
|
|
@33778
|
4 years |
ak19 |
Made a beginning on getting the geojson map data automated. Couldn't …
|
|
|
@33698
|
4 years |
ak19 |
Links to more reading
|
|
|
@33674
|
4 years |
ak19 |
Changes to support the top 5 predicted langcodes and their confidence …
|
|
|
@33666
|
4 years |
ak19 |
Having finished sending all the crawl data to mongodb 1. Recrawled the …
|
|
|
@33657
|
4 years |
ak19 |
Some fixes after brief testing against 1/3 of the crawl. Restarted …
|
|
|
@33656
|
4 years |
ak19 |
Final minor changes before I start processing the crawls of node2.
|
|
|
@33655
|
4 years |
ak19 |
Minor change to print statement
|
|
|
@33653
|
4 years |
ak19 |
1. As suggested by Dr Bainbridge, made the code changes to use Morphia …
|
|
|
@33652
|
4 years |
ak19 |
Introducing morphia subpackage
|
|
|
@33651
|
4 years |
ak19 |
1. Bugfix: overlappingSentences works. 2. storing numSentencesInMaor
|
|
|
@33645
|
4 years |
ak19 |
Fix to 2 bugs when sending data to MongoDB: 1. overlappingSentences …
|
|
|
@33635
|
4 years |
ak19 |
Maori-language-detection doesn't use Greenstone 3 at present, it's not …
|
|
copied from gs3-extensions/maori-lang-detection/src
|
|
|
@33634
|
4 years |
ak19 |
Rewrote NutchTextDumpProcessor as NutchTextDumpToMongoDB.java, which …
|