Changeset 33806 for other-projects
- Timestamp:
- 2019-12-13T21:31:11+13:00 (4 years ago)
- Location:
- other-projects/maori-lang-detection
- Files:
-
- 4 added
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/MoreReading/mongodb.txt
r33804 r33806 711 711 (Related work for other languages to quantifiably answer that) 712 712 713 714 715 716 713 data-preparation 717 714 docs 718 715 719 716 720 717 ------------------------------------------ 718 719 BUILDING TOWARDS NEW MONGODB QUERY: Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori 720 --- 721 722 # https://stackoverflow.com/questions/16902930/mongodb-aggregation-framework-match-or 723 # https://docs.mongodb.com/manual/reference/operator/query/and/ 724 725 # 1. all the websites which are from NZ: 726 db.getCollection('Websites').find({geoLocationCountryCode: "NZ"}).count() 727 128 728 729 # 2. all the websites that have /mi in URL path which are from NZ: 730 db.getCollection('Websites').find({$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}) 731 6 732 733 # 3. all the websites that don't have /mi in URLpath 734 db.getCollection('Websites').find({urlContainsLangCodeInPath: false}).count() 735 1292 736 737 # 4. all the websites that don't have /mi, or if they do are from NZ 738 # (should be the sum of the above points 2 and 3 above) 739 db.getCollection('Websites').find({$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}).count() 740 1298 741 742 # 5. All the websites that have at least 1 page detected as MRI AND either don't have /mi un URL path or if they do are from NZ 743 # These are the TENTATIVE NON-PRODUCT SITES 744 # Should be less than the point 4, but more than 1 to 3 745 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}).count() 746 859 747 748 # 6. Now do the counts by country code of the above, by pasting the query of point 5 as the $match clause (i.e. without the .count() suffix) 749 # Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori 750 db.Websites.aggregate([ 751 { 752 $match: {$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]} 753 }, 754 { $unwind: "$geoLocationCountryCode" }, 755 { 756 $group: { 757 _id: {$toLower: '$geoLocationCountryCode'}, 758 count: { $sum: 1 } 759 } 760 }, 761 { $sort : { count : -1} } 762 ]); 763 764 The result is very close to the same aggregate on just numPagesContainingMRI. 765 766 That's because if you count those websites that contain /mi/ AND numPagesContainingMRI, they're very few: 767 768 db.Websites.aggregate([ 769 { 770 $match: { 771 $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}] 772 } 773 }, 774 { $unwind: "$geoLocationCountryCode" }, 775 { 776 $group: { 777 _id: {$toLower: '$geoLocationCountryCode'}, 778 count: { $sum: 1 } 779 } 780 }, 781 { $sort : { count : -1} } 782 ]); 783 784 785 _id count 786 us 4.0 787 nz 4.0 788 au 3.0 789 ru 1.0 790 de 1.0 791 792 Total: 13 sites that have /mi/ and are detected as having MRI content, 793 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count() 794 13 795 796 Of these 13, the 4 from NZ were already included in steps 5 and 6. So the difference is only 8 sites that are MI. 797 798 799 Let's get a listing of the sites' domains - 3 whose country codes are NOT NZ have NZ TLD! 800 /* 1 */ 801 { 802 "_id" : "nz", 803 "count" : 4.0, 804 "domain" : [ 805 "http://firstworldwar.tki.org.nz", 806 "http://www.firstworldwar.tki.org.nz", 807 "https://admin.teara.govt.nz", 808 "http://community.nzdl.org" 809 ] 810 } 811 812 /* 2 */ 813 { 814 "_id" : "us", 815 "count" : 4.0, 816 "domain" : [ 817 "https://sexualviolence.victimsinfo.govt.nz", 818 "https://follow3rs.com", 819 "http://www.church-of-christ.org", 820 "http://www.mytrickstips.com" 821 ] 822 } 823 824 /* 3 */ 825 { 826 "_id" : "au", 827 "count" : 3.0, 828 "domain" : [ 829 "https://rapuatearatika.education.govt.nz", 830 "https://www.kiwiproperty.com", 831 "https://curriculumtool.education.govt.nz" 832 ] 833 } 834 835 /* 4 */ 836 { 837 "_id" : "ru", 838 "count" : 1.0, 839 "domain" : [ 840 "http://www.treningmozga.com" 841 ] 842 } 843 844 /* 5 */ 845 { 846 "_id" : "de", 847 "count" : 1.0, 848 "domain" : [ 849 "http://www.almancax.com" # Website to learn German, autotranslated 850 ] 851 } 852 853 854 But we're not catching a potentially large number of auto-translated sites, like 855 - https://www.gigalight.com/all-languages.html 856 - http://www.hzhinew.com/ 857 858 859 -------------- 860 GETTING TABLE DATA OUT OF MONGO DB: 861 862 https://stackoverflow.com/questions/28733692/how-to-export-json-from-mongodb-using-robomongo 863 "export to file" as in a spreadsheet like to a .csv? 864 865 IMO this is the EASIEST way to do this in Robo 3T (formerly robomongo): 866 867 1. In the top right of the Robo 3T GUI there is a "View Results in text mode" button, click it and copy everything 868 869 2. paste everything into this website: https://json-csv.com/ 870 871 3. click the download button and now you have it in a spreadsheet. 872 873 874 https://json-csv.com/ 875 876 877 --------------------- 721 878 722 879 /* 1 */
Note:
See TracChangeset
for help on using the changeset viewer.