Changeset 33391
- Timestamp:
- 2019-08-07T19:11:12+12:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt
r33376 r33391 1 NEXT PROBLEMS: prefixes to basic domain should not be counted. 2 e.g. cs.waikato.ac.nz 3 and waikato.ac.nz should count as one? 4 5 https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column 6 It's not enough to cut off http:// and then anything before first ., since some won't have a prefix to the domain. How to detect which ones do and don't and only attempt to remove the prefix from those urls that have a prefix? 7 8 * With -r can leave out \( for ( and \? for ? wildcard: 9 10 tikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less 11 12 13 * Also want to get read of starting " and ending ," 14 FINAL ONE: 15 [[Atikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less 16 17 UNIQ requires 2 consecutive duplicates in order to detect duplicates. 18 So if there's 2 different lines followed by a 3rd line that's a duplicate of the first, then uniq won't detect that. 19 And this happens in our case because some urls are http and https, some have www and some don't. And Massey university's domain URL strangely ends with . sometimes though usually not. 20 21 tikauka:[199]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | uniq | less 22 23 tikauka:[194]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | uniq | less 24 25 26 27 28 tikauka:[182]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\).*@\1@' 29 http://100health.nz 30 31 32 tikauka:[178]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\)@boo@' 33 boo/ProdList.asp?p=1&ClassID=196, 34 35 36 maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | less 37 where 38 cut -d ' ' -f4 39 gets the 4th field (the urls) where each field is separated by a space instead of the default tab delimiter. 40 41 1 42 http://webdatacommons.org/ 2 43
Note:
See TracChangeset
for help on using the changeset viewer.