- Timestamp:
- 2019-08-09T18:57:12+12:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt
r33391 r33393 1 NEXT PROBLEMS: prefixes to basic domain should not be counted.2 e.g. cs.waikato.ac.nz3 and waikato.ac.nz should count as one?4 5 https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column6 It's not enough to cut off http:// and then anything before first ., since some won't have a prefix to the domain. How to detect which ones do and don't and only attempt to remove the prefix from those urls that have a prefix?7 8 * With -r can leave out \( for ( and \? for ? wildcard:9 10 tikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less11 12 13 * Also want to get read of starting " and ending ,"14 FINAL ONE:15 [[Atikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less16 17 UNIQ requires 2 consecutive duplicates in order to detect duplicates.18 So if there's 2 different lines followed by a 3rd line that's a duplicate of the first, then uniq won't detect that.19 And this happens in our case because some urls are http and https, some have www and some don't. And Massey university's domain URL strangely ends with . sometimes though usually not.20 21 tikauka:[199]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | uniq | less22 23 tikauka:[194]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | uniq | less24 25 26 27 28 tikauka:[182]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\).*@\1@'29 http://100health.nz30 31 32 tikauka:[178]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\)@boo@'33 boo/ProdList.asp?p=1&ClassID=196,34 35 36 maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | less37 where38 cut -d ' ' -f439 gets the 4th field (the urls) where each field is separated by a space instead of the default tab delimiter.40 41 42 1 http://webdatacommons.org/ 43 2
Note:
See TracChangeset
for help on using the changeset viewer.