Changeset 33391

Show
Ignore:
Timestamp:
07.08.2019 19:11:12 (2 weeks ago)
Author:
ak19
Message:

Some rough bash scripting lines that work but aren't complete.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33376 r33391  
     1NEXT PROBLEMS: prefixes to basic domain should not be counted. 
     2e.g. cs.waikato.ac.nz 
     3and waikato.ac.nz should count as one? 
     4 
     5    https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column 
     6It's not enough to cut off http:// and then anything before first ., since some won't have a prefix to the domain. How to detect which ones do and don't and only attempt to remove the prefix from those urls that have a prefix? 
     7 
     8* With -r can leave out \( for ( and \? for ? wildcard: 
     9 
     10    tikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less 
     11 
     12 
     13* Also want to get read of starting " and ending ," 
     14FINAL ONE: 
     15    [[Atikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less 
     16 
     17UNIQ requires 2 consecutive duplicates in order to detect duplicates. 
     18So if there's 2 different lines followed by a 3rd line that's a duplicate of the first, then uniq won't detect that. 
     19And this happens in our case because some urls are http and https, some have www and some don't. And Massey university's domain URL strangely ends with . sometimes though usually not. 
     20 
     21    tikauka:[199]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | uniq | less 
     22 
     23    tikauka:[194]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | uniq | less 
     24 
     25 
     26 
     27 
     28tikauka:[182]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\).*@\1@' 
     29http://100health.nz 
     30 
     31 
     32tikauka:[178]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\)@boo@' 
     33boo/ProdList.asp?p=1&ClassID=196, 
     34 
     35 
     36maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | less 
     37    where  
     38        cut -d ' ' -f4 
     39    gets the 4th field (the urls) where each field is separated by a space instead of the default tab delimiter. 
     40 
     41 
    142http://webdatacommons.org/ 
    243