Ignore:
Timestamp:
2019-08-09T18:57:12+12:00 (5 years ago)
Author:
ak19
Message:

Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls file of just the unique toplevel sites

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33391 r33393  
    1 NEXT PROBLEMS: prefixes to basic domain should not be counted.
    2 e.g. cs.waikato.ac.nz
    3 and waikato.ac.nz should count as one?
    4 
    5     https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column
    6 It's not enough to cut off http:// and then anything before first ., since some won't have a prefix to the domain. How to detect which ones do and don't and only attempt to remove the prefix from those urls that have a prefix?
    7 
    8 * With -r can leave out \( for ( and \? for ? wildcard:
    9 
    10     tikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less
    11 
    12 
    13 * Also want to get read of starting " and ending ,"
    14 FINAL ONE:
    15     [[Atikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less
    16 
    17 UNIQ requires 2 consecutive duplicates in order to detect duplicates.
    18 So if there's 2 different lines followed by a 3rd line that's a duplicate of the first, then uniq won't detect that.
    19 And this happens in our case because some urls are http and https, some have www and some don't. And Massey university's domain URL strangely ends with . sometimes though usually not.
    20 
    21     tikauka:[199]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | uniq | less
    22 
    23     tikauka:[194]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | uniq | less
    24 
    25 
    26 
    27 
    28 tikauka:[182]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\).*@\1@'
    29 http://100health.nz
    30 
    31 
    32 tikauka:[178]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\)@boo@'
    33 boo/ProdList.asp?p=1&ClassID=196,
    34 
    35 
    36 maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | less
    37     where
    38         cut -d ' ' -f4
    39     gets the 4th field (the urls) where each field is separated by a space instead of the default tab delimiter.
    40 
    41 
    421http://webdatacommons.org/
    432
Note: See TracChangeset for help on using the changeset viewer.