Changeset 33393

Show
Ignore:
Timestamp:
09.08.2019 18:57:12 (13 days ago)
Author:
ak19
Message:

Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls file of just the unique toplevel sites

Location:
gs3-extensions/maori-lang-detection
Files:
2 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33391 r33393  
    1 NEXT PROBLEMS: prefixes to basic domain should not be counted. 
    2 e.g. cs.waikato.ac.nz 
    3 and waikato.ac.nz should count as one? 
    4  
    5     https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column 
    6 It's not enough to cut off http:// and then anything before first ., since some won't have a prefix to the domain. How to detect which ones do and don't and only attempt to remove the prefix from those urls that have a prefix? 
    7  
    8 * With -r can leave out \( for ( and \? for ? wildcard: 
    9  
    10     tikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less 
    11  
    12  
    13 * Also want to get read of starting " and ending ," 
    14 FINAL ONE: 
    15     [[Atikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | sed 's@^"@@' | sed 's@",$@@' | uniq | less 
    16  
    17 UNIQ requires 2 consecutive duplicates in order to detect duplicates. 
    18 So if there's 2 different lines followed by a 3rd line that's a duplicate of the first, then uniq won't detect that. 
    19 And this happens in our case because some urls are http and https, some have www and some don't. And Massey university's domain URL strangely ends with . sometimes though usually not. 
    20  
    21     tikauka:[199]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | sed 's@\.$@@' | uniq | less 
    22  
    23     tikauka:[194]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt | cut -d ' ' -f4 | sed 's@\(https\?://\)\(www\.\)\?\([^/]*\).*@http://\3@' | uniq | less 
    24  
    25  
    26  
    27  
    28 tikauka:[182]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\).*@\1@' 
    29 http://100health.nz 
    30  
    31  
    32 tikauka:[178]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", | sed 's@\(https\?://[^/]*\)@boo@' 
    33 boo/ProdList.asp?p=1&ClassID=196, 
    34  
    35  
    36 maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt | cut -d ' ' -f4 | less 
    37     where  
    38         cut -d ' ' -f4 
    39     gets the 4th field (the urls) where each field is separated by a space instead of the default tab delimiter. 
    40  
    41  
    421http://webdatacommons.org/ 
    432 
  • gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh

    r33390 r33393  
    154154echo "" 
    155155echo "The file $outfile has now been created, containing all the .nz domains for the crawl of $1" 
    156 echo "Remember to delete the products in the tmp folder or the folder itself after inspecting its contents" 
     156echo "" 
     157 
     158uniq_urls_file=uniq-tld-nz-urls-`date +%F`.txt 
     159echo "Creating file $uniq_urls_file containing just the unique domains and subdomains..." 
     160cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file 
     161 
     162# The first cut grabs the url field of the json. 
     163# The second cut grabs the domain name from the url (located between first // and immediately subsequent /). 
     164# The first sed process then removes any trailing . (e.g. "massey.ac.nz." becomes "massey.ac.nz") followed by ", and optional spaces before the end, 
     165# and the final sed removes any "www." prefix. 
     166# Then we get the uniq urls out of all that. 
     167 
     168echo "File $uniq_urls_file containing just the unique .nz sites (domains and subdomains) to be used as seed urls has now been created." 
     169num_uniq_urls=`wc -l $uniq_urls_file` 
     170total_urls=`wc -l $outfile` 
     171echo "" 
     172echo "" 
     173echo "Summary:" 
     174echo "There were $num_uniq_urls unique sites" 
     175echo "out of a total of $total_urls urls in $outfile." 
     176echo "" 
     177echo "" 
     178 
     179echo "" 
     180echo "NEXT:" 
     181echo "Remember to delete the products in the tmp folder or the folder itself after inspecting its contents." 
    157182echo "" 
    158183exit 0