Changeset 33413

Show
Ignore:
Timestamp:
13.08.2019 21:57:42 (9 days ago)
Author:
ak19
Message:

Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, itself and create-uniq-nz-urls-file.sh. Added a new script, create-uniq-WET-urls-file.sh, to get the WET urls once we have all the CC URLs for the .nz TLD.

Location:
gs3-extensions/maori-lang-detection/bin/script
Files:
2 added
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh

    r33394 r33413  
    156156echo "" 
    157157 
    158 uniq_urls_file=uniq-tld-nz-urls-`date +%F`.txt 
    159 echo "Creating file $uniq_urls_file containing just the unique domains and subdomains..." 
    160 #cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file 
    161 <$outfile cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file 
     158# uniq_urls_file=uniq-tld-nz-urls-`date +%F`.txt 
     159# echo "Creating file $uniq_urls_file containing just the unique domains and subdomains..." 
     160# #cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file 
     161# <$outfile cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file 
    162162 
    163 # cat is unnecessary: https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/1915750 
    164 #    "You can dump the cat! Rather than piping into <the subsequent process>, just let <the subsequent process> read the file using <. Piping through cat is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had." 
     163# # cat is unnecessary: https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/1915750 
     164# #    "You can dump the cat! Rather than piping into <the subsequent process>, just let <the subsequent process> read the file using <. Piping through cat is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had." 
    165165 
    166 # The first cut grabs the url field of the json. 
    167 # The second cut grabs the domain name from the url (located between first // and immediately subsequent /). 
    168 # The first sed process then removes any trailing . (e.g. "massey.ac.nz." becomes "massey.ac.nz") followed by ", and optional spaces before the end, 
    169 # and the final sed removes any "www." prefix. 
    170 # Then we get the uniq urls out of all that. 
     166# # The first cut grabs the url field of the json. 
     167# # The second cut grabs the domain name from the url (located between first // and immediately subsequent /). 
     168# # The first sed process then removes any trailing . (e.g. "massey.ac.nz." becomes "massey.ac.nz") followed by ", and optional spaces before the end, 
     169# # and the final sed removes any "www." prefix. 
     170# # Then we get the uniq urls out of all that. 
    171171 
    172 echo "File $uniq_urls_file containing just the unique .nz sites (domains and subdomains) to be used as seed urls has now been created." 
    173 num_uniq_urls=`wc -l $uniq_urls_file` 
    174 total_urls=`wc -l $outfile` 
    175 echo "" 
    176 echo "" 
    177 echo "Summary:" 
    178 echo "There were $num_uniq_urls unique sites" 
    179 echo "out of a total of $total_urls urls in $outfile." 
    180 echo "" 
    181 echo "" 
     172# echo "File $uniq_urls_file containing just the unique .nz sites (domains and subdomains) to be used as seed urls has now been created." 
     173# num_uniq_urls=`wc -l $uniq_urls_file` 
     174# total_urls=`wc -l $outfile` 
     175# echo "" 
     176# echo "" 
     177# echo "Summary:" 
     178# echo "There were $num_uniq_urls unique sites" 
     179# echo "out of a total of $total_urls urls in $outfile." 
     180# echo "" 
     181# echo "" 
    182182 
    183183echo ""