Context Navigation

← Previous Change
Next Change →

Changeset 33413 for gs3-extensions

Timestamp:

2019-08-13T21:57:42+12:00 (5 years ago)

Author:

ak19

Message:

Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, itself and create-uniq-nz-urls-file.sh. Added a new script, create-uniq-WET-urls-file.sh, to get the WET urls once we have all the CC URLs for the .nz TLD.

Location:

gs3-extensions/maori-lang-detection/bin/script

Files:

: 2 added
: 1 edited

create-uniq-WET-urls-file.sh (added)
create-uniq-nz-urls-file.sh (added)
get_commoncrawl_nz_urls.sh (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh

-              r33394
+              r33413
 echo ""
 uniq_urls_file=uniq-tld-nz-urls-`date +%F`.txt
 echo "Creating file $uniq_urls_file containing just the unique domains and subdomains..."
 #cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file
 <$outfile cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file
+# uniq_urls_file=uniq-tld-nz-urls-`date +%F`.txt
+# echo "Creating file $uniq_urls_file containing just the unique domains and subdomains..."
+# #cat $outfile | cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file
+# <$outfile cut -d ' ' -f4 | cut -d/ -f3 | sed -r 's@\.?",\s*$@@' | sed -r 's@^www\.@@' | uniq > $uniq_urls_file
 # cat is unnecessary: https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/1915750
 #    "You can dump the cat! Rather than piping into <the subsequent process>, just let <the subsequent process> read the file using <. Piping through cat is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had."
+# # cat is unnecessary: https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/1915750
+# #    "You can dump the cat! Rather than piping into <the subsequent process>, just let <the subsequent process> read the file using <. Piping through cat is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had."
 # The first cut grabs the url field of the json.
 # The second cut grabs the domain name from the url (located between first // and immediately subsequent /).
 # The first sed process then removes any trailing . (e.g. "massey.ac.nz." becomes "massey.ac.nz") followed by ", and optional spaces before the end,
 # and the final sed removes any "www." prefix.
 # Then we get the uniq urls out of all that.
+# # The first cut grabs the url field of the json.
+# # The second cut grabs the domain name from the url (located between first // and immediately subsequent /).
+# # The first sed process then removes any trailing . (e.g. "massey.ac.nz." becomes "massey.ac.nz") followed by ", and optional spaces before the end,
+# # and the final sed removes any "www." prefix.
+# # Then we get the uniq urls out of all that.
 echo "File $uniq_urls_file containing just the unique .nz sites (domains and subdomains) to be used as seed urls has now been created."
 num_uniq_urls=`wc -l $uniq_urls_file`
 total_urls=`wc -l $outfile`
 echo ""
 echo ""
 echo "Summary:"
 echo "There were $num_uniq_urls unique sites"
 echo "out of a total of $total_urls urls in $outfile."
 echo ""
 echo ""
+# echo "File $uniq_urls_file containing just the unique .nz sites (domains and subdomains) to be used as seed urls has now been created."
+# num_uniq_urls=`wc -l $uniq_urls_file`
+# total_urls=`wc -l $outfile`
+# echo ""
+# echo ""
+# echo "Summary:"
+# echo "There were $num_uniq_urls unique sites"
+# echo "out of a total of $total_urls urls in $outfile."
+# echo ""
+# echo ""
 echo ""

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33413 for gs3-extensions

Legend:

gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh

Download in other formats: