Context Navigation

moveDumpTxtFilesIntoImport.sh@ 34158

Last change on this file since 34158 was 34132, checked in by ak19, 4 years ago
Committing the commoncrawl site of Nutch recrawls of our CC data where content-language = MRI. 1. Contains the collection configuration files, but also the keep-urls *.txt files in the etc folder, used by NutchTextDumpPlugin to filter URLs of interest. 2. The import_nutchDumpTxtsOfcrawledMRICC.tar.gz file needs to decompressed into any of the collections that need to be rebuilt. This contains just the Nutch dump.txt files (in their siteID folders) as I've removed the binary files. 3. The script moveDumpTxtFilesIntoImport.sh can be used to generate such cut down versions of the Nutch crawled folders that contain only the dump.txt files within their siteID folders. 4. In the next commit, I'll try to add svn externals to get the import_nutchDumpTxtsOfcrawledMRICC.tar.gz from sitelevel into the collection folders for the 2 current collections in this site.
Property svn:executable set to ``*
File size: 1.4 KB

Rev	Line
[34132]	1	#!/bin/bash
	2
	3	# EDIT to set crawleddir value, to the folder where crawls are extracted, then run.
	4	# This will also take care of complete Nutch crawled directories
	5	# that include more than dump.txt within siteID folders.
	6	# (like a folder consisting of the combination of all crawledNode#.tar.gz files
	7	# at http://trac.greenstone.org/browser/other-projects/maori-lang-detection)
	8	# In such cases, this script will copy over just the siteID folder along with dump.txt
	9
	10	# The commented out version will copy across siteID/dump.txt as siteID.txt
	11	# instead (without enclosing folder).
	12
	13
	14	# https://superuser.com/questions/44787/looping-through-subdirectories-and-running-a-command-in-each
	15	# https://stackoverflow.com/questions/15148796/get-string-after-character
	16
	17	## producing files called siteID.txt, instead of siteID/dump.txt files
	18	# for dir in txtdumps/*; do
	19	# #do (cd "$dir" && cp dump.txt );
	20	# filename=${dir#*/}
	21	# echo $filename
	22	# mv "$dir/dump.txt" "import/$filename.txt";
	23	# done
	24
	25
	26
	27	# https://stackoverflow.com/questions/23162299/how-to-get-the-last-part-of-dirname-in-bash
	28	crawleddir=
	29	for dir in $crawleddir/*; do
	30	if [[ -d "$dir" ]]; then
	31	foldername="${dir##*/}"
	32
	33	mkdir import/$foldername
	34	if [[ ! -f "$dir/dump.txt" ]]; then
	35	echo "There was no dump.txt in $foldername"
	36	else
	37	#echo $foldername
	38	cp "$dir/dump.txt" import/$foldername/.
	39	fi
	40	fi
	41	done

Note: See TracBrowser for help on using the repository browser.