source: other-projects/is-sheet-music-encore/trunk/gen-corpus-ids/HATHI-EXTRACT-FORMAT.sh

Last change on this file was 32965, checked in by davidb, 5 years ago

Further changes after test-run

  • Property svn:executable set to *
File size: 674 bytes
Line 
1#!/bin/bash
2
3input=${1:-'hathi_full_20190301.txt.gz'}
4output=${2:-'hathi_brief_20190301.txt'}
5
6echo ""
7echo "===="
8echo " Script to extract Format (and related fields, such as copyright)"
9echo " from HathiTrust tab-delimited metadata dump"
10echo "===="
11
12echo ""
13echo "Reading in : $input"
14echo "Writing out : $output"
15echo ""
16
17echo "Processing ..."
18zcat "$input" \
19 | awk -F '\t' '{print $1 "\t" $3 "\t" $20 "\t" $24} ' \
20 > "$output"
21
22echo "... Done"
23echo ""
24
25echo "===="
26echo " Next, extract entried that are Music Format, Public Domain and"
27echo " NOT scanned by Google (so called 'open-open' files):"
28echo " ./HATHI-EXTRACT-PD-NON-GOOGLE.sh"
29echo "===="
30
Note: See TracBrowser for help on using the repository browser.