source: other-projects/is-sheet-music-encore/trunk/gen-corpus-ids/HATHI-EXTRACT-LANG.sh@ 33340

Last change on this file since 33340 was 33138, checked in by davidb, 5 years ago

Scripts that focus on language (for non-music related work)

  • Property svn:executable set to *
File size: 791 bytes
Line 
1#!/bin/bash
2
3. ./latest-dump.sh
4
5#input=${1:-'hathi_full_20190301.txt.gz'}
6#output=${2:-'hathi_brief_20190301.txt'}
7
8input=${1:-"hathi_full_$latest_date.txt.gz"}
9output=${2:-"hathi_brief_lang_$latest_date.txt"}
10
11echo ""
12echo "===="
13echo " Script to extract Format (and related fields, such as copyright)"
14echo " from HathiTrust tab-delimited metadata dump"
15echo "===="
16
17echo ""
18echo "Reading in : $input"
19echo "Writing out : $output"
20echo ""
21
22echo "Processing ..."
23zcat "$input" \
24 | awk -F '\t' '{print $1 "\t" $3 "\t" $19 "\t" $24} ' \
25 > "$output"
26
27echo "... Done"
28echo ""
29
30echo "===="
31echo " Next, extract entried that are Music Format, Public Domain and"
32echo " NOT scanned by Google (so called 'open-open' files):"
33echo " ./HATHI-EXTRACT-PD-NON-GOOGLE.sh"
34echo "===="
35
Note: See TracBrowser for help on using the repository browser.