source: other-projects/is-sheet-music-encore/trunk/gen-corpus-ids/README.txt@ 32963

Last change on this file since 32963 was 32963, checked in by davidb, 5 years ago

Added text and some refinement of scripts to make things easier to run

File size: 558 bytes
Line 
1Identifying HathiTrust volume IDs suitable for inclusion in a corpus
2of images for identifying sheet music.
3
4Order to run the scripts:
5
61. To get a fresh HathiTrust tab-delimited metadata dump
7 (for the time of writing: March 2019)
8
9
10./HATHI-GET-TAB-DELIM-DUMP.sh
11
122. To winnow the file down to a more manageable size
13 (just the columns we're interested in)
14
15./HATHI-EXTRACT-FORMAT.sh
16
173. To generate a list of Music Notation entries that are in the public
18 domain, and not-scanned by Google (so called 'open-open'):
19
20
21./HATHI-EXTRACT-PD-NON-GOOGLE.sh
Note: See TracBrowser for help on using the repository browser.