Context Navigation

source: other-projects/hathitrust/solr-extracted-features/trunk/README.txt@ 30925

Last change on this file since 30925 was 30925, checked in by davidb, 7 years ago
Improved instrutions
File size: 1.5 KB

Rev	Line
[30915]	1
	2	----
	3	Introduction
	4	----
	5
	6	Java code for processing HTRC Extracted Feature JSON files, suitable for
	7	ingesting into Solr. Designed to be used on a Spark cluster with HDFS.
	8
	9	----
	10	Setup Proceddure
	11	----
	12
	13	This is Step 2, of a two step setup procedure.
	14
	15	For Step 1, see:
	16
	17	http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
	18
	19	Assumptions
	20
	21	* You have 'svn' and 'mvn' on your PATH
	22
	23	----
	24	Step 2
	25	----
	26
[30925]	27	1. Check HDFS and Spark Java daemon processes are running:
[30915]	28
[30925]	29	jps
[30915]	30
[30925]	31	Example output:
[30915]	32
[30925]	33	19212 NameNode
	34	19468 SecondaryNameNode
	35	19604 Master
	36	19676 Jps
[30915]	37
[30925]	38	[[
	39	Starting these processes was previously covered in Step 1, but in brief,
	40	after formatting the disk with:
[30915]	41
[30925]	42	hdfs namenode -format
[30915]	43
[30925]	44	The daemons are started with:
	45
	46	start-dfs.sh
	47	spark-start-all.sh
[30915]	48
[30925]	49	The latter is an alias defined by Step 1 provisioning (created to
	50	avoid the conflict over 'start-all.sh', which both Hadoop and
	51	Spark define)
	52	]]
[30915]	53
[30925]	54	2. Acquire some JSON files to process, if not already done so.
	55	For example:
[30915]	56
[30925]	57	./scripts/PD-GET-FULL-FILE-LIST.sh
	58	./scripts/PD-SELECT-EVERY-10000.sh
	59	./scripts/PD-DOWNLOAD-EVERY-10000.sh
[30915]	60
[30925]	61	3. Push these files over to HDFS
[30915]	62
[30925]	63	hdfs dfs -mkdir /user
	64	hdfs dfs -mkdir /user/htrc
[30916]	65
[30925]	66	hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
	67	hdfs dfs -put pd-ef-json-files /user/htrc/.
[30916]	68
[30925]	69	4. Compile the code:
[30922]	70
[30925]	71	./COMPILE.bash
[30922]	72
[30925]	73	The first time this is run, a variety of Maven/Java dependencies will be
	74	downloaded.
	75
	76	5. Run the code on the cluster:
	77
	78	./RUN.bash pd-ef-json-filelist-10000.txt
	79
	80
	81
	82
	83

Note: See TracBrowser for help on using the repository browser.

Download in other formats: