Context Navigation

source: other-projects/hathitrust/wcsa/extracted-features-solr/trunk/solr-ingest/README.txt@ 36957

Last change on this file since 36957 was 30972, checked in by davidb, 8 years ago
addition of useful command needed before re-running
File size: 1.7 KB

Line
1
2	----
3	Introduction
4	----
5
6	Java code for processing HTRC Extracted Feature JSON files, suitable for
7	ingesting into Solr. Designed to be used on a Spark cluster with HDFS.
8
9	----
10	Setup Proceddure
11	----
12
13	This is Step 2, of a two step setup procedure.
14
15	For Step 1, see:
16
17	http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
18
19	Assumptions
20
21	* You have 'svn' and 'mvn' on your PATH
22
23	----
24	Step 2
25	----
26
27	1. Check HDFS and Spark Java daemon processes are running:
28
29	jps
30
31	Example output:
32
33	19212 NameNode
34	19468 SecondaryNameNode
35	19604 Master
36	19676 Jps
37
38	[[
39	Starting these processes was previously covered in Step 1, but in brief,
40	after formatting the disk with:
41
42	hdfs namenode -format
43
44	The daemons are started with:
45
46	start-dfs.sh
47	spark-start-all.sh
48
49	The latter is an alias defined by Step 1 provisioning (created to
50	avoid the conflict over 'start-all.sh', which both Hadoop and
51	Spark define)
52	]]
53
54	2. Acquire some JSON files to process, if not already done so.
55	For example:
56
57	./scripts/PD-GET-FULL-FILE-LIST.sh
58	./scripts/PD-SELECT-EVERY-10000.sh
59	./scripts/PD-DOWNLOAD-EVERY-10000.sh
60
61	3. Push these files over to HDFS
62
63	hdfs dfs -mkdir /user
64	hdfs dfs -mkdir /user/htrc
65
66	hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
67	hdfs dfs -put pd-ef-json-files /user/htrc/.
68
69	4. Compile the code:
70
71	./COMPILE.bash
72
73	The first time this is run, a variety of Maven/Java dependencies will be
74	downloaded.
75
76	5. Run the code on the cluster:
77
78	./RUN.bash pd-ef-json-filelist-10000.txt
79
80
81	If running subsequently, remove the saved RDD on HDFS first:
82
83	hdfs dfs -rm -r rdd-solr-json-page-files
84

Note: See TracBrowser for help on using the repository browser.

Download in other formats: