Context Navigation

View Latest Revision

source: other-projects/hathitrust

Legend:

: Added
: Modified
: Copied or renamed

	Rev	Age	Author	Log Message
(edit)	@31004	7 years	davidb	added debug
(edit)	@31003	7 years	davidb	Explicity default constructors added
(edit)	@31002	7 years	davidb	Need to separate flatMap and foreach calls in PagedJSON
(edit)	@31001	7 years	davidb	Code to work per-volume and per-page
(edit)	@31000	7 years	davidb	Class name refactoring
(edit)	@30999	7 years	davidb	Class name refactoring
(edit)	@30998	7 years	davidb	Class name refactoring
(edit)	@30997	7 years	davidb	Verbosity control over printing
(edit)	@30996	7 years	davidb	Code refactoring
(edit)	@30995	7 years	davidb	Adjustment of NUM_PARTITIONS to be based on Spark recommended calculation
(edit)	@30994	7 years	davidb	Additional useful links. Links open in new tab
(edit)	@30993	7 years	davidb	Placeholder page to provide useful links to hadoop and solr cluster …
(edit)	@30992	7 years	davidb	Additional adjustments after test run on cluster
(edit)	@30991	7 years	davidb	Inital cut at README notes, and supporting links
(edit)	@30990	7 years	davidb	opt name change
(edit)	@30989	7 years	davidb	Changes to better suit EF set used with solr
(edit)	@30988	7 years	davidb	Changed flag to 'read-only' and changed the filed name full text saved …
(edit)	@30986	7 years	davidb	Debugging for double accumulator added
(edit)	@30985	7 years	davidb	Changed to run main processing method as action rather than transform. …
(edit)	@30984	7 years	davidb	Introduction of Spark accumulator to measure progress. Output of POST …
(edit)	@30983	7 years	davidb	Useful helper script
(edit)	@30982	7 years	davidb	Fixed to host_name for solr2 and solr3
(edit)	@30981	7 years	davidb	Useful folder for 'on-the-side' packages
(edit)	@30980	7 years	davidb	Code added to read response
(edit)	@30979	7 years	davidb	_solr_url needs to be stored in class!
(edit)	@30978	7 years	davidb	Additional debug statements
(edit)	@30977	7 years	davidb	Only have RDD if an output directory was specified on the command-line …
(edit)	@30976	7 years	davidb	Change to reflect changed order of command-line arguments
(edit)	@30975	7 years	davidb	Introduction of new solr-url command line argument, leading to some …
(edit)	@30974	7 years	davidb	update/add/doc JSON structure needed
(edit)	@30973	7 years	davidb	Changed to saving Solr JSON file for debugging purposes
(edit)	@30972	7 years	davidb	addition of useful command needed before re-running
(edit)	@30971	7 years	davidb	Adding in post to Solr cloud. Changed text_t to _text_
(edit)	@30970	7 years	davidb	Added in mapping of EF-JSON to Solr 'add' JSON format
(edit)	@30969	7 years	davidb	Fine tuning resulting from testing the cloud/cluster
(edit)	@30962	7 years	davidb	Corrections and improvements made after initial testing between …
(edit)	@30960	7 years	davidb	Switch to using Puppet to provision machine. Strongly based on files …
(edit)	@30957	8 years	davidb	No longer needed. (Local copy taken on Windows laptop.)
(edit)	@30956	8 years	davidb	Initial commit of files for setting up with Vagrant a Solr cloud
(edit)	@30953	8 years	davidb	Need to specify _output_dir as part of output JSON filename
(edit)	@30952	8 years	davidb	Further text tidy up
(edit)	@30951	8 years	davidb	Save a JSONObject as a file in the output directory
(edit)	@30950	8 years	davidb	Tweak to text
(edit)	@30949	8 years	davidb	Use better name than 'foo'. Further fix to JSON name generated
(edit)	@30947	8 years	davidb	Correction to 'pages-' part of JSON.bz2 output filename used
(edit)	@30946	8 years	davidb	Correction to output JSON.bz2 name generated
(edit)	@30945	8 years	davidb	Getting closer to writing out JSON files
(edit)	@30944	8 years	davidb	Forcer higher partition (6) than default, which seems to be 2
(edit)	@30943	8 years	davidb	Extra debug info
(edit)	@30942	8 years	davidb	Improved output printing for slave node
(edit)	@30941	8 years	davidb	Moved to getFileSystemInstance() method to play nice on cluster
(edit)	@30940	8 years	davidb	Change to using URI not fileIn directly
(edit)	@30939	8 years	davidb	Minor tweaks
(edit)	@30938	8 years	davidb	Experiment with using Hadoop's FileSystem class for local file:// access
(edit)	@30937	8 years	davidb	Expanded set of ClusterFileIO methods
(edit)	@30936	8 years	davidb	Refinement of Spark Monitor echo statements
(edit)	@30935	8 years	davidb	Fixed variable name typo, plus added a couple of 'sleep' pauses of 1 sec
(edit)	@30934	8 years	davidb	Providing json-filelist now a compulsory argument, rather than an option
(edit)	@30933	8 years	davidb	More careful parsing of file prefix
(edit)	@30932	8 years	davidb	Support both file:// and hdfs://
(edit)	@30931	8 years	davidb	Version that runs using fil:// tested
(edit)	@30930	8 years	davidb	Expansion of useful alias commands for Hadoop and Spark
(edit)	@30929	8 years	davidb	Tweaks made while testing the script
(edit)	@30928	8 years	davidb	Forgot to set json_filelist
(edit)	@30927	8 years	davidb	Fixed silly typo in stdout redirect
(edit)	@30926	8 years	davidb	Restructuring of RUN scripts to be more flexible
(edit)	@30925	8 years	davidb	Improved instrutions
(edit)	@30924	8 years	davidb	Tidy up of code. Removed commented out code
(edit)	@30923	8 years	davidb	Rough cut version that reads in each JSON file over HDFS
(edit)	@30922	8 years	davidb	Additional rough-cut notes
(edit)	@30921	8 years	davidb	Code change to read in JSON file over HDFS
(edit)	@30919	8 years	davidb	More consistent naming of folders used
(edit)	@30918	8 years	davidb	More flexible command-line args
(edit)	@30917	8 years	davidb	Changes resulting from a fresh run at provisioning, which yielded the …
(edit)	@30916	8 years	davidb	Some additional details -- note form
(edit)	@30915	8 years	davidb	Initial cut at instructions to follow to get code set up and running
(edit)	@30914	8 years	davidb	Tidy up of setup description
(edit)	@30913	8 years	davidb	Renaming to better represent what the cluster is designed for
(edit)	@30912	8 years	davidb	Changed to Unix style line-endings
(edit)	@30911	8 years	davidb	Changed name of input directory
(edit)	@30910	8 years	davidb	Additional finesse added in as a result of further testing on Vagrant …
(edit)	@30909	8 years	davidb	Additional finesse added in as a result of further testing on Vagrant …
(edit)	@30908	8 years	davidb	Additional finesse added in as a result of further testing on Vagrant …
(edit)	@30907	8 years	davidb	Name change to reflect need for 'bash' not 'sh'
(edit)	@30906	8 years	davidb	Bash version of BAT script
(edit)	@30905	8 years	davidb	Additional resources
(edit)	@30904	8 years	davidb	Extra resource/links added
(edit)	@30903	8 years	davidb	Vagrant provisioning files for a 4-node Hadoop cluster. See …
(edit)	@30902	8 years	davidb	Details of what packages are needed
(edit)	@30901	8 years	davidb	Template setup file
(edit)	@30900	8 years	davidb	For support Java packages
(edit)	@30899	8 years	davidb	Files for compilation using Eclipse
(edit)	@30898	8 years	davidb	Scripts for downloading sample JSON data from public domain extracted …
(edit)	@30897	8 years	davidb	Sub-project for converted HTRC Extract Feature dataset into a form …
(add)	@30890	8 years	davidb	folder to group together hathitrust related projects

Note: See TracRevisionLog for help on using the revision log.

Download in other formats: