source: other-projects/hathitrust/wcsa/extracted-features-solr/trunk

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @31278   7 years davidb To avoid null pointer on ids.iterator()
(edit) @31277   7 years davidb Tweak to minimum value
(edit) @31276   7 years davidb Min num partition guard put in
(edit) @31275   7 years davidb Changes to allow gc slave nodes to work with local disk versions of …
(edit) @31274   7 years davidb Need to use JSONArray no JSONObject for a multifield item
(edit) @31273   7 years davidb Code moved to store fields for multilingual use using dynamic Solr …
(edit) @31272   7 years davidb Use disk and memory to store main language RDD
(edit) @31271   7 years davidb Updating of POS code to new files-per-partition paramater, plus some …
(edit) @31270   7 years davidb Changed over to repartition approach
(edit) @31269   7 years davidb Some variable name changes, and printing tidy up
(edit) @31268   7 years davidb Adjustments to memory allocation in response to test runs on 10% of dataset
(edit) @31267   7 years davidb Values trialed on gsliscluster1. Rekindling idea of per-vol processing
(edit) @31266   7 years davidb Rekindling of per-volume approach. Also some tweaking to verbosity …
(edit) @31264   7 years davidb Switching to 'long' in counts to allow higher number representation
(edit) @31263   7 years davidb Change to using long for higher word counts
(edit) @31261   7 years davidb Overlooked changes from POS to lang
(edit) @31260   7 years davidb Language counting
(edit) @31259   7 years davidb Lambda sort had wrong boolean arg to sort descending. Now fixed
(edit) @31258   7 years davidb POS Label count, similar to Whitelist word count
(edit) @31257   7 years davidb Fixed typo
(edit) @31256   7 years davidb Earlier check of output directory to prevent large scale processing, …
(edit) @31255   7 years davidb Changed to using lambda functions
(edit) @31254   7 years davidb Experimenting with Lucene lowercase filter
(edit) @31253   7 years davidb Identified a typo, and changed to being true anyway
(edit) @31252   7 years davidb Support for icu-tokenize property added, plus relevant refactoring.
(edit) @31251   7 years davidb Code tidy up. Timed experiment showed sorting by key with …
(edit) @31250   7 years davidb Minor mods
(edit) @31247   7 years davidb Change sort order. Pick better output directory name
(edit) @31246   7 years davidb Experimenting with sorting
(edit) @31245   7 years davidb Refactored so processing of words from TokenPosCount now done by the …
(edit) @31244   7 years davidb Tidy up
(edit) @31243   7 years davidb Experimenting with Lucene/Solr's ICU tokenizer
(edit) @31242   7 years davidb Method name refactor
(edit) @31235   7 years davidb More fine-grained testing to help nema setup
(edit) @31234   7 years davidb More selective control of what to source/setup depending on hostname
(edit) @31233   7 years davidb Changes to operate on nema as well as gsliscluster1 and gc0-9
(edit) @31232   7 years davidb Hand edited version of state.json from gsliscluster1 suitable for …
(edit) @31231   7 years davidb Changes to allow SOLR to run on nodes in /hdfsd05/dbbridge/solr-ef
(edit) @31228   7 years davidb Change to see if code can be made more unified. If so, then …
(edit) @31227   7 years davidb Code tidy up
(edit) @31226   7 years davidb Fixed bloom test for init
(edit) @31225   7 years davidb Relocated bloomfilter creation to within call() method, so done on the …
(edit) @31224   7 years davidb Debug added
(edit) @31223   7 years davidb Exception printStackTrace
(edit) @31222   7 years davidb Changed to using ClusterFileIO supporting methods
(edit) @31221   7 years davidb Missing argument added in
(edit) @31220   7 years davidb Use of whitelist Bloom filter added to words going into Solr index
(edit) @31215   7 years davidb Changed back to Guava 20 API, now mvn shading allows me to have this …
(edit) @31214   7 years davidb Not needed now using mvn shading
(edit) @31213   7 years davidb Tidy up
(edit) @31212   7 years davidb Changed from mvn assemblhy to shadowing, which has more control
(edit) @31211   7 years davidb Changing back to regular Guava classes. Looking to use maven shading …
(edit) @31209   7 years davidb checkArgument added in
(edit) @31207   7 years davidb And some more tweaking
(edit) @31206   7 years davidb More tweaking of Guava cloned code
(edit) @31205   7 years davidb Next added in part of new Guava code
(edit) @31204   7 years davidb Splicing in Guava verion 20 of BloomFilter into code as own class (now …
(edit) @31203   7 years davidb Use class provided stringFunnel
(edit) @31202   7 years davidb Turns out Spark uses Guava 14.0 not 20.0. Additional code to fill in …
(edit) @31201   7 years davidb Trigger serialization of whitelist in main program
(edit) @31200   7 years davidb Better output statement
(edit) @31199   7 years davidb Renaming of classname to reflect filename rename
(edit) @31198   7 years davidb File renaming to make way for newer version of classes needed in the …
(edit) @31197   7 years davidb File renaming to make way for newer version of classes needed in the …
(edit) @31196   7 years davidb File renaming to make way for newer version of classes needed in the …
(edit) @31195   7 years davidb File renaming to make way for newer version of classes needed in the …
(edit) @31194   7 years davidb Serialize in and out methods added
(edit) @31193   7 years davidb Peter's white-list file
(edit) @31184   7 years davidb New provision to run different main classes in _RUN.sh; New top-level …
(edit) @31183   7 years davidb Bump up to project using Java 1.8
(edit) @31177   7 years davidb Adding in Google jar that supports Bloom filters
(edit) @31176   7 years davidb Support added for producing whitelist word count
(edit) @31175   7 years davidb Trial to find memory difference betwen Hashmap and Bloom filters
(edit) @31174   7 years davidb One of the last scripts developed for getting ef dataset into HDFS
(edit) @31173   7 years davidb individual file sizes per top-level folder
(edit) @31172   7 years davidb to help track down missing files in HDFS copy
(edit) @31171   7 years davidb Util to help find where missing files are
(edit) @31170   7 years davidb Targetted sub-dir copy
(edit) @31169   7 years davidb Improved logic
(edit) @31161   7 years davidb Comparison of local disk version with HDFS version
(edit) @31152   7 years davidb Development of script
(edit) @31151   7 years davidb More nuanced version to help finish off the 'big put'
(edit) @31128   7 years davidb Some scripts to help with pushing and monitoring the progress of the …
(edit) @31112   7 years davidb To move out shards saved in /tmp on gsliscluter1 nodes to nema
(edit) @31106   7 years davidb Scripts to help run an rsync'd copy of gslistcluster1 …
(edit) @31105   7 years davidb Additional scripts to help with running solr locally out of /tmp area
(edit) @31104   7 years davidb now configurable to be run from local disk (/tmp)
(edit) @31103   7 years davidb Changes made after testing with 20 solr nodes
(edit) @31102   7 years davidb Command line way of running a Solr test query
(edit) @31101   7 years davidb Correction to collection name
(edit) @31100   7 years davidb Change to using solr-cloud-nodes that include port number
(edit) @31099   7 years davidb Changes resulting from test runs to get Zookeeper and Solr running on …
(edit) @31098   7 years davidb Changes resulting from test runs to get Zookeeper and Solr running on …
(edit) @31097   7 years davidb Changed to .in style namne
(edit) @31096   7 years davidb Only need to create a volume's pages output directory is _output_dir …
(edit) @31095   7 years davidb Introduced num-partitions property
(edit) @31094   7 years davidb Changes triggered by running on gsliscluster1
(edit) @31093   7 years davidb Changes triggered by running on gsliscluster1
(edit) @31092   7 years davidb Minor tweak to spark/hadoop combination downloaded
(edit) @31091   7 years davidb Change of number of core for 'gsliscluster1' machine; commmented out …
Note: See TracRevisionLog for help on using the revision log.