Opened 10 years ago

Closed 9 years ago

#885 closed defect (fixed)

Run Solr 4.7.2 as a servlet inside tomcat, instead of building solr col launching jetty server

Reported by: ak19 Owned by: ak19
Priority: moderate Milestone: 3.07 Release
Component: Greenstone3 Runtime Severity: major
Keywords: Cc:

Description

Running solr with GS3 server removes cores at ant stop preventing sucessful Jetty server start

Running the Jetty server runs the local solr web interface. We use this to test new things we can try to do with solr. I think we can run the jetty server only when the GS3 server is stopped.

The buildcol process also runs the jetty server to ingest documents.

When we run the GS3 server and do a search in a solr collection, it adds cores to the file web/ext/solr/solr.xml. If we then stop the GS3 server, the cores are removed again from solr.xml. We can't successfully run the jetty server at this point, since it can't find the cores in solr.xml and we have to re-run the buildcol process on the collection to get them back.

However, if we run the GS3 server and stop it again immediately, the cores remain in the solr.xml file. At this point, we can run the jetty server.

What seems to be happening is a symmetric loading of solr cores on startup, and unloading of cores on stopping the GS3 server. See SolrSearch.java, configure() and cleanup() methods. The symmetry makes sense, but we need to investigate whether it is really necessary to remove the solr cores on shutdown, since this is what is preventing us from running the jetty server whenever we want. Does any part of the solr-related code, including activate.pl, actually require that the solr cores be removed? Start by looking at activate.pl, ext/solr/perllib/solrserver.pm, GS3 src code's gsdl3/service/SolrSearch.java

Change History (7)

comment:1 by ak19, 10 years ago

Summary: Running solr with GS3 server removes cores at ant stop preventing sucessful Jetty server startRun Solr 4.7.2 as a servlet inside tomcat, instead of building solr col launching jetty server

The original title of this ticket was "Running solr with GS3 server removes cores at ant stop preventing successful Jetty server start"

The problem is actually different. In Solr 4.7.2, the jetty server launched by buildcol to build the solr collection interferes with the GS3 tomcat server IF this already running. An index locked SOLR exception is thrown. This wasn't a problem in Solr for Lucene 3.3 because the jetty server could successfully run independently of a running GS3 server and access the same Solr index.

The problem is described in detail in the email to Dr Bainbridge "Testing solrbuilder URL commands against a running Jetty Solr server" of 19/08/14 21:32

Dr Bainbridge's suggested solution is in the email "Re: Testing solrbuilder URL commands against a running Jetty Solr server" of 19/08/14 23:03, as follows:

"Have only made a very quick scan through the details, but there are some very useful insights in this. My first response is that I think we should think fairly seriously about merging (in some way) what we currently run as the 'jetty' servlet for Solr in with the Tomcat one. That way we would only ever be running one server. I'm optimistic that the merging could be mostly achieved by putting the solr.jar file (or what ever it is that jetty likes) into the tomcat servlet area, so things like 'ant start' and 'ant stop' make both a localhost:8383/greenstone/... URL valid, but also something like localhost:8383/solr/...

Then we could do something like:

  1. Refactor the solrserver.pm and related Perl code to start and stop Greenstone 3 as a result of testing if a server is running or not (where it used to start and stop jetty).
  1. Rewrite the Greenstone 3 Java Solr code so that instead of working with SolrCore, EmbeddedSolrServer etc, it opens URL connections (to localhost:8383/solr), reads back the XML syntax returned, which it then turns into what Greenstone 3 is looking for.

Step 2 is less clear in my mind, so we might want to read around the subject a bit in terms of what exactly Solr's Java classes can do, and so what is the simplest way to code up what we need."

For step 1, see https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+Tomcat

Step 2 would require changing over from using an EmbeddedSolrServer in the Java code to HttpSolrServer (GS2SolrSearch.java in GS3/src/java and SolrQueryWrapper.java in ext/solr/src).

The solr class EmbeddedSolrServer has a function that returns the CoreContainer, from which the solr cores can be obtained. (The class SolrDispatchFilter also provides access to the CoreContainer of a running solr server). However, the HttpSolrServer java class does not give access to the CoreContainer, so we need to code things differently. Our own SolrQueryWrapper.getTerms() Greenstone function obtains the EmbeddedSolrServer to work out the term frequency of search terms in the index and documents returned. So we can't go this route with the HttpSolrServer.

Dr Bainbridge suggests that on the solr side, the java code (in one of the jars) must be using an EmbeddedSolrServer. And if on the solr server side we were to modify the code to work out the term frequencies, then on our SolrQueryWrapper side, we can send off a request over http to the running solr servlet and request the term frequencies.

The sequence is:

  1. Get solr running as a servlet in tomcat instead of the jetty server being used to run solr. solrbuilder.pm and solrserver.pm need to be modified to stop and start the tomcat server (only if the GS3 server is not already running in the background, else we leave it running) rather than stopping and starting the jetty server.
  1. See if our java code's use of solr's EmbeddedSolrServer class now works. Dr Bainbridge thinks it still may be a problem and the solr indexes may still be locked since there may be separate solr instances trying to access the index. If there is still a locking problem, the next step is to switch over to using HttpSolrServer

Test that the solr collections rebuild with -activate, with both the GS3 server already running and not running. Test that after rebuild searching still works. In particular, when the GS3 server is already running, test that after rebuilding with the word "mouse" in a document that never contained that word, the modified document now shows up in the search results when searching at document (not section) level for "mouse".

  1. Modify the solr side (some jar files perhaps) to return term frequency information, so that our SolrQueryWrapper.java's getTerms() and runQuery() functions can use this instead.

comment:2 by ak19, 10 years ago

For the upcoming GS3 workshop, some fixes have been made to GLI to deal with solr collections. Among them is a temporary fix whereby GLI now stops the GS3 server before building a SOLR collection and then restarts the GS3 server after the solr collection has been built. The other fix to GLI is permanent, and concerns the preservation of solr-specific elements in the collectionConfig.xml. In the past, if you had opened a solr collection in GLI, GLI would have clobbered all the solr-specific elements in the collectionConfig.xml, such as <facet>, <solr> and the newly-introduced <option> subelement of <index>.

  1. http://trac.greenstone.org/changeset/29221

Added code to allow GLI to preserve any solr-specific <sort> and <facet> subelements of <search> if these were manually-added to a GS3 collectionConfig.xml

Also see http://trac.greenstone.org/changeset/29177

GLI now preserves the newly added (optional) option subelements of collectionConfig.xml's index element. This is only used for solr collections at present when the user hand-edits collectionConfig.xml and specifies the solr field type (option-name solrfieldtype) for an index other than the default text_en_splitting. E.g. type text_es for index allfields.

(Default is text_en_splitting)

  1. TEMPORARY: http://trac.greenstone.org/changeset/29222

SOLR related. TEMPORARY changes for the GS3 workshop. Owing to the change to Solr 4.7.2, solr collections can't re-build despite activate if the GS3 server is running because there is a conflict with the jetty server launched by buildcol and jetty finds a lock on the index. The result is that one can't search the solr index after such a rebuild. Dr Bainbridge suggested a temporary measure: instead of commandline building solr collections, we will now build them in GLI. GLI will build solr collections with activate on but, for solr collections alone, it will stop the GS3 server before a build and start it again upon completion. In future, we will get rid of the solr jetty server and just have solr running over HTTP from tomcat. The Java GS3 runtime code will have to access Solr as a HTTPSolrServer rather than as an EmbeddedSolrServer? at that point.

comment:3 by ak19, 10 years ago

Lucene/Solr upgrade from version 3.3 to 4.7.2 involved commit revisions between 29133 of 16.07.2014 and 29228 of 21.08.2014

comment:5 by ak19, 9 years ago

Making Solr run off the tomcat server, rather than using the jetty server included with solr:

(http://trac.greenstone.org/changeset/29708, http://trac.greenstone.org/changeset/29709, http://trac.greenstone.org/changeset/29710)

MAJOR CHANGES:

http://trac.greenstone.org/changeset/29711

A bug still remains and is visible after rebuilding a solr or even lucene collection, where the 2nd page of search results is empty unless the server is restarted.

http://trac.greenstone.org/changeset/29714

http://trac.greenstone.org/changeset/29722

http://trac.greenstone.org/changeset/29749

http://trac.greenstone.org/changeset/29751

Limiting access to the /solr servlet, changes for Linux then Windows:

http://trac.greenstone.org/changeset/29723

http://trac.greenstone.org/changeset/29754

To work with commit 29687 where web.xml was split into web.xml and servlets.xml and the latter's contents were being included into web.xml as an entity, which broke xml parsing in gs3-server when viewing File > Settings and on GLI startup:

http://trac.greenstone.org/changeset/29722

http://trac.greenstone.org/changeset/29728

http://trac.greenstone.org/changeset/29729

http://trac.greenstone.org/changeset/29730

http://trac.greenstone.org/changeset/29752

http://trac.greenstone.org/changeset/29753

comment:6 by ak19, 9 years ago

Successfully moved the ext/solr SolrQueryWrapper.getTerms() methods to the solr server side.

http://trac.greenstone.org/changeset/29986

The getTerms() functionality previously used by the EmbeddedSolrServer? has now been re-implemented for HttpSolrServer? with the new custom Greenstone Solr RequestHandler? class Greenstone3SearchHandler, which lives on the solr server side, in tomcat's solr webapp. The functionality has been improvemed, such as being able to search for: econom* cat, by recursively calling setRewriteMethods on any PrefixQuery? and WildcardQuery? MultiQueries? within an overall BooleanQuery?, and by handling BooleanQuery?.TooManyClauses? exceptions when the number of expanded terms is too large, such as for a search of a*.

comment:7 by ak19, 9 years ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.