Ticket #934 (closed defect: fixed)

Opened 3 months ago

Last modified 2 months ago

GLI Downloading panel HTTPS/SSL support and other wget changes

Reported by: ak19 Owned by: ak19
Priority: moderate Milestone:
Component: GLI Severity: enhancement
Keywords: Cc:

Description

GLI wasn't able to download from https URLs because there was no SSL support. Some other surrounding issues were identified and also needed fixing.

I didn't at first know there was no https support and that this was because wget hadn't been compiled with SSL. I only discovered this after the following two fixes, which allowed better debugging:

1. Adding shutdownhook for GS2 to ensure the GS2 server is stopped on Ctrl-C. Tested on Linux.

http://trac.greenstone.org/changeset/31813 - http://trac.greenstone.org/changeset/31815

2. http://trac.greenstone.org/changeset/31816

"Fix to getRedirectURL() being stuck forever if the proxy details are wrong. I still can't work out what my proxy settings should be to get downloading to work from a TSG-adminsitered machine, but at least now GLI doesn't hang requiring a ctrl-C to terminate it."

3. Compiling up wget on Unix with open SSL to support https URLs required upgrading our wget to a newer version (from 1.15 to 1.17.1) and also compiling up OpenSSL now, which we didn't do before. We also got a precompiled windows wget binary of 1.17.1, compiled with SSL support, upgrading from shipping a windows wget 1.11.x binary that didn't include SSL.

- GLI changes:

http://trac.greenstone.org/changeset/31820 - http://trac.greenstone.org/changeset/31823

- OpenSSL:

http://trac.greenstone.org/changeset/31824

http://trac.greenstone.org/changeset/31835 (for Mac)

- Wget upgrade and setting no-check-certificate using Wgetrc conf file, needed in newer wget versions compiled on Linux to be able to download from https urls that don't have valid site certificates:

http://trac.greenstone.org/changeset/31825 - http://trac.greenstone.org/changeset/31842 and http://trac.greenstone.org/changeset/31854

4. Fixes to get proxying to work and surrounding issues

Further modifications to GLI getRedirectURL()

- http://trac.greenstone.org/changeset/31843

New class URLConnectionManager to enable accessing web pages in "no-check-certificate" mode from Java code. Needed by getRedirectURL()

- getRedirectURL() phased out: http://trac.greenstone.org/changeset/31852

To get proxying to work on Windows and other issues:

http://trac.greenstone.org/changeset/31851

http://trac.greenstone.org/changeset/31853

http://trac.greenstone.org/changeset/31856, http://trac.greenstone.org/changeset/31857

5. no-check-certificate supported through GLI Preferences interface rather than wgetrc conf file:

http://trac.greenstone.org/changeset/31860

6. Changes to get helpful messages on connection failures to display.

Only for windows, proxy_on, proxy_port and proxy_host would be set (along with username and pwd), while on Linux none of these would be set as proxy settings were obtained by wget from env vars that were set up.

To get helpful messages to display, proxy_on, proxy_port and proxy_host needed to be setup regardless of the OS.

http://trac.greenstone.org/changeset/31861 - http://trac.greenstone.org/changeset/31865, http://trac.greenstone.org/changeset/31874 - http://trac.greenstone.org/changeset/31876

7. Shifting to Windows using proxy settings set up in the environment (as linux does) rather than passing proxy flags to wget.

http://trac.greenstone.org/changeset/31877, http://trac.greenstone.org/changeset/31878

8. Shifting from single to support for multiple proxy URLs: one each for HTTP, HTTPS and FTP.

http://trac.greenstone.org/changeset/31879 - http://trac.greenstone.org/changeset/31881

9. GLI informs user when existing proxy settings in environment (from before launching GLI) are overridden with custom proxy settings turned on and set up within GLI.

http://trac.greenstone.org/changeset/31882 - http://trac.greenstone.org/changeset/31884

Change History

Changed 3 months ago by ak19

  • status changed from new to closed
  • resolution set to fixed

There was still the outstanding problem that if the proxy is wrong, if the URL does not timeout (like  http://seaturtles.com), or if only the http proxy was set and the requested URL only mentions http as protocol but it turns out to be an https site and can't be retrieved because no https proxy was set, then perl could block forever waiting to read from wget's child processes. Nothing is sent back, because the site is inaccessible for some reason or nothing can be read. While setting the read-timeout flag on wget did resolve for a site like seaturtles.com, it wasn't the right solution and won't work in the other cases mentioned.

(1) The actual problem lay with how we were looping trying to process the iostreams of the child wget process launched with open3(). Blocking could always have happened since we weren't processing the streams with IO::Select's can_read() method, which would return the stream handles that were ready for processing, if any were ready.

In our case, blocking would always happen in the scenarios mentioned, since nothing could ever be read, so we needed a read timeout in perl. IO::Select's can_read(timeout) was the way to go about this: the code loops forever until it times out n times trying to read. And in that case, it terminates wget, assuming the URL couldn't be read or the proxy was wrong or something.

CHANGESET: http://trac.greenstone.org/changeset/31920

Note that using IO::Select with its can_read() is the way programmers are meant to go about working with open3(): you're supposed to use can_read() to check when the streams to the child process open3() launches are ready to read.

READING:

Switching to use IO::Select, which allows timeouts, instead of doing the potentially blocking operation "if defined(my $strLine=<$chld_out>)"

Google: perl open3 read timeout

Google: perl open3 select() example

 https://stackoverflow.com/questions/10029406/why-does-ipcopen3-get-deadlocked

 https://codereview.stackexchange.com/questions/84496/the-right-way-to-use-ipcopen3-in-perl

 https://gist.github.com/shalk/6988937

 https://stackoverflow.com/questions/18373500/how-to-check-if-command-executed-with-ipcopen3-is-hung

 http://perldoc.perl.org/IO/Select.html

 http://perldoc.perl.org/IPC/Open3.html - explains the need for select()/IO::Select with open3

 http://www.perlmonks.org/?node_id=951554

 http://search.cpan.org/~dmuey/IPC-Open3-Utils-0.91/lib/IPC/Open3/Utils.pm

 https://stackoverflow.com/questions/3000907/wget-not-behaving-via-ipcopen3-vs-bash?rq=1

(2) There was still a major problem to the above solution, since IO::Select doesn't work on Windows except when using Sockets, which we weren't using as we were simply reading from the child process' streams with regular filehandles.

Dr Bainbridge suggested trying to emulate the read with timeout using perl's alarm() feature. This worked successfully on Linux, but still failed on Windows (even though a simple test of alarm() made it look like it works on Windows, see http://trac.greenstone.org/browser/debug-testing/readWgetUsingAlarm/testing-alarm.pl ). The problem lies with the limitation of alarm() on Windows: it is implemented with polling and any system operations end up taking priority and can "turn off the alarm" before it goes off (probably, in reality this means that system level blocking operations prevent the alarm from getting polled on Windows). And that's exactly what happens on Windows when we set the alarm and try to read and the read blocks: the blocking is never interrupted by the alarm.

The WgetDownload?.pm version of the code using alarm() has been committed to the debug-testing area of GS SVN:

CHANGESETS:

particularly http://trac.greenstone.org/changeset/31925 and http://trac.greenstone.org/changeset/31928

but also http://trac.greenstone.org/changeset/31924 and http://trac.greenstone.org/changeset/31935

READING:

-  http://www.perlmonks.org/?node_id=1127475 - sample code with simple use of alarm() that works on Windows, but only because the operation we do is not a read by perl sleep() which allows polling to continue for alarm(). Less relevant:  http://www.perlmonks.org/?node_id=1127570

-  https://stackoverflow.com/questions/4517034/perl-set-read-timeout-in-client-socket

(3) The real solution lay in going back to (1) where IO::Select will work if using Sockets even if the OS is Windows. There was code online for using Sockets to work with the iostreams of the child process when a child process is launched with open3():

READING:

Google: perl open3 windows

-  http://www.perlmonks.org/?node_id=869942

-  http://www.perlmonks.org/?node_id=811650 - code that uses Sockets with open3

Though the above end up recommending that Perl's IPC::Run (not part of Core) already takes care of all that, there are Windows specific limitations to IPC::Run, see  http://search.cpan.org/~toddr/IPC-Run-0.96/lib/IPC/Run.pm#Win32_LIMITATIONS

So instead of using IPC::Run, we went with the code the provided above on how to use Sockets with the iostreams that are obtained with open3.

CHANGESET: http://trac.greenstone.org/changeset/31929

(4) A separate problem was that using double quotes in the command to launch with open3(), such as to embed the filepath to wget in double quotes, actually launches a subshell first and then tries to run the command. This has the peculiar effect in our case that the child process launched by perl (the subshell) is not the real parent of the wget subprocess that subsequently gets launched, such that sending the termination signal to the putative parent pid does not terminate its wget "child" process.

The solution lay in launching open3() correctly: instead of taking a string representing the command, it accepts an array of command parameters that represent the command and its arguments.

CHANGESET: http://trac.greenstone.org/changeset/31920

READING: Shouldn't use double quotes around wget path after all? See final comment at

-  http://www.perlmonks.org/?node_id=394709

-  http://coldattic.info/shvedsky/pro/blogs/a-foo-walks-into-a-bar/posts/63

Therefore, compose the command as an array rather than as a string, to preserve spaces in the filepath because single/double quotes using open3 seem to launch a subshell, see also final comment at  http://www.perlmonks.org/?node_id=394709 and that ends up causing problems in terminating wget, as 2 processes got launched then which don't have parent-child pid relationship (so that terminating one doesn't terminate the other).

(5) TESTING

Finally, tested on Windows, Linux and Mac Lion. Works!

Testing involved GLI Web Download, URL  http://englishhistory.net/tudor/citizens/. This actually resolves to an https URL.

- Tested that when the http and https proxies are set correctly, and no-check-certificate is ticked, that downloading works to termination. And also that downloading can be successfully cancelled.

- Tested that when no-check-certificate is not ticked, that when the download fails, it still suggests turning it on.

- Tested that when proxy details are unticked in Preferences > Connection, attempted download does not block forever, as it would have, but that wget is terminated after predefined n consecutive attempts time out.

- Tested that when proxy details are ticked, but only http proxy is set (or only http proxy is set correctly) that the attempted download does not block forever, but that wget is similarly terminated after n consecutive download attempts time out.

Similarly tested  http://seaturtles.com for n consecutive timeouts working to end the blocking, terminating wget.

Also tested that when a download looks to be blocking, it can be successfully manually cancelled from GLI before the timeouts cancel it.

Success means that wget is not still running in the background when it shouldn't be.

Changed 3 months ago by ak19

Finally, tested on Windows, Linux and Mac Lion. Works!

Mac Mountain Lion.

Changed 3 months ago by ak19

1. "This has the peculiar effect in our case that the child process launched by perl (the subshell) is not the real parent of the wget subprocess that subsequently gets launched, such that sending the termination signal to the putative parent pid does not terminate its wget "child" process."

That is, using the parent pid as groupid in sending the SIGTERM does not terminate the "child" as it's not really a child/not part of the group.

In perl:

kill(TERM, -grouppid)

kill(-TERM, parentpid)

should terminate the process with pid=grouppid/parentpid and all the child subprocesses thereof, whereas the regular kill(TERM, pid), which does not have a minus in front of the pid or the minus in front of the signal, only terminates the process with the matching pid. Refer to  https://perldoc.perl.org/functions/kill.html

2. "sample code with simple use of alarm() that works on Windows, but only because the operation we do is not a read but a perl sleep() which allows polling to continue for alarm()."

Changed 2 months ago by ak19

* http://trac.greenstone.org/changeset/31956 and http://trac.greenstone.org/changeset/31957

Fix to special behaviour of using open3 with IO::select() with sockets to pipes on Windows: this way of doing things on Windows results in wget successfully terminating not when sysread() returns a length of 0 but an undefined length. Generally this would be an error, but not in this situation. To test between an actual error situation and an acceptable instance where length is also undefined on Windows, added special code.

* http://trac.greenstone.org/changeset/31975

OAIDownload wasn't working anymore. This was because OAIDownload.pm was passing double quoted parameters into the wget command that open3() doesn't accept. Now WgetDownload?.pm::useWget() methods ensure that double quotes are removed. The solution was to use quotewords() instead of split() on space, since quotewords() would preserve spaces in double quoted strings while splitting on space otherwise and removing quotes (and unescaping double backslashes), to give us the actual wget cmd array we want.

Note: See TracTickets for help on using tickets.