root/main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm @ 32592

Revision 32592, 23.3 KB (checked in by ak19, 20 months ago)

Renamed gssql.pm to gsmysql.pm. Not subclassing the old gssql into gsmysql yet, as there's the complex issue of sighandlers, the static singleton method _get_connection_instance(), the singleton variable _db_instance and its use in the sighandlers and DESTROY, and how all of this can be impacted when making them part of an inheritance chain. Not sure of the best way to structure inheritance around these things. Even if rollback_on_cancel ends up unnecessary, the singleton method _get_connection_instance and singleton object _db_instance still impact decisions around inheritance.

Line 
1###########################################################################
2#
3# GreenstoneSQLPlugin.pm -- reads into doc_obj from SQL db and docsql.xml
4# Metadata and/or fulltext are stored in SQL db, the rest may be stored in
5# the docsql .xml files.
6# A component of the Greenstone digital library software
7# from the New Zealand Digital Library Project at the
8# University of Waikato, New Zealand.
9#
10# Copyright (C) 2001 New Zealand Digital Library Project
11#
12# This program is free software; you can redistribute it and/or modify
13# it under the terms of the GNU General Public License as published by
14# the Free Software Foundation; either version 2 of the License, or
15# (at your option) any later version.
16#
17# This program is distributed in the hope that it will be useful,
18# but WITHOUT ANY WARRANTY; without even the implied warranty of
19# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
20# GNU General Public License for more details.
21#
22# You should have received a copy of the GNU General Public License
23# along with this program; if not, write to the Free Software
24# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
25#
26###########################################################################
27
28package GreenstoneSQLPlugin;
29
30
31use strict;
32no strict 'refs'; # allow filehandles to be variables and viceversa
33
34use DBI;
35use docprint; # for new unescape_text() subroutine
36use GreenstoneXMLPlugin;
37use gsmysql;
38
39
40# TODO:
41# - Run TODOs here, in Plugout and in gsmysql.pm by Dr Bainbridge.
42# - Have not yet tested writing out just meta or just fulltxt to sql db and reading just that
43# back in from the sql db while the remainder is to be read back in from the docsql .xml files.
44
45# + TODO: Add public instructions on using this plugin and its plugout: start with installing mysql binary, changing pwd, running the server (and the client against it for checking: basic cmds like create and drop). Then discuss db name, table names (per coll), db cols and col types, and how the plugout and plugin work.
46# Discuss the plugin/plugout parameters.
47
48# TODO, test on windows and mac.
49# Note: if parsing fails (e.g. using wrong plugout like GS XML plugout, which chokes on args intended for SQL plugout) then SQL plugin init would have already been called and done connection, but disconnect would not have been done because SQL plugin disconnect would not have been called upon parse failure.
50
51# DONE:
52# + TODO: For on cancel, add a SIGTERM handler or so to call end()
53# or to explicitly call gs_sql->close_connection if $gs_sql def
54#
55# + TODO: Incremental delete can't work until GSSQLPlugout has implemented build_mode = incremental
56# (instead of tossing away db on every build)
57# + Ask about docsql naming convention adopted to identify OID. Better way?
58# collection names -> table names: it seems hyphens not allowed. Changed to underscores.
59# + Startup parameters (except removeold/build_mode)
60# + how do we detect we're to do removeold during plugout in import.pl phase
61# + incremental building: where do we need to add code to delete rows from our sql table after
62# incrementally importing a coll with fewer docs (for instance)? What about deleted/modified meta?
63# + Ask if I can assume that all SQL dbs (not just MySQL) will preserve the order of inserted nodes
64# (sections) which in this case had made it easy to reconstruct the doc_obj in memory in the correct order.
65# YES: Otherwise for later db types (drivers), can set order by primary key column and then order by did column
66# + NOTTODO: when db is not running GLI is paralyzed -> can we set timeout on DBI connection attempt?
67#   NOT A PROBLEM: Tested to find DBI connection attempt fails immediately when MySQL server not
68# running. The GLI "paralyzing" incident last time was not because of the gs sql connection code,
69# but because my computer was freezing on-and-off.
70# + "Courier" demo documents in lucene-sql collection: character (degree symbol) not preserved in title. Is this because we encode in utf8 when putting into db and reading back in?
71# Test doc with meta and text like macron in Maori text.
72# + TODO Q: During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
73# effect that if the db doesn't exist, gsmysql::use_db() fails, as it won't create db.
74#   This got fixed when GSSQLPlugin stopped connecting on init().
75#
76#
77#+ TODO: deal with incremental vs removeold. If docs removed from import folder, then import step
78# won't delete it from archives but buildcol step will. Need to implement this with this database plugin or wherever the actual flow is.
79#
80# + TODO Q: is "reindex" = del from db + add to db?
81# - is this okay for reindexing, or will it need to modify existing values (update table)
82# - if it's okay, what does reindex need to accomplish (and how) if the OID changes because hash id produced is different?
83# - delete is accomplished in GS SQL Plugin, during buildcol.pl. When should reindexing take place?
84# during SQL plugout/import.pl or during plugin? If adding is done by GSSQLPlugout, does it need to
85# be reimplemented in GSSQLPlugin to support the adding portion of reindexing.
86#
87# INCREMENTAL REBUILDING IMPLEMENTED CORRECTLY AND WORKS:
88# Overriding plugins' remove_all() method covered removeold.
89# Overriding plugins' remove_one() method is all I needed to do for reindex and deletion
90# (incremental and non-incremental) to work.
91# but doing all this needed an overhaul of gsmysql.pm and its use by the GS SQL plugin and plugout.
92# - needed to correct plugin.pm::remove_some() to process all files
93# - and needed to correct GreenstoneSQLPlugin::close_document() to setOID() after all
94# All incremental import and buildcol worked after that:
95# - deleting files and running incr-import and incr-buildcol (= "incr delete"),
96# - deleting files and running incr-import and buildcol (="non-incr delete")
97# - modifying meta and doing an incr rebuild
98# - modifying fulltext and doing an incr rebuild
99# - renaming a file forces a reindex: doc is removed from db and added back in, due to remove_one()
100# - tested CSV file: adding some records, changing some records
101#    + CSVPlugin test (collection csvsql)
102#    + MetadataCSVPlugin test (modified collection sqltest to have metadata.csv refer to the
103#      filenames of sqltest's documents)
104#    + shared image test (collection shareimg): if 2 html files reference the same image, the docs
105#      are indeed both reindexed if the image is modified (e.g. I replaced the image with another
106#      of the same name) which in the GS SQL plugin/plugout case is that the 2 docs are deleted
107#      and added in again.
108
109########################################################################################
110
111# GreenstoneSQLPlugin inherits from GreenstoneXMLPlugin so that it if meta or fulltext
112# is still written out to doc.xml (docsql .xml), that will be processed as usual,
113# whereas GreenstoneSQLPlugin will process all the rest (full text and/or meta, whichever
114# is written out by GreenstoneSQLPlugout into the SQL db).
115
116
117sub BEGIN {
118    @GreenstoneSQLPlugin::ISA = ('GreenstoneXMLPlugin');
119}
120
121# This plugin must be in the document plugins pipeline IN PLACE OF GreenstoneXMLPlugin
122# So we won't have a process exp conflict here.
123# The structure of docsql.xml files is identical to doc.xml and the contents are similar except:
124#   - since metadata and/or fulltxt are stored in mysql db instead, just XML comments indicating
125#   this are left inside docsql.xml within the <Description> (for meta) and/or <Content> (for txt)
126#   - the root element Archive now has a docoid attribute: <Archive docoid="OID">
127sub get_default_process_exp {
128    my $self = shift (@_);
129
130    return q^(?i)docsql(-\d+)?\.xml$^; # regex based on this method in GreenstoneXMLPlugin
131}
132
133my $process_mode_list =
134    [ { 'name' => "meta_only",
135        'desc' => "{GreenstoneSQLPlug.process_mode.meta_only}" },     
136      { 'name' => "text_only",
137        'desc' => "{GreenstoneSQLPlug.process_mode.text_only}" },
138      { 'name' => "all",
139        'desc' => "{GreenstoneSQLPlug.process_mode.all}" } ];
140
141my $rollback_on_cancel_list =
142    [ { 'name' => "true",
143        'desc' => "{GreenstoneSQLPlug.rollback_on_cancel}" },     
144      { 'name' => "false",
145        'desc' => "{GreenstoneSQLPlug.rollbacl_on_cancel}" } ];
146
147# TODO: If subclassing gsmysql for other supporting databases and if they have different required
148# connection parameters, we can check how WordPlugin, upon detecting Word is installed,
149# dynamically loads Word specific configuration options.
150my $arguments =
151    [ { 'name' => "process_exp",
152    'desc' => "{BaseImporter.process_exp}",
153    'type' => "regexp",
154    'deft' => &get_default_process_exp(),
155    'reqd' => "no" },
156      { 'name' => "process_mode",
157    'desc' => "{GreenstoneSQLPlug.process_mode}",
158    'type' => "enum",
159    'list' => $process_mode_list,
160    'deft' => "all",
161    'reqd' => "no"},
162      { 'name' => "rollback_on_cancel",
163    'desc' => "{GreenstoneSQLPlug.rollback_on_cancel}",
164    'type' => "enum",
165    'list' => $rollback_on_cancel_list,
166    'deft' => "false", # better default than true
167    'reqd' => "no",
168    'hiddengli' => "no"},
169      { 'name' => "db_driver",
170    'desc' => "{GreenstoneSQLPlug.db_driver}",
171    'type' => "string",
172    'deft' => "mysql",
173    'reqd' => "yes"},
174      { 'name' => "db_client_user",
175    'desc' => "{GreenstoneSQLPlug.db_client_user}",
176    'type' => "string",
177    'deft' => "root",
178    'reqd' => "yes"},
179      { 'name' => "db_client_pwd",
180    'desc' => "{GreenstoneSQLPlug.db_client_pwd}",
181    'type' => "string",
182    'deft' => "",
183    'reqd' => "no"}, # pwd not required: can create mysql accounts without pwd
184      { 'name' => "db_host",
185    'desc' => "{GreenstoneSQLPlug.db_host}",
186    'type' => "string",
187    'deft' => "127.0.0.1", # NOTE: make this int? No default for port, since it's not a required connection param
188    'reqd' => "yes"},
189      { 'name' => "db_port",
190    'desc' => "{GreenstoneSQLPlug.db_port}",
191    'type' => "string", # NOTE: make this int? No default for port, since it's not a required connection param
192    'reqd' => "no"}
193    ];
194
195my $options = { 'name'     => "GreenstoneSQLPlugin",
196        'desc'     => "{GreenstoneSQLPlugin.desc}",
197        'abstract' => "no",
198        'inherits' => "yes",
199            'args'     => $arguments };
200
201
202###### Methods called during buildcol and import #######
203
204sub new {
205    my ($class) = shift (@_);
206    my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
207    push(@$pluginlist, $class);
208
209    push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
210    push(@{$hashArgOptLists->{"OptList"}},$options);
211
212    my $self = new GreenstoneXMLPlugin($pluginlist, $inputargs, $hashArgOptLists);
213
214   
215    #return bless $self, $class;
216    $self = bless $self, $class;
217    if ($self->{'info_only'}) {
218    # If running pluginfo, we don't need to go further.
219    return $self;
220    }
221
222    # do anything else that needs to be done here when not pluginfo
223   
224    return $self;
225}
226
227# GS SQL Plugin::init() (and deinit()) is called by import.pl and also by buildcol.pl
228# This means it connects and deconnects during import.pl as well. This is okay
229# as removeold, which should drop the collection tables, happens during the import phase,
230# calling GreenstoneSQLPlugin::and therefore also requires a db connection.
231# + TODO: Eventually can try moving get_gssql_instance into gsmysql.pm? That way both GS SQL Plugin
232# and Plugout would be using one connection during import.pl phase when both plugs exist.
233
234# Call init() not begin() because there can be multiple plugin passes and begin() called for
235# each pass (one for doc level and another for section level indexing), whereas init() should
236# be called before any and all passes.
237# This way, we can connect to the SQL database once per buildcol run.
238sub init {
239    my ($self) = shift (@_);
240    ##print STDERR "@@@@@@@@@@ INIT CALLED\n";
241   
242    $self->SUPER::init(@_); # super (GreenstoneXMLPlugin) will not yet be trying to read from doc.xml (docsql .xml) files in init().
243
244    ####################
245#    print "@@@ SITE NAME: ". $self->{'site'} . "\n" if defined $self->{'site'};
246#    print "@@@ COLL NAME: ". $ENV{'GSDLCOLLECTION'} . "\n";
247
248#    print STDERR "@@@@ db_pwd: " . $self->{'db_client_pwd'} . "\n";
249#    print STDERR "@@@@ user: " . $self->{'db_client_user'} . "\n";
250#    print STDERR "@@@@ db_host: " . $self->{'db_host'} . "\n";
251#    print STDERR "@@@@ db_driver: " . $self->{'db_driver'} . "\n";
252    ####################
253
254    # create gsmysql object.
255    # collection name will be used for naming tables (site name will be used for naming database)
256    my $gs_sql = new gsmysql({
257    'collection_name' => $ENV{'GSDLCOLLECTION'},
258    'verbosity' => $self->{'verbosity'} || 0
259               });
260   
261    # if autocommit is set, there's no rollback support
262    my $autocommit = ($self->{'rollback_on_cancel'} eq "false") ? 1 : 0;
263
264    # try connecting to the mysql db, die if that fails
265    if(!$gs_sql->connect_to_db({
266    'db_driver' => $self->{'db_driver'},
267    'db_client_user' => $self->{'db_client_user'},
268    'db_client_pwd' => $self->{'db_client_pwd'},
269    'db_host' => $self->{'db_host'},
270    'db_port' => $self->{'db_port'}, # undef by default, can leave as is
271    'autocommit' => $autocommit
272                   })
273    )
274    {
275    # This is fatal for the plugout, let's terminate here
276    # PrintError would already have displayed the warning message on connection fail   
277    die("Could not connect to db. Can't proceed.\n");
278    }
279   
280    my $db_name = $self->{'site'} || "greenstone2"; # one database per GS3 site, for GS2 the db is called greenstone2
281
282    # Attempt to use the db, create it if it doesn't exist (but don't create the tables yet)
283    # Bail if we can't use the database
284    if(!$gs_sql->use_db($db_name)) {
285   
286    # This is fatal for the plugout, let's terminate here after disconnecting again
287    # PrintError would already have displayed the warning message on load fail
288    # And on die() perl will call gsmysql destroy which will ensure a disconnect() from db
289    #$gs_sql->force_disconnect_from_db();
290    die("Could not use db $db_name. Can't proceed.\n");
291    }
292   
293   
294    # store db handle now that we're connected
295    $self->{'gs_sql'} = $gs_sql;   
296}
297
298
299# This method also runs on import.pl if gs_sql has a value. But we just want to run it on buildcol
300# Call deinit() not end() because there can be multiple plugin passes:
301# one for doc level and another for section level indexing
302# and deinit() should be called before all passes
303# This way, we can close the SQL database once per buildcol run.
304sub deinit {
305    my ($self) = shift (@_);
306   
307    ##print STDERR "@@@@@@@@@@ GreenstoneSQLPlugin::DEINIT CALLED\n";
308   
309    if($self->{'gs_sql'}) {
310
311    # Important to call finished():
312    # it will disconnect from db if this is the last gsmysql instance,
313    # and it will commit to db before disconnecting if rollbback_on_cancel turned on
314    $self->{'gs_sql'}->finished();
315
316    # Clear gs_sql (setting key to undef has a different meaning from deleting:
317    # undef makes key still exist but its value is unded whereas delete deletes the key)
318    # So all future use has to make the connection again
319    delete $self->{'gs_sql'};
320    }
321
322    $self->SUPER::deinit(@_);
323}
324
325
326
327###### Methods only called during import.pl #####
328
329# This is called once if removeold is set with import.pl. Most plugins will do
330# nothing but if a plugin does any stuff outside of creating doc obj, then
331# it may need to clear something.
332# In the case of GreenstoneSQL plugs: this is the first time we have a chance
333# to purge the tables of the current collection from the current site's database
334sub remove_all {
335    my $self = shift (@_);
336    my ($pluginfo, $base_dir, $processor, $maxdocs) = @_;
337
338    $self->SUPER::remove_all(@_);
339   
340    print STDERR "   Building with removeold option set, so deleting current collection's tables if they exist\n" if($self->{'verbosity'});
341   
342    # if we're in here, we'd already have run 'use database <site>;' during sub init()
343    # so we can go ahead and delete the collection's tables
344    my $gs_sql = $self->{'gs_sql'};
345    $gs_sql->delete_collection_tables(); # will delete them if they exist
346
347    # and recreate tables? No. Tables' existence is ensured in GreenstoneSQLPlugout::begin()
348    my $proc_mode = $self->{'process_mode'};
349    if($proc_mode ne "text_only") {
350    $gs_sql->ensure_meta_table_exists();
351    }
352    if($proc_mode ne "meta_only") {
353    $gs_sql->ensure_fulltxt_table_exists();
354    }
355
356}
357
358# This is called during import.pl per document for docs that have been deleted from the
359# collection. Most plugins will do nothing
360# but if a plugin does any stuff outside of creating doc obj, then it may need
361# to clear something.
362# remove the doc(s) denoted by oids from GS SQL db
363# This takes care of incremental deletes (docs marked D by ArchivesInfPlugin when building
364# incrementally) as well as cases of "Non-icremental Delete", see ArchivesInfPlugin.pm
365sub remove_one {
366    my $self = shift (@_);
367   
368    my ($file, $oids, $archivedir) = @_;
369
370    my $rv = $self->SUPER::remove_one(@_);
371   
372    print STDERR "@@@ IN SQLPLUG::REMOVE_ONE: $file\n";
373   
374    #return undef unless $self->can_process_this_file($file); # NO, DON'T DO THIS (inherited remove_one behaviour) HERE:
375           # WE DON'T CARE IF IT'S AN IMAGE FILE THAT WAS DELETED.
376           # WE CARE ABOUT REMOVING THE DOC_OID OF THAT IMAGE FILE FROM THE SQL DB
377           # SO DON'T RETURN IF CAN'T_PROCESS_THIS_FILE
378   
379   
380    my $gs_sql = $self->{'gs_sql'} || return 0; # couldn't make the connection or no db etc
381
382    print STDERR "*****************************\nAsked to remove_one oid\n***********************\n";
383    print STDERR "Num oids: " . scalar (@$oids) . "\n";
384   
385    my $proc_mode = $self->{'process_mode'};
386    foreach my $oid (@$oids) { 
387    if($proc_mode eq "all" || $proc_mode eq "meta_only") {
388        print STDERR "@@@@@@@@ Deleting $oid from meta table\n" if $self->{'verbosity'} > 2;
389        $gs_sql->delete_recs_from_metatable_with_docid($oid);
390    }
391    if($proc_mode eq "all" || $proc_mode eq "text_only") {
392        print STDERR "@@@@@@@@ Deleting $oid from fulltxt table\n" if $self->{'verbosity'} > 2;
393        $gs_sql->delete_recs_from_texttable_with_docid($oid);
394    }
395    }
396    return $rv;
397}
398
399##### Methods called only during buildcol #####
400
401sub xml_start_tag {
402    my $self = shift(@_);
403    my ($expat, $element) = @_;
404
405    my $outhandle = $self->{'outhandle'};
406   
407    $self->{'element'} = $element;
408    if ($element eq "Archive") { # docsql.xml files contain a OID attribute on Archive element
409    # the element's attributes are in %_ as per ReadXMLFile::xml_start_tag() (while $_
410    # contains the tag)
411
412    # Don't access %_{'docoid'} directly: keep getting a warning message to
413    # use $_{'docoid'} for scalar contexts, but %_ is the element's attr hashmap
414    # whereas $_ has the tag info. So we don't want to do $_{'docoid'}.
415    my %attr_hash = %_; # right way, see OAIPlugin.pm
416    $self->{'doc_oid'} = $attr_hash{'docoid'};
417    print $outhandle "Extracted OID from docsql.xml: ".$self->{'doc_oid'}."\n"
418        if $self->{'verbosity'} > 2;
419
420    }
421    else { # let superclass GreenstoneXMLPlugin continue to process <Section> and <Metadata> elements
422    $self->SUPER::xml_start_tag(@_);
423    }
424}
425
426# There are multiple passes processing the document (see buildcol's mode parameter description):
427# - compressing the text which may be a dummy pass for lucene/solr, wherein they still want the
428# docobj for different purposes,
429# - the pass(es) for indexing, e.g. doc/didx and section/sidx level passes
430# - and an infodb pass for processing the classifiers. This pass too needs the docobj
431# Since all passes need the doc_obj, all are read in from docsql + SQL db into the docobj in memory
432
433# We should only ever get here during the buildcol.pl phase
434# At the end of superclass GreenstoneXMLPlugin.pm's close_document() method,
435# the doc_obj in memory is processed (indexed) and then made undef.
436# So we have to work with doc_obj before superclass close_document() is finished.
437sub close_document {
438    my $self = shift(@_);
439
440    ##print STDERR "XXXXXXXXX in SQLPlugin::close_doc()\n";
441   
442    my $gs_sql = $self->{'gs_sql'};
443   
444    my $outhandle = $self->{'outhandle'};
445    my $doc_obj = $self->{'doc_obj'};
446
447    my $oid = $self->{'doc_oid'}; # we stored current doc's OID during sub xml_start_tag()
448    my $proc_mode = $self->{'process_mode'};
449   
450    # For now, we have access to doc_obj (until just before super::close_document() terminates)
451
452    # OID parsed of docsql.xml file does need to be set on $doc_obj, as noticed in this case:
453    # when a doc in import is renamed, and you do incremental import, it is marked for reindexing
454    # (reindexing is implemented by this plugin as a delete followed by add into the sql db).
455    # In that case, UNLESS you set the OID at this stage, the old deleted doc id (for the old doc
456    # name) continues to exist in the index at the end of incremental rebuilding if you were to
457    # browse the rebuilt collection by files/titles. So unless you set the OID here, the deleted
458    # doc oids will still be listed in the index.
459    $self->{'doc_obj'}->set_OID($oid);
460   
461    print STDERR "   GreenstoneSQLPlugin processing doc $oid (reading into docobj from SQL db)\n"
462    if $self->{'verbosity'};
463   
464    if($proc_mode eq "all" || $proc_mode eq "meta_only") {
465    # read in meta for the collection (i.e. select * from <col>_metadata table
466   
467    my $records = $gs_sql->select_from_metatable_matching_docid($oid, $outhandle);
468   
469    print $outhandle "----------SQL DB contains meta-----------\n" if $self->{'verbosity'} > 2;
470    # https://www.effectiveperlprogramming.com/2010/07/set-custom-dbi-error-handlers/
471
472    foreach my $row (@$records) {
473        #print $outhandle "row: @$row\n";
474        my ($primary_key, $did, $sid, $metaname, $metaval) = @$row;
475       
476        # get rid of the artificial "root" introduced in section id when saving to sql db
477        $sid =~ s@^root@@;
478        $sid = $doc_obj->get_top_section() unless $sid;
479        print $outhandle "### did: $did, sid: |$sid|, meta: $metaname, val: $metaval\n"
480        if $self->{'verbosity'} > 2;
481       
482        # + TODO:  we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_meta directly:
483        #$doc_obj->add_utf8_metadata($sid, $metaname, &docprint::unescape_text($metaval));
484       
485        # data stored unescaped in db: escaping only for html/xml files, not for txt files or db
486        $doc_obj->add_utf8_metadata($sid, $metaname, $metaval);
487    }
488    print $outhandle "----------FIN READING DOC's META FROM SQL DB------------\n"
489        if $self->{'verbosity'} > 2;
490    }
491   
492    if($proc_mode eq "all" || $proc_mode eq "text_only") {
493    # read in fulltxt for the collection (i.e. select * from <col>_fulltxt table
494   
495    my $fulltxt_table = $gs_sql->get_fulltext_table_name();
496   
497   
498    my $records = $gs_sql->select_from_texttable_matching_docid($oid, $outhandle);
499   
500   
501    print $outhandle "----------\nSQL DB contains txt entries for-----------\n"
502        if $self->{'verbosity'} > 2;
503
504    foreach my $row (@$records) {
505        my ($primary_key, $did, $sid, $text) = @$row;
506       
507        # get rid of the artificial "root" introduced in section id when saving to sql db
508        #$sid =~ s@^root@@;
509        $sid = $doc_obj->get_top_section() if ($sid eq "root");
510        print $outhandle "### did: $did, sid: |$sid|, fulltext: <TXT>\n"
511        if $self->{'verbosity'} > 2;
512       
513        # TODO - pass by ref?
514        # + TODO: we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_text directly:
515        # data stored unescaped in db: escaping is only for html/xml files, not for txt files or db
516        #my $textref = &docprint::unescape_textref(\$text);
517        $doc_obj->add_utf8_textref($sid, \$text);
518    }   
519    print $outhandle "----------FIN READING DOC's TXT FROM SQL DB------------\n"
520        if $self->{'verbosity'} > 2;
521    }
522   
523    # done reading into docobj from SQL db
524   
525    # don't forget to clean up on close() in superclass
526    # It will get the doc_obj indexed then make it undef
527    $self->SUPER::close_document(@_);
528}
529
530
5311;
Note: See TracBrowser for help on using the browser.