source: trunk/gsdl/perllib/plugins/HTMLPlug.pm

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @10725   19 years chi For some reasons, to change the date format to "yyymmdd" used "date" …
(edit) @10513   19 years mdewsnip Absolute image tags, like <img src="/image.gif"> were being …
(edit) @10347   19 years kjdon removed the unneeded 'use parsargv'
(edit) @10277   19 years chi tidy up the filename in add_file().
(edit) @10218   19 years kjdon Jeffrey's new parsing modifications, committed approx 6 July, 15.16
(edit) @10121   19 years mdewsnip Added the "sectionalise_using_h_tags" option to HTMLPlug, which …
(edit) @9747   19 years davidb Encountered new circumstance -- table -- for HTML tags that reference …
(edit) @9228   19 years davidb Changed setting URL metadata back to always being done (regardless of …
(edit) @9169   19 years davidb HTMLPlug was always setting URL metadata. This only makes sense if …
(edit) @9143   19 years davidb Added handling of <embed> tag in a similar fashion to <img> Also, …
(edit) @9125   19 years mdewsnip Added a substr function to unicode.pm that should work correctly on …
(edit) @9067   19 years kjdon moved smart blocking stuff in htmlplug metadata_read into basplug …
(edit) @9057   19 years kjdon tidied up previous commit
(edit) @9056   19 years kjdon added an option to not strip html tags from metadata in description …
(edit) @9053   19 years kjdon changed the description tags metadata handling again. now uses an …
(edit) @8914   19 years chi Add a smart_block option to deal with associated files of HTML document.
(edit) @8843   19 years jrm21 fix problem for -metadata_fields if tag1<Tag2> given for mapping to a …
(edit) @8794   19 years jrm21 remove trailing \n from meta tags (bug reported by Tim Finney, 13 Dec 2004)
(edit) @8767   19 years jrm21 add 'use utf8' so hopefully substr() is smart enough to cut between …
(edit) @8716   19 years kjdon added some changes made by Emanuel Dejanu (Simple Words)
(edit) @8668   19 years kjdon when processing description tags, it used to use …
(edit) @8509   19 years chi Add new methods (with a smart_block option) to store the blocked …
(edit) @8366   20 years kjdon added script to the list of tags to process as relative links, and js …
(edit) @8225   20 years jrm21 support tag<tagname> as described in the pluginfo for HTMLPlug. The …
(edit) @8121   20 years chi Add the "FileFormat" metadata to each of the Plugins.
(edit) @8071   20 years davidb When title metadata is derived from first 100 chars of text, extra =~ …
(edit) @7966   20 years mdewsnip Updated my fix from yesterday, so the collections will work correctly …
(edit) @7949   20 years mdewsnip Added a bit of a hack for the wv 0.7.1 bug under Windows that causes …
(edit) @7640   20 years mdewsnip Removed the reference to WebPlug, which no longer exists.
(edit) @7595   20 years mdewsnip Seem to have fixed the problem with anchors being added to images (for …
(edit) @7235   20 years kjdon fixed a couple of bugs and added a bit of output to do with extracting …
(edit) @7202   20 years jrm21 rewrote the <meta> tag handling to be more robust and more efficient.
(edit) @6812   20 years mdewsnip Additions for the GsdlCollageApplet: a classifier that displays a …
(edit) @6651   20 years kjdon fixed a bug I introduced last time
(edit) @6649   20 years kjdon changed the regex for getting info out of meta tags so it now works if …
(edit) @6408   20 years jmt12 Added two new attributes for script arguments. HiddenGLI controls …
(edit) @6332   20 years jmt12 When -gli argument is provided to calling script these modules will …
(edit) @5924   20 years kjdon changed the new metadata to eg WordPlug instead of Word, cos a clash …
(edit) @5919   20 years kjdon each plugin now adds a metadata field to teh doc obj based on the …
(edit) @5680   21 years mdewsnip Moved plugin descriptions into the resource bundle …
(edit) @5096   21 years jmt12 Metadata fields actually has nothing to do with the metadata elements …
(edit) @5066   21 years kjdon changed HTMLPLug to extract multiple values for the same metadata name
(edit) @4873   21 years mdewsnip Further work on standardising option descriptions. Specifically, in …
(edit) @4845   21 years jrm21 use add_metadata instead of add_utf8_metadata for Source and URL …
(edit) @4821   21 years jrm21 corrected extract_first_NNNN function so that it doesn't get confused …
(edit) @4785   21 years mdewsnip Commented out print_usage functions - plugins should now call …
(edit) @4748   21 years mdewsnip Changed "metadatum" type to "metadata".
(edit) @4744   21 years mdewsnip Tidied up and structures (representing the options of the plugin) in …
(edit) @3708   21 years sjboddie Fixed a bug where HTMLPlug failed to associate files whose filenames …
(edit) @3540   21 years kjdon added John T's changes into CVS - added info to enable retrieval of …
(edit) @3539   21 years kjdon added jpe to the process and block expressions
(edit) @3369   22 years sjboddie HTMLPlug will no longer prevent metadata extraction when the …
(edit) @3349   22 years sjboddie Bug fix.
(edit) @3247   22 years jrm21 Modified automatic title extraction to also recognise utf-8 nbsp as …
(edit) @3196   22 years sjboddie Added &nbsp; to the list of entities that HTMLPlug doesn't convert to utf-8
(edit) @3181   22 years sjboddie Altered the getcharequiv() function so it now converts entities to raw …
(edit) @3148   22 years jrm21 If a document has associated files that are also given a subdirectory, …
(edit) @3135   22 years jrm21 modified process_exp to process php3 -named files too.
(edit) @3019   22 years jrm21 Fixes for when on windows - it was having a lot of trouble sorting out …
(edit) @2995   22 years sjboddie Fixed a bug preventing HTML headers from being removed correctly when …
(edit) @2975   22 years jrm21 Tidied up usage info to fit in 80 columns. Fixed title_sub stuff, so …
(edit) @2819   22 years sjboddie Altered HTMLPlug's description_tags option a bit so it should now also …
(edit) @2817   22 years sjboddie Implemented a description_tags option to HTMLPlug for splitting an …
(edit) @2735   23 years sjboddie Fixed up bugs I introduced with recent change to BasPlug
(edit) @2695   23 years jrm21 Allow spaces in img src=... tags if surrounded with dbl quotes.
(edit) @2564   23 years jrm21 Added RTFPlug. (It's the smallest one so far - 1511 bytes - yay!) …
(edit) @2453   23 years jrm21 Slightly smarter title extraction from body's text.
(edit) @2364   23 years jrm21 turn "\" into " " so that we don't lose backslashes along the way…
(edit) @2342   23 years sjboddie renamed HTMLPlug's w3mir option to file_is_url
(edit) @2219   23 years sjboddie Had another go at suppressing the "subroutine redefined" warnings as …
(edit) @2209   23 years sjboddie Suppressed some annoying perl warnings
(edit) @1929   23 years dg5 Modified: ConvertToPlug and HTMLPlug to handle files in binary mode to …
(edit) @1891   23 years paynter Named characters like &eacute; and &igrave; are translated to UTF8 …
(edit) @1844   23 years sjboddie Added an 'auto' argument to BasPlug's '-input_encoding' option ('auto' …
(edit) @1699   23 years say1 fixed the bug in HTML plug which broke images for Dave
(edit) @1686   23 years jrm21 HTMLPlug no longer blocks .pdf files. (also updated reference to this …
(edit) @1653   23 years paynter Fixed a few bugs where incorrect variable names were used.
(edit) @1609   24 years say1 fixed print_uage
(edit) @1605   24 years say1 fixed some of my earlier mistakes. sorry Stefan
(edit) @1602   24 years say1 metadata extraction work. (email addresses, generalised HTML tags, …
(edit) @1448   24 years paynter Changed regular expressions for extracting metadata from META tags …
(edit) @1435   24 years davidb Rearrangement of ConvertTo inheritence so HTMLPlug and TextPlug do not …
(edit) @1431   24 years sjboddie Made a few minor adjustments to perl building code for use with …
(edit) @1424   24 years sjboddie Added a -out option to most of the perl building scripts to allow …
(edit) @1410   24 years davidb Introduction of "ConvertTo" family of plugins. This establishes a new …
(edit) @1403   24 years say1 taught HTMLPlug about shtml, asp, cgi, php and html query files …
(edit) @1400   24 years davidb General tidying of code.
(edit) @1358   24 years nzdl Fixed bug I recently introduced into HTMLPlug (<pre> tags were being …
(edit) @1312   24 years sjboddie fixed a bug in the HTML plugin that showed up under windows
(edit) @1245   24 years sjboddie Fixed a bug that davidb found in a couple of regular expressions
(edit) @1244   24 years sjboddie Caught up most general plugins (that's the ones in …
(edit) @1243   24 years sjboddie Caught HTMLPlug up with BasPlug. A few minor changes to some …
(edit) @1231   24 years gwp Bug fix on the H1 metadata option: if the file has no <H1> tag, …
(edit) @1230   24 years gwp Added an additional H1 metadata field that extracts the text between …
(edit) @1220   24 years sjboddie Caught HTMLPlug up with the changes I made to BasPlug. HTMLPlug now …
(edit) @1190   24 years gwp The first 200 chars of body text can now be extracted as metadata by …
(edit) @1020   24 years sjboddie changed paths to collection images (again!)
(edit) @1010   24 years sjboddie renamed old html module ghtml -- it clashed with builtin html module …
(edit) @965   24 years sjboddie fixed bug - added assoc_files option
(edit) @900   24 years sjboddie tweaked the way associated files are handled at build time - some …
Note: See TracRevisionLog for help on using the revision log.