source: other-projects/nightly-tasks/diffcol/trunk/model-collect/Word-PDF-Formatting/archives/HASH8bbe.dir/doc.xml@ 29405

Last change on this file since 29405 was 29405, checked in by ak19, 9 years ago

Trying to rebuild the Word-PDF-Formatting collection with unique dc.Title metadata for docs that have identical final names and with the 2nd browsing classifier sorted on dc.Title, in order to produce a consistent order for browse classifiers' children (a consistent presentation order of the files under the browsing classifiers). This is necessary for perl 5.18/5.17 and later, since they randomise the order of children of unsorted classifiers and for those children with identical filenames. Changes made particularly to collect.cfg and import/metadata.xml

File size: 5.8 KB
Line 
1<?xml version="1.0" encoding="utf-8" standalone="no"?>
2<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3<Archive>
4<Section>
5 <Description>
6 <Metadata name="gsdldoctype">indexed_doc</Metadata>
7 <Metadata name="Language">en</Metadata>
8 <Metadata name="Encoding">utf8</Metadata>
9 <Metadata name="Title">Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction</Metadata>
10 <Metadata name="gsdlsourcefilename">import/langmodl.ps</Metadata>
11 <Metadata name="gsdlconvertedfilename">tmp/1414470426/langmodl.text</Metadata>
12 <Metadata name="OrigSource">langmodl.text</Metadata>
13 <Metadata name="Source">langmodl.ps</Metadata>
14 <Metadata name="SourceFile">langmodl.ps</Metadata>
15 <Metadata name="Plugin">PostScriptPlugin</Metadata>
16 <Metadata name="FileSize">16751</Metadata>
17 <Metadata name="FilenameRoot">langmodl</Metadata>
18 <Metadata name="FileFormat">PS</Metadata>
19 <Metadata name="srcicon">_iconps_</Metadata>
20 <Metadata name="srclink_file">doc.ps</Metadata>
21 <Metadata name="srclinkFile">doc.ps</Metadata>
22 <Metadata name="dc.Creator">Ian H. Witten</Metadata>
23 <Metadata name="dc.Creator">Zane Bray</Metadata>
24 <Metadata name="dc.Creator">Malika Mahoui</Metadata>
25 <Metadata name="dc.Creator">W.J. Teahan</Metadata>
26 <Metadata name="Identifier">HASH8bbe6da0374b413b1b355c</Metadata>
27 <Metadata name="lastmodified">1414470425</Metadata>
28 <Metadata name="lastmodifieddate">20141028</Metadata>
29 <Metadata name="oailastmodified">1414470426</Metadata>
30 <Metadata name="oailastmodifieddate">20141028</Metadata>
31 <Metadata name="assocfilepath">HASH8bbe.dir</Metadata>
32 <Metadata name="gsdlassocfile">doc.ps:application/postscript:</Metadata>
33 </Description>
34 <Content>&lt;pre&gt;
35Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction
36Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan Computer ScienceUniversity
37of WaikatoHamilton, New [email protected] AbstractThis paper describes
38the use of statisticallanguage modeling techniques, such as arecommonly used
39for text compression, to extractmeaningful, low-level, information about
40thelocation of semantic tokens, or \\322entities,\\323 in text.We begin by
41marking up several different tokentypes in training documents\\321for example,people\\325s
42names, dates and time periods, phonenumbers, and sums of money. We form alanguage
43model for each token type and examinehow accurately it identifies new tokens.
44We thenapply a search algorithm to insert tokenboundaries in a way that maximizes
45compressionof the entire test document. The technique can beapplied to hierarchically-defined
46tokens, leadingto a kind of \\322soft parsing\\323 that will, we believe,be
47able to identify structured items such asreferences and tables in html or
48plain text, basedon nothing more than a few marked-up examplesin training
49documents. 1. INTRODUCTIONText mining is about looking for patterns in
50text, and maybe defined as the process of analyzing text to extractinformation
51that is useful for particular purposes.Compared with the kind of data stored
52in databases, textis unstructured, amorphous, and difficult to deal with.Nevertheless,
53in modern Western culture, text is the mostcommon vehicle for the formal
54exchange of information.The motivation for trying to extract information
55from it iscompelling\\321even if success is only partial.Text mining is possible
56because you do not have tounderstand text in order to extract useful information
57fromit. Here are four examples. First, if only names could beidentified,
58links could be inserted automatically to otherplaces that mention the same
59name\\321links that are\\322dynamically evaluated\\323 by calling upon a search
60engineto bind them at click time. Second, actions can beassociated with different
61types of data, using eitherexplicit programming or programming-by-demonstrationtechniques.
62A day/time specification appearing anywherewithin one\\325s email could be
63associated with diary actionssuch as updating a personal organizer or creating
64anautomatic reminder, and each mention of a day/time in thetext could raise
65a popup menu of calendar-based actions.Third, text could be mined for data
66in tabular format,allowing databases to be created from formatted tablessuch
67as stock-market information on Web pages. Fourth,an agent could monitor incoming
68newswire stories forcompany names and collect documents that mentionthem\\321an
69automated press clipping service.In all these examples, the key problem is
70to recognizedifferent types of target fragments, which we will calltokens
71or \\322entities\\323. This is really a kind of languagerecognition problem:
72we have a text made up of differentsublanguages {for personal names, company
73names, dates,table entries, and so on} and seek to determine whichparts are
74expressed in which language.The information extraction research community
75{of whichwe were, until recently, unaware} has studied these tasksand reported
76results at annual Message UnderstandingConferences {MUC}. For example, \\322named
77entities\\323 aredefined as proper names and quantities of interest,including
78personal, organization, and location names, aswell as dates, times, percentages,
79and monetary amounts{Chinchor, 1999}.The standard approach to this problem
80is manual:tokenizers and grammars are hand-designed for theparticular data
81being extracted. Looking at currentcommercial state-of-the-art text mining
82software, forexample, IBM\\325s Intelligent Miner for Text {Tkach, 1997}uses
83specific recognition modules carefully programmedfor the different data types,
84while Apple\\325s data detectors{Nardi et al., 1998} uses language grammars.
85The TextTokenization Tool of Grover et al. {1999} is anotherexample, and
86a demonstration version is available on theWeb. The challenge for machine
87learning is to use
88&lt;/pre&gt;</Content>
89</Section>
90</Archive>
Note: See TracBrowser for help on using the repository browser.