1 | <?xml version="1.0" encoding="utf-8" standalone="no"?>
|
---|
2 | <!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
|
---|
3 | <Archive>
|
---|
4 | <Section>
|
---|
5 | <Description>
|
---|
6 | <Metadata name="gsdldoctype">indexed_doc</Metadata>
|
---|
7 | <Metadata name="Language">en</Metadata>
|
---|
8 | <Metadata name="Encoding">utf8</Metadata>
|
---|
9 | <Metadata name="Title">Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction</Metadata>
|
---|
10 | <Metadata name="gsdlsourcefilename">import/langmodl.ps</Metadata>
|
---|
11 | <Metadata name="gsdlconvertedfilename">tmp/1391133444/langmodl.text</Metadata>
|
---|
12 | <Metadata name="OrigSource">langmodl.text</Metadata>
|
---|
13 | <Metadata name="Source">langmodl.ps</Metadata>
|
---|
14 | <Metadata name="SourceFile">langmodl.ps</Metadata>
|
---|
15 | <Metadata name="Plugin">PostScriptPlugin</Metadata>
|
---|
16 | <Metadata name="FileSize">16751</Metadata>
|
---|
17 | <Metadata name="FilenameRoot">langmodl</Metadata>
|
---|
18 | <Metadata name="FileFormat">PS</Metadata>
|
---|
19 | <Metadata name="srcicon">_iconps_</Metadata>
|
---|
20 | <Metadata name="srclink_file">doc.ps</Metadata>
|
---|
21 | <Metadata name="srclinkFile">doc.ps</Metadata>
|
---|
22 | <Metadata name="Identifier">HASH8bbe6da0374b413b1b355c</Metadata>
|
---|
23 | <Metadata name="lastmodified">1391133401</Metadata>
|
---|
24 | <Metadata name="lastmodifieddate">20140131</Metadata>
|
---|
25 | <Metadata name="oailastmodified">1391133444</Metadata>
|
---|
26 | <Metadata name="oailastmodifieddate">20140131</Metadata>
|
---|
27 | <Metadata name="assocfilepath">HASH8bbe.dir</Metadata>
|
---|
28 | <Metadata name="gsdlassocfile">doc.ps:application/postscript:</Metadata>
|
---|
29 | </Description>
|
---|
30 | <Content><pre>
|
---|
31 | Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction
|
---|
32 | Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan Computer ScienceUniversity
|
---|
33 | of WaikatoHamilton, New [email protected] AbstractThis paper describes
|
---|
34 | the use of statisticallanguage modeling techniques, such as arecommonly used
|
---|
35 | for text compression, to extractmeaningful, low-level, information about
|
---|
36 | thelocation of semantic tokens, or \\322entities,\\323 in text.We begin by
|
---|
37 | marking up several different tokentypes in training documents\\321for example,people\\325s
|
---|
38 | names, dates and time periods, phonenumbers, and sums of money. We form alanguage
|
---|
39 | model for each token type and examinehow accurately it identifies new tokens.
|
---|
40 | We thenapply a search algorithm to insert tokenboundaries in a way that maximizes
|
---|
41 | compressionof the entire test document. The technique can beapplied to hierarchically-defined
|
---|
42 | tokens, leadingto a kind of \\322soft parsing\\323 that will, we believe,be
|
---|
43 | able to identify structured items such asreferences and tables in html or
|
---|
44 | plain text, basedon nothing more than a few marked-up examplesin training
|
---|
45 | documents. 1. INTRODUCTIONText mining is about looking for patterns in
|
---|
46 | text, and maybe defined as the process of analyzing text to extractinformation
|
---|
47 | that is useful for particular purposes.Compared with the kind of data stored
|
---|
48 | in databases, textis unstructured, amorphous, and difficult to deal with.Nevertheless,
|
---|
49 | in modern Western culture, text is the mostcommon vehicle for the formal
|
---|
50 | exchange of information.The motivation for trying to extract information
|
---|
51 | from it iscompelling\\321even if success is only partial.Text mining is possible
|
---|
52 | because you do not have tounderstand text in order to extract useful information
|
---|
53 | fromit. Here are four examples. First, if only names could beidentified,
|
---|
54 | links could be inserted automatically to otherplaces that mention the same
|
---|
55 | name\\321links that are\\322dynamically evaluated\\323 by calling upon a search
|
---|
56 | engineto bind them at click time. Second, actions can beassociated with different
|
---|
57 | types of data, using eitherexplicit programming or programming-by-demonstrationtechniques.
|
---|
58 | A day/time specification appearing anywherewithin one\\325s email could be
|
---|
59 | associated with diary actionssuch as updating a personal organizer or creating
|
---|
60 | anautomatic reminder, and each mention of a day/time in thetext could raise
|
---|
61 | a popup menu of calendar-based actions.Third, text could be mined for data
|
---|
62 | in tabular format,allowing databases to be created from formatted tablessuch
|
---|
63 | as stock-market information on Web pages. Fourth,an agent could monitor incoming
|
---|
64 | newswire stories forcompany names and collect documents that mentionthem\\321an
|
---|
65 | automated press clipping service.In all these examples, the key problem is
|
---|
66 | to recognizedifferent types of target fragments, which we will calltokens
|
---|
67 | or \\322entities\\323. This is really a kind of languagerecognition problem:
|
---|
68 | we have a text made up of differentsublanguages {for personal names, company
|
---|
69 | names, dates,table entries, and so on} and seek to determine whichparts are
|
---|
70 | expressed in which language.The information extraction research community
|
---|
71 | {of whichwe were, until recently, unaware} has studied these tasksand reported
|
---|
72 | results at annual Message UnderstandingConferences {MUC}. For example, \\322named
|
---|
73 | entities\\323 aredefined as proper names and quantities of interest,including
|
---|
74 | personal, organization, and location names, aswell as dates, times, percentages,
|
---|
75 | and monetary amounts{Chinchor, 1999}.The standard approach to this problem
|
---|
76 | is manual:tokenizers and grammars are hand-designed for theparticular data
|
---|
77 | being extracted. Looking at currentcommercial state-of-the-art text mining
|
---|
78 | software, forexample, IBM\\325s Intelligent Miner for Text {Tkach, 1997}uses
|
---|
79 | specific recognition modules carefully programmedfor the different data types,
|
---|
80 | while Apple\\325s data detectors{Nardi et al., 1998} uses language grammars.
|
---|
81 | The TextTokenization Tool of Grover et al. {1999} is anotherexample, and
|
---|
82 | a demonstration version is available on theWeb. The challenge for machine
|
---|
83 | learning is to use
|
---|
84 | </pre></Content>
|
---|
85 | </Section>
|
---|
86 | </Archive>
|
---|