[29015] | 1 | <?xml version="1.0" encoding="utf-8" standalone="no"?>
|
---|
| 2 | <!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
|
---|
| 3 | <Archive>
|
---|
| 4 | <Section>
|
---|
| 5 | <Description>
|
---|
| 6 | <Metadata name="gsdldoctype">indexed_doc</Metadata>
|
---|
| 7 | <Metadata name="Language">en</Metadata>
|
---|
| 8 | <Metadata name="Encoding">utf8</Metadata>
|
---|
| 9 | <Metadata name="Title">Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction</Metadata>
|
---|
| 10 | <Metadata name="gsdlsourcefilename">import/langmodl.ps</Metadata>
|
---|
| 11 | <Metadata name="gsdlconvertedfilename">tmp/1398926157/langmodl.text</Metadata>
|
---|
| 12 | <Metadata name="OrigSource">langmodl.text</Metadata>
|
---|
| 13 | <Metadata name="Source">langmodl.ps</Metadata>
|
---|
| 14 | <Metadata name="SourceFile">langmodl.ps</Metadata>
|
---|
| 15 | <Metadata name="Plugin">PostScriptPlugin</Metadata>
|
---|
| 16 | <Metadata name="FileSize">16751</Metadata>
|
---|
| 17 | <Metadata name="FilenameRoot">langmodl</Metadata>
|
---|
| 18 | <Metadata name="FileFormat">PS</Metadata>
|
---|
| 19 | <Metadata name="srcicon">_iconps_</Metadata>
|
---|
| 20 | <Metadata name="srclink_file">doc.ps</Metadata>
|
---|
| 21 | <Metadata name="srclinkFile">doc.ps</Metadata>
|
---|
| 22 | <Metadata name="dc.Creator">Ian H. Witten</Metadata>
|
---|
| 23 | <Metadata name="dc.Creator">Zane Bray</Metadata>
|
---|
| 24 | <Metadata name="dc.Creator">Malika Mahoui</Metadata>
|
---|
| 25 | <Metadata name="dc.Creator">W.J. Teahan</Metadata>
|
---|
| 26 | <Metadata name="Identifier">HASH8bbe6da0374b413b1b355c</Metadata>
|
---|
| 27 | <Metadata name="lastmodified">1398925758</Metadata>
|
---|
| 28 | <Metadata name="lastmodifieddate">20140501</Metadata>
|
---|
| 29 | <Metadata name="oailastmodified">1398926157</Metadata>
|
---|
| 30 | <Metadata name="oailastmodifieddate">20140501</Metadata>
|
---|
| 31 | <Metadata name="assocfilepath">HASH8bbe.dir</Metadata>
|
---|
| 32 | <Metadata name="gsdlassocfile">doc.ps:application/postscript:</Metadata>
|
---|
| 33 | </Description>
|
---|
| 34 | <Content><pre>
|
---|
| 35 | Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction
|
---|
| 36 | Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan Computer ScienceUniversity
|
---|
| 37 | of WaikatoHamilton, New [email protected] AbstractThis paper describes
|
---|
| 38 | the use of statisticallanguage modeling techniques, such as arecommonly used
|
---|
| 39 | for text compression, to extractmeaningful, low-level, information about
|
---|
| 40 | thelocation of semantic tokens, or \\322entities,\\323 in text.We begin by
|
---|
| 41 | marking up several different tokentypes in training documents\\321for example,people\\325s
|
---|
| 42 | names, dates and time periods, phonenumbers, and sums of money. We form alanguage
|
---|
| 43 | model for each token type and examinehow accurately it identifies new tokens.
|
---|
| 44 | We thenapply a search algorithm to insert tokenboundaries in a way that maximizes
|
---|
| 45 | compressionof the entire test document. The technique can beapplied to hierarchically-defined
|
---|
| 46 | tokens, leadingto a kind of \\322soft parsing\\323 that will, we believe,be
|
---|
| 47 | able to identify structured items such asreferences and tables in html or
|
---|
| 48 | plain text, basedon nothing more than a few marked-up examplesin training
|
---|
| 49 | documents. 1. INTRODUCTIONText mining is about looking for patterns in
|
---|
| 50 | text, and maybe defined as the process of analyzing text to extractinformation
|
---|
| 51 | that is useful for particular purposes.Compared with the kind of data stored
|
---|
| 52 | in databases, textis unstructured, amorphous, and difficult to deal with.Nevertheless,
|
---|
| 53 | in modern Western culture, text is the mostcommon vehicle for the formal
|
---|
| 54 | exchange of information.The motivation for trying to extract information
|
---|
| 55 | from it iscompelling\\321even if success is only partial.Text mining is possible
|
---|
| 56 | because you do not have tounderstand text in order to extract useful information
|
---|
| 57 | fromit. Here are four examples. First, if only names could beidentified,
|
---|
| 58 | links could be inserted automatically to otherplaces that mention the same
|
---|
| 59 | name\\321links that are\\322dynamically evaluated\\323 by calling upon a search
|
---|
| 60 | engineto bind them at click time. Second, actions can beassociated with different
|
---|
| 61 | types of data, using eitherexplicit programming or programming-by-demonstrationtechniques.
|
---|
| 62 | A day/time specification appearing anywherewithin one\\325s email could be
|
---|
| 63 | associated with diary actionssuch as updating a personal organizer or creating
|
---|
| 64 | anautomatic reminder, and each mention of a day/time in thetext could raise
|
---|
| 65 | a popup menu of calendar-based actions.Third, text could be mined for data
|
---|
| 66 | in tabular format,allowing databases to be created from formatted tablessuch
|
---|
| 67 | as stock-market information on Web pages. Fourth,an agent could monitor incoming
|
---|
| 68 | newswire stories forcompany names and collect documents that mentionthem\\321an
|
---|
| 69 | automated press clipping service.In all these examples, the key problem is
|
---|
| 70 | to recognizedifferent types of target fragments, which we will calltokens
|
---|
| 71 | or \\322entities\\323. This is really a kind of languagerecognition problem:
|
---|
| 72 | we have a text made up of differentsublanguages {for personal names, company
|
---|
| 73 | names, dates,table entries, and so on} and seek to determine whichparts are
|
---|
| 74 | expressed in which language.The information extraction research community
|
---|
| 75 | {of whichwe were, until recently, unaware} has studied these tasksand reported
|
---|
| 76 | results at annual Message UnderstandingConferences {MUC}. For example, \\322named
|
---|
| 77 | entities\\323 aredefined as proper names and quantities of interest,including
|
---|
| 78 | personal, organization, and location names, aswell as dates, times, percentages,
|
---|
| 79 | and monetary amounts{Chinchor, 1999}.The standard approach to this problem
|
---|
| 80 | is manual:tokenizers and grammars are hand-designed for theparticular data
|
---|
| 81 | being extracted. Looking at currentcommercial state-of-the-art text mining
|
---|
| 82 | software, forexample, IBM\\325s Intelligent Miner for Text {Tkach, 1997}uses
|
---|
| 83 | specific recognition modules carefully programmedfor the different data types,
|
---|
| 84 | while Apple\\325s data detectors{Nardi et al., 1998} uses language grammars.
|
---|
| 85 | The TextTokenization Tool of Grover et al. {1999} is anotherexample, and
|
---|
| 86 | a demonstration version is available on theWeb. The challenge for machine
|
---|
| 87 | learning is to use
|
---|
| 88 | </pre></Content>
|
---|
| 89 | </Section>
|
---|
| 90 | </Archive>
|
---|