source: other-projects/nightly-tasks/diffcol/trunk/model-collect/Word-PDF-Basic/archives/HASH8bbe.dir/doc.xml@ 34416

Last change on this file since 34416 was 34416, checked in by ak19, 4 years ago

Committing rebuilt model collections after new doc.xml meta gsdlfullsourcepath introduced in commit r34394.

File size: 6.0 KB
Line 
1<?xml version="1.0" encoding="utf-8" standalone="no"?>
2<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3<Archive>
4<Section>
5 <Description>
6 <Metadata name="gsdldoctype">indexed_doc</Metadata>
7 <Metadata name="Language">en</Metadata>
8 <Metadata name="Encoding">utf8</Metadata>
9 <Metadata name="Title">Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction</Metadata>
10 <Metadata name="gsdlsourcefilename">import/langmodl.ps</Metadata>
11 <Metadata name="gsdlsourcefilerenamemethod">url</Metadata>
12 <Metadata name="gsdlfullsourcepath">/Scratch/ak19/gs2-diffcol-26Apr2019/collect/Word-PDF-Basic/import/langmodl.ps</Metadata>
13 <Metadata name="gsdlconvertedfilename">tmp/1601256892/langmodl.text</Metadata>
14 <Metadata name="OrigSource">langmodl.text</Metadata>
15 <Metadata name="Source">langmodl.ps</Metadata>
16 <Metadata name="SourceFile">langmodl.ps</Metadata>
17 <Metadata name="Plugin">PostScriptPlugin</Metadata>
18 <Metadata name="FileSize">16751</Metadata>
19 <Metadata name="FilenameRoot">langmodl</Metadata>
20 <Metadata name="FileFormat">PS</Metadata>
21 <Metadata name="srcicon">_iconps_</Metadata>
22 <Metadata name="srclink_file">doc.ps</Metadata>
23 <Metadata name="srclinkFile">doc.ps</Metadata>
24 <Metadata name="dc.Creator">Ian H. Witten</Metadata>
25 <Metadata name="dc.Creator">Zane Bray</Metadata>
26 <Metadata name="dc.Creator">Malika Mahoui</Metadata>
27 <Metadata name="dc.Creator">W.J. Teahan</Metadata>
28 <Metadata name="Identifier">HASH8bbe6da0374b413b1b355c</Metadata>
29 <Metadata name="lastmodified">1601256681</Metadata>
30 <Metadata name="lastmodifieddate">20200928</Metadata>
31 <Metadata name="oailastmodified">1601256893</Metadata>
32 <Metadata name="oailastmodifieddate">20200928</Metadata>
33 <Metadata name="assocfilepath">HASH8bbe.dir</Metadata>
34 <Metadata name="gsdlassocfile">doc.ps:application/postscript:</Metadata>
35 </Description>
36 <Content>&lt;pre&gt;
37Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction
38Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan Computer ScienceUniversity
39of WaikatoHamilton, New [email protected] AbstractThis paper describes
40the use of statisticallanguage modeling techniques, such as arecommonly used
41for text compression, to extractmeaningful, low-level, information about
42thelocation of semantic tokens, or \\322entities,\\323 in text.We begin by
43marking up several different tokentypes in training documents\\321for example,people\\325s
44names, dates and time periods, phonenumbers, and sums of money. We form alanguage
45model for each token type and examinehow accurately it identifies new tokens.
46We thenapply a search algorithm to insert tokenboundaries in a way that maximizes
47compressionof the entire test document. The technique can beapplied to hierarchically-defined
48tokens, leadingto a kind of \\322soft parsing\\323 that will, we believe,be
49able to identify structured items such asreferences and tables in html or
50plain text, basedon nothing more than a few marked-up examplesin training
51documents. 1. INTRODUCTIONText mining is about looking for patterns in
52text, and maybe defined as the process of analyzing text to extractinformation
53that is useful for particular purposes.Compared with the kind of data stored
54in databases, textis unstructured, amorphous, and difficult to deal with.Nevertheless,
55in modern Western culture, text is the mostcommon vehicle for the formal
56exchange of information.The motivation for trying to extract information
57from it iscompelling\\321even if success is only partial.Text mining is possible
58because you do not have tounderstand text in order to extract useful information
59fromit. Here are four examples. First, if only names could beidentified,
60links could be inserted automatically to otherplaces that mention the same
61name\\321links that are\\322dynamically evaluated\\323 by calling upon a search
62engineto bind them at click time. Second, actions can beassociated with different
63types of data, using eitherexplicit programming or programming-by-demonstrationtechniques.
64A day/time specification appearing anywherewithin one\\325s email could be
65associated with diary actionssuch as updating a personal organizer or creating
66anautomatic reminder, and each mention of a day/time in thetext could raise
67a popup menu of calendar-based actions.Third, text could be mined for data
68in tabular format,allowing databases to be created from formatted tablessuch
69as stock-market information on Web pages. Fourth,an agent could monitor incoming
70newswire stories forcompany names and collect documents that mentionthem\\321an
71automated press clipping service.In all these examples, the key problem is
72to recognizedifferent types of target fragments, which we will calltokens
73or \\322entities\\323. This is really a kind of languagerecognition problem:
74we have a text made up of differentsublanguages {for personal names, company
75names, dates,table entries, and so on} and seek to determine whichparts are
76expressed in which language.The information extraction research community
77{of whichwe were, until recently, unaware} has studied these tasksand reported
78results at annual Message UnderstandingConferences {MUC}. For example, \\322named
79entities\\323 aredefined as proper names and quantities of interest,including
80personal, organization, and location names, aswell as dates, times, percentages,
81and monetary amounts{Chinchor, 1999}.The standard approach to this problem
82is manual:tokenizers and grammars are hand-designed for theparticular data
83being extracted. Looking at currentcommercial state-of-the-art text mining
84software, forexample, IBM\\325s Intelligent Miner for Text {Tkach, 1997}uses
85specific recognition modules carefully programmedfor the different data types,
86while Apple\\325s data detectors{Nardi et al., 1998} uses language grammars.
87The TextTokenization Tool of Grover et al. {1999} is anotherexample, and
88a demonstration version is available on theWeb. The challenge for machine
89learning is to use
90&lt;/pre&gt;</Content>
91</Section>
92</Archive>
Note: See TracBrowser for help on using the repository browser.