root/other-projects/nightly-tasks/diffcol/trunk/model-collect/Customization/archives/HASH8bbe.dir/doc.xml @ 32163

Revision 32163, 5.6 KB (checked in by ak19, 19 months ago)

AUTOCOMMIT by gen-model-colls.sh script. Message: Regenerating model collections after <meta-file> entry in archiveinf-src.db has become compulsory

Line 
1<?xml version="1.0" encoding="utf-8" standalone="no"?>
2<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3<Archive>
4<Section>
5  <Description>
6    <Metadata name="gsdldoctype">indexed_doc</Metadata>
7    <Metadata name="Language">en</Metadata>
8    <Metadata name="Encoding">utf8</Metadata>
9    <Metadata name="Title">Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction</Metadata>
10    <Metadata name="gsdlsourcefilename">import/langmodl.ps</Metadata>
11    <Metadata name="gsdlconvertedfilename">tmp/1522045370/langmodl.text</Metadata>
12    <Metadata name="OrigSource">langmodl.text</Metadata>
13    <Metadata name="Source">langmodl.ps</Metadata>
14    <Metadata name="SourceFile">langmodl.ps</Metadata>
15    <Metadata name="Plugin">PostScriptPlugin</Metadata>
16    <Metadata name="FileSize">16751</Metadata>
17    <Metadata name="FilenameRoot">langmodl</Metadata>
18    <Metadata name="FileFormat">PS</Metadata>
19    <Metadata name="srcicon">_iconps_</Metadata>
20    <Metadata name="srclink_file">doc.ps</Metadata>
21    <Metadata name="srclinkFile">doc.ps</Metadata>
22    <Metadata name="Identifier">HASH8bbe6da0374b413b1b355c</Metadata>
23    <Metadata name="lastmodified">1522045362</Metadata>
24    <Metadata name="lastmodifieddate">20180326</Metadata>
25    <Metadata name="oailastmodified">1522045370</Metadata>
26    <Metadata name="oailastmodifieddate">20180326</Metadata>
27    <Metadata name="assocfilepath">HASH8bbe.dir</Metadata>
28    <Metadata name="gsdlassocfile">doc.ps:application/postscript:</Metadata>
29  </Description>
30  <Content>&lt;pre&gt;
31Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction
32Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan Computer ScienceUniversity
33of WaikatoHamilton, New Zealandihw@cs.waikato.ac.nz AbstractThis paper describes
34the use of statisticallanguage modeling techniques, such as arecommonly used
35for text compression, to extractmeaningful, low-level, information about
36thelocation of semantic tokens, or \\322entities,\\323 in text.We begin by
37marking up several different tokentypes in training documents\\321for example,people\\325s
38names, dates and time periods, phonenumbers, and sums of money. We form alanguage
39model for each token type and examinehow accurately it identifies new tokens.
40We thenapply a search algorithm to insert tokenboundaries in a way that maximizes
41compressionof the entire test document. The technique can beapplied to hierarchically-defined
42tokens, leadingto a kind of \\322soft parsing\\323 that will, we believe,be
43able to identify structured items such asreferences and tables in html or
44plain text, basedon nothing more than a few marked-up examplesin training
45documents. 1.   INTRODUCTIONText mining is about looking for patterns in
46text, and maybe defined as the process of analyzing text to extractinformation
47that is useful for particular purposes.Compared with the kind of data stored
48in databases, textis unstructured, amorphous, and difficult to deal with.Nevertheless,
49in modern Western culture, text is the mostcommon vehicle for the formal
50exchange of information.The motivation for trying to extract information
51from it iscompelling\\321even if success is only partial.Text mining is possible
52because you do not have tounderstand text in order to extract useful information
53fromit. Here are four examples. First, if only names could beidentified,
54links could be inserted automatically to otherplaces that mention the same
55name\\321links that are\\322dynamically evaluated\\323 by calling upon a search
56engineto bind them at click time. Second, actions can beassociated with different
57types of data, using eitherexplicit programming or programming-by-demonstrationtechniques.
58A day/time specification appearing anywherewithin one\\325s email could be
59associated with diary actionssuch as updating a personal organizer or creating
60anautomatic reminder, and each mention of a day/time in thetext could raise
61a popup menu of calendar-based actions.Third, text could be mined for data
62in tabular format,allowing databases to be created from formatted tablessuch
63as stock-market information on Web pages. Fourth,an agent could monitor incoming
64newswire stories forcompany names and collect documents that mentionthem\\321an
65automated press clipping service.In all these examples, the key problem is
66to recognizedifferent types of target fragments, which we will calltokens
67or \\322entities\\323. This is really a kind of languagerecognition problem:
68we have a text made up of differentsublanguages {for personal names, company
69names, dates,table entries, and so on} and seek to determine whichparts are
70expressed in which language.The information extraction research community
71{of whichwe were, until recently, unaware} has studied these tasksand reported
72results at annual Message UnderstandingConferences {MUC}. For example, \\322named
73entities\\323 aredefined as proper names and quantities of interest,including
74personal, organization, and location names, aswell as dates, times, percentages,
75and monetary amounts{Chinchor, 1999}.The standard approach to this problem
76is manual:tokenizers and grammars are hand-designed for theparticular data
77being extracted. Looking at currentcommercial state-of-the-art text mining
78software, forexample, IBM\\325s Intelligent Miner for Text {Tkach, 1997}uses
79specific recognition modules carefully programmedfor the different data types,
80while Apple\\325s data detectors{Nardi et al., 1998} uses language grammars.
81The  TextTokenization Tool of Grover et al. {1999} is anotherexample, and
82a demonstration version is available on theWeb. The challenge for machine
83learning is to use
84&lt;/pre&gt;</Content>
85</Section>
86</Archive>
Note: See TracBrowser for help on using the browser.