source: other-projects/nightly-tasks/diffcol/trunk/gs3-model-collect/Word-PDF-Formatting/archives/HASH8bbe.dir/doc.xml@ 38996

Last change on this file since 38996 was 38996, checked in by anupama, 4 weeks ago

SourceDirectory seems to be new metadata in doc.xml that is breaking diffcol (when diffcol attempted on Win VM)

File size: 6.1 KB
Line 
1<?xml version="1.0" encoding="utf-8" standalone="no"?>
2<!DOCTYPE Archive SYSTEM "https://greenstone.org/dtd/Archive/1.0/Archive.dtd">
3<Archive>
4<Section>
5 <Description>
6 <Metadata name="gsdldoctype">indexed_doc</Metadata>
7 <Metadata name="SourceDirectory">/Scratch/ak19/gs3-svn-02May2024/web/sites/localsite/collect/Word-PDF-Formatting/tmp/1714976079</Metadata>
8 <Metadata name="Language">en</Metadata>
9 <Metadata name="Encoding">utf8</Metadata>
10 <Metadata name="Title">Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction</Metadata>
11 <Metadata name="gsdlsourcefilename">import/langmodl.ps</Metadata>
12 <Metadata name="gsdlsourcefilerenamemethod">url</Metadata>
13 <Metadata name="gsdlconvertedfilename">tmp/1714976079/langmodl.text</Metadata>
14 <Metadata name="OrigSource">langmodl.text</Metadata>
15 <Metadata name="Source">langmodl.ps</Metadata>
16 <Metadata name="SourceFile">langmodl.ps</Metadata>
17 <Metadata name="Plugin">PostScriptPlugin</Metadata>
18 <Metadata name="FileSize">16751</Metadata>
19 <Metadata name="SourceDirectory">.</Metadata>
20 <Metadata name="FilenameRoot">langmodl</Metadata>
21 <Metadata name="FileFormat">PS</Metadata>
22 <Metadata name="srcicon">_iconps_</Metadata>
23 <Metadata name="srclink_file">doc.ps</Metadata>
24 <Metadata name="srclinkFile">doc.ps</Metadata>
25 <Metadata name="dc.Creator">Ian H. Witten</Metadata>
26 <Metadata name="dc.Creator">Zane Bray</Metadata>
27 <Metadata name="dc.Creator">Malika Mahoui</Metadata>
28 <Metadata name="dc.Creator">W.J. Teahan</Metadata>
29 <Metadata name="dc.Title">Using language models for generic entity extraction</Metadata>
30 <Metadata name="Identifier">HASH8bbe6da0374b413b1b355c</Metadata>
31 <Metadata name="lastmodified">1714975834</Metadata>
32 <Metadata name="lastmodifieddate">20240506</Metadata>
33 <Metadata name="oailastmodified">1714976079</Metadata>
34 <Metadata name="oailastmodifieddate">20240506</Metadata>
35 <Metadata name="assocfilepath">HASH8bbe.dir</Metadata>
36 <Metadata name="gsdlassocfile">doc.ps:application/postscript:</Metadata>
37 </Description>
38 <Content>&lt;pre&gt;
39Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction
40Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan Computer ScienceUniversity
41of WaikatoHamilton, New [email protected] AbstractThis paper describes
42the use of statisticallanguage modeling techniques, such as arecommonly used
43for text compression, to extractmeaningful, low-level, information about
44thelocation of semantic tokens, or \\322entities,\\323 in text.We begin by
45marking up several different tokentypes in training documents\\321for example,people\\325s
46names, dates and time periods, phonenumbers, and sums of money. We form alanguage
47model for each token type and examinehow accurately it identifies new tokens.
48We thenapply a search algorithm to insert tokenboundaries in a way that maximizes
49compressionof the entire test document. The technique can beapplied to hierarchically-defined
50tokens, leadingto a kind of \\322soft parsing\\323 that will, we believe,be
51able to identify structured items such asreferences and tables in html or
52plain text, basedon nothing more than a few marked-up examplesin training
53documents. 1. INTRODUCTIONText mining is about looking for patterns in
54text, and maybe defined as the process of analyzing text to extractinformation
55that is useful for particular purposes.Compared with the kind of data stored
56in databases, textis unstructured, amorphous, and difficult to deal with.Nevertheless,
57in modern Western culture, text is the mostcommon vehicle for the formal
58exchange of information.The motivation for trying to extract information
59from it iscompelling\\321even if success is only partial.Text mining is possible
60because you do not have tounderstand text in order to extract useful information
61fromit. Here are four examples. First, if only names could beidentified,
62links could be inserted automatically to otherplaces that mention the same
63name\\321links that are\\322dynamically evaluated\\323 by calling upon a search
64engineto bind them at click time. Second, actions can beassociated with different
65types of data, using eitherexplicit programming or programming-by-demonstrationtechniques.
66A day/time specification appearing anywherewithin one\\325s email could be
67associated with diary actionssuch as updating a personal organizer or creating
68anautomatic reminder, and each mention of a day/time in thetext could raise
69a popup menu of calendar-based actions.Third, text could be mined for data
70in tabular format,allowing databases to be created from formatted tablessuch
71as stock-market information on Web pages. Fourth,an agent could monitor incoming
72newswire stories forcompany names and collect documents that mentionthem\\321an
73automated press clipping service.In all these examples, the key problem is
74to recognizedifferent types of target fragments, which we will calltokens
75or \\322entities\\323. This is really a kind of languagerecognition problem:
76we have a text made up of differentsublanguages {for personal names, company
77names, dates,table entries, and so on} and seek to determine whichparts are
78expressed in which language.The information extraction research community
79{of whichwe were, until recently, unaware} has studied these tasksand reported
80results at annual Message UnderstandingConferences {MUC}. For example, \\322named
81entities\\323 aredefined as proper names and quantities of interest,including
82personal, organization, and location names, aswell as dates, times, percentages,
83and monetary amounts{Chinchor, 1999}.The standard approach to this problem
84is manual:tokenizers and grammars are hand-designed for theparticular data
85being extracted. Looking at currentcommercial state-of-the-art text mining
86software, forexample, IBM\\325s Intelligent Miner for Text {Tkach, 1997}uses
87specific recognition modules carefully programmedfor the different data types,
88while Apple\\325s data detectors{Nardi et al., 1998} uses language grammars.
89The TextTokenization Tool of Grover et al. {1999} is anotherexample, and
90a demonstration version is available on theWeb. The challenge for machine
91learning is to use
92&lt;/pre&gt;</Content>
93</Section>
94</Archive>
Note: See TracBrowser for help on using the repository browser.