Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

source: other-projects/nightly-tasks/diffcol/trunk/model-collect/Word-PDF-Formatting/archives/HASH8bbe.dir/doc.xml@ 29015

Last change on this file since 29015 was 29015, checked in by ak19, 10 years ago
AUTOCOMMIT by gen-model-colls.sh script. Message: Clean rebuild of model collections 1/2. Clearing out deprecated archives and index.
File size: 5.8 KB

Rev	Line
[29015]	1	<?xml version="1.0" encoding="utf-8" standalone="no"?>
	2	<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
	3	<Archive>
	4	<Section>
	5	<Description>
	6	<Metadata name="gsdldoctype">indexed_doc</Metadata>
	7	<Metadata name="Language">en</Metadata>
	8	<Metadata name="Encoding">utf8</Metadata>
	9	<Metadata name="Title">Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction</Metadata>
	10	<Metadata name="gsdlsourcefilename">import/langmodl.ps</Metadata>
	11	<Metadata name="gsdlconvertedfilename">tmp/1398926157/langmodl.text</Metadata>
	12	<Metadata name="OrigSource">langmodl.text</Metadata>
	13	<Metadata name="Source">langmodl.ps</Metadata>
	14	<Metadata name="SourceFile">langmodl.ps</Metadata>
	15	<Metadata name="Plugin">PostScriptPlugin</Metadata>
	16	<Metadata name="FileSize">16751</Metadata>
	17	<Metadata name="FilenameRoot">langmodl</Metadata>
	18	<Metadata name="FileFormat">PS</Metadata>
	19	<Metadata name="srcicon">_iconps_</Metadata>
	20	<Metadata name="srclink_file">doc.ps</Metadata>
	21	<Metadata name="srclinkFile">doc.ps</Metadata>
	22	<Metadata name="dc.Creator">Ian H. Witten</Metadata>
	23	<Metadata name="dc.Creator">Zane Bray</Metadata>
	24	<Metadata name="dc.Creator">Malika Mahoui</Metadata>
	25	<Metadata name="dc.Creator">W.J. Teahan</Metadata>
	26	<Metadata name="Identifier">HASH8bbe6da0374b413b1b355c</Metadata>
	27	<Metadata name="lastmodified">1398925758</Metadata>
	28	<Metadata name="lastmodifieddate">20140501</Metadata>
	29	<Metadata name="oailastmodified">1398926157</Metadata>
	30	<Metadata name="oailastmodifieddate">20140501</Metadata>
	31	<Metadata name="assocfilepath">HASH8bbe.dir</Metadata>
	32	<Metadata name="gsdlassocfile">doc.ps:application/postscript:</Metadata>
	33	</Description>
	34	<Content><pre>
	35	Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction
	36	Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan Computer ScienceUniversity
	37	of WaikatoHamilton, New [email protected] AbstractThis paper describes
	38	the use of statisticallanguage modeling techniques, such as arecommonly used
	39	for text compression, to extractmeaningful, low-level, information about
	40	thelocation of semantic tokens, or \\322entities,\\323 in text.We begin by
	41	marking up several different tokentypes in training documents\\321for example,people\\325s
	42	names, dates and time periods, phonenumbers, and sums of money. We form alanguage
	43	model for each token type and examinehow accurately it identifies new tokens.
	44	We thenapply a search algorithm to insert tokenboundaries in a way that maximizes
	45	compressionof the entire test document. The technique can beapplied to hierarchically-defined
	46	tokens, leadingto a kind of \\322soft parsing\\323 that will, we believe,be
	47	able to identify structured items such asreferences and tables in html or
	48	plain text, basedon nothing more than a few marked-up examplesin training
	49	documents. 1. INTRODUCTIONText mining is about looking for patterns in
	50	text, and maybe defined as the process of analyzing text to extractinformation
	51	that is useful for particular purposes.Compared with the kind of data stored
	52	in databases, textis unstructured, amorphous, and difficult to deal with.Nevertheless,
	53	in modern Western culture, text is the mostcommon vehicle for the formal
	54	exchange of information.The motivation for trying to extract information
	55	from it iscompelling\\321even if success is only partial.Text mining is possible
	56	because you do not have tounderstand text in order to extract useful information
	57	fromit. Here are four examples. First, if only names could beidentified,
	58	links could be inserted automatically to otherplaces that mention the same
	59	name\\321links that are\\322dynamically evaluated\\323 by calling upon a search
	60	engineto bind them at click time. Second, actions can beassociated with different
	61	types of data, using eitherexplicit programming or programming-by-demonstrationtechniques.
	62	A day/time specification appearing anywherewithin one\\325s email could be
	63	associated with diary actionssuch as updating a personal organizer or creating
	64	anautomatic reminder, and each mention of a day/time in thetext could raise
	65	a popup menu of calendar-based actions.Third, text could be mined for data
	66	in tabular format,allowing databases to be created from formatted tablessuch
	67	as stock-market information on Web pages. Fourth,an agent could monitor incoming
	68	newswire stories forcompany names and collect documents that mentionthem\\321an
	69	automated press clipping service.In all these examples, the key problem is
	70	to recognizedifferent types of target fragments, which we will calltokens
	71	or \\322entities\\323. This is really a kind of languagerecognition problem:
	72	we have a text made up of differentsublanguages {for personal names, company
	73	names, dates,table entries, and so on} and seek to determine whichparts are
	74	expressed in which language.The information extraction research community
	75	{of whichwe were, until recently, unaware} has studied these tasksand reported
	76	results at annual Message UnderstandingConferences {MUC}. For example, \\322named
	77	entities\\323 aredefined as proper names and quantities of interest,including
	78	personal, organization, and location names, aswell as dates, times, percentages,
	79	and monetary amounts{Chinchor, 1999}.The standard approach to this problem
	80	is manual:tokenizers and grammars are hand-designed for theparticular data
	81	being extracted. Looking at currentcommercial state-of-the-art text mining
	82	software, forexample, IBM\\325s Intelligent Miner for Text {Tkach, 1997}uses
	83	specific recognition modules carefully programmedfor the different data types,
	84	while Apple\\325s data detectors{Nardi et al., 1998} uses language grammars.
	85	The TextTokenization Tool of Grover et al. {1999} is anotherexample, and
	86	a demonstration version is available on theWeb. The challenge for machine
	87	learning is to use
	88	</pre></Content>
	89	</Section>
	90	</Archive>

Note: See TracBrowser for help on using the repository browser.

Download in other formats: