Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

googlescholar.txt@ 34089

Last change on this file since 34089 was 34089, checked in by ak19, 4 years ago
So far accumulated URLs to docs on Google scholar about or somewhat related to finding low-resource languages on the web, as Dr Bainbridge had suggested.
File size: 10.3 KB

Line
1	1. Google: "low-resource languages" "common crawl"
2
3	a. TANGENTIAL: https://www.isca-speech.org/archive/SLTU_2018/pdfs/Manasa.pdf
4
5	Mining Training Data for Language Modeling Across the World's Languages.
6	M Prasad, T Breiner, D van Esch - SLTU, 2018 - isca-speech.org
7	âŠ [15] Z. Agic, D. Hovy, and A. SÃžgaard, âIf all you have is a bit of the Bible: Learning POS taggers
8	for truly low-resource languages,â in ACL. Center for Language Technology, University of
9	Copenhagen, Denmark, 2015. 64 Page 5. [16] Common crawl. Common Crawl Foundation âŠ
10	Cited by 7 Related articles All 3 versions
11
12
13	Mining Training Data for Language Modeling Across the Worldâs
14	Languages
15	Manasa Prasad, Theresa Breiner, Daan van Esch
16
17	Abstract
18	Building smart keyboards and speech recognition sys-
19	tems for new languages requires a large, clean text corpus
20	to train n-gram language models on. We report our find-
21	ings on how much text data can realistically be found
22	on the web across thousands of languages. In addition,
23	we describe an innovative, scalable approach to normal-
24	izing this data: all data sources are noisy to some extent,
25	but this situation is even more severe for low-resource
26	languages. To help clean the data we find across all lan-
27	guages in a scalable way, we built a pipeline to automat-
28	ically derive the configuration for language-specific text
29	normalization systems, which we describe here as well.
30	Index Terms: speech recognition, keyboard input, low-
31	resource languages, data mining, language modeling, text
32	normalization
33
34	b. TANGENTIAL: https://arxiv.org/abs/1911.00359
35
36	CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
37	Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco GuzmÃ¡n, Armand Joulin, Edouard Grave
38	(Submitted on 1 Nov 2019 (v1), last revised 15 Nov 2019 (this version, v2))
39
40	Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.
41
42	Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
43	Cite as: arXiv:1911.00359 [cs.CL]
44	(or arXiv:1911.00359v2 [cs.CL] for this version)
45
46	2. Google: locating "low-resource languages" on the web
47
48	a. https://halshs.archives-ouvertes.fr/halshs-00986144/
49	Finding viable seed URLs for web corpora: A scouting approach and comparative study of available sources
50	Adrien Barbaresi 1
51	1 ICAR - Interactions, Corpus, Apprentissages, ReprÃ©sentations
52	Abstract : The conventional tools of the "web as corpus" framework rely heavily on URLs obtained from search engines. Recently, the corresponding querying process became much slower or impossible to perform on a low budget. I try to find acceptable substitutes, i.e. viable link sources for web corpus construction. To this end, I perform a study of possible alternatives, including social networks as well as the Open Directory Project and Wikipedia. Four different languages (Dutch, French, Indonesian and Swedish) taken as examples show that complementary approaches are needed. My scouting approach using open-source software leads to a URL directory enriched with metadata which may be used to start a web crawl. This is more than a drop-in replacement for existing tools since said metadata enables researchers to filter and select URLs that fit particular needs, as they are classified according to their language, their length and a few other indicators such as host- and markup-based data.
53
54
55	3. Google: finding low-resource language resources
56	4. Google: finding minority language internet
57
58	a. https://dl.acm.org/doi/abs/10.1145/502585.502633
59
60	Article
61	Mining the web to create minority language corpora
62	Share on
63
64	Authors:
65	Rayid Ghani profile imageRayid Ghani
66
67	,
68	Rosie Jones profile imageRosie Jones
69
70	,
71	Dunja MladeniÄ profile imageDunja MladeniÄ
72
73	Authors Info & Affiliations
74
75	Publication: CIKM '01: Proceedings of the tenth international conference on Information and knowledge management
76	October 2001 Pages 279â286https://doi.org/10.1145/502585.502633
77
78	13citation479Downloads
79
80	eReaderPDF
81
82	CIKM '01: Proceedings of the tenth international conference on Information and knowledge management
83	Mining the web to create minority language corpora
84	Pages 279â286
85	Previous
86	Next
87
88	ABSTRACT
89	References
90	Index Terms
91	Comments
92
93	ACM Digital Library
94	ABSTRACT
95
96	The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for retrieving documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes to a variety of languages.
97
98
99	b. https://link.springer.com/article/10.1007/s10115-003-0121-x
100
101
102	Published: 01 January 2005
103
104	Building Minority Language Corpora by Learning to Generate Web Search Queries
105
106	Rayid Ghani, Rosie Jones & Dunja Mladenic
107
108	Knowledge and Information Systems volume 7, pages56â83(2005)Cite this article
109
110	101 Accesses
111
112	9 Citations
113
114	Metrics details
115
116	Abstract
117
118	The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions.
119
120
121	c. https://minerva-access.unimelb.edu.au/handle/11343/34901
122	Towards a Web search service for minority language communities
123	Thumbnail
124	Download
125	Towards a Web Search Service for Minority Language Communities (84.97Kb)
126
127	Show Statistical Information
128	Author
129	HUGHES, BADEN
130	Date
131	2006
132	Source Title
133	Proceedings, OpenRoad 2006: Exploring Diversity on the Web
134	Publisher
135	State Library of Victoria
136	University of Melbourne Author/s
137	HUGHES, BADEN
138	Affiliation
139	Arts: Department of Linguistics and Applied Linguistics
140	Engineering: Department of Computer Science and Software Engineering
141	Metadata
142	Show full item record
143	Document Type
144	Conference Paper
145	Citations
146	Hughes, B. (2006). Towards a Web search service for minority language communities. In, Proceedings, OpenRoad 2006: Exploring Diversity on the Web, Melbourne.
147	Access Status
148	Open Access
149	URI
150	http://hdl.handle.net/11343/34901
151	Abstract
152	Locating resources of interest on the web in the general case is at best a low precision activity owing to the large number of pages on the web (for example, Google covers more than 8 billion web pages). As language communities (at all points on the spectrum) increasingly self-publish materials on the web, so interested users are beginning to search for them in the same way that they search for general internet resources, using broad coverage search engines with typically simple queries. Given that language resources are in a minority case on the web in general, finding relevant materials for low density or lesser used languages on the web is in general an increasingly inefficient exercise even for experienced searchers. Furthermore, the inconsistent coverage of web content between search engines serves to complicate matters even more.
153
154
155
156	A number of previous research efforts have focused on using web data to create language corpora, mine linguistic data, building language ontologies, create thesaurii etc. The work reported in this paper contrasts with previous research in that it is not specifically oriented towards creation of language resources from web data directly, but rather, increasing the likelihood that end users searching for resources in minority languages will actually find useful results from web searches. Similarly, it differs from earlier work by virtue of its focus on search optimization directly, rather than as a component of a larger process (other researchers use the seed URIs discovered via the mechanism described in this paper in their own varied work). The work here can be seen to contribute to a user-centric agenda for locating language resources for lesser-used languages on the web. (From Introduction)
157
158	Export Reference in RIS Format

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: other-projects/maori-lang-detection/mongodb-data/googlescholar.txt@ 34089

Download in other formats: