source: other-projects/maori-lang-detection/mongodb-data/googlescholar.txt@ 34089

Last change on this file since 34089 was 34089, checked in by ak19, 4 years ago

So far accumulated URLs to docs on Google scholar about or somewhat related to finding low-resource languages on the web, as Dr Bainbridge had suggested.

File size: 10.3 KB
Line 
11. Google: "low-resource languages" "common crawl"
2
3a. TANGENTIAL: https://www.isca-speech.org/archive/SLTU_2018/pdfs/Manasa.pdf
4
5Mining Training Data for Language Modeling Across the World's Languages.
6M Prasad, T Breiner, D van Esch - SLTU, 2018 - isca-speech.org
7
 [15] Z. Agic, D. Hovy, and A. SÞgaard, “If all you have is a bit of the Bible: Learning POS taggers
8for truly low-resource languages,” in ACL. Center for Language Technology, University of
9Copenhagen, Denmark, 2015. 64 Page 5. [16] Common crawl. Common Crawl Foundation 

10Cited by 7 Related articles All 3 versions
11
12
13Mining Training Data for Language Modeling Across the World’s
14Languages
15Manasa Prasad, Theresa Breiner, Daan van Esch
16
17Abstract
18Building smart keyboards and speech recognition sys-
19tems for new languages requires a large, clean text corpus
20to train n-gram language models on. We report our find-
21ings on how much text data can realistically be found
22on the web across thousands of languages. In addition,
23we describe an innovative, scalable approach to normal-
24izing this data: all data sources are noisy to some extent,
25but this situation is even more severe for low-resource
26languages. To help clean the data we find across all lan-
27guages in a scalable way, we built a pipeline to automat-
28ically derive the configuration for language-specific text
29normalization systems, which we describe here as well.
30Index Terms: speech recognition, keyboard input, low-
31resource languages, data mining, language modeling, text
32normalization
33
34b. TANGENTIAL: https://arxiv.org/abs/1911.00359
35
36CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
37Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave
38(Submitted on 1 Nov 2019 (v1), last revised 15 Nov 2019 (this version, v2))
39
40 Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.
41
42Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
43Cite as: arXiv:1911.00359 [cs.CL]
44 (or arXiv:1911.00359v2 [cs.CL] for this version)
45
462. Google: locating "low-resource languages" on the web
47
48a. https://halshs.archives-ouvertes.fr/halshs-00986144/
49 Finding viable seed URLs for web corpora: A scouting approach and comparative study of available sources
50Adrien Barbaresi 1
511 ICAR - Interactions, Corpus, Apprentissages, Représentations
52Abstract : The conventional tools of the "web as corpus" framework rely heavily on URLs obtained from search engines. Recently, the corresponding querying process became much slower or impossible to perform on a low budget. I try to find acceptable substitutes, i.e. viable link sources for web corpus construction. To this end, I perform a study of possible alternatives, including social networks as well as the Open Directory Project and Wikipedia. Four different languages (Dutch, French, Indonesian and Swedish) taken as examples show that complementary approaches are needed. My scouting approach using open-source software leads to a URL directory enriched with metadata which may be used to start a web crawl. This is more than a drop-in replacement for existing tools since said metadata enables researchers to filter and select URLs that fit particular needs, as they are classified according to their language, their length and a few other indicators such as host- and markup-based data.
53
54
553. Google: finding low-resource language resources
564. Google: finding minority language internet
57
58a. https://dl.acm.org/doi/abs/10.1145/502585.502633
59
60Article
61Mining the web to create minority language corpora
62Share on
63
64 Authors:
65 Rayid Ghani profile imageRayid Ghani
66
67 ,
68 Rosie Jones profile imageRosie Jones
69
70 ,
71 Dunja Mladenić profile imageDunja Mladenić
72
73 Authors Info & Affiliations
74
75Publication: CIKM '01: Proceedings of the tenth international conference on Information and knowledge management
76October 2001 Pages 279–286https://doi.org/10.1145/502585.502633
77
78 13citation479Downloads
79
80 eReaderPDF
81
82CIKM '01: Proceedings of the tenth international conference on Information and knowledge management
83Mining the web to create minority language corpora
84Pages 279–286
85Previous
86Next
87
88 ABSTRACT
89 References
90 Index Terms
91 Comments
92
93ACM Digital Library
94ABSTRACT
95
96The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for retrieving documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes to a variety of languages.
97
98
99b. https://link.springer.com/article/10.1007/s10115-003-0121-x
100
101
102 Published: 01 January 2005
103
104Building Minority Language Corpora by Learning to Generate Web Search Queries
105
106 Rayid Ghani, Rosie Jones & Dunja Mladenic
107
108Knowledge and Information Systems volume 7, pages56–83(2005)Cite this article
109
110 101 Accesses
111
112 9 Citations
113
114 Metrics details
115
116Abstract
117
118The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions.
119
120
121c. https://minerva-access.unimelb.edu.au/handle/11343/34901
122Towards a Web search service for minority language communities
123Thumbnail
124Download
125Towards a Web Search Service for Minority Language Communities (84.97Kb)
126
127Show Statistical Information
128Author
129HUGHES, BADEN
130Date
1312006
132Source Title
133Proceedings, OpenRoad 2006: Exploring Diversity on the Web
134Publisher
135State Library of Victoria
136University of Melbourne Author/s
137HUGHES, BADEN
138Affiliation
139Arts: Department of Linguistics and Applied Linguistics
140Engineering: Department of Computer Science and Software Engineering
141Metadata
142Show full item record
143Document Type
144Conference Paper
145Citations
146Hughes, B. (2006). Towards a Web search service for minority language communities. In, Proceedings, OpenRoad 2006: Exploring Diversity on the Web, Melbourne.
147Access Status
148Open Access
149URI
150http://hdl.handle.net/11343/34901
151Abstract
152Locating resources of interest on the web in the general case is at best a low precision activity owing to the large number of pages on the web (for example, Google covers more than 8 billion web pages). As language communities (at all points on the spectrum) increasingly self-publish materials on the web, so interested users are beginning to search for them in the same way that they search for general internet resources, using broad coverage search engines with typically simple queries. Given that language resources are in a minority case on the web in general, finding relevant materials for low density or lesser used languages on the web is in general an increasingly inefficient exercise even for experienced searchers. Furthermore, the inconsistent coverage of web content between search engines serves to complicate matters even more.
153
154
155
156 A number of previous research efforts have focused on using web data to create language corpora, mine linguistic data, building language ontologies, create thesaurii etc. The work reported in this paper contrasts with previous research in that it is not specifically oriented towards creation of language resources from web data directly, but rather, increasing the likelihood that end users searching for resources in minority languages will actually find useful results from web searches. Similarly, it differs from earlier work by virtue of its focus on search optimization directly, rather than as a component of a larger process (other researchers use the seed URIs discovered via the mechanism described in this paper in their own varied work). The work here can be seen to contribute to a user-centric agenda for locating language resources for lesser-used languages on the web. (From Introduction)
157
158Export Reference in RIS Format
Note: See TracBrowser for help on using the repository browser.