source: gs3-extensions/maori-lang-detection/MoreReading/Heritrix-and-WCT.txt@ 33376

Last change on this file since 33376 was 33376, checked in by ak19, 5 years ago

Links and extracts I've read so far on the Web Curator Tool (WCT), Heritrix, CommonCrawl and the related WebDataCommons.

File size: 21.6 KB
Line 
1
2
3RUN:
4export HERITRIX_HOME=/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT
5$HERITRIX_HOME/bin/heritrix -a admin:admin
6Visit: https://localhost:8443/engine
7
8MORE READING:
9http://open-s.com/en/content/heritrix-configuration
10http://open-s.com/en/content/wget
11 https://superuser.com/questions/655346/wget-execute-script-after-download
12https://www.gnu.org/software/wget/manual/wget.html
13 ‘--execute command
14
15 Execute command as if it were a part of .wgetrc (see Startup File). A command thus invoked will be executed after the commands in .wgetrc, thus taking precedence over them.
16 If you need to specify more than one wgetrc command, use multiple instances of ‘-e’.
17
18"crawler TRAPS": https://www.contentkingapp.com/academy/crawler-traps/
19 https://www.billhartzer.com/internet-marketing/crawl-thousands-urls/
20
21
22
23-----------------------------------------
24 Scope DecideReject Configuration rules
25-----------------------------------------
26Issues in H3.docx
27https://sbforge.org/download/attachments/.../Issues%20in%20H3.docx?version=1...
28
29By using maxTransHops and maxSpeculativeHops we thought that we could manage how long our discovery path 'X' should be, but we see different results and ...
30DOCX: https://sbforge.org/download/attachments/21856421/Issues%20in%20H3.docx?version=1&modificationDate=1465912557659&api=v2&usg=AOvVaw03EWO6Xy0XiMRITvFwR2v0
31[ https://webcache.googleusercontent.com/search?q=cache:bEM1AdjQR2cJ:https://sbforge.org/download/attachments/21856421/Issues%2520in%2520H3.docx%3Fversion%3D1%26modificationDate%3D1465912557659%26api%3Dv2+&cd=10&hl=en&ct=clnk&gl=nz&client=ubuntu ]
32
33
34
35 Speculative Hops
36
37 No really effect using maxTransHops or maxSpeculativeHops
38
39 By using maxTransHops and maxSpeculativeHops we thought that we could manage how long our discovery path ‘X’ should be, but we see different results and we still harvest several ‘XX’ or more.
40 To accomplish this we use HopsPathMatchesDecideRule but we haven’t found a specific path that we certainly can say we want excluded from our harvest. We tried to make regex with R, E and X. Does anyone have experience with this?
41
42
43
44 Path seeds
45
46 Harvesting specific URI paths that doesn’t end with slashes
47
48 When adding path seeds to only harvest from a specific place on a domain we sometimes have problems with redirections.
49 We always end path seeds with a slash, but sometimes we are redirected e.g. HTTP 301 http://www.bt.dk/plus/ → http://www.bt.dk/plus.
50 If we have following as a seed http://www.bt.dk/plus we will harvest the whole site because H3 harvest from the last slash in the seed?
51 Any experience with path seeds?
52
53
54--------------------------------------------------------------------------------
55
56TODO:
57
58https://www.stat.auckland.ac.nz/~paul/Reports/maori/maori.html
59 The macron is the only accent required for written Māori and the accent can only be applied to vowels, so the full set of accented characters are:
60lower case a, with macron ā upper case A, with macron Ā
61lower case e, with macron ē upper case E, with macron Ē
62lower case i, with macron ī upper case I, with macron Ī
63lower case o, with macron ō upper case O, with macron Ō
64lower case u, with macron ū upper case U, with macron Ū
65
66http://emacs.1067599.n8.nabble.com/Entering-vowels-with-macrons-td72136.html
67 Ctrl+\
68 type rfc1345 (and enter)
69 type &a- to get a-macron
70 Then Ctrl+\ to toggle back to default input
71 (Can thereafter toggle with Ctrl+\ to get back to rfc1345 input method)
72
73
74
75https://sachachua.com/blog/2011/04/writing-macrons-linux-latin-pronunciation/
76To add macrons: Ctrl-\ "latin-alt-postfix". But doesn't have all macronised vowels as used in te reo.
77Then Ctrl-\ to get default input editor back.
78https://www.gnu.org/software/emacs/manual/html_node/emacs/Select-Input-Method.html
79
80
81DOWNLOAD HERITRIX:
82BINARY: http://builds.archive.org/maven2/org/archive/heritrix/heritrix/3.4.0-SNAPSHOT/
83CODE, STATIC: https://github.com/internetarchive/heritrix3
84https://github.com/internetarchive/heritrix3/wiki/How%20To%20Crawl
85
86WEB CURATOR TOOL:
87http://dia-nz.github.io/webcurator/
88https://webcuratortool.readthedocs.io/en/latest/guides/quick-start-guide.html
89https://webcuratortool.readthedocs.io/en/latest/guides/overview-history.html
90
91
92Crawl as much of the nz domains as we can
93and run the language detection for Maori on the pages that come through
94and only save those pages and configure (somehow) that it knows that it doesn't need to redownload pages already *inspected* for language (not just that it detects it doesn't need to redownload the *stored* pages, since we only store mri language pages and not all pages inspected)
95
96Then break up the pages by sentences using our SentenceDetector model.
97
98
99SURT urls:
100http://crawler.archive.org/apidocs/org/archive/util/SURT.html
101
102WARC
103https://en.wikipedia.org/wiki/Web_ARChive
104
105Q:
106"The harvested material is captured in ARC/WARC format which has strong storage and archiving characteristics." at https://webcuratortool.readthedocs.io/en/latest/guides/user-manual.html
107
108https://blogs.loc.gov/thesignal/2013/11/anatomy-of-a-web-archive/
109
110 Nicholas Taylor
111November 13, 2013 at 2:49 pm
112
113Hi Ross, thanks for the comment. The tools for personal archiving of web pages and websites to WARC format are getting better, with the capture side further along than the replay side. Archive Ready (http://archiveready.com/) and WARCreate (http://warcreate.com/) can both be used to create a WARC containing all of the objects that make up an individual web page. GNU Wget 1.14+ (http://www.archiveteam.org/index.php?title=Wget_with_WARC_output) and WAIL (http://matkelly.com/wail/) can both be used to capture entire websites to WARC. WAIL also bundles a standalone Wayback Machine that runs locally, which is the easiest way I know of for users to view the content they’ve collected in WARC format.
114
115https://webcuratortool.readthedocs.io/en/latest/guides/user-manual.html
116Doesn't mention https
117
118"How targets work
119
120Targets consist of several important elements, including a name and description for internal use; a set of Seed URLs, ******a web harvester profile that controls the behaviour of the web crawler during the harvest******, one or more schedules that specify when the Target will be harvested, and (optionally) a set of descriptive metadata for the Target."
121
122Harvestor Configuration section:
123"The remaining tabs Pre-fetchers, Fetchers, Extractors, Writers, and Post-Processors are a series of processors that a URI passes through when it is crawled."
124
125
126http://crawler.archive.org/articles/user_manual/config.html
127Look for post-process*. Found under:
1286.1.3. Processing Chains
129
1306.1.2. Frontier
131
132The Frontier is a pluggable module that maintains the internal state of the crawl. What URIs have been discovered, crawled etc. As such its selection greatly effects, for instance, the order in which discovered URIs are crawled.
133
134There is only one Frontier per crawl job.
135
136Multiple Frontiers are provided with Heritrix, each of a particular character.
1376.1.2.1. BdbFrontier
138
139The default Frontier in Heritrix as of 1.4.0 and later is the BdbFrontier(Previously, the default was the Section 6.1.2.2, “HostQueuesFrontier”). The BdbFrontier visits URIs and sites discovered in a generally breadth-first manner, it offers configuration options controlling how it throttles its activity against particular hosts, and whether it has a bias towards finishing hosts in progress ('site-first' crawling) or cycling among all hosts with pending URIs.
140
141Discovered URIs are only crawled once, except that robots.txt and DNS information can be configured so that it is refreshed at specified intervals for each host.
142
143The main difference between the BdbFrontier and its precursor, Section 6.1.2.2, “HostQueuesFrontier”, is that BdbFrontier uses BerkeleyDB Java Edition to shift more running Frontier state to disk.
144
1456.1.2.2. HostQueuesFrontier
146
147The forerunner of the Section 6.1.2.1, “BdbFrontier”. Now deprecated mostly because its custom disk-based data structures could not move as much Frontier state out of main memory as the BerkeleyDB Java Edition approach. Has same general characteristics as the Section 6.1.2.1, “BdbFrontier”.
148
149
150https://webcuratortool.readthedocs.io/en/latest/guides/quick-start-guide.html
151You can use OpenWayback to view harvests from within WCT, see the wiki on the WCT Github page: https://github.com/DIA-NZ/webcurator/wiki/Wayback-Integration
152
153https://webarchive.jira.com/wiki/spaces/Heritrix/overview
154https://github.com/internetarchive/heritrix3/wiki/Heritrix3
155
156 "Unlike with previous releases, the web control interface is only made available via secure-socket HTTPS, and corresponding to this change the default port has changed to 8443. Additionally, unless you supply a compatible keystore via the new optional '-s' command-line switch, an 'ad-hoc' keystore with a new locally-generated SSL-capable certificate will be created (and then reused on future launches).
157
158 To then contact the web interface from a browser running on the same machine, visit the URL:
159
160 https://localhost:8443/
161 "
162
163 LOCALHOST WITH HTTPS is possible??? vs https://letsencrypt.org/docs/certificates-for-localhost/
164
165 "For local development
166
167 If you’re developing a web app, it’s useful to run a local web server like Apache or Nginx, and access it via http://localhost:8000/ in your web browser. However, web browsers behave in subtly different ways on HTTP vs HTTPS pages. The main difference: On an HTTPS page, any requests to load JavaScript from an HTTP URL will be blocked. So if you’re developing locally using HTTP, you might add a script tag that works fine on your development machine, but breaks when you deploy to your HTTPS production site. To catch this kind of problem, it’s useful to set up HTTPS on your local web server. However, you don’t want to see certificate warnings all the time. How do you get the green lock locally?
168
169 The best option: Generate your own certificate, either self-signed or signed by a local root, and trust it in your operating system’s trust store. Then use that certificate in your local web server. See below for details."
170
171 [Googled: https certificate localhost
172 https://www.freecodecamp.org/news/how-to-get-https-working-on-your-local-development-environment-in-5-minutes-7af615770eec/
173 ]
174
175
176https://github.com/internetarchive/heritrix3/wiki/Heritrix%20Output#HeritrixOutput-WARCfiles
177* source-report.txt
178
179This report contains a line item for each host, which includes the seed from which the host was reached.
180Note
181
182 The sourceTagSeeds property of the TextSeedModule bean must be set to true for this report to be generated.
183
184* WARC files
185
186Assuming you are using the WARC writer that comes with Heritrix, a number of WARC files will be generated containing crawled content.
187
188You can specify the storage location of WARC files by setting the directory value of the WARCWriterProcessor bean.
189
190
191https://github.com/internetarchive/heritrix3/wiki/Archiving%20Rich-Media%20Content
192Large File Sizes
193
194Rich-media content, such as Flash and video, is usually much larger than standard text/html pages. Crawling such content requires large investments in storage and bandwidth. To mitigate these issues, deduplication is recommended for rich-media crawls. Deduplication detects previously collected content that is redundant and skips the download of such content. Pointers to the duplicate content allow it to appear in subsequent crawls. For details see Configuring Heritrix for Deduplication.
195
196
197Excessive Memory and CPU Usage
198
199Downloading rich-media content can often cause excessive load to be placed on the crawling computers memory and CPU. For example, extracting links from Flash and other rich-media resources requires extensive data parsing, which is CPU intensive. Atypical input patterns can also cause excessive CPU usage when regular expressions used by Heritrix are run. It is therefore recommended that rich-media crawls be allocated more memory and CPU than "normal" crawls. The memory allocated to Heritrix is set from the command line. The following example shows the command line option to allocate 1 GB of memory to Heritrix, which should be sufficient for most rich-media crawls.
200
201export JAVA_OPTS=-Xmx1024M
202
203Multi-core processors are also recommended for rich-media crawls.
204
205
206Streaming media
207and
208Social Networking Sites
209Many social networking sites make use of rich-media to enhance their user-experience. For specific guidelines on archiving social media sites see Archiving Social Networking Sites with Archive-It . These instructions apply to the Archive-It application, which is built on top of Heritrix.
210
211
212Q: https://github.com/internetarchive/heritrix3/wiki/Avoiding%20Too%20Much%20Dynamic%20Content
213"To allow both foo.org and www.foo.org to be captured, you could add two seeds: http://www.foo.org/ and http://foo.org/. To allow every subdomain of foo.org to be crawled, you can add the seed http://foo.org. Note the absence of a trailing slash."
214(Does the latter encompass both of the former?)
215
216
217Delete the TranclusionDecideRule, since this rule has the potential to lead Heritrix onto another host. For example, if a URI returns a 301 response code (move permanently) or 302 (found) response code as well as a URI that contains a different host name than the seeds, Heritrix would accept this URI using the TransclusionDecideRule. Removing this rule will keep Heritrix from straying off of our www.foo.org host.
218...
219Alternately, you can add the MatchesFilePatternDecideRule. Set usePresetPattern to CUSTOM and set the regexp to something like: .foo\.org(?!/calendar).|.*foo\.org/calendar?year=200[56].*
220
221
222https://github.com/internetarchive/heritrix3/wiki/Mirroring%20HTML%20Files%20Only
223Mirroring HTML Files Only
224Alex Osborne edited this page on Jul 4, 2018 · 2 revisions
225
226Suppose you only want to crawl URIs that match http://foo.org/bar/\*.html. Also, you would like to save the crawled files in a file/directory format instead of saving them in WARC files. Also, assume the web server is case-sensitive. For example, http://foo.org/bar/abc.html and http://foo.org/bar/ABC.HTML are pointing to two different resources.
227
228
229!! [If Heritrix needs to] be configured to differentiate between abc.html and ABC.HTML. Do this by removing the LowercaseRule from the canonicalizationPolicy bean.
230
231https://github.com/internetarchive/heritrix3/wiki/Only%20Store%20Successful%20HTML%20Pages
232
233
234https://github.com/internetarchive/heritrix3/wiki/Jobs
235[multiple URLs to crawl can be specified. But what is the separator]
236
237Look up: Spring framework, Spring beans
238
239
240https://localhost:8443/engine/job/pinky/jobdir/crawler-beans.cxml?format=textedit
241 <!--
242 PROCESSING CHAINS
243 Much of the crawler's work is specified by the sequential
244 application of swappable Processor modules. These Processors
245 are collected into three 'chains'. The CandidateChain is applied
246 to URIs being considered for inclusion, before a URI is enqueued
247 for collection. The FetchChain is applied to URIs when their
248 turn for collection comes up. The DispositionChain is applied
249 after a URI is fetched and analyzed/link-extracted.
250 -->
251
252
253https://github.com/internetarchive/heritrix3/wiki/Fetch%20Chain%20Processors
254fetchHttp
255
256
257This processor fetches HTTP URIs. As of Heritrix 3.1, the crawler will now properly decode 'chunked' Transfer-Encoding -- even if encountered when it should not be used, as in a response to an HTTP/1.0 request. Additionally, the fetchHttp processor now includes the parameter 'useHTTP11', which if true, will cause Heritrix to report its requests as 'HTTP/1.1'. This allows sites to use the 'chunked' Transfer-Encoding. (The default for this parameter is false for now, and Heritrix still does not reuse a persistent connection for more than one request to a site.)
258fetchHttp also includes the parameter 'acceptCompression', which if true, will cause Heritrix requests to include an "Accept-Encoding: gzip,deflate" header, which offers to receive compressed responses. (The default for this parameter is false for now.)
259
260extractorHttp
261
262
263This processor extracts outlinks from HTTP headers. As of Heritrix 3.1, the extractorHttp processor now considers any URI on a hostname to imply that the '/favicon.ico' from the same host should be fetched. Also, as of Heritrix 3.1, the "inferRootPage" property has been added to the extractorHttp bean. If this property is "true", Heritrix infers the '/' root page from any other URI on the same hostname. The default for this setting is "false", which means the pre-3.1 behavior of only fetching the root page if it is a seed or otherwise discovered and in-scope remains in effect. Discovery via these new heuristics is considered to be a new 'I' (inferred) hop-type, and is treated the same in scoping/transclusion decisions as an 'E' (embed).
264
265
266https://github.com/internetarchive/heritrix3/wiki/Processor%20Settings
267
268fetchHttp:
269
270timeoutSeconds - This setting determines how long an HTTP request will wait for a resource to respond. This setting should be set to a high value.
271 defaultEncoding - The character encoding to use for files that do not have one specified in the HTTP response headers. The default is ISO-8859 -1.
272
273soTimeoutMs - If the socket is unresponsive for this number of milliseconds, the request is cancelled. Setting the value to zero (no timeout) is not recommended as it could hang a thread on an unresponsive server. This timeout is used to time out socket opens and socket reads. Make sure this value is less than timeoutSeconds for optimal configuration. This ensures at least one retry read.
274
275sendIfModifiedSince - Send If-Modified-Since header, if previous Last-Modified fetch history information is available in URI history.
276
277sendIfNoneMatch - Send If-None-Match header, if previous Etag fetch history information is available in URI history.
278
279sendConnectionClose - Send Connection: close header with every request.
280 w3.org connection header documentation
281
282sendRange- Send the Range header when there is a limit on the retrieved document size. This is for politeness purposes. The Range header states
283that only the first n bytes are of interest. It is only pertinent if maxLengthBytes is greater than zero. Sending the Range header results in a
284206 Partial Content status response, which is better than cutting the response mid-download. On rare occasion, sending the Range header will
285generate 416 Request Range Not Satisfiable response.
286
287
288acceptHeaders - Accept Headers to include in each request. Each must be the complete header, e.g., Accept-Language: en.
289(mi is the 2 letter code for Maori, see https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
290 https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept
291
292
293
294ExtractorHtml:
295
296 extractJavascript - If true, in-page Javascript is scanned for strings that appear to be URIs. This typically finds both valid and invalid URIs. Attempts to fetch the invalid URIs can generate webmaster concern over odd crawler behavior. Default is true.
297
298extractValueAttributes- If true, strings that look like URIs found in unusual places (such as form VALUE attributes) will be extracted. This typically finds both valid and invalid URIs. Attempts to fetch the invalid URIs may generate webmaster concerns over odd crawler behavior. Default is true.
299
300
301 ignoreFormActionUrls - If true, URIs appearing as the ACTION attribute in HTML FORMs are ignored. Default is false.
302extractOnlyFormGets - If true, only ACTION URIs with a METHOD of GET (explicit or implied) are extracted. Default is true.
303
304
305
306
307candidates
308
309 seedsRedirectNewSeeds - If enabled, any URI found because a seed redirected to it (original seed returned 301 or 302), will also be treated as a seed.
310
311
312https://github.com/internetarchive/heritrix3/wiki/Statistics%20Tracking
313Statistics Tracking
314Alex Osborne edited this page on Jul 4, 2018 · 2 revisions
315
316Any number of statistics tracking modules can be attached to a crawl. Currently only one is provided with Heritrix. The statisticsTracker Spring bean that comes with Heritrix creates the progress-statistics.log file and provides the WUI with data to display progress information about the crawl. It is strongly recommended that any crawl run through the WUI use this bean.
317
318
319
320https://github.com/internetarchive/heritrix3/wiki/Configuring-Crawl-Scope-Using-DecideRules
321
322------------
323REST API: https://heritrix.readthedocs.io/en/latest/api.html
324 Execute Script in Job
325
326 POST https://(heritrixhost):8443/engine/job/(jobname)/script
327
328 Executes a script. The script can be written as Beanshell, ECMAScript, Groovy, or AppleScript.
329
330
331https://github.com/beanshell/beanshell
332https://github.com/internetarchive/heritrix3/wiki/BeanShell%20Script%20For%20Downloading%20Video
333https://github.com/internetarchive/heritrix3/wiki/Heritrix3-Useful-Scripts
334
335-----------
336
337LOGGING
338
339https://github.com/internetarchive/heritrix3/wiki/Configuring%20Crawl%20Scope%20Using%20DecideRules
340
341"DecideRuleSequence Logging
342
343Enable FINEST logging on the class org.archive.crawler.deciderules.DecideRuleSequence to watch each DecideRule's evaluation of the processed URI. This can be done in the logging.properties file
344
345logging.properties
346
347 org.archive.modules.deciderules.DecideRuleSequence.level = FINEST
348
349in conjunction with the -Dsysprop VM argument
350 -Djava.util.logging.config.file=/path/to/heritrix3/dist/src/main/conf/logging.properties
351
352"
353
354I couldn't get the above logging instructions to work, but here's what I did.
355
356a. I modified conf/logging.properties by adding:
357
358# PINKY
359# DecideRuleSequence Logging
360# https://github.com/internetarchive/heritrix3/wiki/Configuring-Crawl-Scope-Using-DecideRules
361org.archive.modules.deciderules.DecideRuleSequence.level = FINEST
362
363b. I opened the bin/heritrix script and edited in the following into the 2 locations using $JAVACMD:
364
365 CLASSPATH=${CP} nohup $JAVACMD -Djava.util.logging.config.file=/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT/conf/logging.properties -Dheritrix.home=${HERITRIX_HOME}
366
367
368
369
Note: See TracBrowser for help on using the repository browser.