Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

Heritrix-and-WCT.txt@ 33376

Last change on this file since 33376 was 33376, checked in by ak19, 5 years ago
Links and extracts I've read so far on the Web Curator Tool (WCT), Heritrix, CommonCrawl and the related WebDataCommons.
File size: 21.6 KB

Line
1
2
3	RUN:
4	export HERITRIX_HOME=/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT
5	$HERITRIX_HOME/bin/heritrix -a admin:admin
6	Visit: https://localhost:8443/engine
7
8	MORE READING:
9	http://open-s.com/en/content/heritrix-configuration
10	http://open-s.com/en/content/wget
11	https://superuser.com/questions/655346/wget-execute-script-after-download
12	https://www.gnu.org/software/wget/manual/wget.html
13	â--execute command
14
15	Execute command as if it were a part of .wgetrc (see Startup File). A command thus invoked will be executed after the commands in .wgetrc, thus taking precedence over them.
16	If you need to specify more than one wgetrc command, use multiple instances of â-eâ.
17
18	"crawler TRAPS": https://www.contentkingapp.com/academy/crawler-traps/
19	https://www.billhartzer.com/internet-marketing/crawl-thousands-urls/
20
21
22
23	-----------------------------------------
24	Scope DecideReject Configuration rules
25	-----------------------------------------
26	Issues in H3.docx
27	https://sbforge.org/download/attachments/.../Issues%20in%20H3.docx?version=1...
28
29	By using maxTransHops and maxSpeculativeHops we thought that we could manage how long our discovery path 'X' should be, but we see different results and ...
30	DOCX: https://sbforge.org/download/attachments/21856421/Issues%20in%20H3.docx?version=1&modificationDate=1465912557659&api=v2&usg=AOvVaw03EWO6Xy0XiMRITvFwR2v0
31	[ https://webcache.googleusercontent.com/search?q=cache:bEM1AdjQR2cJ:https://sbforge.org/download/attachments/21856421/Issues%2520in%2520H3.docx%3Fversion%3D1%26modificationDate%3D1465912557659%26api%3Dv2+&cd=10&hl=en&ct=clnk&gl=nz&client=ubuntu ]
32
33
34
35	Speculative Hops
36
37	No really effect using maxTransHops or maxSpeculativeHops
38
39	By using maxTransHops and maxSpeculativeHops we thought that we could manage how long our discovery path âXâ should be, but we see different results and we still harvest several âXXâ or more.
40	To accomplish this we use HopsPathMatchesDecideRule but we havenât found a specific path that we certainly can say we want excluded from our harvest. We tried to make regex with R, E and X. Does anyone have experience with this?
41
42
43
44	Path seeds
45
46	Harvesting specific URI paths that doesnât end with slashes
47
48	When adding path seeds to only harvest from a specific place on a domain we sometimes have problems with redirections.
49	We always end path seeds with a slash, but sometimes we are redirected e.g. HTTP 301 http://www.bt.dk/plus/ â http://www.bt.dk/plus.
50	If we have following as a seed http://www.bt.dk/plus we will harvest the whole site because H3 harvest from the last slash in the seed?
51	Any experience with path seeds?
52
53
54	--------------------------------------------------------------------------------
55
56	TODO:
57
58	https://www.stat.auckland.ac.nz/~paul/Reports/maori/maori.html
59	The macron is the only accent required for written MÄori and the accent can only be applied to vowels, so the full set of accented characters are:
60	lower case a, with macron Ä upper case A, with macron Ä
61	lower case e, with macron Ä upper case E, with macron Ä
62	lower case i, with macron Ä« upper case I, with macron Äª
63	lower case o, with macron Å upper case O, with macron Å
64	lower case u, with macron Å« upper case U, with macron Åª
65
66	http://emacs.1067599.n8.nabble.com/Entering-vowels-with-macrons-td72136.html
67	Ctrl+\
68	type rfc1345 (and enter)
69	type &a- to get a-macron
70	Then Ctrl+\ to toggle back to default input
71	(Can thereafter toggle with Ctrl+\ to get back to rfc1345 input method)
72
73
74
75	https://sachachua.com/blog/2011/04/writing-macrons-linux-latin-pronunciation/
76	To add macrons: Ctrl-\ "latin-alt-postfix". But doesn't have all macronised vowels as used in te reo.
77	Then Ctrl-\ to get default input editor back.
78	https://www.gnu.org/software/emacs/manual/html_node/emacs/Select-Input-Method.html
79
80
81	DOWNLOAD HERITRIX:
82	BINARY: http://builds.archive.org/maven2/org/archive/heritrix/heritrix/3.4.0-SNAPSHOT/
83	CODE, STATIC: https://github.com/internetarchive/heritrix3
84	https://github.com/internetarchive/heritrix3/wiki/How%20To%20Crawl
85
86	WEB CURATOR TOOL:
87	http://dia-nz.github.io/webcurator/
88	https://webcuratortool.readthedocs.io/en/latest/guides/quick-start-guide.html
89	https://webcuratortool.readthedocs.io/en/latest/guides/overview-history.html
90
91
92	Crawl as much of the nz domains as we can
93	and run the language detection for Maori on the pages that come through
94	and only save those pages and configure (somehow) that it knows that it doesn't need to redownload pages already inspected for language (not just that it detects it doesn't need to redownload the stored pages, since we only store mri language pages and not all pages inspected)
95
96	Then break up the pages by sentences using our SentenceDetector model.
97
98
99	SURT urls:
100	http://crawler.archive.org/apidocs/org/archive/util/SURT.html
101
102	WARC
103	https://en.wikipedia.org/wiki/Web_ARChive
104
105	Q:
106	"The harvested material is captured in ARC/WARC format which has strong storage and archiving characteristics." at https://webcuratortool.readthedocs.io/en/latest/guides/user-manual.html
107
108	https://blogs.loc.gov/thesignal/2013/11/anatomy-of-a-web-archive/
109
110	Nicholas Taylor
111	November 13, 2013 at 2:49 pm
112
113	Hi Ross, thanks for the comment. The tools for personal archiving of web pages and websites to WARC format are getting better, with the capture side further along than the replay side. Archive Ready (http://archiveready.com/) and WARCreate (http://warcreate.com/) can both be used to create a WARC containing all of the objects that make up an individual web page. GNU Wget 1.14+ (http://www.archiveteam.org/index.php?title=Wget_with_WARC_output) and WAIL (http://matkelly.com/wail/) can both be used to capture entire websites to WARC. WAIL also bundles a standalone Wayback Machine that runs locally, which is the easiest way I know of for users to view the content theyâve collected in WARC format.
114
115	https://webcuratortool.readthedocs.io/en/latest/guides/user-manual.html
116	Doesn't mention https
117
118	"How targets work
119
120	Targets consist of several important elements, including a name and description for internal use; a set of Seed URLs, ****a web harvester profile that controls the behaviour of the web crawler during the harvest****, one or more schedules that specify when the Target will be harvested, and (optionally) a set of descriptive metadata for the Target."
121
122	Harvestor Configuration section:
123	"The remaining tabs Pre-fetchers, Fetchers, Extractors, Writers, and Post-Processors are a series of processors that a URI passes through when it is crawled."
124
125
126	http://crawler.archive.org/articles/user_manual/config.html
127	Look for post-process*. Found under:
128	6.1.3. Processing Chains
129
130	6.1.2. Frontier
131
132	The Frontier is a pluggable module that maintains the internal state of the crawl. What URIs have been discovered, crawled etc. As such its selection greatly effects, for instance, the order in which discovered URIs are crawled.
133
134	There is only one Frontier per crawl job.
135
136	Multiple Frontiers are provided with Heritrix, each of a particular character.
137	6.1.2.1. BdbFrontier
138
139	The default Frontier in Heritrix as of 1.4.0 and later is the BdbFrontier(Previously, the default was the Section 6.1.2.2, âHostQueuesFrontierâ). The BdbFrontier visits URIs and sites discovered in a generally breadth-first manner, it offers configuration options controlling how it throttles its activity against particular hosts, and whether it has a bias towards finishing hosts in progress ('site-first' crawling) or cycling among all hosts with pending URIs.
140
141	Discovered URIs are only crawled once, except that robots.txt and DNS information can be configured so that it is refreshed at specified intervals for each host.
142
143	The main difference between the BdbFrontier and its precursor, Section 6.1.2.2, âHostQueuesFrontierâ, is that BdbFrontier uses BerkeleyDB Java Edition to shift more running Frontier state to disk.
144
145	6.1.2.2. HostQueuesFrontier
146
147	The forerunner of the Section 6.1.2.1, âBdbFrontierâ. Now deprecated mostly because its custom disk-based data structures could not move as much Frontier state out of main memory as the BerkeleyDB Java Edition approach. Has same general characteristics as the Section 6.1.2.1, âBdbFrontierâ.
148
149
150	https://webcuratortool.readthedocs.io/en/latest/guides/quick-start-guide.html
151	You can use OpenWayback to view harvests from within WCT, see the wiki on the WCT Github page: https://github.com/DIA-NZ/webcurator/wiki/Wayback-Integration
152
153	https://webarchive.jira.com/wiki/spaces/Heritrix/overview
154	https://github.com/internetarchive/heritrix3/wiki/Heritrix3
155
156	"Unlike with previous releases, the web control interface is only made available via secure-socket HTTPS, and corresponding to this change the default port has changed to 8443. Additionally, unless you supply a compatible keystore via the new optional '-s' command-line switch, an 'ad-hoc' keystore with a new locally-generated SSL-capable certificate will be created (and then reused on future launches).
157
158	To then contact the web interface from a browser running on the same machine, visit the URL:
159
160	https://localhost:8443/
161	"
162
163	LOCALHOST WITH HTTPS is possible??? vs https://letsencrypt.org/docs/certificates-for-localhost/
164
165	"For local development
166
167	If youâre developing a web app, itâs useful to run a local web server like Apache or Nginx, and access it via http://localhost:8000/ in your web browser. However, web browsers behave in subtly different ways on HTTP vs HTTPS pages. The main difference: On an HTTPS page, any requests to load JavaScript from an HTTP URL will be blocked. So if youâre developing locally using HTTP, you might add a script tag that works fine on your development machine, but breaks when you deploy to your HTTPS production site. To catch this kind of problem, itâs useful to set up HTTPS on your local web server. However, you donât want to see certificate warnings all the time. How do you get the green lock locally?
168
169	The best option: Generate your own certificate, either self-signed or signed by a local root, and trust it in your operating systemâs trust store. Then use that certificate in your local web server. See below for details."
170
171	[Googled: https certificate localhost
172	https://www.freecodecamp.org/news/how-to-get-https-working-on-your-local-development-environment-in-5-minutes-7af615770eec/
173	]
174
175
176	https://github.com/internetarchive/heritrix3/wiki/Heritrix%20Output#HeritrixOutput-WARCfiles
177	* source-report.txt
178
179	This report contains a line item for each host, which includes the seed from which the host was reached.
180	Note
181
182	The sourceTagSeeds property of the TextSeedModule bean must be set to true for this report to be generated.
183
184	* WARC files
185
186	Assuming you are using the WARC writer that comes with Heritrix, a number of WARC files will be generated containing crawled content.
187
188	You can specify the storage location of WARC files by setting the directory value of the WARCWriterProcessor bean.
189
190
191	https://github.com/internetarchive/heritrix3/wiki/Archiving%20Rich-Media%20Content
192	Large File Sizes
193
194	Rich-media content, such as Flash and video, is usually much larger than standard text/html pages. Crawling such content requires large investments in storage and bandwidth. To mitigate these issues, deduplication is recommended for rich-media crawls. Deduplication detects previously collected content that is redundant and skips the download of such content. Pointers to the duplicate content allow it to appear in subsequent crawls. For details see Configuring Heritrix for Deduplication.
195
196
197	Excessive Memory and CPU Usage
198
199	Downloading rich-media content can often cause excessive load to be placed on the crawling computers memory and CPU. For example, extracting links from Flash and other rich-media resources requires extensive data parsing, which is CPU intensive. Atypical input patterns can also cause excessive CPU usage when regular expressions used by Heritrix are run. It is therefore recommended that rich-media crawls be allocated more memory and CPU than "normal" crawls. The memory allocated to Heritrix is set from the command line. The following example shows the command line option to allocate 1 GB of memory to Heritrix, which should be sufficient for most rich-media crawls.
200
201	export JAVA_OPTS=-Xmx1024M
202
203	Multi-core processors are also recommended for rich-media crawls.
204
205
206	Streaming media
207	and
208	Social Networking Sites
209	Many social networking sites make use of rich-media to enhance their user-experience. For specific guidelines on archiving social media sites see Archiving Social Networking Sites with Archive-It . These instructions apply to the Archive-It application, which is built on top of Heritrix.
210
211
212	Q: https://github.com/internetarchive/heritrix3/wiki/Avoiding%20Too%20Much%20Dynamic%20Content
213	"To allow both foo.org and www.foo.org to be captured, you could add two seeds: http://www.foo.org/ and http://foo.org/. To allow every subdomain of foo.org to be crawled, you can add the seed http://foo.org. Note the absence of a trailing slash."
214	(Does the latter encompass both of the former?)
215
216
217	Delete the TranclusionDecideRule, since this rule has the potential to lead Heritrix onto another host. For example, if a URI returns a 301 response code (move permanently) or 302 (found) response code as well as a URI that contains a different host name than the seeds, Heritrix would accept this URI using the TransclusionDecideRule. Removing this rule will keep Heritrix from straying off of our www.foo.org host.
218	...
219	Alternately, you can add the MatchesFilePatternDecideRule. Set usePresetPattern to CUSTOM and set the regexp to something like: .foo\.org(?!/calendar).\|.foo\.org/calendar?year=200[56].
220
221
222	https://github.com/internetarchive/heritrix3/wiki/Mirroring%20HTML%20Files%20Only
223	Mirroring HTML Files Only
224	Alex Osborne edited this page on Jul 4, 2018 Â· 2 revisions
225
226	Suppose you only want to crawl URIs that match http://foo.org/bar/\*.html. Also, you would like to save the crawled files in a file/directory format instead of saving them in WARC files. Also, assume the web server is case-sensitive. For example, http://foo.org/bar/abc.html and http://foo.org/bar/ABC.HTML are pointing to two different resources.
227
228
229	!! [If Heritrix needs to] be configured to differentiate between abc.html and ABC.HTML. Do this by removing the LowercaseRule from the canonicalizationPolicy bean.
230
231	https://github.com/internetarchive/heritrix3/wiki/Only%20Store%20Successful%20HTML%20Pages
232
233
234	https://github.com/internetarchive/heritrix3/wiki/Jobs
235	[multiple URLs to crawl can be specified. But what is the separator]
236
237	Look up: Spring framework, Spring beans
238
239
240	https://localhost:8443/engine/job/pinky/jobdir/crawler-beans.cxml?format=textedit
241	<!--
242	PROCESSING CHAINS
243	Much of the crawler's work is specified by the sequential
244	application of swappable Processor modules. These Processors
245	are collected into three 'chains'. The CandidateChain is applied
246	to URIs being considered for inclusion, before a URI is enqueued
247	for collection. The FetchChain is applied to URIs when their
248	turn for collection comes up. The DispositionChain is applied
249	after a URI is fetched and analyzed/link-extracted.
250	-->
251
252
253	https://github.com/internetarchive/heritrix3/wiki/Fetch%20Chain%20Processors
254	fetchHttp
255
256
257	This processor fetches HTTP URIs. As of Heritrix 3.1, the crawler will now properly decode 'chunked' Transfer-Encoding -- even if encountered when it should not be used, as in a response to an HTTP/1.0 request. Additionally, the fetchHttp processor now includes the parameter 'useHTTP11', which if true, will cause Heritrix to report its requests as 'HTTP/1.1'. This allows sites to use the 'chunked' Transfer-Encoding. (The default for this parameter is false for now, and Heritrix still does not reuse a persistent connection for more than one request to a site.)
258	fetchHttp also includes the parameter 'acceptCompression', which if true, will cause Heritrix requests to include an "Accept-Encoding: gzip,deflate" header, which offers to receive compressed responses. (The default for this parameter is false for now.)
259
260	extractorHttp
261
262
263	This processor extracts outlinks from HTTP headers. As of Heritrix 3.1, the extractorHttp processor now considers any URI on a hostname to imply that the '/favicon.ico' from the same host should be fetched. Also, as of Heritrix 3.1, the "inferRootPage" property has been added to the extractorHttp bean. If this property is "true", Heritrix infers the '/' root page from any other URI on the same hostname. The default for this setting is "false", which means the pre-3.1 behavior of only fetching the root page if it is a seed or otherwise discovered and in-scope remains in effect. Discovery via these new heuristics is considered to be a new 'I' (inferred) hop-type, and is treated the same in scoping/transclusion decisions as an 'E' (embed).
264
265
266	https://github.com/internetarchive/heritrix3/wiki/Processor%20Settings
267
268	fetchHttp:
269
270	timeoutSeconds - This setting determines how long an HTTP request will wait for a resource to respond. This setting should be set to a high value.
271	defaultEncoding - The character encoding to use for files that do not have one specified in the HTTP response headers. The default is ISO-8859â-1.
272
273	soTimeoutMs - If the socket is unresponsive for this number of milliseconds, the request is cancelled. Setting the value to zero (no timeout) is not recommended as it could hang a thread on an unresponsive server. This timeout is used to time out socket opens and socket reads. Make sure this value is less than timeoutSeconds for optimal configuration. This ensures at least one retry read.
274
275	sendIfModifiedSince - Send If-Modified-Since header, if previous Last-Modified fetch history information is available in URI history.
276
277	sendIfNoneMatch - Send If-None-Match header, if previous Etag fetch history information is available in URI history.
278
279	sendConnectionClose - Send Connection: close header with every request.
280	w3.org connection header documentation
281
282	sendRange- Send the Range header when there is a limit on the retrieved document size. This is for politeness purposes. The Range header states
283	that only the first n bytes are of interest. It is only pertinent if maxLengthBytes is greater than zero. Sending the Range header results in a
284	206 Partial Content status response, which is better than cutting the response mid-download. On rare occasion, sending the Range header will
285	generate 416 Request Range Not Satisfiable response.
286
287
288	acceptHeaders - Accept Headers to include in each request. Each must be the complete header, e.g., Accept-Language: en.
289	(mi is the 2 letter code for Maori, see https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
290	https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept
291
292
293
294	ExtractorHtml:
295
296	extractJavascript - If true, in-page Javascript is scanned for strings that appear to be URIs. This typically finds both valid and invalid URIs. Attempts to fetch the invalid URIs can generate webmaster concern over odd crawler behavior. Default is true.
297
298	extractValueAttributes- If true, strings that look like URIs found in unusual places (such as form VALUE attributes) will be extracted. This typically finds both valid and invalid URIs. Attempts to fetch the invalid URIs may generate webmaster concerns over odd crawler behavior. Default is true.
299
300
301	ignoreFormActionUrls - If true, URIs appearing as the ACTION attribute in HTML FORMs are ignored. Default is false.
302	extractOnlyFormGets - If true, only ACTION URIs with a METHOD of GET (explicit or implied) are extracted. Default is true.
303
304
305
306
307	candidates
308
309	seedsRedirectNewSeeds - If enabled, any URI found because a seed redirected to it (original seed returned 301 or 302), will also be treated as a seed.
310
311
312	https://github.com/internetarchive/heritrix3/wiki/Statistics%20Tracking
313	Statistics Tracking
314	Alex Osborne edited this page on Jul 4, 2018 Â· 2 revisions
315
316	Any number of statistics tracking modules can be attached to a crawl. Currently only one is provided with Heritrix. The statisticsTracker Spring bean that comes with Heritrix creates the progress-statistics.log file and provides the WUI with data to display progress information about the crawl. It is strongly recommended that any crawl run through the WUI use this bean.
317
318
319
320	https://github.com/internetarchive/heritrix3/wiki/Configuring-Crawl-Scope-Using-DecideRules
321
322	------------
323	REST API: https://heritrix.readthedocs.io/en/latest/api.html
324	Execute Script in Job
325
326	POST https://(heritrixhost):8443/engine/job/(jobname)/script
327
328	Executes a script. The script can be written as Beanshell, ECMAScript, Groovy, or AppleScript.
329
330
331	https://github.com/beanshell/beanshell
332	https://github.com/internetarchive/heritrix3/wiki/BeanShell%20Script%20For%20Downloading%20Video
333	https://github.com/internetarchive/heritrix3/wiki/Heritrix3-Useful-Scripts
334
335	-----------
336
337	LOGGING
338
339	https://github.com/internetarchive/heritrix3/wiki/Configuring%20Crawl%20Scope%20Using%20DecideRules
340
341	"DecideRuleSequence Logging
342
343	Enable FINEST logging on the class org.archive.crawler.deciderules.DecideRuleSequence to watch each DecideRule's evaluation of the processed URI. This can be done in the logging.properties file
344
345	logging.properties
346
347	org.archive.modules.deciderules.DecideRuleSequence.level = FINEST
348
349	in conjunction with the -Dsysprop VM argument
350	-Djava.util.logging.config.file=/path/to/heritrix3/dist/src/main/conf/logging.properties
351
352	"
353
354	I couldn't get the above logging instructions to work, but here's what I did.
355
356	a. I modified conf/logging.properties by adding:
357
358	# PINKY
359	# DecideRuleSequence Logging
360	# https://github.com/internetarchive/heritrix3/wiki/Configuring-Crawl-Scope-Using-DecideRules
361	org.archive.modules.deciderules.DecideRuleSequence.level = FINEST
362
363	b. I opened the bin/heritrix script and edited in the following into the 2 locations using $JAVACMD:
364
365	CLASSPATH=${CP} nohup $JAVACMD -Djava.util.logging.config.file=/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT/conf/logging.properties -Dheritrix.home=${HERITRIX_HOME}
366
367
368
369

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs3-extensions/maori-lang-detection/MoreReading/Heritrix-and-WCT.txt@ 33376

Download in other formats: