Context Navigation

← Previous Change
Next Change →

Changeset 8407 for trunk

Timestamp:

2004-10-22T11:25:35+13:00 (20 years ago)

Author:

kjdon

Message:

added an entry about collection size limits - copied from an email Ian wrote in 2001

Location:

trunk/greenorg/macros

Files:

: 2 edited

english.dm (modified) (2 diffs)
faq.dm (modified) ( previous)

Legend:

: Unmodified
: Added
: Removed

trunk/greenorg/macros/english.dm

-              r8401
+              r8407
 _tfaqbuildexpattitle_ {How do I fix XML::Parser errors during import.pl?}
+_tfaqbuildsizelimittitle_ {Are there any limits to the size of collections?}
 _headingplugins_ {More About Plugins}
 …
 <p>
 You may also need to get Expat, available from <a href="http://sourceforge.net/projects/expat/">http://sourceforge.net/projects/expat/</a>.
+}
+_tfaqbuildsizelimitbody_ {
+The largest collections we have built have been 7 Gb of text, and 11 million short documents (about 3 Gb text). These built with no problems. We haven't tried larger amounts of text because we don't have larger amounts of
+text lying around.  It's no good using 7 Gb twice over to make 14 Gb because the vocabulary hasn't grown accordingly, as it would
+with a real collection.
+<p>
+There are three main limitations:
+<ol>
+<li>There is a file size limit of 2 Gb on Linux (soon to be increased to
+    infinity, the Linux people say).  I don't know about corresponding
+    figures for Windows; we use Linux for development.  There are systems
+    that go higher, but we don't have access to them.<br>
+    The compressed text will hit the limit first.  MG stores the compressed
+    text in a single file.  7 Gb will compress to just under 2 Gb, so you
+    can't go much higher without splitting the compressed-text file (hacky,
+    but probably easy).
+</li>
+<li>    Technical.  There is a Huffman coding limitation which we would expect
+    to run into at collections of around 16 Gb.  However, the solution is
+    very easy, we just haven't bothered to implement it until we have
+    encountered the problem.
+</li>
+<li>
+Build time.  For building a single index on an already-imported
+    collection, extrapolations indicate that on a modern machine with 1 Gb
+    of main memory, you should be able to build a 60 Gb collection in about
+days.  However, there are often large gaps
+    between theory and practice in this area!  The more indexes you have,
+    the longer things take to build.
+</li>
+</ol>
+In practice, the solution for very large amounts of data is not to treat the collection
+as one huge monolith, but to partition it into subcollections and arrange for
+the search engine to search them all together behind the scenes. However, while
+you can amalgamate the results of searching subcollections fairly easily, it's
+much harder with browsing.  Of course, A-Z lists and datelists and the like
+aren't really much use with very large collections.
+This is where new techniques of hierarchical phrase browsing come into their
+own.  And the really good news is that you can partition a collection into
+subcollections, each with individual phrase browsers, and arrange to view them
+all together in a single hierarchical browsing structure, as one coordinated
+whole.  We haven't actually demonstrated this yet, but it seems quite feasible.
+}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 8407 for trunk

Legend:

trunk/greenorg/macros/english.dm

Download in other formats: