Changeset 8407 for trunk


Ignore:
Timestamp:
2004-10-22T11:25:35+13:00 (20 years ago)
Author:
kjdon
Message:

added an entry about collection size limits - copied from an email Ian wrote in 2001

Location:
trunk/greenorg/macros
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • trunk/greenorg/macros/english.dm

    r8401 r8407  
    10161016
    10171017_tfaqbuildexpattitle_ {How do I fix XML::Parser errors during import.pl?}
     1018
     1019_tfaqbuildsizelimittitle_ {Are there any limits to the size of collections?}
    10181020
    10191021_headingplugins_ {More About Plugins}
     
    14751477<p>
    14761478You may also need to get Expat, available from <a href="http://sourceforge.net/projects/expat/">http://sourceforge.net/projects/expat/</a>.
     1479
     1480}
     1481
     1482_tfaqbuildsizelimitbody_ {
     1483The largest collections we have built have been 7 Gb of text, and 11 million short documents (about 3 Gb text). These built with no problems. We haven't tried larger amounts of text because we don't have larger amounts of
     1484text lying around.  It's no good using 7 Gb twice over to make 14 Gb because the vocabulary hasn't grown accordingly, as it would
     1485with a real collection. 
     1486<p>
     1487There are three main limitations:
     1488<ol>
     1489<li>There is a file size limit of 2 Gb on Linux (soon to be increased to
     1490    infinity, the Linux people say).  I don't know about corresponding
     1491    figures for Windows; we use Linux for development.  There are systems
     1492    that go higher, but we don't have access to them.<br>
     1493
     1494    The compressed text will hit the limit first.  MG stores the compressed
     1495    text in a single file.  7 Gb will compress to just under 2 Gb, so you
     1496    can't go much higher without splitting the compressed-text file (hacky,
     1497    but probably easy).
     1498</li>
     1499<li>    Technical.  There is a Huffman coding limitation which we would expect
     1500    to run into at collections of around 16 Gb.  However, the solution is
     1501    very easy, we just haven't bothered to implement it until we have
     1502    encountered the problem.
     1503</li>
     1504<li>
     1505Build time.  For building a single index on an already-imported
     1506    collection, extrapolations indicate that on a modern machine with 1 Gb
     1507    of main memory, you should be able to build a 60 Gb collection in about
     1508    3 days.  However, there are often large gaps
     1509    between theory and practice in this area!  The more indexes you have,
     1510    the longer things take to build.
     1511</li>
     1512</ol>
     1513In practice, the solution for very large amounts of data is not to treat the collection
     1514as one huge monolith, but to partition it into subcollections and arrange for
     1515the search engine to search them all together behind the scenes. However, while
     1516you can amalgamate the results of searching subcollections fairly easily, it's
     1517much harder with browsing.  Of course, A-Z lists and datelists and the like
     1518aren't really much use with very large collections.
     1519This is where new techniques of hierarchical phrase browsing come into their
     1520own.  And the really good news is that you can partition a collection into
     1521subcollections, each with individual phrase browsers, and arrange to view them
     1522all together in a single hierarchical browsing structure, as one coordinated
     1523whole.  We haven't actually demonstrated this yet, but it seems quite feasible.
    14771524
    14781525}
Note: See TracChangeset for help on using the changeset viewer.