- Timestamp:
- 2004-10-22T11:25:35+13:00 (20 years ago)
- Location:
- trunk/greenorg/macros
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
trunk/greenorg/macros/english.dm
r8401 r8407 1016 1016 1017 1017 _tfaqbuildexpattitle_ {How do I fix XML::Parser errors during import.pl?} 1018 1019 _tfaqbuildsizelimittitle_ {Are there any limits to the size of collections?} 1018 1020 1019 1021 _headingplugins_ {More About Plugins} … … 1475 1477 <p> 1476 1478 You may also need to get Expat, available from <a href="http://sourceforge.net/projects/expat/">http://sourceforge.net/projects/expat/</a>. 1479 1480 } 1481 1482 _tfaqbuildsizelimitbody_ { 1483 The largest collections we have built have been 7 Gb of text, and 11 million short documents (about 3 Gb text). These built with no problems. We haven't tried larger amounts of text because we don't have larger amounts of 1484 text lying around. It's no good using 7 Gb twice over to make 14 Gb because the vocabulary hasn't grown accordingly, as it would 1485 with a real collection. 1486 <p> 1487 There are three main limitations: 1488 <ol> 1489 <li>There is a file size limit of 2 Gb on Linux (soon to be increased to 1490 infinity, the Linux people say). I don't know about corresponding 1491 figures for Windows; we use Linux for development. There are systems 1492 that go higher, but we don't have access to them.<br> 1493 1494 The compressed text will hit the limit first. MG stores the compressed 1495 text in a single file. 7 Gb will compress to just under 2 Gb, so you 1496 can't go much higher without splitting the compressed-text file (hacky, 1497 but probably easy). 1498 </li> 1499 <li> Technical. There is a Huffman coding limitation which we would expect 1500 to run into at collections of around 16 Gb. However, the solution is 1501 very easy, we just haven't bothered to implement it until we have 1502 encountered the problem. 1503 </li> 1504 <li> 1505 Build time. For building a single index on an already-imported 1506 collection, extrapolations indicate that on a modern machine with 1 Gb 1507 of main memory, you should be able to build a 60 Gb collection in about 1508 3 days. However, there are often large gaps 1509 between theory and practice in this area! The more indexes you have, 1510 the longer things take to build. 1511 </li> 1512 </ol> 1513 In practice, the solution for very large amounts of data is not to treat the collection 1514 as one huge monolith, but to partition it into subcollections and arrange for 1515 the search engine to search them all together behind the scenes. However, while 1516 you can amalgamate the results of searching subcollections fairly easily, it's 1517 much harder with browsing. Of course, A-Z lists and datelists and the like 1518 aren't really much use with very large collections. 1519 This is where new techniques of hierarchical phrase browsing come into their 1520 own. And the really good news is that you can partition a collection into 1521 subcollections, each with individual phrase browsers, and arrange to view them 1522 all together in a single hierarchical browsing structure, as one coordinated 1523 whole. We haven't actually demonstrated this yet, but it seems quite feasible. 1477 1524 1478 1525 }
Note:
See TracChangeset
for help on using the changeset viewer.