Changeset 9230 for trunk/greenorg


Ignore:
Timestamp:
2005-03-01T15:25:00+13:00 (19 years ago)
Author:
kjdon
Message:

added in some statistics about a collection done by diego. it appears that I have changed a couple of lines with funny chars in them - have I really or does it just think so?

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/greenorg/macros/english.dm

    r9164 r9230  
    572572
    573573_ex9d_ {
    574 Ulukau makes available resources for the use, teaching, and enhancement of the Hawaiian language. It has five collections: "Ka HoÊ»oilina: Puke Pai ʻŌlelo HawaiÊ»i" (The Legacy: Journal of Hawaiian Language Resources), Hawaiian Newspapers, Baibala Hemolele (The Hawaiian Bible), Hawaiian Dictionaries, and Hawaiian Books.
     574Ulukau makes available resources for the use, teaching, and enhancement of the Hawaiian language. It has five collections: "Ka HoÊ»oilina: Puke Pai Ê»ÅŒlelo HawaiÊ»i" (The Legacy: Journal of Hawaiian Language Resources), Hawaiian Newspapers, Baibala Hemolele (The Hawaiian Bible), Hawaiian Dictionaries, and Hawaiian Books.
    575575}
    576576
     
    15141514whole.  We haven't actually demonstrated this yet, but it seems quite feasible.
    15151515
     1516<p>
     1517A test collection was built by "Archivo Digital", an office
     1518that depends on the "Archivo Nacional de la Memoria" (National Memory
     1519Archive in English), in Argentina. It contained sequences of page images with
     1520associated OCR text.
     1521<p/><i>Setup details</i>
     1522<ul>
     1523<li>Greenstone version: 2.52</li>
     1524<li>Server: Pentium IV 1.8 GHz, 512 Mb RAM, Windows XP Prof.</li>
     1525<li>Number of indexed documents: 17,655</li>
     1526<li>Number of images (tiff format): 980,000</li>
     1527<li>Total size of text files: 3.2 Gb</li>
     1528<li>Built indexes: section:text document:Title</li>
     1529<li>Used Plugin: PagedImgPlug</li>
     1530<li>5 classifiers</li>
     1531</ul>
     1532<p/><i>Statistics</i>
     1533
     1534<ul>
     1535<li>Time to import the collection: Almost a week was spent collecting documents and importing them. No image conversion was done.</li>
     1536<li>Time to build the collection (excluding import): almost 24 hours. The archives and the indexes were on  separate hard disks, to reduce the overhead that reading and writing from the same disk would cause.</li>
     1537<li>Time to open a hierarchy node that contains 908 objects: 23 seconds</li>
     1538<li>Average Time to search only one word in text index: 2 to 5 seconds</li>
     1539<li>Average Time to search 3 words in text index: 2 to 5 seconds</li>
     1540<li>Average Time to search exact phrases (includes 4, 5 and 6 words): 30 seconds</li></ul>
     1541
    15161542}
    15171543#######################################################################
     
    17341760with an umlaut accent, LaTeX draws a "u" and then draws an umlaut accent over
    17351761it. This means that <tt>pdftohtml</tt> will extract two separate characters
    1736 ('š' and 'u') rather than a single accented character (ÃŒ).</li>
     1762('Âš' and 'u') rather than a single accented character (ÃŒ).</li>
    17371763
    17381764<li>PDF contains pieces of text, and coordinates for where that text
Note: See TracChangeset for help on using the changeset viewer.