Changeset 11859


Ignore:
Timestamp:
2006-05-25T17:34:40+12:00 (18 years ago)
Author:
kjdon
Message:

some changes for workshop

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/gsdl-documentation/tutorials/xml-source/tutorial_en.xml

    r11848 r11859  
    750750</Comment>
    751751<NumberedItem>
    752 <Text id="0281">Start a new collection called <b>reports</b>, fill out appropriate fields for it, and choose Dublin Core as the metadata set.</Text>
     752<Text id="0281">Start a new collection called <b>reports</b> (<AutoText key="glidict::Menu.File"/> &rarr; <AutoText key="glidict::Menu.File_New"/>), base it on <AutoText key="glidict::NewCollectionPrompt.NewCollection"/>, and choose Dublin Core as the metadata set.</Text>
    753753</NumberedItem>
    754754<NumberedItem>
     
    759759</NumberedItem>
    760760<Comment>
    761 <Text id="0287a">Some of the documents don't look very nice in Greenstone. One of them, <Path>pdf05-notext.pdf</Path>, could not be processed using the default configuration. Another, <Path>pdf06-weirdchars.pdf</Path>, was processed but looks very strange. Exercise <TutorialRef>XXX</TutorialRef> looks at how to configure PDFPlug to handle these files better.</Text>
     761<Text id="0287a">Some of the documents don't look very nice in Greenstone. One of them, <Path>pdf05-notext.pdf</Path>, could not be processed using the default configuration. Another, <Path>pdf06-weirdchars.pdf</Path>, was processed but looks very strange. Exercise <TutorialRef id="enhanced_pdf"/> looks at how to configure PDFPlug to handle these files better.</Text>
    762762</Comment>
    763763<Heading>
     
    771771</NumberedItem>
    772772<NumberedItem>
    773 <Text id="0289b">Check whether the Title metadata is correct for each document by opening it. You can open a document from the Librarian Interface by double clicking on it.</Text>
    774 </NumberedItem>
    775 <NumberedItem>
    776 <Text id="0289c">The extracted Title metadata for some documents is incorrect. For example, the Titles for <Path>pdf01.pdf</Path> and <Path>word03.doc</Path> (the same document in different formats) have missed out the second line. The Title for <Path>pdf03.pdf</Path> has the wornf text altogether. The PostScript documents (<Path>cluster.ps</Path> and <Path>langmodl.ps</Path> do not have extracted titles: what appears in the <AutoText key="coredm::_Global:labelTitle_" type="italics"/> list is just the first few characters of the document).</Text>
     773<Text id="0289b">Check whether the <AutoText text="ex.Title"/> metadata is correct for some of the documents by opening them. You can open a document from the Librarian Interface by double clicking on it.</Text>
     774</NumberedItem>
     775<NumberedItem>
     776<Text id="0289c">The extracted Title metadata for some documents is incorrect. For example, the Titles for <Path>pdf01.pdf</Path> and <Path>word03.doc</Path> (the same document in different formats) have missed out the second line. The Title for <Path>pdf03.pdf</Path> has the wrong text altogether. The PostScript documents (<Path>cluster.ps</Path> and <Path>langmodl.ps</Path> do not have extracted titles: what appears in the <AutoText key="coredm::_Global:labelTitle_" type="italics"/> list is just the first few characters of the document).</Text>
    777777</NumberedItem>
    778778<Heading>
     
    11161116</Title>
    11171117<Content>
    1118 <Text id="ew-1">The standard way Greenstone processes Word documents is to convert them to HTML format using a third-party program, wvWare. This sometimes doesn't do a very good job of conversion. If you are using Windows, you can take advantage of Windows native scripting to do a better job of conversion. If the original document was hierarchically structured using Word styles, these can be used to structure the resulting HTML. Word document properties can also be extracted as metadata.</Text>
    1119 <NumberedItem>
    1120 <Text id="ew-2">In your digital library, preview the reports collection. Look at the Word documents and notice how they have no structure-they have been converted to flat documents.</Text>
     1118<Text id="ew-1">The standard way Greenstone processes Word documents is to convert them to HTML format using a third-party program, wvWare. This sometimes doesn't do a very good job of conversion. If you are using Windows, and have Microsoft Word installed, you can take advantage of Windows native scripting to do a better job of conversion. If the original document was hierarchically structured using Word styles, these can be used to structure the resulting HTML. Word document properties can also be extracted as metadata.</Text>
     1119<NumberedItem>
     1120<Text id="ew-2">In your digital library, preview the <b>reports</b> collection. Look at the HTML versions of the Word documents and notice how they have no structure-they have been converted to flat documents.</Text>
    11211121</NumberedItem>
    11221122<Heading>
     
    11241124</Heading>
    11251125<NumberedItem>
    1126 <Text id="ew-4">In the Librarian Interface, open up the reports collection. Switch to the <AutoText key="glidict::GUI.Design"/> panel and select the <AutoText key="glidict::CDM.GUI.Plugins"/> section on the left-hand side. Double click the <AutoText text="WordPlug"/> plugin and switch on the <AutoText text="windows_scripting"/> option.</Text>
    1127 </NumberedItem>
    1128 <NumberedItem>
    1129 <Text id="ew-5">Build and preview the collection. Have a look at <Path>word03.doc</Path> and <Path>word06.doc</Path>. These now appear with hierarchical structure. But these two are the only ones.</Text>
     1126<Text id="ew-4">In the Librarian Interface, open up the <b>reports</b> collection. Switch to the <AutoText key="glidict::GUI.Design"/> panel and select the <AutoText key="glidict::CDM.GUI.Plugins"/> section on the left-hand side. Double click the <AutoText text="WordPlug"/> plugin and switch on the <AutoText text="windows_scripting"/> option.</Text>
     1127</NumberedItem>
     1128<NumberedItem>
     1129<Text id="ew-5"><b>Build</b> the collection. You will notice that the Microsoft Word program is started up for each Word document&mdash;the document is saved as HTML from Word itself, to get a better conversion. <b>Preview</b> the collection. In the <AutoText key="coredm::_Global:labelTitle_"/> list, notice that <Path>word03.doc</Path> and <Path>word06.doc</Path> now have a book icon, rather than a page icon. These now appear with hierarchical structure. But these two are the only ones.</Text>
    11301130<Text id="ew-6">The default behaviour for <AutoText text="WordPlug"/> with <AutoText text="windows_scripting"/> is to section the document based on <AutoText text="Heading 1" type="quoted"/>, <AutoText text="Heading 2" type="quoted"/>, <AutoText text="Heading 3" type="quoted"/> styles. If you open up the <Path>word03.doc</Path> or <Path>word06.doc</Path> documents in Word, you will see that the sections use these Heading styles.</Text>
    11311131<Text id="ew-7">Note, to view style information in Word, you can select <Menu>Format &rarr; Styles and Formatting</Menu> from the menu, and a side bar will appear on the right hand side. Click on a section heading and the formatting information will be displayed in this side bar.</Text>
    11321132</NumberedItem>
    11331133<NumberedItem>
    1134 <Text id="ew-8">Some of the documents do not use styles (e.g. <Path>word01.doc</Path>) and no structure can be extracted from them. Some documents use user-defined styles. <AutoText text="WordPlug"/> can be configured to use these styles instead of <AutoText text="Heading 1" type="plain"/>, <AutoText text="Heading 2" type="plain"/> etc. Next we will configure WordPlug to use the styles found in <Path>word05.doc</Path>.</Text>
     1134<Text id="ew-8">Some of the documents do not use styles (e.g. <Path>word01.doc</Path>) and no structure can be extracted from them. Some documents use user-defined styles. <AutoText text="WordPlug"/> can be configured to use these styles instead of <AutoText text="Heading 1" type="plain"/>, <AutoText text="Heading 2" type="plain"/> etc. Next we will configure <AutoText text="WordPlug"/> to use the styles found in <Path>word05.doc</Path>.</Text>
     1135</NumberedItem>
     1136<Heading>
     1137<Text id="ew-8a">Modes in the Librarian Interface</Text>
     1138</Heading>
     1139<NumberedItem>
     1140<Text id="ew-8b">The Librarian Interface can operate in four modes. Go to <Menu><AutoText key="glidict::Menu.File"/> &rarr; <AutoText key="glidict::Menu.File_Options"/> &rarr; <AutoText key="glidict::Preferences.Mode"/></Menu> and see the four modes and what functionality they provide access to. <AutoText key="glidict::Preferences.Mode.Librarian"/> is the default mode.</Text>
     1141</NumberedItem>
     1142<NumberedItem>
     1143<Text id="ew-10">Change the mode to <AutoText key="glidict::Preferences.Mode.Systems"/> because you will need to use regular expressions to set up the style options in the next part of the exercise.</Text>
    11351144</NumberedItem>
    11361145<Heading>
     
    11381147</Heading>
    11391148<NumberedItem>
    1140 <Text id="ew-10">Change the mode in the Librarian Interface to <AutoText key="glidict::Preferences.Mode.Systems"/>  (<Menu><AutoText key="glidict::Menu.File"/> &rarr; <AutoText key="glidict::Menu.File_Options"/> &rarr; <AutoText key="glidict::Preferences.Mode"/></Menu>). This is because you will need to use regular expressions to set up the style options.</Text>
     1149<Text id="ew-9a">Open up <Path>word05.doc</Path> in Word (by double-clicking on it in the <AutoText key="glidict::GUI.Gather"/> pane), and examine the title and section heading styles. You will see that various user-defined header styles are set such as:</Text>
     1150<BulletList>
     1151<Bullet>
     1152<Text id="ew-13"><AutoText text="PaperTitle" type="italics"/>: Title of the paper</Text>
     1153</Bullet>
     1154<Bullet>
     1155<Text id="ew-14"><AutoText text="SammaryHeader" type="italics"/> (probably mistyped): Summary section</Text>
     1156</Bullet>
     1157<Bullet>
     1158<Text id="ew-15"><AutoText text="ChapterTitle" type="italics"/>: Level 1 section heading</Text>
     1159</Bullet>
     1160<Bullet>
     1161<Text id="ew-16"><AutoText text="SectionHeading" type="italics"/>: Level 2 section heading</Text>
     1162</Bullet>
     1163<Bullet>
     1164<Text id="ew-17"><AutoText text="ReferenceHeading" type="italics"/>: Reference section</Text>
     1165</Bullet>
     1166</BulletList>
    11411167</NumberedItem>
    11421168<NumberedItem>
     
    11441170<Format>
    11451171<BulletList>
    1146 <Bullet>title_header (titleHeader1|titleHeader2|...)</Bullet>
    11471172<Bullet>level1_header (level1Header1|level1Header2|...)</Bullet>
    11481173<Bullet>level2_header (level2Header1|level2Header2|...)</Bullet>
    11491174<Bullet>level3_header (level3Header1|level3Header2|...)</Bullet>
     1175<Bullet>title_header (titleHeader1|titleHeader2|...)</Bullet>
    11501176</BulletList>
    11511177</Format>
    1152 <Text id="ew-12">These header options define which styles should be considered as title, level 1, level 2 and level 3 styles. Open up the <Path>word05.doc</Path> in Word (by double-clicking on it in the <AutoText key="glidict::GUI.Gather"/> pane), and examine the title and section heading styles. You will see that various user-defined header styles are set such as:</Text>
    1153 <BulletList>
    1154 <Bullet>
    1155 <Text id="ew-13"><AutoText text="PaperTitle" type="italics"/>: Title of the paper</Text>
    1156 </Bullet>
    1157 <Bullet>
    1158 <Text id="ew-14"><AutoText text="SammaryHeader" type="italics"/> (probably mistyped): Summary section</Text>
    1159 </Bullet>
    1160 <Bullet>
    1161 <Text id="ew-15"><AutoText text="ChapterTitle" type="italics"/>: Level 1 section heading</Text>
    1162 </Bullet>
    1163 <Bullet>
    1164 <Text id="ew-16"><AutoText text="SectionHeading" type="italics"/>: Level 2 section heading</Text>
    1165 </Bullet>
    1166 <Bullet>
    1167 <Text id="ew-17"><AutoText text="ReferenceHeading" type="italics"/>: Reference section</Text>
    1168 </Bullet>
    1169 </BulletList>
    1170 <Text id="ew-18">Set the options in <AutoText text="WordPlug"/> as follows:</Text>
    1171 <Format>
    1172 title_header: PaperTitle<br/>
     1178<Text id="ew-12">These header options define which styles should be considered as title, level 1, level 2 and level 3 styles. </Text>
     1179<Text id="ew-12a">Set the options as follows:</Text>
     1180<Format>
    11731181level1_header:(SammaryHeader|ChapterTitle|ReferenceHeading|Reference_heading)<br/>
    1174 level2_header: SectionHeading
    1175 </Format>
    1176 </NumberedItem>
    1177 <NumberedItem>
    1178 <Text id="ew-19">Build the collection and preview it. Look in particular at <Path>word05.doc</Path>. You will see that this document is now also hierarchically structured.</Text>
     1182level2_header: SectionHeading<br/>
     1183title_header: PaperTitle
     1184</Format>
     1185<Text id="ew-23">Once these are set, click <AutoText key="glidict::General.OK" type="button"/>.</Text>
     1186</NumberedItem>
     1187<NumberedItem>
     1188<Text id="ew-23a">Close any documents that are still open in Word, as this can prevent the build process from completing correctly.</Text>
     1189</NumberedItem>
     1190<NumberedItem>
     1191<Text id="ew-19"><b>Build</b> the collection and <b>preview</b> it. Look in particular at <Path>word05.doc</Path>. You will see that this document is now also hierarchically structured.</Text>
     1192<Text id="ew-19a">If you have documents with different formatting styles, you can use <Format>(...|...)</Format> to specify all of the different styles.</Text>
    11791193</NumberedItem>
    11801194<Heading>
     
    11901204tof_header: MsoTof
    11911205</Format>
    1192 <Text id="ew-23">Once these are set, click <AutoText key="glidict::General.OK" type="button"/>.</Text>
    1193 </NumberedItem>
    1194 <NumberedItem>
    1195 <Text id="ew-24">Build and preview the collection. <Path>word06.doc</Path> should now only have one table of contents.</Text>
     1206</NumberedItem>
     1207<NumberedItem>
     1208<Text id="ew-24">Build and preview the collection. <Path>word06.doc</Path> should now have only one table of contents.</Text>
    11961209</NumberedItem>
    11971210<Heading>
     
    11991212</Heading>
    12001213<NumberedItem>
    1201 <Text id="ew-26">Word document properties can be extracted as metadata. By default, only the Title will be extracted. Other properties can be extracted using the <AutoText text="extracted_word_metadata_fields"/>  option.</Text>
    1202 </NumberedItem>
    1203 <NumberedItem>
    1204 <Text id="ew-27">In the Enrich panel, look at the metadata that has been extracted for word05.doc and word06.doc. Now open the documents in Word and look at what properties they have set. (<Menu>File &rarr; Properties</Menu>). They have Title, Author, Subject, and Keywords properties. WordPlug can be configured to look for these properties and extract them.</Text>
     1214<Text id="ew-26">Word document properties can be extracted as metadata. By default, only the Title will be extracted. Other properties can be extracted using the <AutoText text="metadata_fields"/>  option.</Text>
     1215</NumberedItem>
     1216<NumberedItem>
     1217<Text id="ew-27">In the <AutoText key="glidict::GUI.Enrich"/> panel, look at the metadata that has been extracted for <Path>word05.doc</Path> and <Path>word06.doc</Path>. Now open the documents in Word and look at what properties they have set. (<Menu>File &rarr; Properties</Menu>). They have Title, Author, Subject, and Keywords properties. WordPlug can be configured to look for these properties and extract them.</Text>
    12051218</NumberedItem>
    12061219<NumberedItem>
     
    15591572</NumberedItem>
    15601573<NumberedItem>
    1561 <Text id="0464">Preview the newly rebuilt collection's <AutoText key="coredm::_Global:labelTitle_"/> page. Previously this listed more than a dozen pages per letter of the alphabet, but now there are just three&mdash;the first three files encountered by the building process.</Text>
     1574<Text id="0464">Preview the newly rebuilt collection's <AutoText key="coredm::_Global:labelTitle_"/>
     1575 page. Previously this listed more than a dozen pages per letter of the alphabet, but now there are just three&mdash;the first three files encountered by the building process.</Text>
    15621576</NumberedItem>
    15631577<NumberedItem>
     
    15841598<Text id="0469">This displays something that looks like this: </Text>
    15851599<Indent>
    1586 <table><tr><td><img width='15' height='20' src="../tutorial_files/itext.gif"/></td><td width='408' valign='top'>A discussion of question five from Tudor Quiz: Henry VIII <br/><i>(quizstuff.html)</i></td></tr></table>
     1600<table><tr><td><img width='15' height='20' src="tutorial_files/itext.gif"/></td><td width='408' valign='top'>A discussion of question five from Tudor Quiz: Henry VIII <br/><i>(quizstuff.html)</i></td></tr></table>
    15871601</Indent>
    15881602<Text id="0472">for a particular document whose <i>Title</i> metadata is <AutoText text="A discussion of question five from Tudor Quiz: Henry VIII"/> and whose <i>Source</i> metadata is <AutoText text="quizstuff.html"/>.</Text>
     
    16371651<NumberedItem>
    16381652<Text id="0490">Now go to the <AutoText key="glidict::GUI.Create"/> panel and click <AutoText key="glidict::CreatePane.Preview_Collection" type="button"/>. Documents in the search results list will be displayed like this:</Text>
    1639 <table><tr><td><img width='15' height='20' src="../tutorial_files/itext.gif" /></td><td width='408' valign='top'>A discussion of question five from Tudor Quiz: Henry VIII <br/>
     1653<table><tr><td><img width='15' height='20' src="tutorial_files/itext.gif" /></td><td width='408' valign='top'>A discussion of question five from Tudor Quiz: Henry VIII <br/>
    16401654Tudor period|Others</td></tr></table>
    16411655<Text id="0493">(The vertical bar appears because this <i>dc.Subject and Keywords</i> metadata is hierarchical metadata. Unfortunately there is no way to get at individual components of the hierarchy. For most metadata, such as title and author, this isn't a problem.)</Text>
     
    19551969<NumberedItem>
    19561970<Text id="mf-18">First, we'll change the colour of the navigation bar and green divider bars. These use an image as a background, specified in the same macro as the page background.</Text>
    1957 <Text id="mf-19">Open <Path>Greenstone &rarr; macros &rarr; style.dm</Path> in a text editor, and find the <Format>_cssheader_</Format> macro that you modified previously. Change the div.navbar and div.divbar parts to use divb-blue.gif instead of bg_green.png:</Text>
     1971<Text id="mf-19">Open <Path>Greenstone &rarr; macros &rarr; style.dm</Path> in a text editor, and find the <Format>_cssheader_</Format> macro that you modified previously. Change the <Format>div.navbar</Format> and <Format>div.divbar</Format> parts to use <Format>divb-blue.gif</Format> instead of <Format>bg_green.png</Format>:</Text>
    19581972<Format>
    19591973#div.navbar \{ background-image: url("_httpimg_/bg_green.png"); \}<br/>
     
    19902004}
    19912005</Format>
    1992 <Text id="mf-27a">Change <Format>color</Format> to <Format>teal</Format>.</Text>
    1993 </Bullet>
    1994 <Bullet>
    1995 <Text id="mf-25">For <Format>a.collectiontitle</Format>, change <Format>color</Format> to <Format>blue</Format>.</Text>
     2006<Text id="mf-27a">Set <Format>color</Format> to <Format>teal</Format>.</Text>
     2007</Bullet>
     2008<Bullet>
     2009<Text id="mf-25">For <Format>a.collectiontitle</Format>, set <Format>color</Format> to <Format>blue</Format>.</Text>
    19962010</Bullet>
    19972011<Bullet>
     
    20042018<BulletList>
    20052019<Bullet>
    2006 <Text id="mf-29">For <Format>div.pageinfo</Format>, change both <Format>float</Format> and <Format>text-align</Format> to <Format>left</Format>.</Text>
    2007 </Bullet>
    2008 <Bullet>
    2009 <Text id="mf-30">For <Format>div.collectimage</Format>, change <Format>float</Format> and <Format>text-align</Format> to <Format>right</Format>.</Text>
     2020<Text id="mf-29">For <Format>div.pageinfo</Format>, set both <Format>float</Format> and <Format>text-align</Format> to <Format>left</Format>.</Text>
     2021</Bullet>
     2022<Bullet>
     2023<Text id="mf-30">For <Format>div.collectimage</Format>, set <Format>float</Format> and <Format>text-align</Format> to <Format>right</Format>.</Text>
    20102024</Bullet>
    20112025</BulletList>
Note: See TracChangeset for help on using the changeset viewer.