Changeset 11859
- Timestamp:
- 2006-05-25T17:34:40+12:00 (18 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
trunk/gsdl-documentation/tutorials/xml-source/tutorial_en.xml
r11848 r11859 750 750 </Comment> 751 751 <NumberedItem> 752 <Text id="0281">Start a new collection called <b>reports</b> , fill out appropriate fields for it, and choose Dublin Core as the metadata set.</Text>752 <Text id="0281">Start a new collection called <b>reports</b> (<AutoText key="glidict::Menu.File"/> → <AutoText key="glidict::Menu.File_New"/>), base it on <AutoText key="glidict::NewCollectionPrompt.NewCollection"/>, and choose Dublin Core as the metadata set.</Text> 753 753 </NumberedItem> 754 754 <NumberedItem> … … 759 759 </NumberedItem> 760 760 <Comment> 761 <Text id="0287a">Some of the documents don't look very nice in Greenstone. One of them, <Path>pdf05-notext.pdf</Path>, could not be processed using the default configuration. Another, <Path>pdf06-weirdchars.pdf</Path>, was processed but looks very strange. Exercise <TutorialRef >XXX</TutorialRef> looks at how to configure PDFPlug to handle these files better.</Text>761 <Text id="0287a">Some of the documents don't look very nice in Greenstone. One of them, <Path>pdf05-notext.pdf</Path>, could not be processed using the default configuration. Another, <Path>pdf06-weirdchars.pdf</Path>, was processed but looks very strange. Exercise <TutorialRef id="enhanced_pdf"/> looks at how to configure PDFPlug to handle these files better.</Text> 762 762 </Comment> 763 763 <Heading> … … 771 771 </NumberedItem> 772 772 <NumberedItem> 773 <Text id="0289b">Check whether the Title metadata is correct for each document by opening it. You can open a document from the Librarian Interface by double clicking on it.</Text>774 </NumberedItem> 775 <NumberedItem> 776 <Text id="0289c">The extracted Title metadata for some documents is incorrect. For example, the Titles for <Path>pdf01.pdf</Path> and <Path>word03.doc</Path> (the same document in different formats) have missed out the second line. The Title for <Path>pdf03.pdf</Path> has the w ornftext altogether. The PostScript documents (<Path>cluster.ps</Path> and <Path>langmodl.ps</Path> do not have extracted titles: what appears in the <AutoText key="coredm::_Global:labelTitle_" type="italics"/> list is just the first few characters of the document).</Text>773 <Text id="0289b">Check whether the <AutoText text="ex.Title"/> metadata is correct for some of the documents by opening them. You can open a document from the Librarian Interface by double clicking on it.</Text> 774 </NumberedItem> 775 <NumberedItem> 776 <Text id="0289c">The extracted Title metadata for some documents is incorrect. For example, the Titles for <Path>pdf01.pdf</Path> and <Path>word03.doc</Path> (the same document in different formats) have missed out the second line. The Title for <Path>pdf03.pdf</Path> has the wrong text altogether. The PostScript documents (<Path>cluster.ps</Path> and <Path>langmodl.ps</Path> do not have extracted titles: what appears in the <AutoText key="coredm::_Global:labelTitle_" type="italics"/> list is just the first few characters of the document).</Text> 777 777 </NumberedItem> 778 778 <Heading> … … 1116 1116 </Title> 1117 1117 <Content> 1118 <Text id="ew-1">The standard way Greenstone processes Word documents is to convert them to HTML format using a third-party program, wvWare. This sometimes doesn't do a very good job of conversion. If you are using Windows, you can take advantage of Windows native scripting to do a better job of conversion. If the original document was hierarchically structured using Word styles, these can be used to structure the resulting HTML. Word document properties can also be extracted as metadata.</Text>1119 <NumberedItem> 1120 <Text id="ew-2">In your digital library, preview the reports collection. Look atthe Word documents and notice how they have no structure-they have been converted to flat documents.</Text>1118 <Text id="ew-1">The standard way Greenstone processes Word documents is to convert them to HTML format using a third-party program, wvWare. This sometimes doesn't do a very good job of conversion. If you are using Windows, and have Microsoft Word installed, you can take advantage of Windows native scripting to do a better job of conversion. If the original document was hierarchically structured using Word styles, these can be used to structure the resulting HTML. Word document properties can also be extracted as metadata.</Text> 1119 <NumberedItem> 1120 <Text id="ew-2">In your digital library, preview the <b>reports</b> collection. Look at the HTML versions of the Word documents and notice how they have no structure-they have been converted to flat documents.</Text> 1121 1121 </NumberedItem> 1122 1122 <Heading> … … 1124 1124 </Heading> 1125 1125 <NumberedItem> 1126 <Text id="ew-4">In the Librarian Interface, open up the reportscollection. Switch to the <AutoText key="glidict::GUI.Design"/> panel and select the <AutoText key="glidict::CDM.GUI.Plugins"/> section on the left-hand side. Double click the <AutoText text="WordPlug"/> plugin and switch on the <AutoText text="windows_scripting"/> option.</Text>1127 </NumberedItem> 1128 <NumberedItem> 1129 <Text id="ew-5"> Build and preview the collection. Have a look at <Path>word03.doc</Path> and <Path>word06.doc</Path>. These now appear with hierarchical structure. But these two are the only ones.</Text>1126 <Text id="ew-4">In the Librarian Interface, open up the <b>reports</b> collection. Switch to the <AutoText key="glidict::GUI.Design"/> panel and select the <AutoText key="glidict::CDM.GUI.Plugins"/> section on the left-hand side. Double click the <AutoText text="WordPlug"/> plugin and switch on the <AutoText text="windows_scripting"/> option.</Text> 1127 </NumberedItem> 1128 <NumberedItem> 1129 <Text id="ew-5"><b>Build</b> the collection. You will notice that the Microsoft Word program is started up for each Word document—the document is saved as HTML from Word itself, to get a better conversion. <b>Preview</b> the collection. In the <AutoText key="coredm::_Global:labelTitle_"/> list, notice that <Path>word03.doc</Path> and <Path>word06.doc</Path> now have a book icon, rather than a page icon. These now appear with hierarchical structure. But these two are the only ones.</Text> 1130 1130 <Text id="ew-6">The default behaviour for <AutoText text="WordPlug"/> with <AutoText text="windows_scripting"/> is to section the document based on <AutoText text="Heading 1" type="quoted"/>, <AutoText text="Heading 2" type="quoted"/>, <AutoText text="Heading 3" type="quoted"/> styles. If you open up the <Path>word03.doc</Path> or <Path>word06.doc</Path> documents in Word, you will see that the sections use these Heading styles.</Text> 1131 1131 <Text id="ew-7">Note, to view style information in Word, you can select <Menu>Format → Styles and Formatting</Menu> from the menu, and a side bar will appear on the right hand side. Click on a section heading and the formatting information will be displayed in this side bar.</Text> 1132 1132 </NumberedItem> 1133 1133 <NumberedItem> 1134 <Text id="ew-8">Some of the documents do not use styles (e.g. <Path>word01.doc</Path>) and no structure can be extracted from them. Some documents use user-defined styles. <AutoText text="WordPlug"/> can be configured to use these styles instead of <AutoText text="Heading 1" type="plain"/>, <AutoText text="Heading 2" type="plain"/> etc. Next we will configure WordPlug to use the styles found in <Path>word05.doc</Path>.</Text> 1134 <Text id="ew-8">Some of the documents do not use styles (e.g. <Path>word01.doc</Path>) and no structure can be extracted from them. Some documents use user-defined styles. <AutoText text="WordPlug"/> can be configured to use these styles instead of <AutoText text="Heading 1" type="plain"/>, <AutoText text="Heading 2" type="plain"/> etc. Next we will configure <AutoText text="WordPlug"/> to use the styles found in <Path>word05.doc</Path>.</Text> 1135 </NumberedItem> 1136 <Heading> 1137 <Text id="ew-8a">Modes in the Librarian Interface</Text> 1138 </Heading> 1139 <NumberedItem> 1140 <Text id="ew-8b">The Librarian Interface can operate in four modes. Go to <Menu><AutoText key="glidict::Menu.File"/> → <AutoText key="glidict::Menu.File_Options"/> → <AutoText key="glidict::Preferences.Mode"/></Menu> and see the four modes and what functionality they provide access to. <AutoText key="glidict::Preferences.Mode.Librarian"/> is the default mode.</Text> 1141 </NumberedItem> 1142 <NumberedItem> 1143 <Text id="ew-10">Change the mode to <AutoText key="glidict::Preferences.Mode.Systems"/> because you will need to use regular expressions to set up the style options in the next part of the exercise.</Text> 1135 1144 </NumberedItem> 1136 1145 <Heading> … … 1138 1147 </Heading> 1139 1148 <NumberedItem> 1140 <Text id="ew-10">Change the mode in the Librarian Interface to <AutoText key="glidict::Preferences.Mode.Systems"/> (<Menu><AutoText key="glidict::Menu.File"/> → <AutoText key="glidict::Menu.File_Options"/> → <AutoText key="glidict::Preferences.Mode"/></Menu>). This is because you will need to use regular expressions to set up the style options.</Text> 1149 <Text id="ew-9a">Open up <Path>word05.doc</Path> in Word (by double-clicking on it in the <AutoText key="glidict::GUI.Gather"/> pane), and examine the title and section heading styles. You will see that various user-defined header styles are set such as:</Text> 1150 <BulletList> 1151 <Bullet> 1152 <Text id="ew-13"><AutoText text="PaperTitle" type="italics"/>: Title of the paper</Text> 1153 </Bullet> 1154 <Bullet> 1155 <Text id="ew-14"><AutoText text="SammaryHeader" type="italics"/> (probably mistyped): Summary section</Text> 1156 </Bullet> 1157 <Bullet> 1158 <Text id="ew-15"><AutoText text="ChapterTitle" type="italics"/>: Level 1 section heading</Text> 1159 </Bullet> 1160 <Bullet> 1161 <Text id="ew-16"><AutoText text="SectionHeading" type="italics"/>: Level 2 section heading</Text> 1162 </Bullet> 1163 <Bullet> 1164 <Text id="ew-17"><AutoText text="ReferenceHeading" type="italics"/>: Reference section</Text> 1165 </Bullet> 1166 </BulletList> 1141 1167 </NumberedItem> 1142 1168 <NumberedItem> … … 1144 1170 <Format> 1145 1171 <BulletList> 1146 <Bullet>title_header (titleHeader1|titleHeader2|...)</Bullet>1147 1172 <Bullet>level1_header (level1Header1|level1Header2|...)</Bullet> 1148 1173 <Bullet>level2_header (level2Header1|level2Header2|...)</Bullet> 1149 1174 <Bullet>level3_header (level3Header1|level3Header2|...)</Bullet> 1175 <Bullet>title_header (titleHeader1|titleHeader2|...)</Bullet> 1150 1176 </BulletList> 1151 1177 </Format> 1152 <Text id="ew-12">These header options define which styles should be considered as title, level 1, level 2 and level 3 styles. Open up the <Path>word05.doc</Path> in Word (by double-clicking on it in the <AutoText key="glidict::GUI.Gather"/> pane), and examine the title and section heading styles. You will see that various user-defined header styles are set such as:</Text> 1153 <BulletList> 1154 <Bullet> 1155 <Text id="ew-13"><AutoText text="PaperTitle" type="italics"/>: Title of the paper</Text> 1156 </Bullet> 1157 <Bullet> 1158 <Text id="ew-14"><AutoText text="SammaryHeader" type="italics"/> (probably mistyped): Summary section</Text> 1159 </Bullet> 1160 <Bullet> 1161 <Text id="ew-15"><AutoText text="ChapterTitle" type="italics"/>: Level 1 section heading</Text> 1162 </Bullet> 1163 <Bullet> 1164 <Text id="ew-16"><AutoText text="SectionHeading" type="italics"/>: Level 2 section heading</Text> 1165 </Bullet> 1166 <Bullet> 1167 <Text id="ew-17"><AutoText text="ReferenceHeading" type="italics"/>: Reference section</Text> 1168 </Bullet> 1169 </BulletList> 1170 <Text id="ew-18">Set the options in <AutoText text="WordPlug"/> as follows:</Text> 1171 <Format> 1172 title_header: PaperTitle<br/> 1178 <Text id="ew-12">These header options define which styles should be considered as title, level 1, level 2 and level 3 styles. </Text> 1179 <Text id="ew-12a">Set the options as follows:</Text> 1180 <Format> 1173 1181 level1_header:(SammaryHeader|ChapterTitle|ReferenceHeading|Reference_heading)<br/> 1174 level2_header: SectionHeading 1175 </Format> 1176 </NumberedItem> 1177 <NumberedItem> 1178 <Text id="ew-19">Build the collection and preview it. Look in particular at <Path>word05.doc</Path>. You will see that this document is now also hierarchically structured.</Text> 1182 level2_header: SectionHeading<br/> 1183 title_header: PaperTitle 1184 </Format> 1185 <Text id="ew-23">Once these are set, click <AutoText key="glidict::General.OK" type="button"/>.</Text> 1186 </NumberedItem> 1187 <NumberedItem> 1188 <Text id="ew-23a">Close any documents that are still open in Word, as this can prevent the build process from completing correctly.</Text> 1189 </NumberedItem> 1190 <NumberedItem> 1191 <Text id="ew-19"><b>Build</b> the collection and <b>preview</b> it. Look in particular at <Path>word05.doc</Path>. You will see that this document is now also hierarchically structured.</Text> 1192 <Text id="ew-19a">If you have documents with different formatting styles, you can use <Format>(...|...)</Format> to specify all of the different styles.</Text> 1179 1193 </NumberedItem> 1180 1194 <Heading> … … 1190 1204 tof_header: MsoTof 1191 1205 </Format> 1192 <Text id="ew-23">Once these are set, click <AutoText key="glidict::General.OK" type="button"/>.</Text> 1193 </NumberedItem> 1194 <NumberedItem> 1195 <Text id="ew-24">Build and preview the collection. <Path>word06.doc</Path> should now only have one table of contents.</Text> 1206 </NumberedItem> 1207 <NumberedItem> 1208 <Text id="ew-24">Build and preview the collection. <Path>word06.doc</Path> should now have only one table of contents.</Text> 1196 1209 </NumberedItem> 1197 1210 <Heading> … … 1199 1212 </Heading> 1200 1213 <NumberedItem> 1201 <Text id="ew-26">Word document properties can be extracted as metadata. By default, only the Title will be extracted. Other properties can be extracted using the <AutoText text=" extracted_word_metadata_fields"/> option.</Text>1202 </NumberedItem> 1203 <NumberedItem> 1204 <Text id="ew-27">In the Enrich panel, look at the metadata that has been extracted for word05.doc and word06.doc. Now open the documents in Word and look at what properties they have set. (<Menu>File → Properties</Menu>). They have Title, Author, Subject, and Keywords properties. WordPlug can be configured to look for these properties and extract them.</Text>1214 <Text id="ew-26">Word document properties can be extracted as metadata. By default, only the Title will be extracted. Other properties can be extracted using the <AutoText text="metadata_fields"/> option.</Text> 1215 </NumberedItem> 1216 <NumberedItem> 1217 <Text id="ew-27">In the <AutoText key="glidict::GUI.Enrich"/> panel, look at the metadata that has been extracted for <Path>word05.doc</Path> and <Path>word06.doc</Path>. Now open the documents in Word and look at what properties they have set. (<Menu>File → Properties</Menu>). They have Title, Author, Subject, and Keywords properties. WordPlug can be configured to look for these properties and extract them.</Text> 1205 1218 </NumberedItem> 1206 1219 <NumberedItem> … … 1559 1572 </NumberedItem> 1560 1573 <NumberedItem> 1561 <Text id="0464">Preview the newly rebuilt collection's <AutoText key="coredm::_Global:labelTitle_"/> page. Previously this listed more than a dozen pages per letter of the alphabet, but now there are just three—the first three files encountered by the building process.</Text> 1574 <Text id="0464">Preview the newly rebuilt collection's <AutoText key="coredm::_Global:labelTitle_"/> 1575 page. Previously this listed more than a dozen pages per letter of the alphabet, but now there are just three—the first three files encountered by the building process.</Text> 1562 1576 </NumberedItem> 1563 1577 <NumberedItem> … … 1584 1598 <Text id="0469">This displays something that looks like this: </Text> 1585 1599 <Indent> 1586 <table><tr><td><img width='15' height='20' src=" ../tutorial_files/itext.gif"/></td><td width='408' valign='top'>A discussion of question five from Tudor Quiz: Henry VIII <br/><i>(quizstuff.html)</i></td></tr></table>1600 <table><tr><td><img width='15' height='20' src="tutorial_files/itext.gif"/></td><td width='408' valign='top'>A discussion of question five from Tudor Quiz: Henry VIII <br/><i>(quizstuff.html)</i></td></tr></table> 1587 1601 </Indent> 1588 1602 <Text id="0472">for a particular document whose <i>Title</i> metadata is <AutoText text="A discussion of question five from Tudor Quiz: Henry VIII"/> and whose <i>Source</i> metadata is <AutoText text="quizstuff.html"/>.</Text> … … 1637 1651 <NumberedItem> 1638 1652 <Text id="0490">Now go to the <AutoText key="glidict::GUI.Create"/> panel and click <AutoText key="glidict::CreatePane.Preview_Collection" type="button"/>. Documents in the search results list will be displayed like this:</Text> 1639 <table><tr><td><img width='15' height='20' src=" ../tutorial_files/itext.gif" /></td><td width='408' valign='top'>A discussion of question five from Tudor Quiz: Henry VIII <br/>1653 <table><tr><td><img width='15' height='20' src="tutorial_files/itext.gif" /></td><td width='408' valign='top'>A discussion of question five from Tudor Quiz: Henry VIII <br/> 1640 1654 Tudor period|Others</td></tr></table> 1641 1655 <Text id="0493">(The vertical bar appears because this <i>dc.Subject and Keywords</i> metadata is hierarchical metadata. Unfortunately there is no way to get at individual components of the hierarchy. For most metadata, such as title and author, this isn't a problem.)</Text> … … 1955 1969 <NumberedItem> 1956 1970 <Text id="mf-18">First, we'll change the colour of the navigation bar and green divider bars. These use an image as a background, specified in the same macro as the page background.</Text> 1957 <Text id="mf-19">Open <Path>Greenstone → macros → style.dm</Path> in a text editor, and find the <Format>_cssheader_</Format> macro that you modified previously. Change the div.navbar and div.divbar parts to use divb-blue.gif instead of bg_green.png:</Text>1971 <Text id="mf-19">Open <Path>Greenstone → macros → style.dm</Path> in a text editor, and find the <Format>_cssheader_</Format> macro that you modified previously. Change the <Format>div.navbar</Format> and <Format>div.divbar</Format> parts to use <Format>divb-blue.gif</Format> instead of <Format>bg_green.png</Format>:</Text> 1958 1972 <Format> 1959 1973 #div.navbar \{ background-image: url("_httpimg_/bg_green.png"); \}<br/> … … 1990 2004 } 1991 2005 </Format> 1992 <Text id="mf-27a"> Change<Format>color</Format> to <Format>teal</Format>.</Text>1993 </Bullet> 1994 <Bullet> 1995 <Text id="mf-25">For <Format>a.collectiontitle</Format>, change<Format>color</Format> to <Format>blue</Format>.</Text>2006 <Text id="mf-27a">Set <Format>color</Format> to <Format>teal</Format>.</Text> 2007 </Bullet> 2008 <Bullet> 2009 <Text id="mf-25">For <Format>a.collectiontitle</Format>, set <Format>color</Format> to <Format>blue</Format>.</Text> 1996 2010 </Bullet> 1997 2011 <Bullet> … … 2004 2018 <BulletList> 2005 2019 <Bullet> 2006 <Text id="mf-29">For <Format>div.pageinfo</Format>, changeboth <Format>float</Format> and <Format>text-align</Format> to <Format>left</Format>.</Text>2007 </Bullet> 2008 <Bullet> 2009 <Text id="mf-30">For <Format>div.collectimage</Format>, change<Format>float</Format> and <Format>text-align</Format> to <Format>right</Format>.</Text>2020 <Text id="mf-29">For <Format>div.pageinfo</Format>, set both <Format>float</Format> and <Format>text-align</Format> to <Format>left</Format>.</Text> 2021 </Bullet> 2022 <Bullet> 2023 <Text id="mf-30">For <Format>div.collectimage</Format>, set <Format>float</Format> and <Format>text-align</Format> to <Format>right</Format>.</Text> 2010 2024 </Bullet> 2011 2025 </BulletList>
Note:
See TracChangeset
for help on using the changeset viewer.