Changeset 35093 for main/trunk/model-sites-dev/eurovision-lod/collect/eurovision/transform/pages/about.xsl
- Timestamp:
- 2021-04-22T14:50:00+12:00 (3 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
main/trunk/model-sites-dev/eurovision-lod/collect/eurovision/transform/pages/about.xsl
r35066 r35093 18 18 <gsf:script src="sites/{$site_name}/collect/{$collName}/js/jquery.show-more.js"/> 19 19 20 20 21 <div id="about-desc"> 21 22 <h2>Introduction</h2> … … 23 24 <p style="padding-bottom: 10px;"> 24 25 The <a href="https://eurovision.tv">Eurovision Song 25 Conte nt</a> is a live-broadcast televised event that26 Contest</a> is a live-broadcast televised event that 26 27 was first held in 1956 featuring artists singing original songs from 27 28 7 countries. Since then it has grown into an event involving … … 43 44 The contest has grown significantly from 44 45 that modest start with 7 countries (and one cameraman), 45 with over 40 countries competing these daysâ even46 Australiatakes part now, through a specially46 with over 40 countries competing these daysâAustralia 47 even takes part now, through a specially 47 48 arranged invitation. It's an annual celebration of 48 49 European culture and the highlight of many people's … … 507 508 508 509 <p> 509 Access to and the analysis of how countries have voted over the years 510 To fulfill our vision of developing this DL collection 511 as a rich resource through which people can explore the 512 phenomenon we went looking for voting data that was 513 available in a machine-readable format. 514 We found data compiled through a manual curation process 515 about how countries have voted going back to 1975 is available through the 516 <a href="https://www.kaggle.com/datagraver/eurovision-song-contest-scores-19752019">Kaggle website as an Excel spreadsheet</a>. 517 </p> 518 <p> 519 To incorporate this as metadata into the DL, we wrote 520 some Python code to transform the data into the internal 521 serialized metadata format used by Greenstone. Prior to 522 this project, the only serialized form for this was XML, 523 which is processed by the MetadataXML plugin. As it was 524 more convenient to generate JSON from our Python code, 525 we took the step of adding in a new plugin to 526 Greenstone3: MetadataJSON. 527 </p> 528 529 <h3>Page Scraping</h3> 530 531 <p> 532 Despite our best intentions work soley with 533 machine-readable dataâprimarily as you have seen in the 534 form of Linked Open Data, but also utilizing a 535 spreadsheet of voting dataâto form the Eurovision DL, 536 in looking to expand the metadata in the DL to cover 537 details concerning the draw position of acts, and their 538 overall placing, we have resorted to page-scraping 539 content from Wikipedia itself. This was because such 540 information was not part of the entity extraction 541 process that occurs when Wikipedia is mapped to DBpedia. 542 </p> 543 544 <p> 545 A review of Wikipedia article pages about the event in 546 any given year showed these pages to be especially well 547 curated, and included a table in each that listed the 548 information we sought. While there was some variation 549 in how this table was expressed in HTML, with a 550 considerably portion of the heavy lifting being done by 551 the Python library BeautifulSoup4, it was not too 552 complex a task to develop a program that extracted this 553 information and turned it into the newly developed 554 Greenstone JSON metadata format. 555 </p> 510 556 511 To fulfill our vision of developing this DL collection as a rich resource to 512 through which people can explore the phenomenon. 513 514 </p> 557 <h3>Patching in Missing Data</h3> 558 515 559 516 <h3>Patching in Missing Data: Page Scraping</h3> 517 518 519 <p> 520 Despite our best intentions to work solely with .... 521 .. missing categories ... 522 523 totting up how many entrie per year ... 524 thousands of entries 525 560 <p> 561 Another difficulty we have encountered is that 562 not every country who had an entry in Eurovision 563 in a given year has its own standalone article page. 564 This leads to missing entries in the category 565 page for the contest in a given year, which is 566 problematic to us, because it is this category 567 information that we draw upon in our SPARQL query 568 to populate the DL with all the acts. 569 </p> 570 <p> 571 The information about all the countries competing 572 in a given year does, however, appear in the 573 article page for the contest in that year. In fact 574 it's in the same table we targetted to extract out 575 draw position and placement. We therefore 576 wrote a further page-scraping program to compare 577 the countries in that table with the countries 578 listed on the category page for the contest in 579 that year. For any entries we find in the 580 table, but not in the Category page, we 581 produce a metadata record for the DL 582 with basic information about the entry: 583 country, year, song title, artist, 584 draw-position, placement, and (where available) 585 their total score. 586 </p> 587 <p> 588 Comparable with the problem titles and artist/entrants, 589 we have formulated a SPARQL query that enumerates 590 these missing category entrants: 591 <!-- 526 592 We took the opportunity to add in further fields: Performing Position, Placement, Voting Total, thumbnail flag image. 593 594 595 An unintended side-affect of this is that we have also been able to expand 596 --> 597 527 598 528 599 <ul>
Note:
See TracChangeset
for help on using the changeset viewer.