At no other time in human history has so much content been available so conveniently to so many of us. Information is truly at our fingertips. You like cats? Find images of kittens, videos of cats, scholarly articles about felines – and much more — online. To ‘go online’ is quickly becoming an old expression: we are now always online. What distinguishes our situation from the beginning of the Information Age is how we are connecting with one another. Finding groups of likeminded cat lovers on Facebook or Meetup.com is only a click away.
We are in a middle of a new shift: our data is being connected. Take, for example, I Know Where Your Cat Lives. The website is made up of cat photos, from sites with public Application Programming Interfaces (APIs), that are positioned on a Google map using GPS data embedded in the image. Similarly, search engines tap into data openly published on the web to offer us an array of options to sort, organize and discover information. The search results are frequently mashed up with information from a variety of trusted sources. During a recent Google search for La Belle et la Bête (Figure 1), I was impressed by what the search engine displayed.
Figure 1: Google search results for “La Belle et la Bête”
On the right side of the screen was a synopsis, ratings, cast and crew information, and release dates. This information was derived from separate sources but presented as one record. The results in the left column included both the English version of the title (Beauty and the Beast) as well as the original French title.
Unsurprisingly most audiovisual material remains difficult to access online in spite of the high user demand for this type of content (Simou et al., 2012). This is not only due to the backlog of items that require cataloging but because much of this metadata is unavailable for online search engines to index. Ceton (2013) argues that audiovisual archives must change their business models to meet the changing expectations of their constituency. Though daunting as this change may be, archives now have the opportunity to connect their holdings to the world. But how do archives expose their collections to new types of usage similar to those search engines are beginning to roll out? In this and my next blog post, I explore three audiovisual projects that have brought their collections to the users by way of the Semantic Web. My focus will be on describing the means by which metadata, or descriptive information about the source material, is structured and published to the Internet.
What is the Semantic Web?
W3.org defines the semantic web as a web of data that can be “shared and reused across application, enterprise, and community boundaries, to be processed automatically by tools as well as manually, including revealing possible new relationships among pieces of data” (“W3C Semantic Web FAQ,” 2009, sec. What is the Semantic Web). Based on the principles of Linked Open Data (see below), semantic markup makes these new uses possible. Producing and distributing this markup is more cost effective than traditional methods like publishing physical indexes, not to mention much cheaper to maintain (Hussain, 2013). Ultimately users may be connected to multiple, federated, information sources from across the Semantic Web (Ibid.). However, this Web of Things requires a modern content architecture that is open, service oriented, and loosely coupled (Ibid.). Many organizations wishing to take full advantage of the Semantic Web will have to undergo a shift from quarantined information silos to an open data model.
Time Berners-Lee (2006) published a set of best practices, known as the four principles, for publishing and connecting structured data on the web:
- Use URIs as names for things.
- Use HTTP URIs so that people can look up those names.
- When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
- Include links to other URIs[,] so that they can discover more things.
These guidelines establish type relations between data published on the web from a variety of sources. There are many sources of Linked Open Data. As we’ll see, they are being used as virtual authority lists. The growing list includes DBpedia, MusicBrainz, GeoNames, NYTimes, Freebase, and Eurostate.
Different groups of people associated different meanings to the same words. For example, in metaphysics ontology is defined as the study of existence, yet in information science it is known as the representation of knowledge concepts commonly displayed in the form of a graph. How can we disambiguate terms? One way is to create domain specific ontologies to represent these meanings. This structure can then be mapped to the Linked Data community by using the Simple Knowledge Organization System (SKOS) (Shiri, 2012).
The data must be described and structured in a consistent way before it can connect to the Web of Things. The most popular way is through the use of a serialized form of the Resource Description Framework (RDF). Data are described by linking to HTTP Uniform Resource Identifiers (URIs) that are combined to form triples statements – comprised of a subject, predicate and object. For example:
Figure 2: Basic RDF graph showing that Alfred Hitchcock (subject) was the creator (predicate) of the film Psycho (object).
Elements from popular metadata schema, such as Dublin Core can be combined to form RDF expressions and then encoded in any of the three syntaxes currently defined by the framework. Alternately, archives may want to map their pre-existing schema to Schema.org, a common set of metadata fields for structured data markup on web pages. These include groups of fields to describe creative works such as movies, television series, and music recordings. While not intending to be the only element set available, Schema.org aims to create one vocabulary that is understood by all the popular search engines. Adopting this standard improves search results, underpins certain functionality (I.E. facet browsing), and disambiguates entities from one another. For example Schema.org allows users to differentiate the video of a movie from the book about a movie. The standard recently released a set of TV and radio markup, which offers specific tags for series, season and episode (Raimond, 2013). Consequently, users will be able to search online for “When is the next Orange is the New Black episode on?”
In spite of the above-mentioned benefits, there exists one main hurdle for Schema.org: many webmasters are waiting for its wider adoption while some search engines stand by for webmasters to implement the standard. While conceding the hazards of this “chicken and egg” scenario, the standard’s organizers have shown that their product is popular, having been adopted by over 5 million websites (Barker & Campbell, 2014). Another aspect of Schema.org that would interest the audiovisual archive community is the transparency that is baked into it. Being trustworthy is a crucial aspect of any archive’s mission. If the data contained on a web page is marked up using Schema.org’s element set, the visible data may be compared to the markup. In the past, many dishonest companies have crammed misleading terms into the <meta> field inside the header of HTML pages to increase their online profile. By attributing metadata to the individual parts of the HTML body, humans and machines now have a way to expose dishonest markup.
There are many ways to make use of the semantic web. Broadcast your metadata or use named entities to described people, places, or things. Alternatively, programmers are using Linked Open Data to help us reaserch and find information. How do you use Linked Open Data?
In my next post I will explore how three audiovisual archives have embraced Linked Data and published their collections online.
Barker, P., & Campbell, L. M. (2014, June 5). What is Schema.org? Retrieved from http://publications.cetis.ac.uk/2014/960
Berners-Lee, T. (2006, July 27). Linked Data – Design Issues. Retrieved August 6, 2014, from http://www.w3.org/DesignIssues/LinkedData.html
Ceton, N. (2013). The Networked Society. In Metadata as the Cornerstone of Digital Archiving (pp. 10–19). Beeld En Geluid at Mullervisual Communication.
Raimond, Y. (2013, December 3). schema blog: Schema.org for TV and Radio markup. Retrieved from http://blog.schema.org/2013/12/schemaorg-for-tv-and-radio-markup.html
Shiri, A. (2012). Powering search: the role of thesauri in new information environments. Medford, New Jersey: Published on behalf of the American Society for Information Science and Technology by Information Today, Inc.
Simou, N., Evain, J. P., Tzouvaras, V., Rendina, M., Drosopoulos, N., & Oomen, J. (2012). Linking Europe’s Television Heritage. In Linked European Television Heritage. San Diego, CA, USA. Retrieved from http://www.museumsandtheweb.commw2012paperslinking_europe_s_television_heritage.html
W3C Semantic Web FAQ. (2009). Retrieved August 6, 2014, from http://www.w3.org/RDF/FAQ