Chapter 1: Introduction and Application Scenarios
Introduction
The last decade has seen a growing interest in the Semantic Web, which extends the web of documents to a web of data. This technology applies web-based standards for encoding datasets and linking them to other published datasets, so that applications can exploit data from many different sources. It also provides standards for encoding general knowledge in ontologies, so allowing enhancements based on automatic reasoning (improved querying, for example).
This chapter introduces Linked Data and related semantic technologies, and shows how they can be deployed in web applications. As an example, we target the development of a music portal (based on the MusicBrainz dataset) which facilitates access to a wide range of information and multimedia resources relating to music.
Movie 1: Developing a music portal. Dr Barry Norton introduces the target application for this chapter, a music portal based on Linked Data.
Learning outcomes
- the enabling technologies for Linked Data, in particular the RDF data model
- the current status of Linked Data technology
- how Linked Data can be applied in a particular scenario related to a music portal
Part I: Semantic Technologies and Linked Data Foundations
We will describe a set of technologies that allows datasets to be published over the web, and queried effectively by applications. Compared with search engines such as Google and Yahoo, which are based on text-string matching, these technologies are "semantic". This means that information is represented not in a natural language like English or Spanish, but in a graph-based data model that facilitates extension, integration, inference and uniform querying. As a realistic application of semantic technologies, we consider the provision of a portal through which users can retrieve resources and information in the world of music. Consider for example the following tasks:
- Retrieve a performance of the Beethoven violin concerto by a Chinese orchestra
- Retrieve a photograph of the conductor of this performance
- List male British rock musicians married to Scandinavians
Attempts to answer such queries through text-based search are unreliable: we might equally retrieve a performance in which the soloist was Chinese, or a rock musician that plays Scandinavian music. Using semantic technologies, resources such as the audio file of the performance, or the photograph of the conductor, can be annotated using the Resource Description Framework (RDF). In this framework, formal names can be assigned to what are called resources, which would include Beethoven, his violin concerto, the orchestra, and the conductor. Names can also be assigned to types (or classes) of resource (composers, concertos, etc.), and to relationships (or properties) that link resources (e.g., the "composed-by" relationship between composition and composer). By reasoning over facts encoded in this way, a query system can confirm that a performance was given by the Beijing Symphony Orchestra, that this orchestra is based in Beijing, that Beijing is located in China, and so forth -- thus combining geographical and musical knowledge in order to retrieve an answer.
In designing these semantic technologies, a key design decision was to leave open the naming of resources and properties, provided that names conform to the format for web resource names -- that is, provided they are Uniform Resource Identifiers or URIs.
http://rdf.freebase.com/ns/en.ludwig_van_beethoven
http://dbpedia.org/resource/Ludwig_van_Beethoven
http://musicbrainz.org/artist/1f9df192-a621-4f54-8850-2c5373b7eac9#_
http://data.nytimes.com/N30866506154608358173
All four of the above could be names for Beethoven, illustrating that the URI need not be human-readable (e.g., it might be an arbitrary string of letters and numbers), although identifiers should be resolvable to RDF representations that include human-readable labels, as explained below. If data from different sources are to be combined, it is therefore important to establish links, for instance through statements indicating that the above four URIs are synonymous. These statements, which can also be expressed in RDF, provide a means by which data published by many people or organisations can be combined into linked data.
In the following chapters, we will show through practical examples how to describe resources in RDF, how to convert data from other formats to RDF, how to publish RDF data, and how to link published RDF to other datasets. We will also consider how to utilize existing linked data in applications for querying, analysis, mining, and visualisation. All these topics will be illustrated by the case scenario of a music portal.
For examples of existing music portals, you can look at the BBC music reviews site, and the etree linked music site. These applications make use of a music ontology and a large dataset of musical information called MusicBrainz, which we will also exploit in our training material.
http://www.bbc.co.uk/music/reviews
http://etree.linkedmusic.org/
http://musicbrainz.org/
Background technologies
Linked data results from a confluence of earlier ideas and technologies, including hypertext, databases, ontologies, markup languages, the Internet, and the World Wide Web. In this section we provide background information on these technologies.
Movie 2: Towards a web of data. Dr Barry Norton describes some technologies that underpin the current interest in Linked Data.
Internet
The Internet is an extension of the technology of computer networks. The earliest computers operated independently. In the 1960s and 1970s, it became common for computers in an organisation (e.g., university, government, company) to be linked together in a network. At the same time, there were early experiments in linking whole networks together, including the ARPANET in the United States. In the early 1980s, the Internet Protocol Suite (TCP/IP) for the ARPANET was standardised, to provide the basis for a network of networks that could embrace the whole world. The Internet spread mostly to Europe and Australia during the 1980s, and to the rest of the world during the 1990s.
The technology supporting the Internet includes the IP (Internet Protocol) system for addressing computers, so that messages can be routed from one computer to another. Each computer on the Internet is assigned an IP number which can be written as four integers from 0--255 separated by dots, e.g. 185.56.200.4. (To be precise, this convention holds for version 4 of the IP, but not the more recent version 6.) The structure of messages is governed by application protocols that vary according to the service required (e.g., email, telephony, file transfer, hypertext). Examples of such protocols are FTP (File Transfer), USENET, and HTTP (HyperText Transfer).
Hypertext
The concept of hypertext is normally dated from Bush and Wang's 1945 article "As we may think" [1], which proposed an organisation of external records (books, papers, photographs) corresponding to the association of ideas in human memory. By the 1960s, with more advanced computer technology, this concept was implemented by pioneers such as Douglas Engelbart and Ted Nelson in programs that allowed texts (or other media) to be viewed with some spans marked as hyperlinks, through which the reader could jump to another document.
World Wide Web
Informally people often use the terms "Internet" and "World Wide Web" (WWW) interchangeably, but this is inaccurate: the WWW is in fact just one of many services delivered over the Internet. The distinctive feature of the WWW is that it is a hypertext application, which exploits the Internet to allow cross-linking of documents all over the world.
The formal proposal for the WWW, and prototype software, were produced in 1990 by Tim Berners-Lee [2], and elaborated over the next few years. The basic idea is that a client application called a web browser obtains access to a document stored on another computer by sending a message, over the Internet, to a web server application, which sends back the source code for the document. Documents (or web pages) are written in the Hypertext Markup Language (HTML), which allows some spans to be marked as hyperlinks to a document at a specified location in the web, named using a Universal Resource Locator (URL). When the user clicks on a hyperlink, the browser finds the IP address associated with the URL, and sends a message to this IP address requesting the HTML file at the given location in the server's file system; on receipt, this file is displayed in the browser.
Figure 1: Development of the WWW
Source: Radar Networks & Nova Spivack, 2007. http://www.radarnetworks.com
Citation: Nova Spivack's illustration of the evolution of the WWW.
License: CC (Some Rights Reserved)
Web 1.0 (static)
In 1993 came a turning point for the WWW with the introduction of the Mosaic web browser, which could display graphics as well as text. From that date, usage of the web grew rapidly, although most users operated only as consumers of content, not producers. During this early phase of web development, sometimes called Web 1.0, web pages were mostly static documents read from a server and displayed on a client, with no options for users to contribute content, or for content to be tailored to a user's specific demands.
Web 2.0 (dynamic)
Around 2000 a second phase of web development began with the increasing use of technologies allowing the user of a browser to interact with web pages and shape their content. There are basically two ways in which this can be done, known as client-side scripting, and server-side scripting.
Client-side scripting is achieved through program code incorporated into the HTML source, typically written in Javascript. This code can be run on the user's computer, without any need for further messages to be sent to the server: hence "client-side".
Server-side scripting is achieved through messages to the server which invoke applications capable of creating the HTML source dynamically: the document eventually displayed to the user is therefore tailored in response to a specific request rather than retrieved from a previously stored file.
Social web
These Web 2.0 technologies have made possible a wide range of social web sites now familiar to everyone, including chat rooms, blogs, wikis, product reviews, e-markets, and crowdsourcing. Previously a consumer of content provided by others, the web user has now become a prosumer, capable of adding information to a web page, and in this way communicating not only with the server, but through the server with other clients as well.
Web 3.0 (semantic)
During the 1990s, Berners-Lee and collaborators developed proposals for a further stage of web development known as the Semantic Web. This far-reaching concept, first publicised in a 2001 article in the Scientific American [3], is partly implemented in the current stage of web development sometimes called Web 3.0. At present we cannot see clearly what lies beyond Web 3.0, but in Figure 1 we allow for future stages in Semantic Web development by including a loosely defined further stage "Web 4.0".
In their 2001 article, Berners-Lee and co-authors pointed out that existing web content was usable by people but not by computer applications. There were many computer applications available for tasks like planning, or scheduling, or analysis, but they worked only on data files in some standard logical format, not on information presented in natural language text. A person could plan an itinerary by looking at web pages giving flight schedules, hotel locations, and so forth, but it was not yet possible (then as now) for programs to extract such information reliably from text-based web pages. The initial aim of the Semantic Web is to provide standards through which people can publish documents that consist of data, or perhaps a mixture of data and text, so allowing programs to combine data from many datasets, just as a person can combine information from many text documents in order to solve a problem or perform a task.
Figure 2: From documents to data
Source: Own source.
Ontologies
Datasets usually encode facts about individual objects and events, such as the following two facts about the Beatles (shown here in English rather than a database format):
The Beatles are a music group
The Beatles are a group
There is something odd about this pair of facts: having said that the Beatles are a music group, why must we add the more generic fact that they are a group? Must we list these two facts for all music groups -- not to mention all groups of acrobats or actors etc.? Must we also add all other consequences of being a music group, such as performing music and playing musical instruments?
Ontologies allow more efficient storage and use of data by encoding generic facts about classes (or types of object), such as the following:
Every music group is a group
Every theatre group is a group
It is now sufficient to state that the Beatles (and the Rolling Stones, etc.) are music groups, and the more general fact that they are groups can be derived through inference. Ontologies thus enhance the value of data by allowing a computer application to infer, automatically, many essential facts that may be obvious to a person but not to a program.
To allow automatic inference, ontologies may be encoded in some version of mathematical logic. There are many formal logics, which vary in expressivity (the meanings that can be expressed) and tractability (the speed with which inferences can be drawn). To be useful in practical applications it is necessary to trade expressivity for tractability, and description logic, which is implemented in the Web Ontology Language OWL, does precisely this. However, despite these restrictions on expressivity, OWL cannot yet be used efficiently for inference over very large datasets, as required by Linked Data applications. For this reason, most reasoning for Linked Data relies on the far simpler logical resources of RDF-Schema, with OWL used sparingly if at all.
Background standards
The technologies described in the previous section are implemented through a number of standard protocols and languages, with probably familiar acronyms like HTTP, URI, XML, RDF, RDFS, OWL, SPARQL. You can look up details of these standards as needed, but as background it is useful to know a little about each one, and in particular what they are for. The later standards in this list build on the earlier ones, so they are often described as a stack of languages, as shown in Figure 3.
HTTP
From using the World Wide Web, most people are familiar with the HTTP prefix in front of web addresses such as http://musicbrainz.org/. The meaning of this acronym is HyperText Transfer Protocol, and it refers to a set of conventions governing communication between a client and a server. More precisely, these conventions define the structure of request messages from client to server, and response messages from server to client. Message structure varies from one protocol to another: thus a different protocol such as FTP (File Transfer Protocol) will define a different message structure. A request messages in HTTP consists essentially of a method to be applied to a resource. The fundamental method is GET, which requests the server to send back a representation of the resource, typically an HTML file that can be displayed in a browser pane. However, there are several other methods including DELETE, which deletes the resource, and POST, which submits data to be processed with respect to the resource. The resource, specified through a relative document ID (often a filename/path on the server), may be a document, or picture, or an executable that will generate data for the response.
URI
A Uniform Resource Identifier (URI) is defined in the standard [4] as "a compact sequence of characters that identifies an abstract or physical resource". The word "compact" here means that the string must contain no space characters (or other white-space padding). "Abstract or physical" means that the URI may refer to an abstract resource such as the concepts "Beethoven" and "symphony", as well as to a document or other file that can be retrieved from the WWW.
A URI that is linked to a retrievable resource is known also as a Uniform Resource Locator, or URL. For instance, the following URI for the MusicBrainz FAQ page is a URL:
http://musicbrainz.org/doc/Frequently_Asked_Questions
The definition of a correctly formed URI is quite complicated, with constituents that vary according to the scheme (the initial constituent before the colon), which specifies the relevant internet protocol, such as HTTP. For an HTTP URI, the other constituents most relevant for our purposes are the authority, and the path, which occur in that order. The authority specifies the server where the resource (if it really exists) is located. Finally, the path locates the resource precisely within the server's directory structure.
Thus for the URL given above, "http" is the scheme, "musicbrainz.org" is the authority, and "/doc/Frequently_Asked_Questions" is the path; the other characters such as the colon are punctuation separating these constituents.
Note that the constituents following the scheme will be different for different schemes: thus the "tel" scheme, for example, is followed simply by a telephone number. Here are some examples indicating this variety:
ldap://[2001:db8::7]/c=GB?objectClass?one mailto:[email protected] news:comp.infosystems.www.servers.unix tel:+1-816-555-1212 telnet://192.0.2.16:80/ urn:oasis:names:specification:docbook:dtd:xml:4.1.2 http://dbpedia.org/resource/Karlsruhe
Since URIs are typically long, and hence difficult to read and write, it is convenient to make use of abbreviated forms known as "compact URIs" or "CURIEs". A compact URI consists simply of a namespace and a local name, separated by a colon. Typically, the namespace includes the scheme, the authority, and perhaps the early part of the path; the local name contains the remainder of the URI, chosen so as to convey intuitively what the URI means, while observing some syntactic restrictions (e.g., there should be no further use of the characters "/" and "#"). Thus in the example just given, one could introduce a namespace "dbp" for http://dbpedia.org/resource/, so reducing the URI to "dbp:Karlsruhe", where the local name preserves the substring that is significant to human readers. We will use this convenient method of abbreviation often in the rest of this book.
XML
Extensible Markup Language (XML) is a refinement of Standard Generalised Markup Language (SGML), which was introduced in the 1980s as a meta-language suitable for defining particular mark-up languages -- for instance, languages for adding formatting information to documents. The basic concept, now well known from widespread use of HTML, is that labelled tags are placed around spans of text, thus indicating perhaps that the span should be formatted in italics:
<i>text in italics</i>
The italic tag "i" is part of HTML, not SGML, but the convention of placing tags within angle brackets, and distinguishing the closing tag by a forward slash character, comes from SGML, as does the syntax for adding attributes to the opening tag, as in this example yielding blue text:
<font color="blue">blue text</font>
SGML is versatile because it can be used simply for encoding data, as well as for adding structure to text.
In the mid-1990s, the newly formed World Wide Web Consortium (abbreviated W3C) set up a working group to simplify and rework SGML to meet the requirements of the WWW. The result was the first XML specification, which became a W3C recommendation in 1998, and has become the standard convention for data exchange over the web. The essential advance on SGML is that XML is simpler and stricter: to give just one example, it is permissible in SGML (but not in XML) to omit closing tags, as in the common practice of inserting <p> without a closing </p> when writing HTML.
RDF
Figure 3: Stack of Semantic Web Languages
Source: http://w3.org/DesignIssues/diagrams/sweb-stack/2006a.png
Citation: Semantic Web Language stack - architectural layers.
License: Copyright (c) 2006 World Wide Web Consortium, (Massachusetts Institute of Technology, European Research Consortium for Informatics and Mathematics, Keio University). All rights reserved. http://www.w3.org/Consortium/Legal/2002/copyright-documents-20021231
The Resource Description Framework (RDF) was introduced originally as a data model for metadata, which are attributes of a document, or image, or program, etc. such as its author, date, location, and coding standards. First published as a W3C recommendation in 1999 [5], the framework has since been updated, and generalised in its purpose to cover not only metadata (strictly interpreted) but knowledge of all kinds.
The basic idea of RDF is a very simple one: namely, that statements are represented as triples of the form subject--predicate--object, each triple expressing a relation (represented by the predicate resource) between the subject and object resources. Formally, the subject is expressed by a URI or a blank node, the predicate by a URI, and the object by a URI or a literal such as a number or string.
The original W3C recommendation for exposing RDF data was that it should be encoded in XML syntax, sometimes called RDF/XML. It is for this reason that the semantic web "stack" of languages has RDF implemented on top of XML. However, notations have also been proposed which are easier for people to read and write, such as Turtle, in which statements are formed simply by listing the elements of the triple on a line, in the order subject-predicate-object, followed by a full stop, with URIs possibly shortened through the use of namespace abbreviations defined by "prefix" and "base" statements, as in the following example:
@base <http://musicbrainz.org/>. @prefix mo:<http://purl.org/ontology/mo/>. <artist/b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d#_> a mo:MusicGroup.
Here the subject is abbreviated using the "base" statement, and the object is abbreviated using the "prefix" statement. The very simple predicate "a" relies on a further Turtle shorthand for very commonly used predicates, and refers to the "type" relation between a resource and its class. This can be seen from the following equivalent Turtle statement, in which all URIs are shown in their cumbersome unabbreviated form. Note that this statement should occupy a single line, although it is shown here with wrapping so that it fits on the page. The format in which every URI in a Turtle statement is fully expanded is also known as NTriples.
<http://musicbrainz.org/artist/b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d#_> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/mo/MusicGroup>.
Where multiple statements apply to the same subject, they can be abbreviated by placing a semi-colon after the first object, and then giving further predicate-object pairs separated by semi-colons, with a full stop after the final pair. For statements having the same subject and predicate, objects can be listed in a similar way separated by commas. These conventions are illustrated by the following statements:
@base <http://musicbrainz.org/>. @prefix mo:<http://purl.org/ontology/mo/>. @prefix rdfs:<http://www.w3.org/2000/01/rdf-schema#>. @prefix owl:<http://www.w3.org/2002/07/owl#>. @prefix dbpedia:<http://dbpedia.org/resouce/>. @prefix bbc:<http://www.bbc.co.uk/music/artists/>. <artist/b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d#_> rdfs:label "The Beatles"; owl:sameAs dbpedia:The_Beatles, bbc:b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d#artist.
RDFS
RDF Schema (RDFS) is an extension of RDF which allows resources to be classified explicitly as classes or properties; it also supports some further statements that depend on this classification, such as class-subclass or property-subproperty relationships, and domain and range of a property. Some important resources in RDFS are as follows (for brevity we use the "rdfs" prefix defined above):
- rdfs:Class
- A resource representing the class of all classes.
- rdfs:subClassOf
- Used as a predicate to mean that the subject is a subclass of the object.
- rdfs:subPropertyOf
- Used as a predicate to mean that the subject is a sub-property of the object.
- rdfs:domain
- Used as a predicate when the subject is a property and the object is the class that is domain of this property.
- rdfs:range
- Used as a predicate when the subject is a property and the object is the class that is range of this property.
The following statements in Turtle serve to illustrate these RDFS resources. Note that they use abbreviated URLs for which the prefixes are given above.
mo:member rdf:type rdfs:Property. mo:member rdfs:domain mo:MusicGroup. mo:member rdfs:range foaf:Agent. mo:MusicGroup rdfs:subClassOf foaf:Group.
In these statements, the resource "mo:member" denotes the property that relates a music group to each of its members -- for instance, the Beatles to John, Paul, George and Ringo, as in the following triple:
dbpedia:The_Beatles mo:member dbpedia:Ringo_Starr.
The second and third statements above give the domain and range of the property "mo:member". Intuitively, their meaning is that if "mo:member" is employed as predicate in a triple, its subject will belong to the class "mo:MusicGroup", and its object to the class "foaf:Agent". The fourth statement means that any resource belonging to the class "mo:MusicGroup" will also belong to the (more general) class "foaf:Group".
An important gain in adding such statements is that they allow new facts to be inferred from existing ones. Consider for instance how they may be combined with the statement (just given) that Ringo is a member of the Beatles. Using the domain and range statements for the property "mo:member", it follows directly that the Beatles are a music group, and that Ringo is an agent; using the subClassOf statment, it follows further that the Beatles are a group. Encoded in Turtle, these inferred facts are as follows:
dbpedia:The_Beatles rdf:type mo:MusicGroup. dbpedia:Ringo_Starr rdf:type foaf:Agent. dbpedia:The_Beatles rdf:type foaf:Group.
RDFS also contains some predicates for linking a resource to information useful in presentation and navigation, but not for inference. These include the following:
- rdfs:comment
- Associates a resource with a human-readable description of it.
- rdfs:label
- Associates a resource with a human-readable label for it.
- rdfs:seeAlso
- Associates a resource with another resource that might provide additional information about it.
- rdfs:isDefinedBy
- A sub-property of "rdfs:seeAlso", indicating a resource that contains a definition of the subject resource.
OWL
The Web Ontology Language (OWL) extends RDFS to provide an implementation of a description logic, capable of expressing more complex general statements about individuals, classes and properties.
OWL was developed in the early 2000s and became a W3C standard (along with RDFS) in 2004. The acronym OWL was preferred to the more logical WOL because it is easier to pronounce, provides a handy logo, and is suggestive of wisdom. Of course the name also reminds us of the character in "Winnie the Pooh" who misspells his name "Wol".
The reason for choosing description logic, rather than a more expressive kind of mathematical logic, has already been mentioned: the aim was to achieve fast scalable reasoning services, and hence to use a logic for which efficient reasoning algorithms were already available. In fact description logics are more a family of languages than a single language. They can be thought of as a palette of operators for constructing classes, properties and statements, from which the user can make different selections, so obtaining fragments with different profiles of expressivity and tractability.
Figure 4: Fragments of OWL 2
Source: http://techwiki.openstructs.org/index.php/File:OWL1vOWL2.png
Citation: OWL 2 Fragments
License: CC Attribution 3.0
The OWL standard is under constant development, and the current version OWL 2.0 provides for the fragments shown in Figure 4; their meanings are as follows:
- OWL 2 Full
- Used informally to refer to RDF graphs considered as OWL 2 ontologies and interpreted using the RDF-Based Semantics.
- OWL 2 DL
- Used informally to refer to OWL 2 ontologies interpreted using the formal semantics of Description Logic ("Direct Semantics").
- OWL 2 EL
- A simple fragment limited to basic classification, allowing reasoning in polynomial time.
- OWL 2 QL
- A fragment designed to be translatable to querying in relational databases.
- OWL 2 RL
- A fragment designed to be efficiently implementable using rule-based reasoners.
As already explained, a detailed understanding of OWL is not necessary for working with Linked Data. When reasoning over huge amounts of data, only the simplest reasoning processes are computationally efficient, and these can for the most part be implemented using only the resources of RDFS. Very briefly, the additional resources in OWL are terms providing mainly for the following:
- Class construction: forming new classes from existing classes, properties and individuals (e.g., ObjectIntersectionOf);
- Property construction: distinguishing object properties (resources as values) from data properties (literals as values);
- Class axioms: statements about classes, describing sub-class, equivalence and disjointness relationships;
- Property axioms: statments about properties, including relationships such as equivalence and sub-property, and also attributes such as whether a property is functional, transitive, and so forth;
- Individual axioms: statements about individuals, including class membership, and whether two resources represent the same individual or different individuals.
SPARQL
The SPARQL Protocol and RDF Query Language (a recursive acronymn, since it contains itself) is a language for formulating queries over RDF data. It is the Semantic Web's counterpart to SQL (Structure Query Language), which has been a standard language for querying relational databases since the 1980s. SPARQL is a recent addition to the Semantic Web stack of languages, having been recommended as a W3C standard in 2008 [6].
Movie 3: Formulating a SPARQL query. Dr Barry Norton explains some key concepts of SPARQL and shows how to construct a simple query.
Since chapter 3 of this book is dedicated to SPARQL, we limit ourselves here to an example that illustrates its purpose. Comparing SPARQL with SQL, the key difference is that it is designed for retrieving information from sets of triples, rather than from data organised into relations (i.e., tables). Queries are therefore formulated using lists of RDF triples in which some URIs or literals are replaced by variables, as in the following:
PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX dbpedia: <http://dbpedia.org/resource/> PREFIX music-ont: <http://purl.org/ontology/mo/> SELECT ?album_name ?track_title WHERE { dbpedia:The_Beatles foaf:made ?album . ?album dc:title ?album_name . ?album music-ont:track ?track . ?track dc:title ?track_title }
Translated into English, the meaning of this query is as follows:
Retrieve a list of all album names AN and track titles TT in the data for which the following conditions hold:
- There is an album A made by the Beatles.
- Album A has the title AN.
- There is a track T on album A.
- Track T has the title TT.
Or more colloquially: retrieve the titles of all tracks on albums by the Beatles, along with the corresponding album titles. The response should be a list of pairs, each containing an album name and a track title.
This example shows the simplest kind of query, in which the WHERE statement is simply a list of triples (containing variables). SPARQL also provides some more sophisticated constructs: these include FILTER, which allows conditions on the values of variables (e.g., that a number should be between 1990 and 2000); also OPTIONAL, which specifies data that should be retrieved if available, while allowing the query to succeed even when they are unavailable. For more information on these more complex constructs, see Chapter 3.
Practically, to pose a query to a dataset you need to use a program or website that serves as a SPARQL endpoint. For a list of endpoints see the W3C site at http://www.w3.org/wiki/SparqlEndpoints. Typically, an endpoint interface provides text fields where you can type the URL of the dataset you wish to query, and the query itself (e.g., the SELECT query in the example above). On hitting the "Submit" button, you obtain a dynamically generated webpage listing the values of the query variables in a table. There are also libraries allowing you to incorporate SPARQL queries into your programs, such as the Java library Jena at http://jena.apache.org/.
Part II: Introduction to Linked Data
In 2006 Berners-Lee wrote an influential note suggesting principles for the publication of data on the semantic web. The original text can be found at this web address:
http://www.w3.org/DesignIssues/LinkedData
Since then the volume of data has grown from around 2 billion triples in 2007 to over 30 billion in 2011, interconnected by over 500 million RDF links, the main purpose of which is to establish chains of URIs that refer to the same individuals. Through such links, published datasets are combined into a vast body of data known as a "cloud".
Figure 5: Linked data cloud (2007)
Source: http://lod-cloud.net
Citation: Linking Open Data cloud diagram (2007), by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net
License: CC-BY-SA
Figure 5 shows a diagram of the linked data cloud for 2007, in which nodes represent published datasets, and links represent sets of RDF triples through which the URIs in one dataset are paired with their counterparts in another dataset. Thus the link from DBpedia to MusicBrainz means that DBpedia includes not only RDF triples that give informaton about the world, but also triples that link some DBpedia names to their synonyms in MusicBrainz. We have seen examples of such statements in the last section, including the following triple which links the two names for the Beatles.
<http://musicbrainz.org/artist/b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/The_Beatles>.
Note that since the "sameAs" relation is transitive and commutative, two statements of the form "X sameAs Y" and "Y sameAs Z" (or equivalently "Z sameAs Y") can be combined to infer "X sameAs Z"; in this way, lists of synonymous names can be derived from the cloud.
Principles
In his 2006 note, Berners-Lee set out four simple principles for publishing data on the web. These are best seen as rules of best practice rather than rules that must be obeyed: the idea is that the more people follow these principles, the more their data will be usable by others.
In brief, the principles are as follows:
- Use URIs to identify things.
- Use HTTP URIs so that people can look up those names.
- When someone looks up a URI, provide useful information, using the standards (RDF, RDFS, SPARQL).
- Include links to other URIs, so that they can discover more things.
The rationale for these principles is probably obvious. By using URIs to identify individuals, classes, and properties, we obtain names that perform a double duty: as well as referring to the relevant thing, they give us a location on the web where we may look for information about that thing. Other naming schemes accomplish only the first of these duties. However, to obtain benefit from a name that also serves as a web address, the URI should not be a broken link. It should point to relevant information, encoded in one of the expected formats. This benefit will be enhanced further if the information includes URIs that point to other locations on the web from which additional relevant information might be recovered.
Rating published datasets
In 2010 Berners-Lee extended the note referenced above to propose a system for rating datasets, based on the five-star rating system used for hotels. Closely related to the principles just listed, the system is as follows:
- One-star (*): The data is available on the web with an open license.
- Two-star (**): The data is structured and machine-readable.
- Three-star (***): The data does not use a proprietary format.
- Four-star (****): The data uses only open standards from W3C (RDF, SPARQL).
- Five-star (*****): The data is linked to that of other data providers.
Note that every level here includes the previous levels: thus for instance three-star data must also be available on the web in machine-readable form.
Movie 4: Towards five-star data. Dr Barry Norton explains the relationship between the four principles proposed by Berners-Lee and the five-star rating system. He also gives examples of web sites with ratings from one to five stars.
Growth of linked data on the web
We have shown above a diagram of the linked data cloud for 2007 (Figure 5). For comparison, Figure 6 shows the corresponding diagram for 2011 (the last year for which we have data), showing the expansion that has taken place during these years.
Figure 6: Linked data cloud (2011)
Source: http://lod-cloud.net
Citation: Linking Open Data cloud diagram (2011), by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net
License: CC-BY-SA
The colours on this diagram provide a broad categorisation of the domains of the various datasets. Figure 7 explains this colour code, and gives statistics on the overall size of the linked data cloud in 2011.
Figure 7: Composition of linked data cloud (2011)
Source: http://www4.wiwiss.fu-berlin.de/lodcloud/state
Citation: (Not specified)
License: (Not specified)
Case scenario: a music portal
As a motivating scenario for Euclid we consider the provision of a music-based portal, and the challenges and benefits of using Linked Data in creating it. In fact there are already services and mash-ups that draw benefit from existing Linked Open Data and Semantics-based technologies, including for example the following:
http://seevl.net/ http://www.bbc.co.uk/music/reviews http://etree.linkedmusic.org/about/
These services (along with others in the music domain) exploit two major artifacts, the Music Ontology and the MusicBrainz dataset, which will be used as examples throughout the Euclid training material.
http://musicontology.com/ http://musicbrainz.org/
In order to provide a useful portal, the developer in this scenario would like to bring together a number of disparate components of data-oriented content:
- 1. Musical content
- Content exists in the third-party commercial setting (links into download and streaming providers, e.g. Amazon/iTunes and Spotify/Last.fm), the license-free setting (e.g. the Live Music Archive 'etree'), and the grey market setting (e.g. YouTube).
- 2. Music and artist metadata
- While the MusicBrainz dataset is the primary source (its primary artifact being a relational database, but Euclid material will demonstrate how this is transformed into RDF via D2R), it is weak on biographical and genre information, for which alternative sources will be discussed (including Freebase and DBpedia).
- 3. Review content
- Reviews exist that are already Linked-Data-oriented (e.g. BBC Music Reviews), that are semi-structured but unlinked (e.g. Pitchfork) and that are largely unstructured (the Web in general).
- 4. Visual content
- Photographic depictions, album covers and videos exist on the Web, but are loosely coupled in terms of semantic interlinking.
The portal developer will use common identifiers to bring together this disparate content, and furthermore to offer interesting mash-ups using the inter-linkage to further data from the Linking Open Data Cloud, e.g. geographical and biographical exploration, and the possibility to provide engaging visualisations over this.
Developers will also seek to improve the quality of the semantic interlinking of the content they aggregate and contribute back to the Linking Open Data Cloud. In particular they will improve the linking of artists and works to visual content and to reviews, in the latter case crawling review content and publishing external annotations. They will also seek to improve classification within the metadata, encoding genre information -- at least with respect to the emphasis of their portal -- and along the way demonstrating the use of the Google Refine technology.
Finally, prototypical examples from the portal will demonstrate the use of RDFa annotation of human-readable content, and demonstrate the link to emerging Web technologies that inherit from semantics, such as Google RichSnipets, Facebook OpenGraph and schema.org annotation.
Figure 8: MusicBrainz site
Source: http://musicbrainz.org
Citation: (Not specified)
License: Not specified, but the core data of the database is licensed under the CC0 and the documentation (published in the portal) under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0.
Figure 9: Architecture of music portal
Source: Own source, but the icons are imported from different websites.
Movie 5: Screencast of the MusicBrainz site.
Movie 6: Screencast of the Seevl site.
Examples
To explore the possibilities of linked data browsers and mashups (which combine data from many sources), look at these examples of working websites based on semantic web technology.
http://marbles.sourceforge.net/
The Marbles site, created at the Berlin Freie Universitat, allows you to view presentations based on RDF data from multiple sources that are distinguished visually by marbles of different colours. At the bottom of the page, these colours are indexed to the URIs of the sources. The value of such presentations is that they show at a glance how strongly the information is attested among the various datasets, thus providing some indication of its reliability. Applications like Marbles that exploit multiple datasets are sometimes called mash-ups.
Figure 10: Marbles Linked Data Browser
Source: http://dbpedia.org/Marbles#h53-2
Citation: Christian Becker, Christian Bizer. DBpedia Mobile: A Location-Enabled Linked Data Browser. 1st Workshop about Linked Data on the Web (LDOW2008), Beijing, China, April 2008.
License: Creative Commons Attribution-ShareAlike 3.0 and GNU Free Documentation License
http://sig.ma/
The Sigma site is a mash-up demonstration created by researchers at the Digital Enterprise Research Institute in Ireland. Among other things it provides a keyword search engine through which you can recover images and texts accessed through RDF annotations, as well as a list of synonymous URIs matching the search key, and links to web sources containing relevant RDF data.
Figure 11: Sigma site
Source: http://sig.ma
Citation: Giovanni Tummarello, Richard Cyganiak, Michele Catasta, Szymon Danielczyk, Renaud Delbru, Stefan Decker. Sig.ma: Live views on the Web of Data. Journal of Wem Semantics: Science, Services and Agents on the World Wide Web - Volume 8, Issue 4. November 2010, Pages 355-364.
License: Sigma (c) Copyright
Movie 7: Screencast of the Sig.ma site.
http://wiki.dbpedia.org/DBPediaMobile
DBpedia Mobile is an application for mobile devices (phones, pads) which uses location detection in order to offer information from DBpedia on the user's current neighbourhood. The user sees a map the neighbourhood on which various features of potential interest are labelled; clicking on a label opens a pane giving information generated from DPpedia and linked datasets. The application uses the Marbles Linked Data Browser (see above).
Figure 12: DBpedia Mobile site
Source: http://revyu.com
Citation: Christian Becker, Christian Bizer. DBpedia Mobile: A Location-Enabled Linked Data Browser. 1st Workshop about Linked Data on the Web (LDOW2008), Beijing, China, April 2008.
License: Creative Commons ShareAlike
Some other sites
For further examples of sites using Linked Data, see the following.
- New York Times
- The NY Times maintains a Linked Open Data site at http://data.nytimes.com/schools/schools.html.
- BBC Music
- The BBC has launched a music portal based on Linked Data at http://www.bbc.co.uk/music.
- LinkedGeoData
- The University of Leipzig has a community project providing street map information based on Linked Data, at http://linkedgeodata.org/.
- US government data
- In 2009 the US and UK governments made commitments to open data. The US government data site is at http://www.data.gov/.
- UK government data
- Available at http://data.gov.uk/ with over 8000 datasets published at the time of writing.
Movie 8: Screencast of the BBC Music site.
Movie 9: Screencast of the Data.gov.uk site.
Further reading
[1] V. Bush and J. Wang (1945) "As we may think". Atlantic Monthly vol. 176, pp 101-108. Available on-line at http://dl.acm.org/citation.cfm?id=227186.
[2] T. Berners-Lee and R. Cailliau (1990) "WorldWideWeb: Proposal for a HyperText Project". Published on-line at http://www.w3.org/Proposal.
[3] T. Berners-Lee, J. Hendler and O. Lassila (2001) "The Semantic Web". Scientific American vol. 284 number 5, pp 34-43. Available on-line at http://www.scientificamerican.com/article.cfm?id=the-semantic-web.
[4] T. Berners-Lee, R. Fielding and L. Masinter (2005) "Uniform Resource Identifier (URI): Generic Syntax". Published on-line at http://tools.ietf.org/html/rfc3986.
[5] O. Lassila and R. Swick (1999) "Resource Description Framework (RDF) Model and Syntax Specification". Published on-line at http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/.
[6] E. Prud'hommeaux and A. Seaborne (2008) "SPARQL Query Language for RDF". Published on-line at http://www.w3.org/TR/rdf-sparql-query/.
Summary
After studying this chapter you should achieve the following outcomes:
- Understanding of the purpose and potential applications of Linked Data.
- Some familiarity with background technologies and standards such as HTTP, RDF, Turtle, OWL, and SPARQL.
- Ability to formulate correct RDF statements in Turtle, using either full URIs or abbreviations.
- Ability for formulate correct SPARQL queries using SELECT, WHERE, and FROM, and to execute them in the Euclid SPARQL endpoint.
- Ability to use some basic resources of RDFS, such as "rdfs:label" and "rdfs:subClassOf".
- Understanding of the basic principles for exposing Linked Data.
- Some familiarity with web sites already available which provide services based on Linked Data.