Chapter 2: Querying Linked Data

Introduction

Having surveyed the main concepts and standards for Linked Data, we now look in detail at SPARQL (SPARQL Protocol And RDF Query Language). As its name suggests, SPARQL is fundamentally a language for formulating queries, through which information can be retrieved from datasets. However, since it is targetted at datasets published on the World Wide Web, it also provides a protocol for specifying SPARQL commands. Moreover, in the latest version SPARQL 1.1 it has been extended to allow updating as well as querying, so that a SPARQL command can require data to be added, revised or deleted.

We will cover all these features of SPARQL in this chapter, using examples from the music domain which you can run over the MusicBrainz dataset, using the Euclid query editor here.

To log into this site, give both username and password as "exercise2".

We also look in more detail at RDFS and OWL, which allow developers to formulate conceptual knowledge that can be exploited by automatic reasoning services in order to enhance the semantics of queries.

Learning outcomes

On completing this chapter you should understand the following:

  • How to formulate a range of queries in SPARQL, and understand the responses you get back.
  • How in SPARQL 1.1 you can formulate queries that update datasets rather than just retrieving information.
  • How SPARQL queries from a client can be sent over the World Wide Web to a server containing a dataset, using the SPARQL protocol.
  • How reasoning and data integration can be achieved by utilising domain knowledge encoded in RDFS and OWL.

 

Part I: Introduction to SPARQL

SPARQL was proposed as a standard by the World Wide Web Consortium (W3C) in November 2008. It is maintained and developed by the W3C SPARQL Working Group, who in November 2012 recommended an upgraded version SPARQL 1.1 with new features including an update language (allowing users to change as well as consult RDF datasets). The latest recommendation can be found at these two sites, one for the query language and one for update:

  http://www.w3.org/TR/sparql11-query
  http://www.w3.org/TR/sparql11-update

Along with RDF and OWL, SPARQL is one of the three core standards of the Semantic Web. Its location in the Semantic Web "stack of languages" is shown in Figure 1. One point to note in the figure is that SPARQL does not depend on RDFS and OWL. However, as will be shown later in the chapter, knowledge encoded in RDFS and OWL may enhance the power of querying.

Web Generations

Figure 1: SPARQL in Semantic Web stack

SPARQL, as a database query language, resembles the well-known Structured Query Language (SQL). The syntax of SPARQL is shaped by the fact that it operates over graph data represented as RDF triples, as opposed to SQL's tabular data organised in a relational database.

Movie 1: Introduction to SPARQL. In this clip from the first Euclid webinar for chapter 2, Dr Barry Norton gives an overview of the SPARQL query language.

The essence of querying is shown by the following illustration, using for the time being English rather than RDF. Imagine an RDF dataset with statements containing the following information:

The Beatles made the album "Help".
The Beatles made the album "Abbey Road".
The Beatles made the album "Let it be".
The Beatles includes band-member Paul McCartney.
Wings made the album "Band on the run".
Wings made the album "London Town".
Wings includes band-member Paul McCartney.
The Rolling Stones made the album "Hot Rocks".

One can imagine various queries that a music portal might need to run over such a dataset. For instance, the portal might construct web pages on demand for any album or group nominated by the user. This would require retrieval of information from the dataset for questions such as the following:

Who made the album "Help"?
Which albums did the Beatles make?

These are so-called WH-questions ("who", "what", "where", etc.), for which the first would receive a single answer ("The Beatles"), and the second a list of three answers ("Help", "Abbey Road", "Let it be"). The SPARQL counterparts to these questions use RDF triples that contain variables; these correspond to the WH-words in the English queries. The general form for such questions (still working in English) is as follows:

Give me all values of X such that X made the album "Help".
Give me all values of X such that the Beatles made X.

We can go further than this by introducing more than one variable, thus generalising the query:

Give me all values of X and Y such that X made Y.

This is like asking a question with two WH-words, such as "Which bands made which albums?". The answer is not a list of values, as before, but a list of X-Y pairs that could be conveniently presented in a table:

X Y
The Beatles "Help"
The Beatles "Abbey Road"
The Beatles "Let it be"
Wings "Band on the run"
Wings "London Town"
The Rolling Stones "Hot Rocks"

In all these examples, the question is represented by a single statement with one or more variables; however, we can also construct more complex queries containing several statements:

Give me all values of X and Y such that: (a) X made Y, and (b) X includes band member Paul McCartney.

The answer would be the first five pairs from the previous answer, excluding "Hot Rocks" since the dataset does not list Paul McCartney as a band member of the Rolling Stones.

Moving now from English to SPARQL, here is the encoding for the simple query "Which albums did the Beatles make?" for the MusicBrainz dataset. For now don't worry about learning the exact syntax; the important thing is to understand what the various bits and pieces are doing.

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>

SELECT ?album
WHERE { dbpedia:The_Beatles foaf:made ?album .
        ?album a mo:SignalGroup
      } 

The query begins with PREFIX statements that define abbreviations for namespaces. The query proper begins in the line starting SELECT, which also contains a variable (corresponding to X and Y in our English examples) starting with the question mark character '?'. Choose any word you like for the rest of the variable name, provided that you use it consistently. The remainder of the query, starting WHERE, contains a list of RDF triple patterns. These are like RDF triples except that they include variables. They are expressed in Turtle, which we introduced in Chapter 1.

The WHERE clause in the example has two RDF triple patterns, separated by a full stop. The first pattern matches resources made by the Beatles; the second requires that these resources belong to a class mo:SignalGroup (this rather weird name distinguishes albums, which are "signal groups", from their constituent tracks, which are also encoded as resources made by the Beatles).

The response to a query is computed by a process known as graph matching, shown diagrammatically in Figure 2, where both query and dataset are shown as RDF graphs specified in Turtle (to simplify, only part of the above dataset is included).

Web Generations

Figure 2: Answering a query by graph matching

SPARQL terminology

Before proceeding to the detailed structure of queries, it is worth pausing to review the concepts introduced so far:

RDF triple
An RDF triple is a statement of the form subject-predicate-object expressed in one of the RDF formalisms.
RDF triple pattern
An RDF triple pattern is the same as an RDF triple except that any or all of its three constituents may be replaced by a variable.
RDF graph
An RDF graph is a set of RDF triples. You probably know that "graph" in mathematics has two distinct meanings: (1) a diagram showing points arranged by their relationship to an X axis and a Y axis; (2) a set of vertices (or nodes) linked by edges (or arcs). In the case of RDF the second meaning applies, where the subject and object in a triple are vertices, and the predicate is an edge that links them by pointing from subject to object. Formally, an RDF graph can be described as a directed labelled multigraph, which means (a) that edges are directional (you cannot switch subject and object without changing the statement), (b) that edges are named (by the predicate identifier), and (c) that there can be multiple edges linking two vertices (resources may be related in different ways).
RDF dataset
An RDF dataset is a set of RDF triples comprising a default RDF graph, which by definition is unnamed, and zero or more named RDF graphs. The idea behind this segmentation is that SPARQL queries can be explicitly confined to a named subset rather than running over the whole dataset.
Graph pattern
We use this term to refer to a conjunction of RDF triple patterns. It is therefore the same as an RDF graph, except that its constituents are RDF triple patterns (which contain variables) as opposed to normal RDF triples (which don't). Note that in a query, the expression following the keyword WHERE is a graph pattern; this is why graph patterns are important.
SPARQL Protocol client
A SPARQL Protocol client is an HTTP client that sends requests for SPARQL Protocol operations. As you probably know, "client" here refers to a program that sends a request to another program, possibly running on another computer, over a network; the other computer is known as the "server".
SPARQL Protocol service
A SPARQL Protocol service is an HTTP server that services requests for SPARQL Protocol operations.
SPARQL endpoint
A SPARQL endpoint is a SPARQL Protocol service, identified by a given URL, which listens for requests from SPARQL clients.

Querying with SPARQL

Submitting a query

In developing an application like a music portal, you will need to build queries into your application code. There are APIs that help you to do this, like the Jena library for Java. However, before learning to use APIs, you can learn the syntax for queries (and their responses) by using a SPARQL endpoint to enter the query by hand.

As an example of this procedure Figure 3 shows a snapshot of the Euclid SPARQL endpoint at http://euclid.sti2.org/Exercises/Exercise2 with a query about the albums made by the Beatles. If you type in this query and hit the "Run Query" button you will obtain the page shown in Figure 4 containing a table giving values for the variables ?album and ?title.

Web Generations

Figure 3: Typing a query into a SPARQL endpoint

Web Generations

Figure 4: Viewing the response to a query

Types of query

SPARQL defines the following query types for data retrieval:

ASK
An ASK query is a test of whether there are any resources in the dataset matching the search pattern; the response is either true or false. Intuitively, it poses the question: "Are there any X, Y, etc. satisfying the following conditions ...?"
SELECT
A SELECT query returns a table in which columns represent variables and rows represent variable bindings matching the search pattern. Intuitively: "Return a table of all X, Y, etc. satisfying the following conditions ...".
CONSTRUCT
A CONSTRUCT query returns an RDF graph (i.e., set of triples) matching a template, using variable bindings obtained from the dataset using a search pattern. Intuitively: "Find all X, Y, etc. satisfying the following conditions ... and substitute them into the following template in order to generate (possibly new) RDF statements ...".
DESCRIBE
A DESCRIBE query returns an RDF graph, extracted from the dataset, which provides all available information about a resource (or resources). The resource may be identified by name, or by a variable accompanied by a graph pattern. Intuitively: "Find all statements in the dataset that provide information about the following resource(s) ... (identified by name or description)".

Queries using ASK

An ASK query corresponds intuitively to a Yes/No question in conversational language. For example, the following query corresponds to the Yes/No question "Is Paul McCartney a member of the Beatles?":

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>

ASK
WHERE { dbpedia:The_Beatles mo:member dbpedia:Paul_McCartney } 

If this query is submitted as described above, the answer given will be "true".

Dissecting the syntax of this query, we note the following:

  • The PREFIX statements are not an essential part of a query, but here as elsewhere they are useful as a means of abbreviating RDF triples or patterns. Be careful not to put a full stop at the end of a PREFIX statement, since this will cause a syntax error.
  • The query proper begins with ASK, which specifies the relevant query type.
  • WHERE introduces a graph pattern, which at its simplest is a conjunction of RDF triples or patterns, presented in curly brackets and separated by full stops (or by commas or semicolons, as mentioned in the section on Turtle in Chapter 1). The patterns may use abbreviations defined in the PREFIX statements, and may include one or more variables. (More complex graph patterns will be described later on.)
  • Layout is free provided that terms are separated by white space. For instance, if you wished you could type the whole query on one line, or at the other extreme type a new-line character after every term. The layout given above with new lines for the key words PREFIX, ASK, WHERE, is adopted only for human readability.
  • The keywords of SPARQL syntax – PREFIX, ASK, WHERE, etc. – are not case-sensitive, so if you prefer you can use prefix, ask, where, and so on. In the examples we consistently capitalise these words for reasons of readability, but this has no effect on how the query engine interprets the query.

If you want to ask whether there are any X, Y, etc. such that certain conditions hold – e.g., "Are there any X such that X is a member of the Beatles" – you need to use RDF patterns, which are like triples except that they contain variables. These are represented by names beginning with a question mark '?', as in this example:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>

ASK
WHERE { dbpedia:The_Beatles mo:member ?person } 

Queries using SELECT

To show the basic syntax of a SELECT query, let us return to the example given in section 2.3:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>

SELECT ?album
WHERE { dbpedia:The_Beatles foaf:made ?album .
        ?album a mo:SignalGroup
      } 

Note the following:

  • After the (optional) PREFIX statements, the query proper begins with SELECT, which specifies the query type.
  • After SELECT you list all the variables that you would like to see tabulated in the response. Variables should be separated by spaces if there are more than one. Alternatively, you can simply put an asterisk after SELECT, meaning that all variables should be tabulated. In the example, this would yield the same result.
  • As before, WHERE introduces a graph pattern including one or more variables.
  • As before, layout is free provided that terms are separated by white space.

 

Movie 2: Understanding a WHERE clause. In this clip from the first Euclid webinar for chapter 2, Dr Barry Norton explains the structure of the WHERE clause in a SELECT query.

Try running the query in the Euclid Exercise 2 endpoint. You should obtain in response a list of MusicBrainz URIs that use arbitrary codes rather than recognisable words. However, we can exploit some further triples associating albums with their titles (encoded as string literals) in order to obtain more understandable output, as follows:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT ?album ?title
WHERE { dbpedia:The_Beatles foaf:made ?album .
        ?album a mo:SignalGroup .
        ?album dc:title ?title
      } 

Ordering the rows in the query result

For some queries you might want the results to be presented in a particular order. For instance, if your music portal retrieves albums made by the Beatles, using the query given above, you might want to present these in alphabetical order of title. This can be done using the keywords ORDER BY, as in the following example:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT *
WHERE { dbpedia:The_Beatles foaf:made ?album .
        ?album a mo:SignalGroup .
        ?album dc:title ?title
      }
ORDER BY ?title

The key element in this query is the ORDER BY component at the end, which stipulates that rows should be presented in alphabetical order of the values in the ?title column. Note also one other change from the previous example, the use of the asterisk shorthand in the SELECT clause, which asks for all variables in the WHERE clause to be tabulated. This shorthand is not always used, since often the selected variables are a strict subset of those mentioned in the WHERE clause.

Try replacing ORDER BY ?title by ORDER BY DESC(?title), which will present the rows in descending rather than ascending order of title (i.e., ascending is the default).

What exactly do these tables show? The columns, as we have seen, correspond to variables in the query for which tabulation is requested after SELECT, either explicitly by name, or implicitly by the '*' option. The rows give all variable bindings that match the graph pattern in the WHERE clause. A binding is an assignment of identifiers or literals to the variables which, when instantiated in these patterns, will yield a subgraph of the dataset. Note that when computing these bindings, the query engine makes the key assumption that variables occurring in more than one pattern are bound to the same resource. In this way it avoids returning a row in which an album is paired with the title of a different album.

Returning results page by page

For some queries, including our example, the output table returns too many result at once. In such cases it would be useful if the music portal included a paging facility allowing users to view the information in manageable portions – perhaps ten rows at a time. This can be done through the query engine if you use the keywords LIMIT and OFFSET. To see how this works, trying extending our previous query as follows:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT *
WHERE { dbpedia:The_Beatles foaf:made ?album .
        ?album a mo:SignalGroup .
        ?album dc:title ?title
      }
ORDER BY ?title
LIMIT 10 OFFSET 0

You should get back the same table, but cut off after the first ten rows; the result will be the same if you just put LIMIT 10 leaving the offset unspecified. Now try raising the number after OFFSET to 10. You should get back the next ten-row segment of the table, covering rows 11-20. In general, if LIMIT is L and OFFSET is S, the query will return L rows starting at S+1 and continuing up to S+L.

Using tests to filter the results

We have seen queries in which the WHERE clause contains specific resources (the Beatles) or variables (?album). But what if we want to obtain results for any variable that satisfies a certain condition – e.g., albums beginning with the letter 'B', or band members born before 1960?

Such conditions are called "filters", and to illustrate them, let us switch to an example in which our aim is to retrieve tracks (not albums) with a duration between 300 and 400 seconds. Since in MusicBrainz durations are encoded in milliseconds, the relevant filter condition can be stated as follows:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT ?title ?duration
WHERE { dbpedia:The_Beatles foaf:made ?track .
        ?track a mo:Track .
        ?track dc:title ?title .
        ?track mo:duration ?duration .
        FILTER (?duration>=300000 && ?duration<=400000)
      }
ORDER BY ?duration

The tables in Figures 5 and 6 show some other operators from which filter conditions can be constructed. Both tables are taken from the W3C specification at http://www.w3.org/TR/rdf-sparql-query/, which can be consulted for further details.

Conceptually, filters define a boolean condition on a graph pattern binding. We have already seen that the graph pattern in a WHERE clause has a set of solutions, each corresponding to a binding of the variables mentioned in the graph pattern. The filter submits these solutions to a boolean condition, letting through only those variable bindings for which the condition is met. In a naive implementation, the projection would be computed in just this way: first find the set of solutions matching the graph pattern; then select from this set the solutions that satisfy the filter condition, for inclusion in the final result.

Web Generations

Figure 5: Unary operators for filter expressions

Web Generations

Figure 6: Binary operators for filter expressions

Avoiding duplicate rows in the output table

For some queries you may find that the output table has duplicated rows. The reason for this is usually that the selected (tabulated) variables are a strict subset of the variables in the graph pattern. Consider for instance the table in Figure 7, which you will see if you submit the previous query and scroll down a little:

Web Generations

Figure 7: Response table with duplicate rows

In the centre of this figure we find two rows with track title "Within You Without You" and duration 305000; and there are many more examples further down the table. This happens because there might be multiple resources instantiating ?track having the same values for ?title and ?duration. Here, for example, the track "Within You Without You" is present in two different albums, so it shows up twice. If all three variables were tabulated, the rows would differ in the ?track column, but since this column is not requested, we obtain rows that appear identical. To avoid this you can include the keyword DISTINCT in the SELECT clause, as follows:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT DISTINCT ?title ?duration
WHERE { dbpedia:The_Beatles foaf:made ?track .
        ?track a mo:Track .
        ?track dc:title ?title .
        ?track mo:duration ?duration .
        FILTER (?duration>300000 && ?duration<400000)
      }
ORDER BY ?duration

Scrolling through the output table, you should now find only one row pairing "Within You Without You" with 305000.

Since DISTINCT is computationally expensive there is an efficient alternative REDUCED which eliminates some duplicates but not necessarily all (e.g., it fails to eliminate the duplication of "Within You Without You" mentioned above); however, DISTINCT is more widely used, and should not be computationally expensive when ordering is used.

Retrieving aggregate data for groups of bindings

Suppose that the dataset contains triples specifying the tracks on each album, and associating a duration in milliseconds with each track, represented by an integer literal. (In fact the MusicBrainz dataset associates records with albums, and tracks with records, which complicates the query slightly – see below.) Fully listed this is a lot of data, and you might wish instead to report, for each album, the total duration obtained by summing the durations of the tracks. This is an example of aggregate data, and it requires two operations: first, we must segment the variable bindings into groups, corresponding in this case to all bindings relating to a given album; second, for each group, we must submit the values of a specified variable to an aggegation function – in this case, sum the track durations.

For instance, suppose the dataset has just two albums matching the query, namely Revolver and Abbey Road; and suppose that for Revolver just three tracks are included, namely "Eleanor Rigby", "I'm only sleeping", and "Doctor Robert", with durations respectively of 200000, 240000, and 160000 milliseconds. (These are just round numbers made up for the example.) This will mean that for Revolver we have three bindings of the variables ?album, ?track, and ?track_duration; and if Abbey Road has four specified tracks, it will correspondingly have four bindings. We now separate the bindings into groups – three for Revolver, four for Abbey Road – and within each group we want to sum the track durations, and return only a table that gives albums along with their total durations. This will mean that the first row of the table will specify Revolver in the ?album column, and 600000 in the second column for which we need a new name – perhaps ?album_duration.

Here is a query that achieves this result. In addition it imposes a condition on the total duration of the album, reporting only albums with duration exceeding 3600000 milliseconds (i.e., one hour), selecting (and hence grouping by) album title rather than album.

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>

SELECT ?album_title (SUM(?track_duration) AS ?album_duration)
WHERE { ?album mo:record ?record .
        ?album dc:title ?album_title .
        ?record mo:track ?track .
        dbpedia:The_Beatles foaf:made ?track .
        ?track mo:duration ?track_duration .
      } 
GROUP BY ?album_title
HAVING (?album_duration > 3600000)

Note two new keywords here: AS introducing a variable name for the sum of durations (this becomes the heading of the second column of the output table); and HAVING introducing a filter over the group of variable bindings that has just been specified by the GROUP BY component.

Queries using CONSTRUCT

When building an application like the music portal, you might need to retrieve some information from a queried dataset and re-express it in new RDF triples, perhaps using new names for resources. This might, for example, allow more efficient integration with triples from another dataset.

To meet this need, SPARQL provides a CONSTRUCT query which uses information retrieved from a dataset in order to build new RDF statements. Note that the query does not update the dataset. The new RDF triples are returned to the user as output, to be used in any way desired, the dataset itself remaining unchanged. (In SPARQL 1.1 there are query types for updating a dataset, as described in a later section of this chapter.)

To show the basic syntax of a CONSTRUCT query, consider the following example where the user wishes to assert "creator" relationships between artists and their products – of any kind; perhaps the aim is to construct a dataset in which this predicate is used consistently, replacing more specific predicates in MusicBrainz.

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX mo: <http://purl.org/ontology/mo/>

CONSTRUCT { ?album dc:creator dbpedia:The_Beatles .
            ?track dc:creator dbpedia:The_Beatles .
          }
WHERE { dbpedia:The_Beatles foaf:made ?album .
       ?album mo:record ?record .
       ?record mo:track ?track . 
      }

The key to understanding this query is that the variables employed in the CONSTRUCT list must occur also in the WHERE list. When the query is run, the query engine begins by retrieving the variable bindings satisfying the description in the WHERE list – just as it would for a SELECT query. For each variable binding, it then instantiates the triple patterns in the CONSTRUCT list and so creates (in this case two) new RDF triples. The result of the query is a merged graph including all the created triples.

Figure 8 shows this outcome diagrammatically for an even simpler CONSTRUCT query, with the relevant part of the dataset and the constructed triples both shown as graphs.

Web Generations

Figure 8: Result of a simple CONSTRUCT query

Ordering and limiting the constructed triples

When building new RDF triples with CONSTRUCT, you can use the operators described in section 2.4.4 in order to organise and delimit the variable bindings retrieved from the dataset. The following query will build "creator" relationships only for the first 10 albums recorded by the Beatles, following alphabetical order of the titles, along with their tracks:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX mo: <http://purl.org/ontology/mo/>

CONSTRUCT { ?album dc:creator dbpedia:The_Beatles .
            ?track dc:creator dbpedia:The_Beatles .
          }
WHERE { dbpedia:The_Beatles foaf:made ?album .
        ?album mo:record ?record .
        ?album dc:title ?album_title .
        ?record mo:track ?track . 
      }
ORDER BY ?album_title
LIMIT 10

Query patterns containing disjunction

Suppose that for some reason you want to construct triples for albums made either by the Beatles or by the Smashing Pumpkins (or both). Including both of these constraints in the WHERE list will not work, because implicitly the list represents a conjunction of statements, each of which must be satisfied. To allow disjunctions, SPARQL contains a UNION pattern; this is formed by placing the keyword UNION between two subsets of statements, each subset delimited by curly brackets. The meaning is that variable bindings should be retrieved if they satisfy either the statements on the left or the statements on the right (or both). Thus our target query is formed as follows:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX mo: <http://purl.org/ontology/mo/>

CONSTRUCT { ?album dc:creator ?band .
            ?track dc:creator ?band .
          }
WHERE { ?band foaf:made ?album .
        ?album mo:record ?record .
        ?record mo:track ?track .
        { ?band foaf:name "The Beatles" }
          UNION
        { ?band foaf:name "The Smashing Pumpkins" }
      }

Note that the first three statements of the WHERE list lie outside the scope of the UNION operator.

Retrieving resources for which information is MISSING from the dataset

In the examples we have seen so far, variable bindings must be retrieved for all patterns listed after WHERE. This means that if we retrieve several facts about an album (say), the album will only be included in the output if all these facts are presented in the dataset: if just one is missing, the others will be ignored. SPARQL deals with this problem by allowing any graph pattern in the list to be preceded by the keyword OPTIONAL. This means that when computing variable bindings, the query engine should accept incomplete bindings provided that the unspecified variables occur only in optional patterns.

In the following query, optional patterns are used ingeniously to select only variable bindings for which a particular variable is not bound. The variable in question records an artist's place of death, and it is assumed that if this information is missing from the dataset, the artist will still be alive. If variables in the CONSTRUCT clause are not bound in the OPTIONAL clause, the triple patterns with these variables are not generated. As a result, "creator" relationships are constructed only for artists who are alive (or more precisely, artists for whom there is no death place recorded in the dataset).

PREFIX dbont: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

CONSTRUCT { ?album dc:creator ?artist . }
WHERE { ?artist foaf:made ?album .
        OPTIONAL { ?artist dbont:deathPlace ?place_of_death }
        FILTER (!BOUND(?place_of_death))
      }

Note that in the filter expression '!' denotes negation, so that the whole expression means that the variable is not bound.

You should take care using this kind of query, since it depends on a risky inference sometimes called the closed-world assumption – namely, that any relevant statement not found in the dataset must be false. Thus if the dataset contains information about places of death, but no statement giving the place of death of Paul McCartney, we infer by this assumption that Paul McCartney must still be alive, since otherwise his place of death would have been recorded.

Assigning variables

If you want to construct RDF triples using a variable that is derived from retrieved data, e.g., through an arithmetical operation, you can add a BIND statement to the WHERE clause as follows:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>

CONSTRUCT { ?track mo:runtime ?secs } 
WHERE { dbpedia:The_Beatles foaf:made ?album .
        ?album mo:record ?record .
        ?record mo:track ?track .
        ?track mo:duration ?duration .
        BIND ((?duration/1000) AS ?secs) .
      }

In this way the object of mo:runtime will be given in seconds rather than milliseconds.

Constructing new triples using aggregate data

We have already discussed a SELECT query that returns aggregate data by summing the durations of tracks in each album. You may recall that such a query uses the AS keyword in the expression following SELECT, to introduce a variable name for the aggregate value – in this case, the album duration. In the context of a CONSTRUCT query we therefore have a problem: how to introduce this new variable for the aggregate?

The solution used in SPARQL is to allow a sub-query after the keyword WHERE, in place of the usual graph pattern. This is achieved by the following rather convoluted syntax:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>

CONSTRUCT { ?album mo:duration ?album_duration }
WHERE {
    SELECT ?album (SUM(?track_duration) AS ?album_duration)
    { dbpedia:The_Beatles foaf:made ?album .
      ?album mo:record ?record .
      ?record mo:track ?track .
      ?track mo:duration ?track_duration . 
    } GROUP BY ?album 
      HAVING (?album_duration > 3600000) }

Queries using DESCRIBE

Like CONSTRUCT, DESCRIBE delivers as output an RDF graph – i.e., a set of RDF triples. It differs from CONSTRUCT in that these triples are not constructed according to a template, but returned as found in the dataset. The reasons for doing this are similar to those for CONSTRUCT – you might, for instance, want to add these triples to another dataset – but you would prefer DESCRIBE if you were satisfied with the original encoding and had no reason to re-express the information using different resource names.

To specify the desired information, the simplest method is to name a resource; the query engine will then return all triples in which this resource is employed either as subject, predicate or object. Thus the following query will retrieve all statements mentioning Paul McCartney.

PREFIX dbpedia: <http://dbpedia.org/resource/>

DESCRIBE dbpedia:Paul_McCartney

Alternatively, resources can be specified more generically as bindings to a variable. Thus the following query requests all triples that mention a member of the Beatles.

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX mo: <http://purl.org/ontology/mo/>

DESCRIBE ?member
WHERE { dbpedia:The_Beatles mo:member ?member }

 

Part II: Updating Linked Data with SPARQL 1.1

We now turn to queries that modify datasets by adding or removing data; these were not available in the original 2008 standard, but are provided for in the SPARQL 1.1 recommendation. To understand how they work, it is useful first to review how datasets are organised.

Every SPARQL query engine has an associated RDF dataset on which queries are normally run. The dataset is sometimes called a graph store, because it is organised as a collection of RDF graphs rather than a single graph. One of these graphs is called the default graph of the dataset, and has no name. The other graphs, if present, are called named graphs, each of which is identified by an IRI. Thus a dataset must have a default graph, and in addition may have any number of named graphs – including none.

The main update operations provided are as follows:

  • Deleting data from one of the graphs in the graph store.
  • Inserting data into one of the graphs in the store.
  • Loading the content of another graph into a graph in the store.
  • Clearing all triples from a graph in the store.

Unfortunately, to protect the datasets in the Euclid Exercise 2 endpoint, the examples cannot be run – you will obtain in response only a message explaining that access to the requested service is restricted.

Movie 3: Introduction to updating. In this clip from the second Euclid webinar for chapter 2, Dr Barry Norton gives an overview of updating commands in SPARQL 1.1.

Adding data to a graph

Suppose you have a graph containing data about the Beatles, in which Peter Best is included as a member of the band. If we assume that the graph is meant to describe the band as it was in the mid 1960s, when the Beatles were at their most active and famous, then this information would be out-of-date; we might want to describe Peter Best as a former band member, but not a current one. Here is a query that will achieve the first part of this objective, by describing Peter Best as a former band member:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX db-ont: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

INSERT { dbpedia:The_Beatles db-ont:formerMember ?x }
WHERE  { dbpedia:The_Beatles db-ont:currentMember ?x .
        ?x foaf:name "Peter Best"
       } 

This query is very similar in syntax to the queries we examined in section 2.4, especially CONSTRUCT and DESCRIBE queries. The query form in this case is INSERT, and like CONSTRUCT it is followed by a graph pattern that specifies a set of RDF triples, given bindings for any variables that are included in the pattern. These bindings, as before, are obtained through a WHERE clause. In this simple example, we assume that the person formulating the query does not know the IRI naming the resource Peter Best, and accordingly retrieves it from the dataset using the foaf:name relation to the literal "Peter Best". Once the relevant IRI is plugged into the single triple following INSERT, we obtain an exact specification of an RDF triple (with no variables). The query engine then checks whether this triple is present in the default graph, and if not, adds it. If the triple was already present in the graph, there would be no change.

If you knew the URI for Peter Best, you could use an alternative form of the insert query with the keywords INSERT DATA. This is followed not by a graph pattern (including variables) but a set of fully specified triples; accordingly, a WHERE clause is not needed.

If you wish the triples to be added to a named graph rather than the default graph, you can name the relevant graph at the beginning of the pattern after INSERT, as in the following slightly modified example:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX db-ont: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

INSERT { GRAPH <http://myFavGroups/The_Beatles> 
         { dbpedia:The_Beatles db-ont:formerMember ?x }
       }
WHERE  { dbpedia:The_Beatles db-ont:currentMember ?x .
        ?x foaf:name "Peter Best"
       } 

Deleting data from a graph

As in the case of inserting, you have two options: either use DELETE DATA followed by a set of RDF triples (without variables), or use DELETE followed by a graph pattern and a WHERE clause. In either case, you again have the option of indicating, within the DELETE clause, the graph from which the specified triples will be removed. If no graph is named, the triples will be removed from the default graph.

In the following example, which would be a natural sequel to the query in 2.6.1, Peter Best is removed as a current band member from the default graph.

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX db-ont: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

DELETE { dbpedia:The_Beatles db-ont:currentMember ?x }
WHERE  { dbpedia:The_Beatles db-ont:currentMember ?x .
        ?x foaf:name "Peter Best"
       } 

Combining insertion and deletion

In the examples for INSERT and DELETE the aim was to replace an outdated statement (Peter Best as current member) by an up-to-date one (Peter Best as former member). When the INSERT and DELETE operations use the same variable bindings, they can be combined in a single query, sharing the same WHERE clause, as follows:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX db-ont: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

DELETE { dbpedia:The_Beatles db-ont:currentMember ?x }
INSERT { dbpedia:The_Beatles db-ont:formerMember ?x }
WHERE  { dbpedia:The_Beatles db-ont:currentMember ?x .
        ?x foaf:name "Peter Best"
       } 

Loading an RDF graph

You might wish to extend your dataset by loading triples from an external dataset for which you have the URL. This can be done using the LOAD query, which has the following simple basic syntax:

LOAD <http://xmlns.com/foaf/spec/20100809.rdf>

This will retrieve all triples from the specified external web source – if it exists – and copy them into the default graph of your query engine. If you want instead to copy them into a named graph you can use the keyword INTO to specify the graph name:

LOAD <http://xmlns.com/foaf/spec/20100809.rdf>
INTO <http://myFavGroups/The_Beatles>

Clearing an RDF graph

If you want to retain a graph in the store as a location into which you can load and insert data, but to repopulate it from scratch by clearing out its current triples, you can use a CLEAR query with the following simple syntax:

CLEAR GRAPH <http://myFavGroups/The_Beatles>

The keyword GRAPH is employed when you identify the graph to be cleared using its IRI. Alternatives are CLEAR DEFAULT, which clears the default graph, CLEAR NAMED, which clears the graph with the specified name, and CLEAR ALL, which clears all graphs in the store. Thus the following would be a complete query removing all triples from the query engine's default graph:

CLEAR DEFAULT

Adding and removing graphs

SPARQL 1.1 allows you to manage the graphs in the query engine's store by the query forms CREATE and DROP. It is important to understand that these operate on graphs, not on triples. CLEAR removes all triples from a graph, but does not remove the graph itself; it remains in place as a location into which triples may be inserted or loaded. DROP, by contrast, removes the graph altogether from the store, so that any attempt to load or insert into it will yield an error – unless you first reinstate it using CREATE. Thus CLEAR is like removing all text from a Word document while leaving the document in place, whereas DROP is like trashing the document altogether.

To create a new graph, just specify its IRI as follows:

CREATE GRAPH <http://myFavGroups/The_Beatles>

To drop an existing graph, use the same syntax with DROP instead of CREATE:

DROP GRAPH <http://myFavGroups/The_Beatles>

Other graph management operations

If you are working with several graphs in a store, you might wish to copy or move information from one to another. For instance, when starting a new version of a named graph, you might first create a new (empty) graph for version 2, and then wish to copy across the entire contents of version 1. This is done as follows:

COPY GRAPH <http://myFavGroups/The_Beatles/v1>
TO GRAPH <http://myFavGroups/The_Beatles/v2>

Note that version 1 remains unchanged by this operation, which is like copying text from one document and pasting it into another document, overwriting any text already in the second document. This means that after the COPY operation is performed, the two graphs will have exactly the same triples.

If the source graph is a draft that you have no more use for, you can use the MOVE command, which first copies all triples from the first graph to the second, and then removes the first graph altogether – equivalent to performing COPY followed by DROP:

MOVE GRAPH <http://myFavGroups/The_Beatles/draft>
TO GRAPH <http://myFavGroups/The_Beatles/v1>

This is like copying text from one document to another, overwriting any text in the second document, and then trashing the first document.

If you want to retain triples already present in the destination graph, you can use the ADD command, which is the same as COPY except that triples from the source graph are added to the destination graph without overwriting. You might do this, for example, if you have been constructing separate graphs for different rock groups (The Beatles, The Rolling Stones, etc.), and now wish to add your Beatles data to a larger graph covering all your favourite rock groups:

ADD GRAPH <http://myFavGroups/The_Beatles>
TO GRAPH <http://myRockGroups>

Like COPY, this operation leaves the source graph unchanged. It is like copying text from a document and pasting it into another document without overwriting the text already present there.

For all these operations, a destination graph will be created automatically if it does not exist already.

SPARQL protocol

The SPARQL 1.1 Protocol comprises two operations, one for querying, one for updating. An operation specifies the following:

  • An HTTP method (either GET or POST)
  • HTTP query parameters
  • Message content in the HTTP request body
  • Message content in the HTTP response body

A request is sent from the SPARQL client to the SPARQL server (or endpoint), which sends back a status code, possibly accompanied by other data (e.g., tables with the results of the query). For a query operation, the response is a table of data in one of various formats (XML, CSV, etc.) for a SELECT query, and triples formatted in RDF/XML or Turtle for a CONSTRUCT query.

Status codes are three-digit integers whose meanings are defined by the HTTP specification; for instance, the standard response to a successful request has the code "200".

Figure 9 shows a possible encoding of the HTTP GET request for a simple query. The details are not important for this chapter, but it is worth noting the final line (starting with "Accept"), which lists the media types that the sender can understand, with preference ratings expressed as numbers between 0.0 and 1.0. This is called content negotiation, and is also part of the HTTP specification.

Web Generations

Figure 9: Encoding of a GET request

Movie 4: SPARQL 1.1 Protocol. In this clip from the second Euclid webinar for chapter 2, Dr Barry Norton outlines the SPARQL 1.1 protocol.

Reasoning over linked data

Reasoning enhances the information contained in a dataset by including results obtained by inference from the triples already present. As a simple example, suppose that the dataset includes the following triples, shown here in Turtle (assume the usual prefixes):

dbpedia:The_Beatles a mo:MusicGroup .
mo:MusicGroup rdfs:subClassOf mo:MusicArtist .

Recall that the predicate a in the first triple is Turtle shorthand for rdf:type, denoting class membership. Now, suppose that we submit the following SELECT query designed to retrieve all triples in which the Beatles occur as subject:

PREFIX dbpedia: <http://dbpedia.org/resource/>

SELECT *
WHERE { dbpedia:The_Beatles ?predicate ?object }

Assuming there are no other relevant triples, the response to this query under a regime with no entailment will be the following single-row table:

?predicate ?object
a mo:MusicGroup

An obvious inference has been missed here, since if every music group is a music artist (as asserted by the second triple), the Beatles will also be a music artist. (It might sound odd to call a group an artist, but this is how the information is encoded in the Music Ontology.) If we execute the query on an engine that implements the RDFS entailment regime (a set of rules governing inference based on RDFS resources like rdfs:subClassOf), the output table will be enriched by a second variable binding inferred with the aid of the second triple.

?predicate ?object
a mo:MusicGroup
a mo:MusicArtist

This is a very simple case, but in general the formulation of workable entailment regimes is a complex task, still under investigation and discussion. One hard problem is what to do if the dataset is inconsistent, since a well-known result of logic states that from a contradiction, anything can be inferred. (This follows from the definition of material implication, for which the truth-conditions state that if the antecedent is false, the implication holds whatever the consequent. Hence a contradiction, being false by definition, implies any statement whatever.)

Another problem is that some queries might return an infinite set of solutions. For instance, there is a convention in RDF that allows for a family of predicates rdf:_1, rdf:_2, etc. ending in any positive integer; such predicates are used when constructing lists. This means that if all possible inferences are performed, a query of the form ?x rdf:type rdf:Property (intuitively, return all properties) would yield an infinite set.

For practical purposes we can ignore such cases, but they illustrate that that query engines may not always return all the entailments that one might expect.

In the rest of this section we look first at typical entailments that arise from RDFS, and then at the much richer entailments that can result from an ontology in OWL.

Movie 5: Integration and inference. In this clip from the second Euclid webinar for chapter 2, Dr Barry Norton shows how data integration can be achieved through inferences based on RDFS and OWL.

Reasoning using RDFS

RDF Schema introduces a very limited range of logical terms based mostly on the concept of class, or rdfs:Class, a concept absent from RDF. It is possible to state directly that a resource denotes a class, using a triple such as the following (to save space we omit prefixes):

mo:MusicGroup rdf:type rdfs:Class .

We have already seen an example of the predicate rdfs:subClassOf, through which we can assert that the resource in subject position is a subclass of the resource in object position – a relationship often expressed in English by sentences of the form All X are Y (e.g., "All pianists are musicians"). A similar predicate is provided for properties rather than classes, illustrated by the following triple in which subject, predicate and object are all property resources:

ex:hasSinger rdfs:subPropertyOf ex:hasPerformer .

This means that if a subject and object are related by the ex:hasSinger predicate, they must also be related by the ex:hasPerformer predicate. Note that even though these are properties, they may occur in subject and object position – in other words, we can make statements about properties, as well as using them to express relationships among other resources. Two important facts about any property are its domain and range, which constrain the resources that can be employed in subject and object position when the property is used as a predicate; these can also be defined using RDFS resources, as follows:

ex:hasSinger rdfs:domain mo:Performance .
ex:hasSinger rdfs:range foaf:Agent .

This means that any resource occurring as subject of ex:hasSinger must belong to the class mo:Performance (only performances have singers), while any resource occurring as object of ex:hasSinger must belong to the class foaf:Agent (only agents are singers).

RDFS entailment rules

The RDFS entailment regime is defined by thirteen rules, which are listed here. In brief, their import is as follows:

rdfs1
Allocates a blank node to a literal, and classifies this node as a member of the class rdfs:Literal.
rdfs2
Uses a rdfs:domain statement to classify a resource found in subject position (in a triple containing the relevant property as predicate).
rdfs3
Uses a rdfs:range statement to classify a resource found in object position (in a triple containing the relevant property as predicate).
rdfs4
Classifies a resource found in subject or object position in a triple as belonging to the class rdfs:Resource.
rdfs5
Inference based on the transitivity of rdfs:subPropertyOf.
rdfs6
Infers that any resource classified as a property is a sub-property of itself.
rdfs7
Uses a rdfs:subPropertyOf statement to infer that two resources are related by a property P if they are related by a subproperty of P.
rdfs8
Infers that any resource classified as a class is a subclass of rdfs:Resource.
rdfs9
Infers that a resource belongs to a class if it belongs to its subclass.
rdfs10
Infers that any class is a subclass of itself.
rdfs11
Uses the transitivity of rdfs:subClassOf to infer that the subclass of a class C is a subclass of the superclass of C.
rdfs12
Infers that any resource belonging to rdfs:ContainerMembershipProperty is a subproperty of rdfs:member.
rdfs13
Infers that any resource belonging to rdfs:Datatype is a subclass of rdfs:Literal.

In what follows, we look more closely at how these RDFS resources and their associated rules can support inferences when information is retrieved from a dataset.

Inferring subclass and class membership relationships

We have already seen an example of an inference based on rdfs:subClassOf, which we repeat below omitting Prefix statements:

dbpedia:The_Beatles a mo:MusicGroup .
mo:MusicGroup rdfs:subClassOf mo:MusicArtist .

The general form of this inference is that if a resource belongs to a class, which in turn belongs to a broader class, then the resource must also belong to this broader class (see rdfs9). This pattern will be familiar as a variant of the Aristotelian syllogism:

Socrates is a man.
All men are mortal.
Therefore, Socrates is mortal.

A similar inference can be drawn from two subclass statements in which the object of one statement is the subject of the other (see rdfs11):

mo:MusicGroup rdfs:subClassOf mo:MusicArtist .
mo:MusicArtist rdfs:subClassOf foaf:Agent .

If music groups are a subclass of music artists, who are in turn a subclass of agents, then music groups must be a subclass of agents. This corresponds to the so-called Barbara pattern of the syllogism:

All Greeks are men.
All men are mortal.
Therefore, all Greeks are mortal.

Classifying a resource from its relationships to other resources

Suppose a dataset contains the following triple asserting that Paul McCartney belongs to the Beatles:

dbpedia:The_Beatles mo:bandMember dbpedia:Paul_McCartney .

Can we infer anything further about the resources in subject and object position? To a human, of course, it will be obvious from the resource names that the Beatles is a group, and that Paul McCartney is an agent, or more specifically a person. However, these names are meaningless to the query engine, which can only utilise information encoded in RDF triples. To put ourselves in this position, we could try using arbitrary IRIs, as in this version of the triple:

ex:X mo:bandMember ex:Y .

Even with subject and object anonymised in this way, a person could infer from the predicate that X will be a band, and Y a person. The query engine can perform this inference too if the dataset includes triples defining the domain and range of the property (see rdfs2 and rdfs3):

mo:bandMember rdfs:domain mo:MusicGroup ;
              rdfs:range foaf:Agent .

(Remember the use of the semi-colon here to abbreviate two statements with the same subject.) If these triples are present in the dataset, and the query engine is deploying the RDFS entailment regime, then two further triples can be directly inferred:

ex:X a mo:MusicGroup .
ex:Y a foaf:Agent .

A special case of this inference arises when a triple uses an RDFS predicate for which domain and range statements are axiomatic, meaning that they are defined in RDFS itself and need not be included in the dataset over which the query is run. Thus for rdfs:subClassOf, the following triples are axiomatic and need not be defined by developers:

rdfs:subClassOf rdfs:domain rdfs:Class .
rdfs:subClassOf rdfs:range  rdfs:Class .

It follows that on encountering any subclass statement, the query engine can infer that its subject and object must be classes. Thus from –

mo:MusicGroup rdfs:subClassOf mo:MusicArtist .

– it can infer:

mo:MusicGroup a rdfs:Class .
mo:MusicArtist a rdfs:Class .

Inferences based on sub-property relationships

Finally, whenever SP has been defined as a sub-property of P, it can be inferred that any subject and object having an SP relationship must also have the (broader) P relationship (see rdfs5):

ex:hasSinger rdfs:subPropertyOf ex:hasPerformer .
ex:Yesterday ex:hasSinger dbpedia:PaulMcCartney .

From these two statements, the query engine may infer:

ex:Yesterday ex:hasPerformer dbpedia:PaulMcCartney .

Such inferences are common when a dataset uses RDFS to organise classes and properties into hierarchies. Thus classes like Corgi, Dog, Canine, Mammal, Animal, Living thing, comprise an obvious hierarchy in which membership of any class implies membership of all its superclasses; as we have just shown, a similar hierarchical organisation can be defined for properties.

More advanced inferences

Compared with OWL, RDFS has two main limitations: (a) it provides no operators for constructing complex classes or properties out of simpler ones (e.g., "artist that belongs to at least two bands" from artist, band, and member-of); (b) it lacks some important resources for describing the logical properties of classes and properties, such as disjointness (for classes) and inverse (for properties). Limitations of the second kind are particularly common when working with linked data, and it is worth illustrating some of them now, before we look in detail at OWL.

Suppose that a dataset contains resources named ex:Man and Woman, both classes, which can be used for classifying resources that represent individual people:

dbpedia:Paul_McCartney a ex:Man .
dbpedia:Cilla_Black a ex:Woman .

Now, suppose that in addition to these two triples, a third triple is either present in the dataset, or can be inferred:

dbpedia:Paul_McCartney a ex:Woman .

Plainly something has gone wrong here, and we would like a query engine capable of even elementary reasoning to signal an inconsistency. It might surprise you the learn that there is no way of doing this, because RDFS provides no predicate allowing you to state that men and women are disjoint (i.e., that no man is a woman, or nothing is both a man and a woman). The necessary predicate, owl:disjointWith, is found in OWL only, and allows statements such as the following:

ex:Man owl:disjointWith ex:Woman .

To give one more example, suppose that we introduce a property resource allowing us to state that one person is married to another. We might for instance apply it to the McCartneys:

dbpedia:Paul_McCartney ex:marriedTo dbpedia:Linda_McCartney .

Now, suppose that someone formulates the following query, corresponding to the question "Who is married to Paul McCartney?" (we omit PREFIX clauses as before):

SELECT ?person
WHERE { ?person ex:marriedTo dbpedia:Paul_McCartney }

Obviously the answer should be "Linda McCartney", but this binding will not be found because the following triple is missing from the dataset:

dbpedia:Linda_McCartney ex:marriedTo dbpedia:Paul_McCartney .

To infer this we need a statement to the effect that if X is married to Y, Y must be married to X. In mathematics, this fact is expressed by saying that the property ex:marriedTo is symmetric. This can be stated directly using OWL, as follows:

ex:marriedTo a owl:SymmetricProperty .

No such resource exists in RDFS, so this kind of inference cannot be performed under an RDFS entailment regime.

Reasoning using OWL

As a reminder, Figure 10 shows the location of the Web Ontology Language (OWL) in the Semantic Web stack. As can be seen, OWL depends on RDF and RDFS, from which it draws crucial resources such as rdf:type and rdfs:Class. Like RDFS, it can enhance the information in RDF datasets, accessible to SPARQL queries.

Web Generations

Figure 10: OWL in Semantic Web stack

The current OWL standard OWL-2 is complex, providing for a number of fragments with different computational properties (see chapter 1). Most of these are subsets of OWL2-DL, the OWL description logic, but there is also a more expressive variant called OWL2-Full which ventures outside description logic. We will not try to cover all this material in this section. Instead, we look at some of the main logical resources in OWL2-DL from the viewpoint of their role in inference.

Inferences based on characteristics of properties

We have illustrated in the last section an inference based on one of the property characteristics that can be represented in OWL, namely symmetry (if X p Y, then Y p X). A number of others are provided, of which the following are used most often:

  • Transitive properties support the inference: if X p Y, and Y p Z, then X p Z. An example is the property "longer than" applied to track durations: if "Hey Jude" is longer than "Help", and "Help" is longer than "Yesterday", then we can infer that "Hey Jude" is longer than "Yesterday".
  • Functional properties support the inference that if X p Y and X p Z, Y and Z must be identical: in other words, X can bear the relation p to only one thing. An example is the property "has mother", assuming that this means biological mother. To say that this property is functional means that a person can have at most one mother. Thus if Charles has Elizabeth as mother, and Charles also has Mrs Windsor as mother, then Elizabeth and Mrs Windsor are two names of the same person.
  • Inverse-functional properties support the inference that if Y p X and Z p X, then Y and Z must be identical. An example would be the property "is mother of". Thus if Elizabeth is mother of Charles, and so is Mrs Windsor, then Elizabeth and Mrs Windsor are the same person.
  • Two properties p1 and p2 are inverse if X p1 Y means exactly the same as Y p2 X (i.e., they are equivalent, and each can be inferred from the other). Thus in our last two examples, "has mother" and "is mother of" are inverse properties, since if Charles has mother Elizabeth, then Elizabeth is mother of Charles, and vice-versa.

Let us see how these characteristics are used in practice, starting with a transtive property.

ex:locatedIn a owl:TransitiveProperty .
ex:AbbeyRoadStudios ex:locatedIn ex:London .
ex:London ex:locatedIn ex:UnitedKingdom .

Here the classification of ex:locatedIn as a transitive property allows a query engine with an OWL2-DL regime to draw the following inference in response to a query concerning the location of the Abbey Road Studio (or concerning what is located in the United Kingdom):

ex:AbbeyRoadStudios ex:locatedIn ex:UnitedKingdom .

Moving on to functional properties, consider this set of triples:

ex:hasFather a owl:FunctionalProperty .
dbpedia:Julian_Lennon ex:hasFather dbpedia:John_Lennon .
dbpedia:Julian_Lennon ex:hasFather ex:J_Lennon .

The classification of ex:hasFather as functional permits the inference that Julian Lennon can have only one father, and hence that the objects of the second and third triples must be co-referential. In OWL this is expressed using owl:sameAs:

dbpedia:John_Lennon owl:sameAs ex:J_Lennon .

Exactly the same inference can be drawn if the same information is presented using the inverse-functional property ex:isFatherOf:

ex:isFatherOf a owl:InverseFunctionalProperty .
dbpedia:John_Lennon ex:isFatherOf dbpedia:Julian_Lennon .
ex:J_Lennon ex:isFatherOf dbpedia:Julian_Lennon .

Note that owl:sameAs is symmetric, and also transitive; these characteristics are axiomatic in OWL, so they need not be specified by the ontology developer.

Finally, if we include a triple stating that ex:hasFather and ex:isFatherOf are inverse, we obtain two ways of expressing fatherhood, with subject and object switched. Imagine for instance a dataset with just two triples –

ex:hasFather owl:inverseOf ex:isFatherOf .
dbpedia:Julian_Lennon ex:hasFather dbpedia:John_Lennon .

– and suppose we run the following query (prefixes omitted) on who is the father of Julian Lennon:

SELECT ?person
WHERE { ?person ex:isFatherOf dbpedia:Julian_Lennon }

Without OWL-based reasoning the query engine has no way of returning John Lennon as an answer to this query, but using the inverse property statement it can derive the following entailment:

dbpedia:John_Lennon ex:isFatherOf dbpedia:Julian_Lennon .

Inferences based on equivalence among terms

We have already seen an OWL property owl:sameAs which allows us to state that two IRIs name the same individual:

dbpedia:John_Lennon owl:sameAs ex:J_Lennon .

This property is very important in linked data, since it provides a means by which datasets with different naming conventions can be connected. OWL also provides predicates for stating that two classes, or two properties, have the same meaning, as follows:

mo:MusicArtist owl:equivalentClass ex:musician .
foaf:made owl:equivalentProperty ex:creatorOf .

In essence we have three ways here of stating that two terms mean the same thing, and in each case inferences can be drawn from the fact that equivalence of meaning is symmetric and transitive. The practical consequence is that if a dataset already contains two names for John Lennon (say), owing to the owl:sameAs statement above, we can add a third name by relating it to only one of these names –

dbpedia:John_Lennon owl:sameAs new:JL666 .

– whereupon by symmetry and transitivity, a similar statement can be inferred for any subect-object pair drawn from these three names:

ex:J_Lennon owl:sameAs new:JL666 .
new:JL666 owl:sameAs ex:J_Lennon .
new:JL666 owl:sameAs dbpedia:John_Lennon .

Similarly, a single owl:sameAs statement introducing a fourth name will suffice to make it equivalent in meaning to any of the other three – and so on.

Further Reading

  1. E. Prud'hommeaux and A. Seaborne (2008). "SPARQL Query Language for RDF (W3C Recommendation)". W3C, available at http://semanticweb.com/introduction-to-sparql_b22498
  2. P. Hitzler, M. Krötzsch, and S. Rudolph (2010). "Query Langauges: Foundations of Semantic Web Technologies". CRC Press.
  3. V. Bush and J. Wang (1945) "As we may think". Atlantic Monthly vol. 176, pp 101-108. Available online at "http://dl.acm.org/citation.cfm?id=227186".
  4. Hartig, O. (2012), An Introduction to SPARQL and Queries over Linked Data., in Marco Brambilla; Takehiro Tokuda & Robert Tolksdorf, ed., 'ICWE' , Springer, pp. 506-507. Materials available at http://www2.informatik.hu-berlin.de/~hartig/tmp/HartigTutorialICWE12_2.pdf, Exercises http://www2.informatik.hu-berlin.de/~hartig/tmp/HartigTutorialICWE12_HandsOn.pdf
  5. A. Hogan, J. Z. Pan, A. Polleres, and Y. Ren, (2011). "Scalable OWL 2 Reasoning for Linked Data". In A. Polleres, C. d'Amato, M. Arenas, S. Handschuh, P. Kroner, S. Ossowski & P. F. Patel-Schneider (eds.), Reasoning Web (p./pp. 250-325), : Springer. ISBN: 978-3-642-23031-8. Materials available at http://homepages.abdn.ac.uk/jeff.z.pan/pages/tutorial/eswc2010/, slides http://homepages.abdn.ac.uk/jeff.z.pan/pages/tutorial/eswc2010/ESWC2010TutorialOnScalableOWLReasoningForLinkedData.pdf

Summary

After studying this chapter you should achieve the following outcomes:

  1. Conceptual understanding of querying and updating datasets using SPARQL.
  2. Practical understanding of how to formulate and submit SPARQL queries at an endpoint.
  3. Understanding the purposes of the different query forms (ASK, SELECT, CONSTRUCT, DESCRIBE).
  4. Enough knowledge of SPARQL syntax to formulate basic queries, including patterns and modifiers like ORDER BY, LIMIT, OFFSET, OPTIONAL, FILTER, DISTINCT.
  5. Understanding of basic data and graph management operations in SPARQL 1.1 (e.g., INSERT, DELETE, LOAD, CLEAR).
  6. Outline understanding of the SPARQL Protocol (details not needed at this point).
  7. Outline understanding of the entailment regimes in reasoning (RDFS, OWL), and the ways in which they can enhance SPARQL query semantics.