iRap: Interest-based RDF update propagation framework

iRap is an RDF update propagation framework that propagates only interesting parts of an update from the source dataset to the target dataset. iRap filters interesting parts of changesets from the source dataset based on graph-pattern-based interest expressions registered by a target dataset user.

Source Code Mailing-list Homepage

Many LOD datasets, such as DBpedia and LinkedGeoData, are voluminous and process large amount of requests from diverse applications. Replication of Linked Data datasets enhances flexibility of information sharing and integration infrastructures. Since hosting a replica of large datasets, such as DBpedia and LinkedGeoData, is costly, organizations might want to host only a relevant subset of the data. However, due to the evolving nature of these datasets in terms of content and ontology, maintaining a consistent and up-to-date replica of the relevant data is a challenge. We present an approach and its implementation for interest-based RDF update propagation, which propagates only interesting parts of updates from the source to the target dataset. Our approach is based on a formal definition for graph-pattern-based interest expressions that is used to filter interesting parts of updates from the source dataset.

Architecture

We implement the approach in the iRap framework and perform a comprehensive evaluation based on DBpedia Live updates.



The Interest-based RDF update propagation (iRap) framework was implemented using Jena-ARQ. It is provided as open-source and consists of three modules: Interest Manager (IM), Changeset Manager (CM) and Interest Evaluator (IE), each of which each can be extended to accommodate new or improved functionality.

Evaluation

To evaluate the proposed approach, we performed experiments on the iRap framework using changesets published by DBpedia and compared the results with the DBpedia Live Mirror tool. The comparison considers two cases: using iRap to update a previously-established local replica of i) an entire remote dataset (Entire DBpedia for a Location-based app) ii) a subset of a remote dataset (Slice of DBpedia about Soccer-players and teams app). These two cases simulate two ways in which iRap can be used: i) using interest-based changeset propagation for future updates of a local copy of a large dataset or ii) starting with a new subset of the large dataset.

We tested our approach using the DBpedia dump for the initial setup of the target datasets for two different application domains, namely, Location and Football datasets. The following figure shows the evaluation results comparing the growth of mirror based replica and iRap based replica:


Fig 1. Soccer app dataset growth                         Fig 2. Location app dataset growth



Further evaluation material can be found at iRap's GitHub repository.

Demo

Download: demo.zip.

The download file contains sample interest expression file - interest.ttl, last downloaded date file - lastDownloadedDate.dat, and java archive file - irap.jar.

After extracting the compressed file run the following command:

         $java -jar irap.jar   interest.ttl   [numOfChangesets] 
Where -
  • interest.ttl - is an interest expression that contains the following query pattern
    
    CONSTRUCT 
     WHERE {
       ?person     a     dbpedia-owl:Person .
       ?person    dbpedia-prop:name    ?name .
       OPTIONAL {   ?person foaf:homepage  ?page .}
     }
  • numOfChangesets - is number of changesets you want to propagate since May 01, 00:00:00 2015. e.g., 10, 100, 1000, etc.
    (Note that, lastDownloadedDate.dat file will always be updated per each download. Therefore, if you run iRap once then it will be updated and it will not start from May 01 but from the last downloaded date. )

After iRap finishes evaluation you will find (name of TDB folders depend on interest expression):

  1. names-tdb - a TDB dataset that contains only interesting triples that matches the above query,
  2. names-pi-tdb - a TDB dataset that contains potentially interesting triples that matches parts of the above query, and
  3. changesets - a folder that contains last publication date file from a source dataset