The goal of the new Linked Data track is to investigate retrieval techniques over a combination of textual and highly structured data, where RDF properties carry additional key information about semantic relations among data objects that cannot be captured by keywords alone. We intend to investigate if and how structural information could be exploited to improve ad-hoc retrieval performance, and how it could be used in combination with structured queries to help users navigate or explore large sets of results (known from faceted search systems), or to address Jeopardy-style natural-language queries (known from question answering) which are translated into a semi-structured query format.
The new Linked Data track at INEX 2012 thus aims to close the gap between IR-style keyword search and Semantic-Web-style reasoning techniques. Our goal is to bring together different communities and to foster research at the intersection of Information Retrieval, Databases, and the Semantic Web.
As its core collection, the Linked Data track will use a fusion of XML-ified Wikipedia articles with RDF properties from both DBpedia and YAGO2, the latter of which contain the article entity as either their first or second argument. The core collection (Wikipedia-LOD v1.1, see below) is based on the popular MediaWiki format (see http://dumps.wikimedia.org/enwiki/20110722/), where all Wiki-markup has been replaced by proper XML tags and CDATA sections. In addition, all internal Wikipedia links (including the article entity itself) have been enriched with links to both their corresponding DBpedia and YAGO2 entities (as far as available). Participants are explicitly encouraged to make use of more RDF facts available from DBpedia and YAGO2, in particular for processing the reasoning-related Faceted Search and Jeopardy topics.
For INEX 2012, we will explore three different retrieval tasks:
The new Wikipedia-LOD collection is available from the following link:
In addition to the new core collection, which is based on XML-ified Wikipedia articles, the Linked Data track explicitly encourages (but does not require) the use of current Linked Data dumps for DBpedia (v3.7) and YAGO2, which are available from the following URLs:
Each Wikipedia-LOD article consists of a mixture of XML tags and CDATA sections, containing infobox attributes, free-text contents describing the entity or category that the article captures, and a section with both DBpedia and YAGO2 properties that are related to the article's entity. All sections contain links to other Wikipedia articles (including links to the corresponding DBpedia and YAGO2 resources), Wikipedia categories, and external Web pages.
DBpedia and YAGO2 are two comprehensive, common-sense knowledge bases providing structured information that has been semi-automatically extracted mostly using Wikipedia infoboxes and categories. Both knowledge bases focus on extracting attribute-value pairs from Wikipedia infoboxes and category lists, which serve as basis for applying various information extraction techniques. They also contain geo-coordinates, links between Wikipedia pages, redirection and disambiguation pages, external links, and so on. Each Wikipedia page corresponds to a resource in DBpedia and YAGO2. The connection between the data sets is given in the "wikipedia_links_en" file from DBpedia. See, for example:
<http://dbpedia.org/resource/AccessibleComputing> <http://xmlns.com/foaf/0.1/page> < http://en.wikipedia.org/wiki/AccessibleComputing>
The Linked Data track is intended as an open track and thus invites participants to include more Linked Data (see, for example, linkeddata.org) or other sources that go beyond "just" DBpedia and YAGO2. Any inclusion of further data sources is welcome, however, research papers and workshop submissions should explicitly mention these sources when describing their approaches.
<qid> Q0 <file> <rank> <rsv> <run_id>
An example submission is:
2012001 Q0 12 1 0.9999 2012UniXRun1
2012001 Q0 997 2 0.9998 2012UniXRun1
2012001 Q0 9989 3 0.9997 2012UniXRun1
Here are three results for topic “2012001”. The first result is the Wikipedia page with ID "12". The second result is the page with ID "997", and the third result is the page with ID "9989".
An example submission is:
<fv f=”dbpedia-owl:date” v=”1955-11-01”>
<fv f=”dbpedia-owl:place” v=”dbpedia:South_Vietnam”>
<fv f=”rdf:type” v=”dbpedia-owl:MilitaryConflict”/>
<fv f=”rdf:type” v=”dbpedia-owl:Country”/>
<fv f=”dbpedia-owl:place” v=”dbpedia:North_Vietnam”>
<fv f=”rdbpprob:capital” v=”dbpedia:Ho_Chi_Minh_City”/>
Here for the topic “2012001”, the search system first recommends the facet-value condition "dbpedia-owl:date=1955-11-01" among other facet-value conditions, which are its siblings. If the user selects this condition to refine the query, the system will recommend a new list of facet-value conditions, which are "dbpedia-owl:place=dbpedia:South_Vietnam" and "dbpedia-owl:place=dbpedia:North_Vietnam". If the user then selects "dbpedia-owl:place=dbpedia:North_Vietnam", the system will recommend the facet-value condition "rdbprob:capital=dbpedia:Ho_Chi_Minh_City". Note that no facet-value condition occurs twice on a path in the hierarchy.
The new Jeopardy task investigates retrieval techniques over a set of natural-language Jeopardy clues, which have been manually translated into SPARQL query patterns and enhanced with keyword-based filter conditions. Specifically, we investigate a data model, where every entity (in DBpedia or YAGO) is associated with the Wikipedia article (contained in the Wikipedia-LOD v1.1 collection) that describes this entity.
For example, topic no. 2012301 from the current set of Jeopardy topics looks as follows:<topic id="2012301" category="LAKES">
The <jeopardy_clue> element contains the original Jeopardy clue as a natural-language sentence; the <keyword_title> element contains a set of keywords that has been manually extracted from this title and will be reused as part of the ad-hoc task; and the <sparql_ft> element contains a formulation of the natural-language sentence into a correspondig SPARQL pattern. The <category> attribute of the <topic> element may be used as an additional hint for disambiguating the query.
In the above query, the DBpedia entity http://dbpedia.org/resource/Niagara_Falls has been marked as the subject of the first triplet pattern, while both the object of the first triplet pattern and the subject and object of the second triplet pattern are unknown. The two FTContains filter conditions however restrict both these subjects and objects to entities that should be associated with the keywords "river water course niagara" and "lake origin" via their corresponding Wikipedia articles, respectively.Since this particular variant of SPARQL with full-text filter conditions cannot be run against a standard RDF collection (such as DBpedia or YAGO) alone, participants are encouraged to develop individual solutions to index both the RDF and textual contents of the Wikipedia-LOD collection in order to process these queries.
An XML file with 90 Jeopardy-style topics is available here:
Again, each participanting group may submit up to 3 runs. Each run can contain a maximum of 1000 results per topic, ordered by decreasing value of relevance (although we expect most topics to have just one or a few entities or sets of entities as targets). The results of one run must be contained in one submission file (i.e. up to 3 files can be submitted in total). For relevance assessment and evaluation of the results we require submission files to be in the familiar TREC format, however containing one row of target entities (denoted by their Wikipedia page ID's) that denote each query result. Each row of target entities must reflect the order of query variables as specified by the Select clause of the Jeopardy topic. In case the Select clause contains more than one query variable, the row should consist of a comma- or semicolon-separated list of target entity ID's.
<qid> Q0 <file> <rank> <rsv> <run_id>
An example submission is:
2012301 Q0 12 1 0.9999 2012UniXRun1
2012301 Q0 997 2 0.9998 2012UniXRun1
2012301 Q0 9989 3 0.9997 2012UniXRun1
Here are three results for topic "2012301". The first result is the entity (i.e. Wikipedia page) with ID "12". The second result is the entity with ID "997", and the third result is the entity with ID "9989".
Relevance assessments for the Jeopardy task will be conducted by the groups participating in this task. All the submitted results for each query will be pooled and assessors are asked to identify all relevant results in the pool using the INEX assessment tool.
In addition, all keyword titles from the Jeopardy topics will be added to the set of topics for the ad-hoc search task. Thus they can be assessed in the same way as the other ad-hoc search topics.
The effectiveness of the retrieval results submitted by the participants will be evaluated using the classical IR metrics, e.g. MAP, P@5, P@10, NDCG and so on. In addition, we will explore different measures known from entity-centric retrieval measures.