INEX Document Collections

Available Collections

Current

Books and Social Search (2011-)
Linked Data (2012-)
Question Answering / Tweet Contextualization (2008-)
Snippet Retrieval (2011-)
Wikipedia Collection (2009-)

Old

Relevance Feedback (2010-2012)
Data-Centric Track (2010-2011)
Web Services (2010)
XML Mining (2010)

Adhoc Track (2009-2010) and Wikipedia Collection (2009-)

Documents

This collection is a 2,666,190 article dump of the Wikipedia taken on 8 October 2008, it is annotated with the 2008-w40-2 version of YAGO. It is 50.7GB in size. It was prepared by Ralf Schenkel. For details please see (and cite) Ralf Schenkel, Fabian M. Suchanek, Gjergji Kasneci (2007): YAWN: A semantically annotated Wikipedia XML corpus, 12. GI-Fachtagung fur Datenbanksysteme in Business, Technologie und Web (BTW 2007), Aachen, Germany, March 2007.

Europe / Atlantic

Get the
INEX Wikipedia 2009 collection from Max-Planck-Institut Informatik

Topics

2010 Topics 2010001-2010107 (v1.0)
2009 Topics 2009001-2009115 (v1.0)

Assessments

2010 README files for evaluation, measures, formates, sub2fol, and inex_eval
2010 Assessments (passage level and document level qrels)
2009 README files for evaluation, measures, formates, sub2fol, and inex_eval
2009 Assessments (passage level qrels only)
2009 Assessments and 2009 inex_eval evaluation tool
2009 sub2fol for all tasks [1.1GB] (you only need this if you use XPath in your results, its 1GB in size and requires >4GB memory to run)
2009 sub2fol for in-context tasks [28MB] (you only need this if you use XPath in your results)

Reference Runs

2010 BM25 reference run (CO)
2009 BM25 reference run (CO)

Helpful Stuff

Tags, DocumentFrequency, CollectionFrequency

Books and Social Search / Social Book Search Track (2011, 2012, 2013, 2014)

Documents

2011 Book Collection (from Amazon and LibaryThing, 7.1GiB)
Corpus License Agreement needed, see here for further information how to access the collection.

Topics

2014 topics
2014 user profiles
2013 ProveIt! Task guidelines
2013 Social Book Search Task user profiles (V2, May 3)
2013 Social Book Search Task official topic set
2012 Social Book Search Task official topic set
2012 Social Book Search Task user profiles
2012 Social Book Search Task guidelines (V3, updated May 10, 2012)
2011 Prove it! task official topics
Training material for 2011 Social Search for Best Books: This is a set of training topics+relevance judgements we've prepared for the Social Search for Best Books (SSBB) task. This is not the official topic set. The relevance judgements are taken unedited from the LibraryThing discussion forums, so use this set for sanity-checking your setup only.
2011 Social Search for Best Books official topic set

Evaluation

official qrels for the 2014 Suggestion task.
official qrels for the 2013 Social Book Search task.
official qrels for the 2012 Social Book Search task.
inofficial qrels over all 300 topics of the INEX 2012 Social Book Search task.
official qrels for the INEX 2012 Prove It! task.
official qrels for 2011 using the LibraryThing work ID of books suggested in the LT discussion threads. The ISBNs in the submitted runs are mapped to LT work IDs as well, with the highest ranked ISBN being mapped to the work ID and lower ranked ISBNs mapped to the same ID removed from the results list.
expanded qrels for 2011 with the work IDs expanded to all matching ISBNs. Multiple search results mapping to the same ID all contribute to the score.
qrels for 2011 derived from the Mechanical Turk relevance judgements, for 24 of the 211 topics.
perl script to map ISBNs to IDs in run. This script requires amazon-lt.isbn.thingID.gz for the mappings.

Linked Data Track (2012, 2013)

Wikipedia-LOD Collection V2.0 (2013)

See the track page for the official datasets. The auxiliary Wikipedia-LOD 2.0 collection is available from here (MPI Informatik server).

Wikipedia-LOD Collection V1.1 (2012)

The Wikipedia-LOD collection is available from here (MPI Informatik server). It consists of 8 files in 7z format and contains approximately 2.7 million XML articles. The uncompressed size of the collection is 60 GB.
A DTD for the XML collection is available here.

Topics

2013 topics for the Adhoc Task
2013 topics for the Jeopardy Task (V2, as of March 17, 2013)
2012 topics for the Adhoc Task
2012 topics for the Faceted Task
2012 topics for the Jeopardy Task

Evaluation

2013 qrels (Adhoc track)
2013 qrels (Jeopardy track)
2013 evaluation results (Adhoc track)
2013 evaluation results (Jeopardy track)
2012 qrels (Adhoc and Jeopardy tracks)
2012 evaluation results

Data-Centric (2010, 2011)

IMDB Collection (2010-2011)

2010 IMDB Collection (1.4GB) cleaned for use with INEX 2010 toolset
Information courtesy of The Internet Movie Database (http://www.imdb.com). Used with permission.
This collection is the IMDB plain text files from the web site and dated as 2010-4-23 (converted into XML).
It is available for personal and non-commercial use. See the IMDb Licence.

Evaluation

2010 assessment tool
2011 qrels (adhoc track)
2011 article-level qrels (adhoc track)

Topics

2011 topics for the Adhoc Task (updated 03-Aug-2011)
2011 topics for the Faceted Task
2010 topics

Relevance Feedback (2010, 2011, 2012)

Software

2012 Evaluation Platform
2011 Evaluation Platform Java software

Topics

2011 topics

Qrels

2011 qrels

Useful Stuff

Focused Relevance Feedback presentation by Timothy Chappell

Snippet Retrieval (2011, 2012, 2013)

Topics

2013 Topics
2013 Reference run for the topics (V2, May 1st)
2012 Topics
2012 Reference run for the topics
2011 Training topics
2011 Reference run for the training topics
2011 Topics
2011 Reference run for the topics
Assessment Tool

Tweet Contextualization (2011, 2012, 2013, 2014)

Data

The data collection is available from http://qa.termwatch.es/data (username: inex, password: inexqa2011).

XML Mining Track (2005-2010)

2010

2010 Tags and Trees
2010 Links
2010 Entities
2010 Bag of Bi-grams
2010 Bag of Stemmed Words
2010 Bag of Stemmed Words and Bi-grams (concatenated)
2010 Training Labels 20% of the labels for classification training

Imprint | Contact someone about INEX