Ad-hoc Information Retrieval

Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection. Searches can be based on full-text or other content-based indexing. Here, the Ad-hoc information retrieval refer in particular to text-based retrieval where documents in the collection remain relative static and new queries are submitted to the system continually (cited from the survey).

The number of queries is huge. Some benchmark datasets are listed in the following,

Classic Datasets

Dataset Genre #Query #Collections
Robust04 news 250 0.5M
ClueWeb09-Cat-B web 150 50M
Gov2 .gov pages 150 25M
MS MARCO (Document Ranking) web pages 367,013 3.2M
MQ2007 .gov pages 1692 25M
MQ2008 .gov pages 794 25M
  • Robust04 is a small news dataset which contains about 0.5 million documents in total. The queries are collected from TREC Robust Track 2004. There are 250 queries in total.

  • Cluebweb09 is a large Web collection which contains about 34 million documents in total. The queries are accumulated from TREC Web Tracks 2009, 2010, and 2011. There are 150 queries in total.

  • Gov2 is a large Web collection where the pages are crawled from .gov. It consists of 25 million documents in total. The queries are accumulated over TREC Terabyte Tracks 2004, 2005, and 2006. There are 150 queries in total.

  • MS MARCO (Document Ranking) provides a large number of information question-style queries from Bing’s search logs. There passages are annotated by humans with relevant/non-relevant labels. There are 8,841822 documents in total. There are 808,731queries, 6,980 queries and 48,598 queries for train, validation and test, respectively.

  • Million Query TREC 2007 (MQ2007) is a LETOR benchmark dataset which uses Gov2 web collection. There are 1692 queries in MQ2007 with 65,323 labeled documents.

  • Million Query TREC 2008 (MQ2008) is another LETOR benchmark dataset which also uses Gov2 web collection. There are 784 queries in MQ2008 with 14,384 labeled documents.

Neural Models

Robust04

Model Code MAP P@20 nDCG@20 Paper
DSSM MatchZoo 0.095 0.171 0.201 Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, CIKM 2013
CDSSM MatchZoo 0.067 0.125 0.146 Learning Semantic Representations Using Convolutional Neural Networks for Web Search, WWW 2014
ARC-I MatchZoo 0.041 0.065 0.066 Convolutional Neural Network Architectures for Matching Natural Language Sentences, NIPS 2014
ARC-II MatchZoo 0.067 0.128 0.147 Convolutional Neural Network Architectures for Matching Natural Language Sentences, NIPS 2014
DRMM official MatchZoo 0.279 0.431 0.382 A Deep Relevance Matching Model for Ad-hoc Retrieval, CIKM 2016
KNRM official MatchZoo 0.352 0.409 End-to-End Neural Ad-hoc Ranking with Kernel Pooling, SIGIR 2017
CONV-KNRM MatchZoo 0.416 Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search, WSDM 2018
BERT-MaxP official 0.469 Deeper Text Understanding for IR with Contextual Neural Language Modeling, SIGIR 2019
CEDR-DRMM official 0.459 0.526 CEDR: Contextualized Embeddings for Document Ranking, SIGIR 2019

ClueWeb09-B

Model Code MAP P@20 nDCG@20 Paper
DSSM MatchZoo 0.054 0.185 0.132 Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, CIKM 2013
CDSSM MatchZoo 0.064 0.214 0.153 Learning Semantic Representations Using Convolutional Neural Networks for Web Search, WWW 2014
ARC-I MatchZoo 0.024 0.089 0.073 Convolutional Neural Network Architectures for Matching Natural Language Sentences, NIPS 2014
ARC-II MatchZoo 0.033 0.123 0.087 Convolutional Neural Network Architectures for Matching Natural Language Sentences, NIPS 2014
DRMM officialMatchZoo 0.133 0.365 0.258 A Deep Relevance Matching Model for Ad-hoc Retrieval, CIKM 2016
CONV-KNRM MatchZoo 0.270 Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search, WSDM 2018
BERT-MaxP official 0.289 Deeper Text Understanding for IR with Contextual Neural Language Modeling, SIGIR 2019

MS MARCO (Document Ranking)

Model Code MRR@10 nDCG@10 Recall@10 Paper
MatchPyramid official MatchZoo 0.286 0.344 0.531 Text Matching as Image Recognition, AAAI 2016
Duet official MatchZoo 0.266 0.327 0.520 Learning to Match using Local and Distributed Representations of Text for Web Search, WWW 2017
Co-PACRR official MatchZoo 0.284 0.345 0.543 Co-PACRR: A Context-Aware Neural IR Model for Ad-hoc Retrieval, WSDM 2018
KNRM official MatchZoo 0.261 0.323 0.519 End-to-End Neural Ad-hoc Ranking with Kernel Pooling, SIGIR 2017
CONV-KNRM MatchZoo 0.283 0.345 0.542 Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search, WSDM 2018
BERT 0.352 0.417 0.623 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
Transformer-Kernel 0.316 0.380 0.586 Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking, Arxiv 2020

MQ2007

Model Code MAP P@10 nDCG@10 Paper
DSSM MatchZoo 0.409 0.352 0.371 Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, CIKM 2013
CDSSM MatchZoo 0.364 0.291 0.325 Learning Semantic Representations Using Convolutional Neural Networks for Web Search, WWW 2014
ARC-I MatchZoo 0.417 0.364 0.386 Convolutional Neural Network Architectures for Matching Natural Language Sentences, NIPS 2014
ARC-II MatchZoo 0.421 0.366 0.390 Convolutional Neural Network Architectures for Matching Natural Language Sentences, NIPS 2014
DRMM officialMatchZoo 0.467 0.388 0.440 A Deep Relevance Matching Model for Ad-hoc Retrieval, CIKM 2016
MatchPyramid official MatchZoo 0.434 0.371 0.409 Text Matching as Image Recognition, AAAI 2016
Duet official MatchZoo 0.474 0.398 0.453 Learning to Match using Local and Distributed Representations of Text for Web Search, WWW 2017
DeepRank official MatchZoo 0.497 0.412 0.482 DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval, CIKM 2017
HiNT official MatchZoo 0.502 0.447 0.490 Modeling Diverse Relevance Patterns in Ad-hoc Retrieval, SIGIR 2018

MQ2008

Model Code MAP P@10 nDCG@10 Paper
DSSM MatchZoo 0.391 0.221 0.178 Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, CIKM 2013
CDSSM MatchZoo 0.395 0.222 0.175 Learning Semantic Representations Using Convolutional Neural Networks for Web Search, WWW 2014
ARC-I MatchZoo 0.424 0.311 0.187 Convolutional Neural Network Architectures for Matching Natural Language Sentences, NIPS 2014
ARC-II MatchZoo 0.421 0.229 0.181 Convolutional Neural Network Architectures for Matching Natural Language Sentences, NIPS 2014
DRMM officialMatchZoo 0.473 0.245 0.220 A Deep Relevance Matching Model for Ad-hoc Retrieval, CIKM 2016
MatchPyramid official MatchZoo 0.449 0.239 0.211 Text Matching as Image Recognition, AAAI 2016
Duet official MatchZoo 0.476 0.240 0.216 Learning to Match using Local and Distributed Representations of Text for Web Search, WWW 2017
DeepRank official MatchZoo 0.498 0.252 0.240 DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval, CIKM 2017
HiNT officialMatchZoo 0.505 0.255 0.244 Modeling Diverse Relevance Patterns in Ad-hoc Retrieval, SIGIR 2018

Updated: