Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection. Searches can be based on full-text or other content-based indexing. Here, the Ad-hoc information retrieval refer in particular to text-based retrieval where documents in the collection remain relative static and new queries are submitted to the system continually (cited from the survey).
The number of queries is huge. Some benchmark datasets are listed in the following,
|MS MARCO (Document Ranking)||web pages||367,013||3.2M|
Robust04 is a small news dataset which contains about 0.5 million documents in total. The queries are collected from TREC Robust Track 2004. There are 250 queries in total.
Cluebweb09 is a large Web collection which contains about 34 million documents in total. The queries are accumulated from TREC Web Tracks 2009, 2010, and 2011. There are 150 queries in total.
Gov2 is a large Web collection where the pages are crawled from .gov. It consists of 25 million documents in total. The queries are accumulated over TREC Terabyte Tracks 2004, 2005, and 2006. There are 150 queries in total.
MS MARCO (Document Ranking) provides a large number of information question-style queries from Bing’s search logs. There passages are annotated by humans with relevant/non-relevant labels. There are 8,841822 documents in total. There are 808,731queries, 6,980 queries and 48,598 queries for train, validation and test, respectively.
Million Query TREC 2007 (MQ2007) is a LETOR benchmark dataset which uses Gov2 web collection. There are 1692 queries in MQ2007 with 65,323 labeled documents.
Million Query TREC 2008 (MQ2008) is another LETOR benchmark dataset which also uses Gov2 web collection. There are 784 queries in MQ2008 with 14,384 labeled documents.