Community Question Answering

Community Question Answer is to automatically search for relevant answers among many responses provided for a given question (Answer Selection), and search for relevant questions to reuse their existing answers (Question Retrieval).

Classic Datasets

Dataset	Domain	#Question	#Answer
TRECQA	Open-domain	1,229	5,3417
WikiQA	Open-domain	3,047	29,258
InsuranceQA	Insurance	12,889	21,325
FiQA	Financial	6,648	57,641
Yahoo! Answers	Open-domain	50,112	253,440
SemEval-2015 Task 3	Open-domain	2,600	16,541
SemEval-2016 Task 3	Open-domain	4,879	36,198
SemEval-2017 Task 3	Open-domain	4,879	36,198

TRECQA dataset is created by Wang et. al. from TREC QA track 8-13 data, with candidate answers automatically selected from each question’s document pool using a combination of overlapping non-stop word counts and pattern matching. This data set is one of the most widely used benchmarks for answer sentence selection.
WikiQA is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering by Microsoft Research.
InsuranceQA is a non-factoid QA dataset from the insurance domain. Question may have multiple correct answers and normally the questions are much shorter than the answers. The average length of questions and answers in tokens are 7 and 95, respectively. For each question in the development and test sets, there is a set of 500 candidate answers.
FiQA is a non-factoid QA dataset from the financial domain which has been recently released for WWW 2018 Challenges. The dataset is built by crawling Stackexchange, Reddit and StockTwits in which part of the questions are opinionated, targeting mined opinions and their respective entities, aspects, sentiment polarity and opinion holder.
Yahoo! Answers is a web site where people post questions and answers, all of which are public to any web user willing to browse or download them. The data we have collected is the Yahoo! Answers corpus as of 10/25/2007. This is a benchmark dataset for communitybased question answering that was collected from Yahoo Answers. In this dataset, the answer lengths are relatively longer than TrecQA and WikiQA.
SemEval-2015 Task 3 consists of two sub-tasks. In Subtask A, given a question (short title + extended description), and several community answers, classify each of the answer as definitely relevance (good), potentially useful (potential), or bad or irrelevant (bad, dialog, non-english other). In Subtask B, given a YES/NO question (short title + extended description), and a list of community answers, decide whether the global answer to the question should be yes, no, or unsure.
SemEval-2016 Task 3 consists two sub-tasks, namely Question-Comment Similarity and Question-Question Similarity. In the Question-Comment Similarity task, given a question from a question-comment thread, rank the comments according to their relevance with respect to the question. In Question-Question Similarity task, given the new question, rerank all similar questions retrieved by a search engine.
SemEval-2017 Task 3 contains two sub-tasks, namely Question Similarity and Relevance Classification. Given the new question and a set of related questions from the collection, the Question Similarity task is to rank the similar questions according to their similarity to the original question. While the Relevance Classification is to rank the answer posts according to their relevance with respect to the question based on a question-answer thread.

Performance

TREC QA (Raw Version)

Model	Code	MAP	MRR	Paper
Punyakanok (2004)	—	0.419	0.494	Mapping dependencies trees: An application to question answering, ISAIM 2004
Cui (2005)	—	0.427	0.526	Question Answering Passage Retrieval Using Dependency Relations, SIGIR 2005
Wang (2007)	—	0.603	0.685	What is the Jeopardy Model? A Quasi-Synchronous Grammar for QA, EMNLP 2007
H&S (2010)	—	0.609	0.692	Tree Edit Models for Recognizing Textual Entailments, Paraphrases, and Answers to Questions, NAACL 2010
W&M (2010)	—	0.595	0.695	Probabilistic Tree-Edit Models with Structured Latent Variables for Textual Entailment and Question Answering, COLING 2020
Yao (2013)	—	0.631	0.748	Answer Extraction as Sequence Tagging with Tree Edit Distance, NAACL 2013
S&M (2013)	—	0.678	0.736	Automatic Feature Engineering for Answer Selection and Extraction, EMNLP 2013
Backward (Shnarch et al., 2013)	—	0.686	0.754	Probabilistic Models for Lexical Inference, Ph.D. thesis 2013
LCLR (Yih et al., 2013)	—	0.709	0.770	Question Answering Using Enhanced Lexical Semantic Models, ACL 2013
bigram+count (Yu et al., 2014)	—	0.711	0.785	Deep Learning for Answer Sentence Selection, NIPS 2014
BLSTM (W&N et al., 2015)	—	0.713	0.791	A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering, ACL 2015
Architecture-II (Feng et al., 2015)	—	0.711	0.800	Applying deep learning to answer selection: A study and an open task, ASRU 2015
PairCNN (Severyn et al., 2015)		0.746	0.808	Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks, SIGIR 2015
aNMM (Yang et al., 2016)		0.750	0.811	aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model, CIKM 2016
HDLA (Tay et al., 2017)		0.750	0.815	Learning to Rank Question Answer Pairs with Holographic Dual LSTM Architecture, SIGIR 2017
PWIM (Hua et al. 2016)		0.758	0.822	Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement, NAACL 2016
MP-CNN (Hua et al. 2015)		0.762	0.830	Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks, EMNLP 2015
HyperQA (Tay et al., 2017)		0.770	0.825	Enabling Efficient Question Answer Retrieval via Hyperbolic Neural Networks, WSDM 2018
MP-CNN (Rao et al., 2016)		0.780	0.834	Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks, CIKM 2016
HCAN (Rao et al., 2019)	—	0.774	0.843	Bridging the Gap between Relevance Matching and Semantic Matching for Short Text Similarity Modeling, EMNLP 2019
MP-CNN (Tayyar et al., 2018)	—	0.836	0.863	Integrating Question Classification and Deep Learning for improved Answer Selection, COLING 2018
Pre-Attention (Kamath et al., 2019)	—	0.852	0.891	Predicting and Integrating Expected Answer Types into a Simple Recurrent Neural Network Model for Answer Sentence Selection, CICLING 2019
CETE (Laskar et al., 2020)	—	0.950	0.980	Contextualized Embeddings based Transformer Encoder for Sentence Similarity Modeling in Answer Selection Task LREC 2020

TREC QA (Clean Version)

Model	Code	MAP	MRR	Paper
W&I (2015)	—	0.746	0.820	FAQ-based Question Answering via Word Alignment, arXiv 2015
LSTM (Tan et al., 2015)		0.728	0.832	LSTM-Based Deep Learning Models for Nonfactoid Answer Selection, arXiv 2015
AP-CNN (dos Santos et al. 2016)	—	0.753	0.851	Attentive Pooling Networks, arXiv 2016
L.D.C Model (Wang et al., 2016)	—	0.771	0.845	Sentence Similarity Learning by Lexical Decomposition and Composition, COLING 2016
MP-CNN (Hua et al., 2015)		0.777	0.836	Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks, EMNLP 2015
HyperQA (Tay et al., 2017)		0.784	0.865	Enabling Efficient Question Answer Retrieval via Hyperbolic Neural Networks, WSDM 2018
MP-CNN (Rao et al., 2016)		0.801	0.877	Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks, CIKM 2016
BiMPM (Wang et al., 2017)		0.802	0.875	Bilateral Multi-Perspective Matching for Natural Language Sentences, arXiv 2017
CA (Bian et al., 2017)		0.821	0.899	A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection, CIKM 2017
IWAN (Shen et al., 2017)	—	0.822	0.889	Inter-Weighted Alignment Network for Sentence Pair Modeling, EMNLP 2017
sCARNN (Tran et al., 2018)	—	0.829	0.875	The Context-dependent Additive Recurrent Neural Net, NAACL 2018
MCAN (Tay et al., 2018)	—	0.838	0.904	Multi-Cast Attention Networks, KDD 2018
MP-CNN (Tayyar et al., 2018)	—	0.865	0.904	Integrating Question Classification and Deep Learning for improved Answer Selection, COLING 2018
CA + LM + LC (Yoon et al., 2019)	—	0.868	0.928	A Compare-Aggregate Model with Latent Clustering for Answer Selection, CIKM 2019
GSAMN (Lai et al., 2019)		0.914	0.957	A Gated Self-attention Memory Network for Answer Selection, EMNLP 2019
TANDA (Garg et al., 2019)		0.943	0.974	TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection, AAAI 2020
CETE (Laskar et al., 2020)	—	0.936	0.978	Contextualized Embeddings based Transformer Encoder for Sentence Similarity Modeling in Answer Selection Task, LREC 2020

WikiQA

Model	Code	MAP	MRR	Paper
ABCNN (Yin et al., 2016)		0.6921	0.7108	ABCNN: Attention-based convolutional neural network for modeling sentence pairs, ACL 2016
Multi-Perspective CNN (Rao et al., 2016)		0.701	0.718	Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks, CIKM 2016
HyperQA (Tay et al., 2017)		0.705	0.720	Enabling Efficient Question Answer Retrieval via Hyperbolic Neural Networks, WSDM 2018
KVMN (Miller et al., 2016)		0.7069	0.7265	Key-Value Memory Networks for Directly Reading Documents, ACL 2016
BiMPM (Wang et al., 2017)		0.718	0.731	Bilateral Multi-Perspective Matching for Natural Language Sentences, IJCAI 2017
IWAN (Shen et al., 2017)	—	0.733	0.750	Inter-Weighted Alignment Network for Sentence Pair Modeling, EMNLP 2017
CA (Wang and Jiang, 2017)		0.7433	0.7545	A Compare-Aggregate Model for Matching Text Sequences, ICLR 2017
HCRN (Tay et al., 2018c)		0.7430	0.7560	Hermitian co-attention networks for text matching in asymmetrical domains, IJCAI 2018
Compare-Aggregate (Bian et al., 2017)		0.748	0.758	A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection, CIKM 2017
RE2 (Yang et al., 2019)		0.7452	0.7618	Simple and Effective Text Matching with Richer Alignment Features, ACL 2019
GSAMN (Lai et al., 2019)		0.857	0.872	A Gated Self-attention Memory Network for Answer Selection, EMNLP 2019
TANDA (Garg et al., 2019)		0.920	0.933	TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection, AAAI 2020