Paraphrase Identification

Paraphrase Identification is an task to determine whether two sentences have the same meaning, a problem considered a touchstone of natural language understanding.

Take an instance in MRPC dataset for example, this is a pair of sentences with the same meaning:

sentence1: Amrozi accused his brother, whom he called “the witness”, of deliberately distorting his evidence.

sentence2: Referring to him as only “the witness”, Amrozi accused his brother of deliberately distorting his evidence.

Some benchmark datasets are listed in the following.

Classic Datasets

Dataset	pairs of sentence
MRPC	5800
STS	1750
SICK-R	9840
SICK-E	9840
Quora Question Pair	404290

MRPC is short for Microsoft Research Paraphrase Corpus. It contains 5,800 pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship.
SentEval encompasses semantic relatedness datasets including SICK and the STS Benchmark dataset. SICK dataset includes two subtasks SICK-R and SICK-E. For STS and SICK-R, it learns to predict relatedness scores between a pair of sentences. For SICK-E, it has the same pairs of sentences with SICK-R but can be treated as a three-class classification problem (classes are ‘entailment’, ‘contradiction’, and ‘neutral’).
Quora Question Pairs is a task released by Quora which aims to identify duplicate questions. It consists of over 400,000 pairs of questions on Quora, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.

A list of neural matching models for paraphrase identification models are as follows.

Performance

MRPC

Model	Code	accuracy	f1	Paper
XLNet-Large (ensemble) (Yang et al., 2019)		93.0	90.7	XLNet: Generalized Autoregressive Pretraining for Language Understanding
MT-DNN-ensemble (Liu et al., 2019)		92.7	90.3	Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
Snorkel MeTaL(ensemble) (Ratner et al., 2018)		91.5	88.5	Training Complex Models with Multi-Task Weak Supervision
GenSen (Subramanian et al., 2018)		78.6	84.4	Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning
InferSent (Conneau et al., 2017)		76.2	83.1	Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
TF-KLD (Ji and Eisenstein, 2013)	—	80.4	85.9	Discriminative Improvements to Distributional Sentence Similarity
SpanBert (Joshi et al., 2019)		90.9	87.9	SpanBERT: Improving Pre-training by Representing and Predicting Spans
MT-DNN (Liu et al., 2019)		91.1	88.2	Multi-Task Deep Neural Networks for Natural Language Understanding
AugDeepParaphrase (Agarwal et al., 2017)	—	77.7	84.5	A Deep Network Model for Paraphrase Detection in Short Text Messages
ERNIE (Zhang et al. 2019)		88.2	—	ERNIE: Enhanced Language Representation with Informative Entities
This work (Lan and Xu, 2018)	—	84.0	—	Character-based Neural Networks for Sentence Pair Modeling
ABCNN (Yin et al.2018)		78.9	84.8	ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs
Attentive Tree-LSTMs (Zhou et al.2016)		75.8	83.7	Modelling Sentence Pairs with Tree-structured Attentive Encoder
Bi-CNN-MI (Yin and Schutze, 2015)	—	78.1	84.4	Convolutional Neural Network for Paraphrase Identification

SentEval

The evaluation metric for STS and SICK-R is Pearson correlation.

The evaluation metri for SICK-E is classification accuracy.

Model	SICK-R	SICK-E	STS	Paper
XLNet-Large (ensemble) (Yang et al., 2019)	—	—	91.6	XLNet: Generalized Autoregressive Pretraining for Language Understanding
MT-DNN-ensemble (Liu et al., 2019)	—	—	91.1	Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
Snorkel MeTaL(ensemble) (Ratner et al., 2018)	—	—	90.1	Training Complex Models with Multi-Task Weak Supervision
GenSen (Subramanian et al., 2018)	88.8	87.8	78.9	Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning
InferSent (Conneau et al., 2017)	88.4	86.3	75.8	Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
SpanBert (Joshi et al., 2019)	—	—	89.9	SpanBERT: Improving Pre-training by Representing and Predicting Spans
MT-DNN (Liu et al., 2019)	—	—	89.5	Multi-Task Deep Neural Networks for Natural Language Understanding
ERNIE (Zhang et al. 2019)	—	—	83.2	ERNIE: Enhanced Language Representation with Informative Entities
PWIM (He and Lin, 2016)	—	—	76.7	Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement

Quora Question Pair

Model	Code	F1	Accuracy	Paper
XLNet-Large (ensemble) (Yang et al., 2019)		74.2	90.3	XLNet: Generalized Autoregressive Pretraining for Language Understanding
MT-DNN-ensemble (Liu et al., 2019)		73.7	89.9	Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
Snorkel MeTaL(ensemble) (Ratner et al., 2018)		73.1	89.9	Training Complex Models with Multi-Task Weak Supervision
MwAN (Tan et al., 2018)	—	—	89.12	Multiway Attention Networks for Modeling Sentence Pairs
DIIN (Gong et al., 2018)		—	89.06	Natural Language Inference over Interaction Space
pt-DecAtt (Char) (Tomar et al., 2017)	—	—	88.40	Neural Paraphrase Identification of Questions with Noisy Pretraining
BiMPM (Wang et al., 2017)		—	88.17	Bilateral Multi-Perspective Matching for Natural Language Sentences
GenSen (Subramanian et al., 2018)	—	—	87.01	Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning
This work (Wang et al.2016)	—	78.4	84.7	Sentence Similarity Learning by Lexical Decomposition and Composition
RE2 (Yang et al., 2019)		—	89.2	Simple and Effective Text Matching with Richer Alignment Features
MSEM (Wang et al.2016)	—	—	88.86	Multi-task Sentence Encoding Model for Semantic Retrieval in Question Answering Systems
Bi-CAS-LSTM (Choi et al.2019)	—	—	88.6	Cell-aware Stacked LSTMs for Modeling Sentences
DecAtt (Parikh et al., 2016)	—	—	86.5	A Decomposable Attention Model for Natural Language Inference	—