Paraphrase Identification

Paraphrase Identification is an task to determine whether two sentences have the same meaning, a problem considered a touchstone of natural language understanding.

Take an instance in MRPC dataset for example, this is a pair of sentences with the same meaning:

sentence1: Amrozi accused his brother, whom he called “the witness”, of deliberately distorting his evidence.

sentence2: Referring to him as only “the witness”, Amrozi accused his brother of deliberately distorting his evidence.

Some benchmark datasets are listed in the following.

Classic Datasets

Dataset pairs of sentence
MRPC 5800
STS 1750
SICK-R 9840
SICK-E 9840
Quora Question Pair 404290
  • MRPC is short for Microsoft Research Paraphrase Corpus. It contains 5,800 pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship.
  • SentEval encompasses semantic relatedness datasets including SICK and the STS Benchmark dataset. SICK dataset includes two subtasks SICK-R and SICK-E. For STS and SICK-R, it learns to predict relatedness scores between a pair of sentences. For SICK-E, it has the same pairs of sentences with SICK-R but can be treated as a three-class classification problem (classes are ‘entailment’, ‘contradiction’, and ‘neutral’).

  • Quora Question Pairs is a task released by Quora which aims to identify duplicate questions. It consists of over 400,000 pairs of questions on Quora, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.

A list of neural matching models for paraphrase identification models are as follows.

Performance

MRPC

Model Code accuracy f1 Paper
XLNet-Large (ensemble) (Yang et al., 2019) official 93.0 90.7 XLNet: Generalized Autoregressive Pretraining for Language Understanding
MT-DNN-ensemble (Liu et al., 2019) official 92.7 90.3 Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
Snorkel MeTaL(ensemble) (Ratner et al., 2018) official 91.5 88.5 Training Complex Models with Multi-Task Weak Supervision
GenSen (Subramanian et al., 2018) official 78.6 84.4 Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning
InferSent (Conneau et al., 2017) official 76.2 83.1 Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
TF-KLD (Ji and Eisenstein, 2013) 80.4 85.9 Discriminative Improvements to Distributional Sentence Similarity
SpanBert (Joshi et al., 2019) official 90.9 87.9 SpanBERT: Improving Pre-training by Representing and Predicting Spans
MT-DNN (Liu et al., 2019) official 91.1 88.2 Multi-Task Deep Neural Networks for Natural Language Understanding
AugDeepParaphrase (Agarwal et al., 2017) 77.7 84.5 A Deep Network Model for Paraphrase Detection in Short Text Messages
ERNIE (Zhang et al. 2019) official 88.2 ERNIE: Enhanced Language Representation with Informative Entities
This work (Lan and Xu, 2018) 84.0 Character-based Neural Networks for Sentence Pair Modeling
ABCNN (Yin et al.2018) official 78.9 84.8 ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs
Attentive Tree-LSTMs (Zhou et al.2016) official 75.8 83.7 Modelling Sentence Pairs with Tree-structured Attentive Encoder
Bi-CNN-MI (Yin and Schutze, 2015) 78.1 84.4 Convolutional Neural Network for Paraphrase Identification

SentEval

The evaluation metric for STS and SICK-R is Pearson correlation.

The evaluation metri for SICK-E is classification accuracy.

Model Code SICK-R SICK-E STS Paper
XLNet-Large (ensemble) (Yang et al., 2019) official 91.6 XLNet: Generalized Autoregressive Pretraining for Language Understanding
MT-DNN-ensemble (Liu et al., 2019) official 91.1 Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
Snorkel MeTaL(ensemble) (Ratner et al., 2018) official 90.1 Training Complex Models with Multi-Task Weak Supervision
GenSen (Subramanian et al., 2018) official 88.8 87.8 78.9 Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning
InferSent (Conneau et al., 2017) official 88.4 86.3 75.8 Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
SpanBert (Joshi et al., 2019) official 89.9 SpanBERT: Improving Pre-training by Representing and Predicting Spans
MT-DNN (Liu et al., 2019) official 89.5 Multi-Task Deep Neural Networks for Natural Language Understanding
ERNIE (Zhang et al. 2019) official 83.2 ERNIE: Enhanced Language Representation with Informative Entities
PWIM (He and Lin, 2016) official 76.7 Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement

Quora Question Pair

Model Code F1 Accuracy Paper  
XLNet-Large (ensemble) (Yang et al., 2019) official 74.2 90.3 XLNet: Generalized Autoregressive Pretraining for Language Understanding  
MT-DNN-ensemble (Liu et al., 2019) official 73.7 89.9 Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding  
Snorkel MeTaL(ensemble) (Ratner et al., 2018) official 73.1 89.9 Training Complex Models with Multi-Task Weak Supervision  
MwAN (Tan et al., 2018) 89.12 Multiway Attention Networks for Modeling Sentence Pairs  
DIIN (Gong et al., 2018) official MatchZoo 89.06 Natural Language Inference over Interaction Space  
pt-DecAtt (Char) (Tomar et al., 2017) 88.40 Neural Paraphrase Identification of Questions with Noisy Pretraining  
BiMPM (Wang et al., 2017) official MatchZoo 88.17 Bilateral Multi-Perspective Matching for Natural Language Sentences  
GenSen (Subramanian et al., 2018) 87.01 Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning official
This work (Wang et al.2016) 78.4 84.7 Sentence Similarity Learning by Lexical Decomposition and Composition  
RE2 (Yang et al., 2019) official MatchZoo 89.2 Simple and Effective Text Matching with Richer Alignment Features  
MSEM (Wang et al.2016) 88.86 Multi-task Sentence Encoding Model for Semantic Retrieval in Question Answering Systems  
Bi-CAS-LSTM (Choi et al.2019) 88.6 Cell-aware Stacked LSTMs for Modeling Sentences  
DecAtt (Parikh et al., 2016) 86.5 A Decomposable Attention Model for Natural Language Inference

Updated: