Paraphrase Identification is an task to determine whether two sentences have the same meaning, a problem considered a touchstone of natural language understanding.
Take an instance in MRPC dataset for example, this is a pair of sentences with the same meaning:
sentence1: Amrozi accused his brother, whom he called “the witness”, of deliberately distorting his evidence.
sentence2: Referring to him as only “the witness”, Amrozi accused his brother of deliberately distorting his evidence.
Some benchmark datasets are listed in the following.
Classic Datasets
- MRPC is short for Microsoft Research Paraphrase Corpus. It contains 5,800 pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship.
-
SentEval encompasses semantic relatedness datasets including
SICK and the STS Benchmark dataset. SICK dataset includes two subtasks SICK-R and SICK-E.
For STS and SICK-R, it learns to predict relatedness scores between a pair of sentences. For SICK-E, it has the same pairs of sentences with SICK-R but can be treated as a three-class classification problem (classes are ‘entailment’, ‘contradiction’, and ‘neutral’).
- Quora Question Pairs is a task released by Quora which aims to identify duplicate questions. It consists of over 400,000 pairs of questions on Quora, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.
A list of neural matching models for paraphrase identification models are as follows.
MRPC
Model |
Code |
accuracy |
f1 |
Paper |
XLNet-Large (ensemble) (Yang et al., 2019) |
|
93.0 |
90.7 |
XLNet: Generalized Autoregressive Pretraining for Language Understanding |
MT-DNN-ensemble (Liu et al., 2019) |
|
92.7 |
90.3 |
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding |
Snorkel MeTaL(ensemble) (Ratner et al., 2018) |
|
91.5 |
88.5 |
Training Complex Models with Multi-Task Weak Supervision |
GenSen (Subramanian et al., 2018) |
|
78.6 |
84.4 |
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning |
InferSent (Conneau et al., 2017) |
|
76.2 |
83.1 |
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data |
TF-KLD (Ji and Eisenstein, 2013) |
— |
80.4 |
85.9 |
Discriminative Improvements to Distributional Sentence Similarity |
SpanBert (Joshi et al., 2019) |
|
90.9 |
87.9 |
SpanBERT: Improving Pre-training by Representing and Predicting Spans |
MT-DNN (Liu et al., 2019) |
|
91.1 |
88.2 |
Multi-Task Deep Neural Networks for Natural Language Understanding |
AugDeepParaphrase (Agarwal et al., 2017) |
— |
77.7 |
84.5 |
A Deep Network Model for Paraphrase Detection in Short Text Messages |
ERNIE (Zhang et al. 2019) |
|
88.2 |
— |
ERNIE: Enhanced Language Representation with Informative Entities |
This work (Lan and Xu, 2018) |
— |
84.0 |
— |
Character-based Neural Networks for Sentence Pair Modeling |
ABCNN (Yin et al.2018) |
|
78.9 |
84.8 |
ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs |
Attentive Tree-LSTMs (Zhou et al.2016) |
|
75.8 |
83.7 |
Modelling Sentence Pairs with Tree-structured Attentive Encoder |
Bi-CNN-MI (Yin and Schutze, 2015) |
— |
78.1 |
84.4 |
Convolutional Neural Network for Paraphrase Identification |
SentEval
The evaluation metric for STS and SICK-R is Pearson correlation.
The evaluation metri for SICK-E is classification accuracy.
Quora Question Pair
Model |
Code |
F1 |
Accuracy |
Paper |
|
XLNet-Large (ensemble) (Yang et al., 2019) |
|
74.2 |
90.3 |
XLNet: Generalized Autoregressive Pretraining for Language Understanding |
|
MT-DNN-ensemble (Liu et al., 2019) |
|
73.7 |
89.9 |
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding |
|
Snorkel MeTaL(ensemble) (Ratner et al., 2018) |
|
73.1 |
89.9 |
Training Complex Models with Multi-Task Weak Supervision |
|
MwAN (Tan et al., 2018) |
— |
— |
89.12 |
Multiway Attention Networks for Modeling Sentence Pairs |
|
DIIN (Gong et al., 2018) |
|
— |
89.06 |
Natural Language Inference over Interaction Space |
|
pt-DecAtt (Char) (Tomar et al., 2017) |
— |
— |
88.40 |
Neural Paraphrase Identification of Questions with Noisy Pretraining |
|
BiMPM (Wang et al., 2017) |
|
— |
88.17 |
Bilateral Multi-Perspective Matching for Natural Language Sentences |
|
GenSen (Subramanian et al., 2018) |
— |
— |
87.01 |
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning |
|
This work (Wang et al.2016) |
— |
78.4 |
84.7 |
Sentence Similarity Learning by Lexical Decomposition and Composition |
|
RE2 (Yang et al., 2019) |
|
— |
89.2 |
Simple and Effective Text Matching with Richer Alignment Features |
|
MSEM (Wang et al.2016) |
— |
— |
88.86 |
Multi-task Sentence Encoding Model for Semantic Retrieval in Question Answering Systems |
|
Bi-CAS-LSTM (Choi et al.2019) |
— |
— |
88.6 |
Cell-aware Stacked LSTMs for Modeling Sentences |
|
DecAtt (Parikh et al., 2016) |
— |
— |
86.5 |
A Decomposable Attention Model for Natural Language Inference |
— |