MultiNLI

The Multi-Genre NLI Corpus

Adina Williams (NYU)
Nikita Nangia (NYU)
Angeliki Lazaridou (Google DeepMind)
Sam Bowman (NYU)

Introduction

The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus is being used as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.

Examples

Premise Label Hypothesis
Fiction
The Old One always comforted Ca'daan, except today. neutral Ca'daan knew the Old One very well.
Letters
Your gift is appreciated by each and every student who will benefit from your generosity. neutral Hundreds of students will benefit from your generosity.
Telephone Speech
yes now you know if if everybody like in August when everybody's on vacation or something we can dress a little more casual or contradiction August is a black out month for vacations in the company.
9/11 Report
At the other end of Pennsylvania Avenue, people began to line up for a White House tour. entailment People formed a line at the end of Pennsylvania Avenue.

Download

MultiNLI is distributed in a single ZIP file containing the corpus as both JSON lines (jsonl) and tab-separated text (txt).

Download: MultiNLI 1.0 (227MB, ZIP)

Previous Versions

MultiNLI 0.9 differs from MultiNLI 1.0 only in the pairID and promptID fields in the training and development sets (and the attached paper), so results achieved on version 0.9 are still valid on 1.0. Version 0.9 can be downloaded here.

The Stanford NLI Corpus (SNLI)

MultiNLI is modeled after SNLI. The two corpora are distributed in the same formats, and for many applications, it may be productive to treat them as a single, larger corpus. You can find out more about SNLI here and download it from an NYU mirror here.

Data description paper

A description of the data can be found here (PDF) or in the corpus package zip.

Baselines

The data description paper presents the following baselines:

Model  Matched Test Acc.  Mismatched Test Acc.
Most Frequent Class 36.5% 35.6%
CBOW 65.2% 64.6%
BiLSTM 67.5% 67.1%
ESIM 72.4% 71.9%

Note that the ESIM relies on attention between sentences and would be ineligible for inclusion in the RepEval competition. All three models are trained on a mix of MultiNLI and SNLI and use GloVe word vectors. Code (TensorFlow/Python) is available here.

Supplemental annotations

We have annotated roughly 1000 development set examples with tags reflecting properties like the presence of negation, numerical reasoning, or high lexical overlap. We expect that these tags may be helpful when conducting error analysis. We include a helper Python script for this purpose. Note that these were released for use during the RepEval 2017 shared task, and are not the same as the newer tags described in the paper. These tags must be downloaded separately here:

Download: MultiNLI 1.0 supplemental annotations (15KB, ZIP)

Baseline results by tag: Matched

MODEL

CBOW

BiLSTM

ESIM

CONDITIONAL

100%

100%

100%

WORD_OVERLAP

38%

50%

50%

NEGATION

60%

71%

76%

ANTO

67%

67%

67%

LONG_SENTENCE

42%

50%

75%

TENSE_DIFFERENCE

64%

64%

73%

ACTIVE/PASSIVE

88%

75%

88%

PARAPHRASE

84%

78%

89%

QUANTITY/TIME_REASONING

33%

50%

33%

COREF

84%

84%

83%

QUANTIFIER

64%

64%

69%

MODAL

71%

66%

78%

BELIEF

71%

74%

65%

Baseline results by tag: Mismatched

Tag

CBOW

BiLSTM

ESIM

CONDITIONAL

80%

100%

60%

WORD_OVERLAP

57%

57%

62%

NEGATION

62%

69%

71%

ANTO

67%

58%

58%

LONG_SENTENCE

42%

55%

69%

TENSE_DIFFERENCE

68%

71%

79%

ACTIVE/PASSIVE

91%

82%

91%

PARAPHRASE

81%

81%

84%

QUANTITY/TIME_REASONING

31%

46%

54%

COREF

75%

80%

75%

QUANTIFIER

65%

70%

72%

MODAL

68%

64%

76%

BELIEF

69%

73%

67%

Test set and leaderboard

To get unlabeled test data, and evaluate your system on the full test set, use the following Kaggle in Class competitions:

These competitions will be open indefinitely. Evaluations on a subset of the test set had previously been conducted with different leaderboards through the RepEval 2017 Workshop.

License

See details in the data description paper.

Thanks

This work was made possible by a Google Faculty Research Award to Sam Bowman and Angeliki Lazaridou. We also thank George Dahl, the organizers of the RepEval 2016 and RepEval 2017 workshops, and our colleagues at NYU for their help and advice.