MultiNLI

The Multi-Genre NLI Corpus

Adina Williams
Nikita Nangia
Sam Bowman
NYU

Introduction

The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.

Examples

Premise Label Hypothesis
Fiction
The Old One always comforted Ca'daan, except today. neutral Ca'daan knew the Old One very well.
Letters
Your gift is appreciated by each and every student who will benefit from your generosity. neutral Hundreds of students will benefit from your generosity.
Telephone Speech
yes now you know if if everybody like in August when everybody's on vacation or something we can dress a little more casual or contradiction August is a black out month for vacations in the company.
9/11 Report
At the other end of Pennsylvania Avenue, people began to line up for a White House tour. entailment People formed a line at the end of Pennsylvania Avenue.

Download

MultiNLI is distributed in a single ZIP file containing the corpus as both JSON lines (jsonl) and tab-separated text (txt).

Download: MultiNLI 1.0 (227MB, ZIP)

Previous versions

MultiNLI 0.9 differs from MultiNLI 1.0 only in the pairID and promptID fields in the training and development sets (and the attached paper), so results achieved on version 0.9 are still valid on 1.0. Version 0.9 can be downloaded here.

The Stanford NLI Corpus (SNLI)

MultiNLI is modeled after SNLI. The two corpora are distributed in the same formats, and for many applications, it may be productive to treat them as a single, larger corpus. You can find out more about SNLI here and download it from an NYU mirror here.

Data description paper and citation

A description of the data can be found here (PDF) or in the corpus package zip. If you use the corpus in an academic paper, please cite us:

@InProceedings{N18-1101,
  author = "Williams, Adina
            and Nangia, Nikita
            and Bowman, Samuel",
  title = "A Broad-Coverage Challenge Corpus for 
           Sentence Understanding through Inference",
  booktitle = "Proceedings of the 2018 Conference of 
               the North American Chapter of the 
               Association for Computational Linguistics:
               Human Language Technologies, Volume 1 (Long
               Papers)",
  year = "2018",
  publisher = "Association for Computational Linguistics",
  pages = "1112--1122",
  location = "New Orleans, Louisiana",
  url = "http://aclweb.org/anthology/N18-1101"
}

Baselines, code, and analysis

The data description paper presents the following baselines:

Model  Matched Test Acc.  Mismatched Test Acc.
Most Frequent Class 36.5% 35.6%
CBOW 65.2% 64.6%
BiLSTM 67.5% 67.1%
ESIM 72.4% 71.9%

Note that the ESIM relies on attention between sentences and would be ineligible for inclusion in the RepEval competition. All three models are trained on a mix of MultiNLI and SNLI and use GloVe word vectors. Code (TensorFlow/Python) is available here, alongside a script to reproduce the categories used in the error analysis in the paper.

Additional analysis-oriented datasets are available as part of GLUE and here.

Test set and leaderboard

To evaluate your system on the full test set, use the following Kaggle in Class competitions. You do not need to submit code to evaluate your model, and you may evaluate under a psuedonym, but you are expected to post a brief description of your model in the competition discussion board.

These competitions will be open indefinitely. Evaluations on a subset of the test set had previously been conducted with different leaderboards through the RepEval 2017 Workshop. Evaluations on the hard subset of the test set used in Gururangan et al. '18 are available separately (matched/mismatched).

Researchers interested in multi-task learning and general-purpose representation learning can also access the test set through a separate leaderboard on the GLUE platform.

The best result (state of the art) that we've seen written up in a paper is 82.1/81.4 from Radford et al. 2018.

License

See details in the data description paper.

Thanks

This work was made possible by a Google Faculty Research Award. We also thank George Dahl, the organizers of the RepEval 2016 and RepEval 2017 workshops, Andrew Drozdov, Angeliki Lazaridou, and our other NYU colleagues for help and advice.