The Multi-Genre NLI Corpus
The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.
|The Old One always comforted Ca'daan, except today.||neutral||Ca'daan knew the Old One very well.|
|Your gift is appreciated by each and every student who will benefit from your generosity.||neutral||Hundreds of students will benefit from your generosity.|
|yes now you know if if everybody like in August when everybody's on vacation or something we can dress a little more casual or||contradiction||August is a black out month for vacations in the company.|
|At the other end of Pennsylvania Avenue, people began to line up for a White House tour.||entailment||People formed a line at the end of Pennsylvania Avenue.|
MultiNLI is distributed in a single ZIP file containing the corpus as both JSON lines (jsonl) and tab-separated text (txt).
Download: MultiNLI 1.0 (227MB, ZIP)
MultiNLI 0.9 differs from MultiNLI 1.0 only in the pairID and promptID fields in the training and development sets (and the attached paper), so results achieved on version 0.9 are still valid on 1.0. Version 0.9 can be downloaded here.
The Stanford NLI Corpus (SNLI)
MultiNLI is modeled after SNLI. The two corpora are distributed in the same formats, and for many applications, it may be productive to treat them as a single, larger corpus. You can find out more about SNLI here and download it from an NYU mirror here.
Data description paper
A description of the data can be found here (PDF) or in the corpus package zip.
Baselines, code, and analysis
The data description paper presents the following baselines:
|Model||Matched Test Acc.||Mismatched Test Acc.|
|Most Frequent Class||36.5%||35.6%|
Note that the ESIM relies on attention between sentences and would be ineligible for inclusion in the RepEval competition. All three models are trained on a mix of MultiNLI and SNLI and use GloVe word vectors. Code (TensorFlow/Python) is available here, alongside a script to reproduce the categories used in the error analysis in the paper.
Test set and leaderboard
To evaluate your system on the full test set, use the following Kaggle in Class competitions. You do not need to submit code to evaluate your model, and you may evaluate under a psuedonym, but you are expected to post a brief description of your model in the competition discussion board.
These competitions will be open indefinitely. Evaluations on a subset of the test set had previously been conducted with different leaderboards through the RepEval 2017 Workshop. Evaluations on the hard subset of the test set used in Gururangan et al. '18 are available separately (matched/mismatched).
Researchers interested in multi-task learning and general-purpose representation learning can also access the test set through a separate leaderboard on the GLUE platform.
See details in the data description paper.
Please cite the following paper:
Adina Williams, Nikita Nangia, and Samuel R. Bowman.
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference.
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL).
This work was made possible by a Google Faculty Research Award. We also thank George Dahl, the organizers of the RepEval 2016 and RepEval 2017 workshops, Andrew Drozdov, Angeliki Lazaridou, and our other NYU colleagues for help and advice.