The Cross-lingual Natural Language Inference (XNLI) corpus is the extension of the Multi-Genre NLI (MultiNLI) corpus to 15 languages. The dataset was created by manually translating the validation and test sets of MultiNLI into each of those 15 languages. The English training set was machine translated for all languages. The dataset is composed of 122k train, 2490 validation and 5010 test examples.
328 PAPERS • 10 BENCHMARKS
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
314 PAPERS • 164 BENCHMARKS
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently, the dataset is entirely parallel across 11 languages.
170 PAPERS • 1 BENCHMARK
A corpus of parallel text in 21 European languages from the proceedings of the European Parliament.
125 PAPERS • NO BENCHMARKS YET
This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.
96 PAPERS • NO BENCHMARKS YET
The Image-Grounded Language Understanding Evaluation (IGLUE) benchmark brings together—by both aggregating pre-existing datasets and creating new ones—visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. The benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.
21 PAPERS • 13 BENCHMARKS
XGLUE is an evaluation benchmark XGLUE,which is composed of 11 tasks that span 19 languages. For each task, the training data is only available in English. This means that to succeed at XGLUE, a model must have a strong zero-shot cross-lingual transfer capability to learn from the English data of a specific task and transfer what it learned to other languages. Comparing to its concurrent work XTREME, XGLUE has two characteristics: First, it includes cross-lingual NLU and cross-lingual NLG tasks at the same time; Second, besides including 5 existing cross-lingual tasks (i.e. NER, POS, MLQA, PAWS-X and XNLI), XGLUE selects 6 new tasks from Bing scenarios as well, including News Classification (NC), Query-Ad Matching (QADSM), Web Page Ranking (WPR), QA Matching (QAM), Question Generation (QG) and News Title Generation (NTG). Such diversities of languages, tasks and task origin provide a comprehensive benchmark for quantifying the quality of a pre-trained model on cross-lingual natural lan
20 PAPERS • 2 BENCHMARKS
Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
17 PAPERS • NO BENCHMARKS YET
license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations
14 PAPERS • NO BENCHMARKS YET
MultiEURLEX is a multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. The dataset covers 23 official EU languages from 7 language families.
9 PAPERS • NO BENCHMARKS YET
Global Voices is a multilingual dataset for evaluating cross-lingual summarization methods. It is extracted from social-network descriptions of Global Voices news articles to cheaply collect evaluation data for into-English and from-English summarization in 15 languages.
8 PAPERS • NO BENCHMARKS YET
EUR-Lex-Sum is a dataset for cross-lingual summarization. It is based on manually curated document summaries of legal acts from the European Union law platform. Documents and their respective summaries exist as crosslingual paragraph-aligned data in several of the 24 official European languages, enabling access to various cross-lingual and lower-resourced summarization setups. The dataset contains up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.
5 PAPERS • NO BENCHMARKS YET
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used for the Knowledge Graph Completion and Entity Alignment task. DPB-5L (Greek) is a subset of DPB-5L with Greek KG.
4 PAPERS • 1 BENCHMARK
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
4 PAPERS • 3 BENCHMARKS
GeoCoV19 is a large-scale Twitter dataset containing more than 524 million multilingual tweets. The dataset contains around 378K geotagged tweets and 5.4 million tweets with Place information. The annotations include toponyms from the user location field and tweet content and resolve them to geolocations such as country, state, or city level. In this case, 297 million tweets are annotated with geolocation using the user location field and 452 million tweets using tweet content.
3 PAPERS • NO BENCHMARKS YET
A manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive.
2 PAPERS • NO BENCHMARKS YET
The development of ecologically valid procedures for collecting reliable and unbiased emotional data towards computer interfaces with social and affective intelligence targeting patients with mental disorders. Following its development, presented with, the Athens Emotional States Inventory (AESI) proposes the design, recording and validation of an audiovisual database for five emotional states: anger, fear, joy, sadness and neutral. The items of the AESI consist of sentences each having content indicative of the corresponding emotion. Emotional content was assessed through a survey of 40 young participants with a questionnaire following the Latin square design. The emotional sentences that were correctly identified by 85% of the participants were recorded in a soundproof room with microphones and cameras. A preliminary validation of AESI is performed through automatic emotion recognition experiments from speech. The resulting database contains 696 recorded utterances in Greek language
1 PAPER • NO BENCHMARKS YET
The Food Recall Incidents dataset consists of 7,546 short texts (from 5 to 360 characters each), which are the titles of food recall announcements (therefore referred to as title), crawled from 24 public food safety authority websites by Agroknow. The texts are written in 6 languages, with English (6,644) and German (888) being the most common, followed by French (8), Greek (4), Italian (1) and Danish (1). Most of the texts have been authored after 2010 and they describe recalls of specific food products due to specific hazards. Experts manually classified each text to four groups of classes describing hazards and products on two levels of granularity:
We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text.
1 PAPER • 1 BENCHMARK
Lyra is a dataset of 1570 traditional and folk Greek music pieces that includes audio and video (timestamps and links to YouTube videos), along with annotations that describe aspects of particular interest for this dataset, including instrumentation, geographic information and labels of genre and subgenre, among others.
WEATHub is a dataset containing 24 languages. It contains words organized into groups of (target1, target2, attribute1, attribute2) to measure the association target1:target2 :: attribute1:attribute2. For example target1 can be insects, target2 can be flowers. And we might be trying to measure whether we find insects or flowers pleasant or unpleasant. The measurement of word associations is quantified using the WEAT metric in our paper. It is a metric that calculates an effect size (Cohen's d) and also provides a p-value (to measure statistical significance of the results). In our paper, we use word embeddings from language models to perform these tests and understand biased associations in language models across different languages.