The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The first version of the dataset was released in 2015 and consisted of 10 treebanks over 10 languages. Version 2.7 released in 2020 consists of 183 treebanks over 104 languages. The annotation consists of UPOS (universal part-of-speech tags), XPOS (language-specific part-of-speech tags), Feats (universal morphological features), Lemmas, dependency heads and universal dependency labels.
505 PAPERS • 12 BENCHMARKS
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
314 PAPERS • 164 BENCHMARKS
Manga109 has been compiled by the Aizawa Yamasaki Matsui Laboratory, Department of Information and Communication Engineering, the Graduate School of Information Science and Technology, the University of Tokyo. The compilation is intended for use in academic research on the media processing of Japanese manga. Manga109 is composed of 109 manga volumes drawn by professional manga artists in Japan. These manga were commercially made available to the public between the 1970s and 2010s, and encompass a wide range of target readerships and genres (see the table in Explore for further details.) Most of the manga in the compilation are available at the manga library “Manga Library Z” (formerly the “Zeppan Manga Toshokan” library of out-of-print manga).
249 PAPERS • 12 BENCHMARKS
PAWS-X contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.
160 PAPERS • 2 BENCHMARKS
The Microsoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study.
116 PAPERS • 1 BENCHMARK
This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.
96 PAPERS • NO BENCHMARKS YET
The JAFFE dataset consists of 213 images of different facial expressions from 10 different Japanese female subjects. Each subject was asked to do 7 facial expressions (6 basic facial expressions and neutral) and the images were annotated with average semantic ratings on each facial expression by 60 annotators.
87 PAPERS • 4 BENCHMARKS
ASPEC, Asian Scientific Paper Excerpt Corpus, is constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010.
84 PAPERS • NO BENCHMARKS YET
Kuzushiji-MNIST is a drop-in replacement for the MNIST dataset (28x28 grayscale, 70,000 images). Since MNIST restricts us to 10 classes, the authors chose one character to represent each of the 10 rows of Hiragana when creating Kuzushiji-MNIST. Kuzushiji is a Japanese cursive writing style.
82 PAPERS • 2 BENCHMARKS
DBP15k contains four language-specific KGs that are respectively extracted from English (En), Chinese (Zh), French (Fr) and Japanese (Ja) DBpedia, each of which contains around 65k-106k entities. Three sets of 15k alignment labels are constructed to align entities between each of the other three languages and En.
61 PAPERS • 3 BENCHMARKS
WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
57 PAPERS • 3 BENCHMARKS
OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.
55 PAPERS • NO BENCHMARKS YET
Multilingual Document Classification Corpus (MLDoc) is a cross-lingual document classification dataset covering English, German, French, Spanish, Italian, Russian, Japanese and Chinese. It is a subset of the Reuters Corpus Volume 2 selected according to the following design choices:
51 PAPERS • 11 BENCHMARKS
The Machine Translation of Noisy Text (MTNT) dataset is a Machine Translation dataset that consists of noisy comments on Reddit and professionally sourced translation. The translation are between French, Japanese and French, with between 7k and 37k sentence per language pair.
51 PAPERS • NO BENCHMARKS YET
WikiLingua includes ~770k article and summary pairs in 18 languages from WikiHow. Gold-standard article-summary alignments across languages are extracted by aligning the images that are used to describe each how-to step in an article.
50 PAPERS • 5 BENCHMARKS
The RWC (Real World Computing) Music Database is a copyright-cleared music database (DB) that is available to researchers as a common foundation for research. It contains around 100 complete songs with manually labeled section boundaries. For the 50 instruments, individual sounds at half-tone intervals were captured with several variations of playing styles, dynamics, instrument manufacturers and musicians.
41 PAPERS • NO BENCHMARKS YET
XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages. GSM8K (Grade School Math 8K) is a dataset of 8.5K high-quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
38 PAPERS • 1 BENCHMARK
BEAT has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations. Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with \textit{facial expressions}, \textit{emotions}, and \textit{semantics}, in addition to the known correlation with \textit{audio}, \textit{text}, and \textit{speaker identity}. Based on this observation, we propose a baseline model, \textbf{Ca}scaded \textbf{M}otion \textbf{N}etwork \textbf{(CaMN)}, which consists of above six modalities modeled in a cascaded architecture for gesture synthesis. To evaluate the semantic relevancy, we introduce a metric, Semantic Relevance Gesture Recall (\textbf{SRGR}). Qualitative and quantitative experiments demonstrate metrics' validness, ground truth data quality, and baseline's state-of-the-art performance. To the best of our knowledge,
37 PAPERS • 1 BENCHMARK
Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.
37 PAPERS • NO BENCHMARKS YET
AVSpeech is a large-scale audio-visual dataset comprising speech clips with no interfering background signals. The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
35 PAPERS • NO BENCHMARKS YET
WMT 2020 is a collection of datasets used in shared tasks of the Fifth Conference on Machine Translation. The conference builds on a series of annual workshops and conferences on Statistical Machine Translation.
33 PAPERS • 1 BENCHMARK
CoVoST is a large-scale multilingual speech-to-text translation corpus. Its latest 2nd version covers translations from 21 languages into English and from English into 15 languages. It has total 2880 hours of speech and is diversified with 78K speakers and 66 accents.
32 PAPERS • NO BENCHMARKS YET
The Image-Grounded Language Understanding Evaluation (IGLUE) benchmark brings together—by both aggregating pre-existing datasets and creating new ones—visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. The benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.
21 PAPERS • 13 BENCHMARKS
XGLUE is an evaluation benchmark XGLUE,which is composed of 11 tasks that span 19 languages. For each task, the training data is only available in English. This means that to succeed at XGLUE, a model must have a strong zero-shot cross-lingual transfer capability to learn from the English data of a specific task and transfer what it learned to other languages. Comparing to its concurrent work XTREME, XGLUE has two characteristics: First, it includes cross-lingual NLU and cross-lingual NLG tasks at the same time; Second, besides including 5 existing cross-lingual tasks (i.e. NER, POS, MLQA, PAWS-X and XNLI), XGLUE selects 6 new tasks from Bing scenarios as well, including News Classification (NC), Query-Ad Matching (QADSM), Web Page Ranking (WPR), QA Matching (QAM), Question Generation (QG) and News Title Generation (NTG). Such diversities of languages, tasks and task origin provide a comprehensive benchmark for quantifying the quality of a pre-trained model on cross-lingual natural lan
20 PAPERS • 2 BENCHMARKS
CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems
18 PAPERS • 1 BENCHMARK
Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
17 PAPERS • NO BENCHMARKS YET
Japanese-English Subtitle Corpus is a large Japanese-English parallel corpus covering the underrepresented domain of conversational dialogue. It consists of more than 3.2 million examples, making it the largest freely available dataset of its kind. The corpus was assembled by crawling and aligning subtitles found on the web.
16 PAPERS • NO BENCHMARKS YET
XFUND is a multilingual form understanding benchmark dataset that includes human-labeled forms with key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
15 PAPERS • NO BENCHMARKS YET
license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations
14 PAPERS • NO BENCHMARKS YET
xSID, a new evaluation benchmark for cross-lingual (X) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect, covering Arabic (ar), Chinese (zh), Danish (da), Dutch (nl), English (en), German (de), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Serbian (sr), Turkish (tr) and an Austro-Bavarian German dialect, South Tyrolean (de-st).
13 PAPERS • NO BENCHMARKS YET
Synbols is a dataset generator designed for probing the behavior of learning algorithms. By defining the distribution over latent factors one can craft a dataset specifically tailored to answer specific questions about a given algorithm.
11 PAPERS • NO BENCHMARKS YET
VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac). Image Source: http://www.voxforge.org/home
11 PAPERS • 9 BENCHMARKS
This is the dataset for the 2020 Duolingo shared task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Sentence prompts, along with automatic translations, and high-coverage sets of translation paraphrases weighted by user response are provided in 5 language pairs. Starter code for this task can be found here: github.com/duolingo/duolingo-sharedtask-2020/. More details on the data set and task are available at: sharedtask.duolingo.com
10 PAPERS • NO BENCHMARKS YET
MINTAKA is a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models. It is composed of 20,000 question-answer pairs collected in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka includes 8 types of complex questions, including superlative, intersection, and multi-hop questions, which were naturally elicited from crowd workers.
JVS is a Japanese multi-speaker voice corpus which contains voice data of 100 speakers in three styles (normal, whisper, and falsetto). The corpus contains 30 hours of voice data including 22 hours of parallel normal voices.
9 PAPERS • NO BENCHMARKS YET
MCoNaLa is a multilingual dataset to benchmark code generation from natural language commands extending beyond English. Modeled off of the methodology from the English Code/Natural Language Challenge (CoNALa) dataset, the authors annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines MoShed SMPLX body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enh
8 PAPERS • 2 BENCHMARKS
The Japanese-English business conversation corpus, namely Business Scene Dialogue corpus, was constructed in 3 steps:
Global Voices is a multilingual dataset for evaluating cross-lingual summarization methods. It is extracted from social-network descriptions of Global Voices news articles to cheaply collect evaluation data for into-English and from-English summarization in 15 languages.
8 PAPERS • NO BENCHMARKS YET
JGLUE, Japanese General Language Understanding Evaluation, is built to measure the general NLU ability in Japanese.
7 PAPERS • NO BENCHMARKS YET
JSUT Corpus is a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist.
Consists of faces extracted from pre-modern Japanese artwork.
Open Bandit Dataset is a public real-world logged bandit feedback data. The dataset is provided by ZOZO, Inc., the largest Japanese fashion e-commerce company with over 5 billion USD market capitalization (as of May 2020). The company uses multi-armed bandit algorithms to recommend fashion items to users in a large-scale fashion e-commerce platform called ZOZOTOWN.
Demetr is a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories.
6 PAPERS • NO BENCHMARKS YET
XL-BEL is a benchmark for cross-lingual biomedical entity linking (XL-BEL). The benchmark spans 10 typologically diverse languages.
This resource, our Concepticon, links concept labels from different conceptlists to concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts, as you can see in the graphic to the right, which displays the relations between concept sets linked to the concept set SIBLING. The resource can be used for various purposes. Serving as a rich reference for new and existing databases in diachronic and synchronic linguistics, it allows researchers a quick access to studies on semantic change, cross-linguistic polysemies, and semantic associations.
5 PAPERS • NO BENCHMARKS YET
JParaCrawl is a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. The parallel corpus was constructed by broadly crawling the web and automatically aligning parallel sentences. The corpus amassed over 8.7 million sentence pairs.