We now introduce IndicGLUE, the Indic General Language Understanding Evaluation Benchmark, which is a collection of various NLP tasks as de- scribed below. The goal is to provide an evaluation benchmark for natural language understanding ca- pabilities of NLP models on diverse tasks and mul- tiple Indian languages.
14 PAPERS • 3 BENCHMARKS
This is a new dataset of news headlines and their frames related to the issue of gun violence in the United States. This Gun Violence Frame Corpus (GVFC) was curated and annotated by journalism and communication experts. The articles in this dataset are drawn from a sample of news articles from a list of 30 top U.S. news websites defined in terms of traffic to the websites; and collected from four time periods over the course of 2018 in order to capture a diversity of articles.
6 PAPERS • NO BENCHMARKS YET
MasakhaNEWS is a benchmark dataset for news topic classification covering 16 languages widely spoken in Africa.
5 PAPERS • NO BENCHMARKS YET
Multilabeled News Dataset (MN-DS) is a dataset for news classification. It consists of 10,917 articles in 17 first-level and 109 second-level categories from 215 media sources.
3 PAPERS • NO BENCHMARKS YET
The Headline Grouping dataset is a binary classification dataset on pairs of news headline. For each pair of headline, the binary label indicates whether the two headlines are part of the same group (and describe the same underlying event), or whether they are in distinct groups. The dataset contains a total of 20k annotated headline pairs, further split in a train, validation and test portions.
2 PAPERS • NO BENCHMARKS YET
Two news datasets (KINNEWS and KIRNEWS) for multi-class classification of news articles in Kinyarwanda and Kirundi, two low-resource African languages. The two languages are mutually intelligible.
1 PAPER • NO BENCHMARKS YET
N15News is a large-scale multimodal news dataset comprising 200K imagetext pairs and 15 categories, which exceeding the previous news dataset in both the number of categories and samples.
1 PAPER • 1 BENCHMARK
Verifee is a dataset of news articles with fine-grained trustworthiness annotations. It contains over 10, 000 unique articles from almost 60 Czech online news sources. These are categorized into one of the 4 classes across the credibility spectrum we propose, raging from entirely trustworthy articles all the way to the manipulative ones.
A Filipino multi-modal language dataset for image-conditional language generation and text-conditional image generation. Consists of 351,755 Filipino news articles gathered from Filipino news outlets.
0 PAPER • NO BENCHMARKS YET
The CIDII dataset is a binary classification, consisting of two classes of correct information and disinformation related to Islamic issues. The CIDII dataset belongs to our research (DISINFORMATION DETECTION ABOUT ISLAMIC ISSUES ON SOCIAL MEDIA USING DEEP LEARNING TECHNIQUES) published in MJCS journal in the link below: https://ejournal.um.edu.my/index.php/MJCS/article/view/41935
Eduge news classification dataset provided by Bolorsoft LLC. Used to train the Eduge.mn production news classifier 75K news articles in 9 categories: урлаг соёл, эдийн засаг, эрүүл мэнд, хууль, улс төр, спорт, технологи, боловсрол and байгал орчин