The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The first version of the dataset was released in 2015 and consisted of 10 treebanks over 10 languages. Version 2.7 released in 2020 consists of 183 treebanks over 104 languages. The annotation consists of UPOS (universal part-of-speech tags), XPOS (language-specific part-of-speech tags), Feats (universal morphological features), Lemmas, dependency heads and universal dependency labels.
505 PAPERS • 12 BENCHMARKS
MasakhaNER is a collection of Named Entity Recognition (NER) datasets for 10 different African languages. The languages forming this dataset are: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, and Yorùbá.
46 PAPERS • 2 BENCHMARKS
MasakhaNEWS is a benchmark dataset for news topic classification covering 16 languages widely spoken in Africa.
5 PAPERS • NO BENCHMARKS YET
HERDPhobia is an annotated hate speech detection dataset on Fulani herders in Nigeria -- in three languages: English, Nigerian-Pidgin, and Hausa.
2 PAPERS • NO BENCHMARKS YET