WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
57 PAPERS • 3 BENCHMARKS
Cherokee-English Parallel Dataset is a low-resource dataset of 14,151 pairs of sentences with around 313K English tokens and 206K Cherokee tokens. The parallel corpus is accompanied by a monolingual Cherokee dataset of 5,120 sentences. Both datasets are mostly derived from Cherokee monolingual books.
2 PAPERS • NO BENCHMARKS YET