WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
57 PAPERS • 3 BENCHMARKS
OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.
55 PAPERS • NO BENCHMARKS YET
The Norwegian Parliamentary Speech Corpus (NPSC) is a speech corpus made by the Norwegian Language Bank at the National Library of Norway in 2019-2021. The NPSC consists of recordings of speech from Stortinget, the Norwegian parliament, and corresponding orthographic transcriptions to Norwegian Bokmål and Norwegian Nynorsk. All transcriptions are done manually by trained linguists or philologists, and the manual transcriptions are subsequently proofread to ensure consistency and accuracy. Entire days of Parliamentary meetings are transcribed in the dataset.
2 PAPERS • 1 BENCHMARK
NorDial is the first step to creating a corpus of dialectal variation of written Norwegian. It consists of small corpus of tweets manually annotated as Bokmål, Nynorsk, any dialect, or a mix.
1 PAPER • NO BENCHMARKS YET
Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine-learning approach for automatic language identification for the Nordic languages, which often suffer miscategorization by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokmål), Faroese, and Icelandic. This is the data for the tasks. Two variants are provided: 10K and 50K, withholding 10,000 and 50,000 examples for each language respectively.
1 PAPER • 1 BENCHMARK