Language Identification
123 papers with code • 6 benchmarks • 19 datasets
Language identification is the task of determining the language of a text.
Libraries
Use these libraries to find Language Identification models and implementationsDatasets
Most implemented papers
The WiLI benchmark dataset for written language identification
This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification.
SpeechBrain: A General-Purpose Speech Toolkit
SpeechBrain is an open-source and all-in-one speech toolkit.
Scaling Speech Technology to 1,000+ Languages
Expanding the language coverage of speech technology has the potential to improve access to information for many more people.
GlotLID: Language Identification for Low-Resource Languages
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages.
Universal Dependency Parsing for Hindi-English Code-switching
We present a treebank of Hindi-English code-switching tweets under Universal Dependencies scheme and propose a neural stacking model for parsing that efficiently leverages part-of-speech tag and syntactic tree annotations in the code-switching treebank and the preexisting Hindi and English treebanks.
Predicting the Type and Target of Offensive Posts in Social Media
In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media.
SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)
We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval).
Word-level Embeddings for Cross-Task Transfer Learning in Speech Processing
Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer.
Common Voice: A Massively-Multilingual Speech Corpus
To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages.
VoxLingua107: a Dataset for Spoken Language Recognition
Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech.