The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.
176 PAPERS • 49 BENCHMARKS
JFLEG is for developing and evaluating grammatical error correction (GEC). Unlike other corpora, it represents a broad range of language proficiency levels and uses holistic fluency edits to not only correct grammatical errors but also make the original text more native sounding.
87 PAPERS • 5 BENCHMARKS
CoNLL-2014 will continue the CoNLL tradition of having a high profile shared task in natural language processing. This year's shared task will be grammatical error correction, a continuation of the CoNLL shared task in 2013. A participating system in this shared task is given short English texts written by non-native speakers of English. The system detects the grammatical errors present in the input texts, and returns the corrected essays. The shared task in 2014 will require a participating system to correct all errors present in an essay (i.e., not restricted to just five error types in 2013). Also, the evaluation metric will be changed to F0.5, weighting precision twice as much as recall.
75 PAPERS • 2 BENCHMARKS
WI-LOCNESS is part of the Building Educational Applications 2019 Shared Task for Grammatical Error Correction. It consists of two datasets:
25 PAPERS • 2 BENCHMARKS
MuCGEC is a multi-reference multi-source evaluation dataset for Chinese Grammatical Error Correction (CGEC), consisting of 7,063 sentences collected from three different Chinese-as-a-Second-Language (CSL) learner sources. Each sentence has been corrected by three annotators, and their corrections are meticulously reviewed by an expert, resulting in 2.3 references per sentence.
14 PAPERS • 1 BENCHMARK
AKCES-GEC is a new dataset on grammatical error correction for Czech.
10 PAPERS • NO BENCHMARKS YET
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
6 PAPERS • NO BENCHMARKS YET
Grammatical error correction dataset for text from Wikipedia.
3 PAPERS • NO BENCHMARKS YET
Are you the kind of person who makes a lot of typos when writing code? Or are you the one who fixes them by making "fix typo" commits? Either way, thank you—you contributed to the state-of-the-art in the NLP field.
Grammatical error correction dataset for text from Yahoo! Answers
2 PAPERS • NO BENCHMARKS YET
a fine-grained corpus to detect, identify and correct the chinese grammatical errors. collected mainly from multi-choice questions in public school Chinese examinations with multiple references Online Evaluation Site for test set: https://codalab.lisn.upsaclay.fr/competitions/8020
1 PAPER • 1 BENCHMARK
Kor-Lang8 is a Korean grammatical error correction (GEC) dataset extracted from the NAIST Lang-8 Learner Corpora by the language label. It contains more than 109K sentence pairs.
1 PAPER • NO BENCHMARKS YET
Kor-Learner is a Korean grammatical error correction (GEC) dataset made from the NIKL learner corpus containing essays written by Korean learners and their grammatical error correction annotations by their tutors in an morpheme-level XML file format. It contains more than 28K sentence pairs.
Kor-Learner is a Korean grammatical error correction (GEC) dataset collected grammatically from two sources, and the correct sentences were read using Google Text-to-Speech(TTS) system. The general public was tasked with dictating grammatically correct sentences and transcribe them. It contains more than 17K sentence pairs.
NaSGEC is a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. Previous CGEC research primarily focuses on correcting texts from a single domain, especially learner essays.