Dataset of hate speech annotated on Internet forum posts in English at sentence-level. The source forum in Stormfront, a large online community of white nacionalists. A total of 10,568 sentence have been been extracted from Stormfront and classified as conveying hate speech or not.
161 PAPERS • 1 BENCHMARK
HSOL is a dataset for hate speech detection. The authors begun with a hate speech lexicon containing words and phrases identified by internet users as hate speech, compiled by Hatebase.org. Using the Twitter API they searched for tweets containing terms from the lexicon, resulting in a sample of tweets from 33,458 Twitter users. They extracted the time-line for each user, resulting in a set of 85.4 million tweets. From this corpus they took a random sample of 25k tweets containing terms from the lexicon and had them manually coded by CrowdFlower (CF) workers. Workers were asked to label each tweet as one of three categories: hate speech, offensive but not hate speech, or neither offensive nor hate speech.
62 PAPERS • NO BENCHMARKS YET
Torque is an English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships.
16 PAPERS • 1 BENCHMARK
TalkDown is a labelled dataset for condescension detection in context. The dataset is derived from Reddit, a set of online communities that is diverse in content and tone. The dataset is built from COMMENT and REPLY pairs in which the REPLY targets a specific quoted span (QUOTED) in the COMMENT as being condescending. The dataset contains 3,255 positive (condescend) samples and 3,255 negative ones.
7 PAPERS • NO BENCHMARKS YET
A corpus of Offensive Language and Hate Speech Detection for Danish
4 PAPERS • 1 BENCHMARK
A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.
1 PAPER • NO BENCHMARKS YET
CoRAL is a language and culturally aware Croatian Abusive dataset covering phenomena of implicitness and reliance on local and global context.
The PointDenoisingBenchmark dataset features 28 different shapes, split into 18 training shapes and 10 test shapes.
The ComMA Dataset v0.2 is a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the "type" of discursive role that the comment is performing with respect to the previous comment. The initial dataset, being discussed here (and made available as part of the ComMA@ICON shared task), consists of a total 15,000 annotated comments in four languages - Meitei, Bangla, Hindi, and Indian English - collected from various social media platforms such as YouTube, Facebook, Twitter and Telegram. As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English.