The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.
587 PAPERS • 13 BENCHMARKS
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. Big-bench include more than 200 tasks.
222 PAPERS • 134 BENCHMARKS
This dataset was designed for contextual investigations, with related works making considerable usage of said context. The dataset was constructed by scraping Reddit comments; with sarcastic entries being self-annotated by authors through the use of the \s token, which indicates sarcastic intent on the website. Posts on Reddit are often in response to another comment; SARC incorporates this information through the addition of the parent comment and further child comments surrounding a post.
32 PAPERS • 3 BENCHMARKS
iSarcasmEval is the first shared task to target intended sarcasm detection: the data for this task was provided and labelled by the authors of the texts themselves. Such an approach minimises the downfalls of other methods to collect sarcasm data, which rely on distant supervision or third-party annotations. The shared task contains two languages, English and Arabic, and three subtasks: sarcasm detection, sarcasm category classification, and pairwise sarcasm identification given a sarcastic sentence and its non-sarcastic rephrase. The task received submissions from 60 different teams, with the sarcasm detection task being the most popular. Most of the participating teams utilised pre-trained language models. In this paper, we provide an overview of the task, data, and participating teams.
20 PAPERS • NO BENCHMARKS YET
iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic. Each sarcastic tweet is further labelled for one of the following types of ironic speech:
17 PAPERS • 1 BENCHMARK
ArSarcasm-v2 is an extension of the original ArSarcasm dataset published along with the paper From Arabic Sentiment Analysis to Sarcasm Detection: The ArSarcasm Dataset. ArSarcasm-v2 conisists of ArSarcasm along with portions of DAICT corpus and some new tweets. Each tweet was annotated for sarcasm, sentiment and dialect. The final dataset consists of 15,548 tweets divided into 12,548 training tweets and 3,000 testing tweets. ArSarcasm-v2 was used and released as a part of the shared task on sarcasm detection and sentiment analysis in Arabic.
14 PAPERS • NO BENCHMARKS YET
ArSarcasm is a new Arabic sarcasm detection dataset. The dataset was created using previously available Arabic sentiment analysis datasets (SemEval 2017 and ASTD) and adds sarcasm and dialect labels to them. The dataset contains 10,547 tweets, 1,682 (16%) of which are sarcastic.
12 PAPERS • NO BENCHMARKS YET
MUStARD++ is a multimodal sarcasm detection dataset (MUStARD) pre-annotated with 9 emotions. It can be used for the task of detecting the emotion in a sarcastic statement.
7 PAPERS • 1 BENCHMARK
The Sarcasm Corpus contains sarcastic and non-sarcastic utterances of three different types, which are balanced with half of the samples being sarcastic and half non-sarcastic. The three types are:
5 PAPERS • NO BENCHMARKS YET
The Headlines dataset for sarcasm detection is collected from two news website. TheOnion aims at producing sarcastic versions of current events. The dataset includes all the headlines from News in Brief and News in Photos categories (which are sarcastic) and real (and non-sarcastic) news headlines from HuffPost. This dataset has following advantages over the existing Twitter datasets:
4 PAPERS • NO BENCHMARKS YET
This dataset is an extension of MASAC, a multimodal, multi-party, Hindi-English code-mixed dialogue dataset compiled from the popular Indian TV show, ‘Sarabhai v/s Sarabhai’. WITS was created by augmenting MASAC with natural language explanations for each sarcastic dialogue. The dataset consists of the transcribed sarcastic dialogues from 55 episodes of the TV show, along with audio and video multimodal signals. It was designed to facilitate Sarcasm Explanation in Dialogue (SED), a novel task aimed at generating a natural language explanation for a given sarcastic dialogue, that spells out the intended irony. Each data instance in WITS is associated with a corresponding video, audio, and textual transcript where the last utterance is sarcastic in nature. All the final selected explanations contain the following attributes:
4 PAPERS • 2 BENCHMARKS
A first-of-its-kind large dataset of sarcastic/non-sarcastic tweets with high-quality labels and extra features: (1) sarcasm perspective labels (2) new contextual features. The dataset is expected to advance sarcasm detection research.
3 PAPERS • NO BENCHMARKS YET
A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.
1 PAPER • NO BENCHMARKS YET
We release the MUStARD dataset which is a multimodal video corpus for research in automated sarcasm discovery. The dataset is compiled from popular TV shows including Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous. MUStARD consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context, which provides additional information on the scenario where the utterance occurs.