The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.
587 PAPERS • 13 BENCHMARKS
FaithDial is a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WoW) benchmark.
12 PAPERS • NO BENCHMARKS YET
This dataset was collected with the goal of assessing dialog evaluation metrics. In the paper, USR: An Unsupervised and Reference Free Evaluation Metric for Dialog (Mehri and Eskenazi, 2020), the authors collect this data to measure the quality of several existing word-overlap and embedding-based metrics, as well as their newly proposed USR metric.
7 PAPERS • 1 BENCHMARK
Reddit Engagement Dataset (RED), a distant-supervision set, with 80k single-turn conversations. RED is sourced from Reddit, sampling from 43 popular subreddits, and processed from a total of 5 million posts, filtering out data that was either non-conversational, toxic, or posts not possible to ascertain popularity.
1 PAPER • NO BENCHMARKS YET
The primary data of the SaGA corpus are made up of 25 dialogs of interlocutors (50), who engage in a spatial communication task combining direction-giving and sight description. Six of those dialogues with data only from the direction giver are available including audio (.wav) and video (.mp4) data. The secondary data consists of annotations (*.eaf) of gestures and speech-gesture referents, which have been completely and systematically annotated based on an annotation grid (cf. the SaGA documentation). The corpus is comprised of of 9881 isolated words and 1764 isolated gestures. The stimulus is a model of a town presented in a Virtual Reality (VR) environment. Upon finishing a "bus ride" through the VR town along five landmarks, a router explained the route as well as the wayside landmarks to an unknown and naive follower. The SaGA Corpus was curated for CLARIN as part of the Curation Project "Editing and Integration of Multimodal Resources in CLARIN-D" by the CLARIN-D Working Group 6