Introduced by Hamilton et al. in Inductive Representation Learning on Large Graphs

The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.

Source: https://arxiv.org/pdf/1706.02216.pdf

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Node Classification	Reddit	BNS-GCN
Graph Classification	REDDIT-B	CRaWl
Question Answering	squadshifts reddit	deepset/roberta-large-squad2
Text Summarization	Reddit TIFU	PEGASUS 2B + SLiC
Conversational Response Selection	PolyAI Reddit	Multi-context ConveRT
Graph Classification	REDDIT-BINARY	CT-Layer
Graph Classification	REDDIT-MULTI-12K	GNN
Sarcasm Detection	FigLang 2020 Reddit Dataset	BERT+Aspect-based approaches
Classification	Reddit Ideology Database	SVM
Graph Classification	REDDIT-MULTI-5k	GraphSAGE
Dialogue Generation	Reddit (multi-ref)	SpaceFusion
Dynamic Link Prediction	Reddit	DyG2Vec
Sequence-to-sequence Language Modeling	Reddit	pegasus-reddit