The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.
1,063 PAPERS • 24 BENCHMARKS
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
492 PAPERS • 20 BENCHMARKS
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.
314 PAPERS • 14 BENCHMARKS
The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title. The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.
205 PAPERS • 5 BENCHMARKS
SNAP is a collection of large network datasets. It includes graphs representing social networks, citation networks, web graphs, online communities, online reviews and more.
150 PAPERS • NO BENCHMARKS YET
Orkut is a social network dataset consisting of friendship social network and ground-truth communities from Orkut.com on-line social network where users form friendship each other.
79 PAPERS • NO BENCHMARKS YET
EmailEU is a directed temporal network constructed from email exchanges in a large European research institution for a 803-day period. It contains 986 email addresses as nodes and 332,334 emails as edges with timestamps. There are 42 ground truth departments in the dataset.
30 PAPERS • NO BENCHMARKS YET
Yeast dataset consists of a protein-protein interaction network. Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology.
17 PAPERS • NO BENCHMARKS YET
Models character profiles and gives dialogue agents the ability to learn characters' language styles through their HLAs.
4 PAPERS • NO BENCHMARKS YET
Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in placenta histology whole slide images. Cell graphs are large (>1 million nodes per image), node features are varied (64-dimensions of 11 types of cells), class labels are imbalanced (9 classes ranging from 0.21% of the data to 40.0%), and cellular communities cluster into heterogeneously distributed tissues of widely varying sizes (from 11 nodes to 44,671 nodes for a single structure).
2 PAPERS • 1 BENCHMARK
This dataset is a collection of undirected and unweighted LFR benchmark graphs as proposed by Lancichinetti et al. [1]. We generated the graphs using the code provided by Santo Fortunato on his personal website [2], embedded in our evaluation framework [3], with two different parameter sets. Let N denote the number of vertices in the network, then
1 PAPER • NO BENCHMARKS YET
Twitter-HyDrug is a real-world hypergraph data that describes the drug trafficking communities on Twitter. We first crawl the metadata (275,884,694 posts and 40,780,721 users) through the official Twitter API from Dec 2020 to Aug 2021. Afterward, we generate a drug keyword list that covers 21 drug types that may cause drug overdose or drug addiction problems to filter the tweets that contain drug-relevant information. Based on the keyword list, we obtain 266,975 filtered drug-relevant posts by 54,680 users. Moreover, we define six types of drug communities, i.e., cannabis, opioid, hallucinogen, stimulant, depressant, and others communities, based on the drug functions. Six researchers spent 62 days annotating these Twitter users into six communities based on the annotation rules discussed in the next section. With the specific criteria, six researchers annotated the filtered metadata separately. For these Twitter users with disagreed labels, we conducted further discussion among annota
1 PAPER • 1 BENCHMARK