The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It was created by crawling questions through the Google Suggest API, and then obtaining answers using Amazon Mechanical Turk. The original split uses 3,778 examples for training and 2,032 for testing. All answers are defined as Freebase entities.
201 PAPERS • 3 BENCHMARKS
The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in form of natural language text. The corpus contains sets with up to 7 triplets each along with one or more reference texts for each set. The test set is split into two parts: seen, containing inputs created for entities and relations belonging to DBpedia categories that were seen in the training data, and unseen, containing inputs extracted for entities and relations belonging to 5 unseen categories.
143 PAPERS • 17 BENCHMARKS
Abstract GENeration DAtaset (AGENDA) is a dataset of knowledge graphs paired with scientific abstracts. The dataset consists of 40k paper titles and abstracts from the Semantic Scholar Corpus taken from the proceedings of 12 top AI conferences.
19 PAPERS • 1 BENCHMARK
GenWiki is a large-scale dataset for knowledge graph-to-text (G2T) and text-to-knowledge graph (T2G) conversion. It is introduced in the paper "GenWiki: A Dataset of 1.3 Million Content-Sharing Text and Graphs for Unsupervised Graph-to-Text Generation" by Zhijing Jin, Qipeng Guo, Xipeng Qiu, and Zheng Zhang at COLING 2020.
7 PAPERS • 2 BENCHMARKS
Adopts two subsets of Freebase (Bollacker et al., 2008) as Knowledge Bases to construct the PathQuestion (PQ) and the PathQuestion-Large (PQL) datasets. Paths are extracted between two entities which span two hops (es → r1 → e1 → r2 → a, denoted by -2H) or three hops (es→ r1 → e1 →r2 → e2→ r3 → a, denoted by -3H) and then generated natural language questions with templates. To make the generated questions analogical to real-world questions, paraphrasing templates and synonyms for relations are included by searching the Internet and two real-world datasets, WebQuestions (Berant et al., 2013) and WikiAnswers (Fader et al., 2013). In this way, the syntactic structure and surface wording of the generated questions have been greatly enriched.
7 PAPERS • 1 BENCHMARK
WikiGraphs is a dataset of Wikipedia articles each paired with a knowledge graph, to facilitate the research in conditional text generation, graph generation and graph representation learning. Existing graph-text paired datasets typically contain small graphs and short text (1 or few sentences), thus limiting the capabilities of the models that can be learned on the data.
3 PAPERS • 1 BENCHMARK
EventNarrative is a knowledge graph-to-text dataset from publicly available open-world knowledge graphs. EventNarrative consists of approximately 230,000 graphs and their corresponding natural language text.
2 PAPERS • 1 BENCHMARK
It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta), Telugu (te), and monolingual dataset in English (en). This is the Wikipedia text <--> Wikidata KG aligned corpus used to train the data-to-text generation model. The Train & validation splits are created using distant supervision methods and Test data is generated through human annotations.
ENT-DESC involves retrieving abundant knowledge of various types of main entities from a large knowledge graph (KG), which makes the current graph-to-sequence models severely suffer from the problems of information loss and parameter explosion while generating the descriptions.
1 PAPER • 1 BENCHMARK