The Hateful Memes data set is a multimodal dataset for hateful meme detection (image + text) that contains 10,000+ new multimodal examples created by Facebook AI. Images were licensed from Getty Images so that researchers can use the data set to support their work.
142 PAPERS • 1 BENCHMARK
LIAR is a publicly available dataset for fake news detection. A decade-long of 12.8K manually labeled short statements were collected in various contexts from POLITIFACT.COM, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. The LIAR dataset4 includes 12.8K human labeled short statements from POLITIFACT.COM’s API, and each statement is evaluated by a POLITIFACT.COM editor for its truthfulness.
108 PAPERS • 1 BENCHMARK
The DFDC (Deepfake Detection Challenge) is a dataset for deepface detection consisting of more than 100,000 videos.
104 PAPERS • 1 BENCHMARK
A new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes.
41 PAPERS • NO BENCHMARKS YET
CoAID include diverse COVID-19 healthcare misinformation, including fake news on websites and social platforms, along with users' social engagement about such news. CoAID includes 4,251 news, 296,000 related user engagements, 926 social platform posts about COVID-19, and ground truth labels.
38 PAPERS • NO BENCHMARKS YET
A new publicly available dataset for verification of climate change-related claims.
30 PAPERS • 1 BENCHMARK
FakeNewsNet is collected from two fact-checking websites: GossipCop and PolitiFact containing news contents with labels annotated by professional journalists and experts, along with social context information.
26 PAPERS • NO BENCHMARKS YET
A large-scale curated dataset of over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st to April 4th at the time of writing.
10 PAPERS • 6 BENCHMARKS
MEIR is a substantially challenging dataset over that which has been previously available to support research into image repurposing detection. The new dataset includes location, person, and organization manipulations on real-world data sourced from Flickr.
8 PAPERS • NO BENCHMARKS YET
NELA-GT-2018 is a dataset for the study of misinformation that consists of 713k articles collected between 02/2018-11/2018. These articles are collected directly from 194 news and media outlets including mainstream, hyper-partisan, and conspiracy sources. It includes ground truth ratings of the sources from 8 different assessment sites covering multiple dimensions of veracity, including reliability, bias, transparency, adherence to journalistic standards, and consumer trust.
For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS.
7 PAPERS • 2 BENCHMARKS
ArCOV-19 is an Arabic COVID-19 Twitter dataset that covers the period from 27th of January till 30th of April 2020. ArCOV-19 is the first publicly-available Arabic Twitter dataset covering COVID-19 pandemic that includes over 1M tweets alongside the propagation networks of the most-popular subset of them (i.e., most-retweeted and -liked).
6 PAPERS • NO BENCHMARKS YET
Chinese dataset on COVID-19 misinformation. CHECKED provides ground-truth on credibility, carefully obtained by ensuring the specific sources are used. CHECKED includes microblogs related to COVID-19, identified by using a specific list of keywords, covering a total 2120 microblogs published from December 2019 to August 2020. The dataset contains a rich set of multimedia information for each microblog including ground-truth label, textual, visual, response, and social network information.
5 PAPERS • NO BENCHMARKS YET
NELA-GT-2019 is an updated version of the NELA-GT-2018 dataset. NELA-GT-2019 contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Just as with NELA-GT-2018, these sources come from a wide range of mainstream news sources and alternative news sources. Included with the dataset are source-level ground truth labels from 7 different assessment sites covering multiple dimensions of veracity.
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
4 PAPERS • 3 BENCHMARKS
NELA-GT-2020 is an updated version of the NELA-GT-2019 dataset. NELA-GT-2020 contains nearly 1.8M news articles from 519 sources collected between January 1st, 2020 and December 31st, 2020. Just as with NELA-GT-2018 and NELA-GT-2019, these sources come from a wide range of mainstream news sources and alternative news sources. Included in the dataset are source-level ground truth labels from Media Bias/Fact Check (MBFC) covering multiple dimensions of veracity. Additionally, new in the 2020 dataset are the Tweets embedded in the collected news articles, adding an extra layer of information to the data.
4 PAPERS • NO BENCHMARKS YET
Some Like it Hoax is a fake news detection dataset consisting of 15,500 Facebook posts and 909,236 users.
ArCOV19-Rumors is an Arabic COVID-19 Twitter dataset for misinformation detection composed of tweets containing claims from 27th January till the end of April 2020.
3 PAPERS • NO BENCHMARKS YET
COVID-Q consists of COVID-19 questions which have been annotated into a broad category (e.g. Transmission, Prevention) and a more specific class such that questions in the same class are all asking the same thing.
CoVERT is a fact-checked corpus of tweets with a focus on the domain of biomedicine and COVID-19-related (mis)information. The corpus consists of 300 tweets, each annotated with medical named entities and relations. Employs a novel crowdsourcing methodology to annotate all tweets with fact-checking labels and supporting evidence, which crowdworkers search for online. This methodology results in moderate inter-annotator agreement.
CoVaxLies v1 includes 17 known Misinformation Targets (MisTs) found on Twitter about the covid-19 vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each MisT. This collection is a first step in providing large-scale resources for misinformation detection and misinformation stance identification.
GeoCoV19 is a large-scale Twitter dataset containing more than 524 million multilingual tweets. The dataset contains around 378K geotagged tweets and 5.4 million tweets with Place information. The annotations include toponyms from the user location field and tweet content and resolve them to geolocations such as country, state, or city level. In this case, 297 million tweets are annotated with geolocation using the user location field and 452 million tweets using tweet content.
Stanceosaurus is a corpus of 28,033 tweets in English, Hindi, and Arabic annotated with stance towards 251 misinformation claims. The claims in Stanceosaurus originate from 15 fact-checking sources that cover diverse geographical regions and cultures. Unlike existing stance datasets, it introduces a more fine-grained 5-class labeling strategy with additional subcategories to distinguish implicit stance.
The Gossipcop variant of the UPFD dataset for benchmarking.
3 PAPERS • 1 BENCHMARK
CoVaxFrames includes 113 Vaccine Hesitancy Framings found on Twitter about the COVID-19 vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.
2 PAPERS • NO BENCHMARKS YET
CoVaxLies v2 includes 47 Misinformation Targets (MisTs) found on Twitter about the COVID-19 vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each MisT. This collection is a first step in providing large-scale resources for misinformation detection and misinformation stance identification.
Covid-HeRA is a dataset for health risk assessment and severity-informed decision making in the presence of COVID19 misinformation. It is a benchmark dataset for risk-aware health misinformation detection, related to the 2019 coronavirus pandemic. Social media posts (Twitter) are annotated based on the perceived likelihood of health behavioural changes and the perceived corresponding risks from following unreliable advice found online.
Detecting out-of-context media, such as "mis-captioned" images on Twitter, is a relevant problem, especially in domains of high public significance. Twitter-COMMs is a large-scale multimodal dataset with 884k tweets relevant to the topics of Climate Change, COVID-19, and Military Vehicles. This dataset can be used to develop methods to detect misinformation on social media platforms related to these three topics.
The PolitiFact variant of the UPFD dataset for benchmarking.
2 PAPERS • 1 BENCHMARK
This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish.
1 PAPER • NO BENCHMARKS YET
HpVaxFrames includes 64 Vaccine Hesitancy Framings found on Twitter about the HPV vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.
MMVax-Stance includes 113 Vaccine Hesitancy Framings found on Twitter about the COVID-19 vaccines. Language experts annotated multimodal image-text tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is conta
NAIST COVID is a multilingual dataset of social media posts related to COVID-19, consisting of microblogs in English and Japanese from Twitter and those in Chinese from Weibo. The data cover microblogs from January 20, 2020, to March 24, 2020.
NELA-GT-2021 is the fourth installment of the NELA-GT datasets, NELA-GT-2021. The dataset contains 1.8M articles from 367 outlets between January 1st, 2021 and December 31st, 2021. Just as in past releases of the dataset, NELA-GT-2021 includes outlet-level veracity labels from Media Bias/Fact Check and tweets embedded in collected news articles.
Combines CoVaxFrames and HpVaxFrames into a unified dataset of 113 Vaccine Hesitancy Framings found on Twitter about the COVID-19 vaccines and 64 Vaccine Hesitancy Framings found on Twitter about the HPV vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.
A Natural Language Resource for Learning to Recognize Misinformation about the COVID-19 and HPV Vaccines.
The CIDII dataset is a binary classification, consisting of two classes of correct information and disinformation related to Islamic issues. The CIDII dataset belongs to our research (DISINFORMATION DETECTION ABOUT ISLAMIC ISSUES ON SOCIAL MEDIA USING DEEP LEARNING TECHNIQUES) published in MJCS journal in the link below: https://ejournal.um.edu.my/index.php/MJCS/article/view/41935
0 PAPER • NO BENCHMARKS YET