11 dataset results for Entity Resolution

The Amazon-Google dataset for entity resolution derives from the online retailers Amazon.com and the product search service of Google accessible through the Google Base Data API. The dataset contains 1363 entities from amazon.com and 3226 google products as well as a gold standard (perfect mapping) with 1300 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description, manufacturer and price.

19 PAPERS • 2 BENCHMARKS

Abt-Buy

The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1081 entities from abt.com and 1092 entities from buy.com as well as a gold standard (perfect mapping) with 1097 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price.

18 PAPERS • 2 BENCHMARKS

WDC LSPM

Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

8 PAPERS • 4 BENCHMARKS

DBLP Temporal

DBLP Temporal is a dataset for temporal entity resolution, based on author profiles extracted from the Digital Bibliography and Library Project (DBLP).

6 PAPERS • 1 BENCHMARK

WDC Products

WDC Products is an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are

3 PAPERS • 3 BENCHMARKS

MusicBrainz20K

The MusicBrainz20K dataset for entity resolution and entity clustering is based on real records about songs from the MusicBrainz database. Each record is described with the following attributes: artist, title, album, year and length. The records have been modified with the DAPO [1] data generator. The generated dataset consists of five sources and approximately 20K records describing 10K unique song entities. It contains duplicates for 50% of the original records in two to five sources which are generated with a high degree of corruption to stress-test the entity resolution and clustering approaches.

2 PAPERS • 1 BENCHMARK

Binette's 2022 Inventors Benchmark

Hand-disambiguation of a sample of U.S. patents inventor mentions from PatentsView.org.

1 PAPER • NO BENCHMARKS YET

CEREC

CEREC (Corpus for Entity Resolution in Email Conversations)

CEREC is a large scale corpus for entity resolution in email conversations. The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 60,383 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort.

1 PAPER • NO BENCHMARKS YET

MovieGraphBenchmark

The dataset contains entities from IMDB, TheMovieDB and TheTVDB with goldstandard matches between the sources. Due to the licensing of IMDB we provide a script to build the IMDB part of the dataset yourself.

1 PAPER • NO BENCHMARKS YET

PIZZA

PIZZA is a dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

1 PAPER • NO BENCHMARKS YET

Weibo-Douban

Weibo-Douban (WD)

This dataset is used for user identity linkage across two online social networks in Chinese. It contains two popular Chinese social platforms: Sina Weibo\footnote{https://weibo.com} and Douban\footnote{https://www.douban.com}.

1 PAPER • NO BENCHMARKS YET

Datasets

11 dataset results for Entity Resolution