The Amazon-Google dataset for entity resolution derives from the online retailers Amazon.com and the product search service of Google accessible through the Google Base Data API. The dataset contains 1363 entities from amazon.com and 3226 google products as well as a gold standard (perfect mapping) with 1300 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description, manufacturer and price.
19 PAPERS • 2 BENCHMARKS
The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1081 entities from abt.com and 1092 entities from buy.com as well as a gold standard (perfect mapping) with 1097 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price.
18 PAPERS • 2 BENCHMARKS
SOTAB V2 features two annotation tasks: Column Type Annotation (CTA) and Columns Property Annotation (CPA). The goal of the Column Type Annotation (CTA) task is to annotate the columns of a table using 82 types from the Schema.org vocabulary, such as telephone, Duration, Mass, or Organization. The goal of the Columns Property Annotation (CPA) task is to annotate pairs of table columns with one out of 108 Schema.org properties, such as gtin, startDate, priceValidUntil, or recipeIngredient. The benchmark consists of 45,834 tables annotated for CTA and 30,220 tables annotated for CPA originating from 55,511 different websites. The tables are split into training-, validation- and test sets for both tasks. The tables cover 17 popular Schema.org types including Product, LocalBusiness, Event, and JobPosting.
5 PAPERS • 2 BENCHMARKS
The WikiTables-TURL dataset was constructed by the authors of TURL and is based on the WikiTable corpus, which is a large collection of Wikipedia tables. The dataset consists of 580,171 tables divided into fixed training, validation and testing splits. Additionally, the dataset contains metadata about each table, such as the table name, table caption and column headers.
4 PAPERS • 3 BENCHMARKS
WDC Products is an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are
3 PAPERS • 3 BENCHMARKS
WDC SOTAB is a benchmark that features two annotation tasks: Column Type Annotation and Columns Property Annotation. The goal of the Column Type Annotation (CTA) task is to annotate the columns of a table with 91 Schema.org types, such as telephone, duration, Place, or Organization. The goal of the Columns Property Annotation (CPA) task is to annotate pairs of table columns with one out of 176 Schema.org properties, such as gtin13, startDate, priceValidUntil, or recipeIngredient. The benchmark consists of 59,548 tables annotated for CTA and 48,379 tables annotated for CPA originating from 74,215 different websites. The tables are split into training-, validation- and test sets for both tasks. The tables cover 17 popular Schema.org types including Product, LocalBusiness, Event, and JobPosting. The tables originate from the Schema.org Table Corpus.
2 PAPERS • 2 BENCHMARKS
WDC Block is a benchmark for comparing the performance of blocking methods that are used as part of entity resolution pipelines.
1 PAPER • 3 BENCHMARKS