The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
10,147 PAPERS • 92 BENCHMARKS
LVIS is a dataset for long tail instance segmentation. It has annotations for over 1000 object categories in 164k images.
434 PAPERS • 14 BENCHMARKS
Few-Shot Object Detection Dataset (FSOD) is a high-diverse dataset specifically designed for few-shot object detection and intrinsically designed to evaluate thegenerality of a model on novel categories.
62 PAPERS • NO BENCHMARKS YET
The ELEVATER benchmark is a collection of resources for training, evaluating, and analyzing language-image models on image classification and object detection. ELEVATER consists of:
22 PAPERS • 2 BENCHMARKS
A large-scale logo image database for logo detection and brand recognition from real-world product images.
3 PAPERS • NO BENCHMARKS YET
Millions of people around the world have low or no vision. Assistive software applications have been developed for a variety of day-to-day tasks, including currency recognition. To aid with this task, we present BankNote-Net, an open dataset for assistive currency recognition. The dataset consists of a total of 24,816 embeddings of banknote images captured in a variety of assistive scenarios, spanning 17 currencies and 112 denominations. These compliant embeddings were learned using supervised contrastive learning and a MobileNetV2 architecture, and they can be used to train and test specialized downstream models for any currency, including those not covered by our dataset or for which only a few real images per denomination are available (few-shot learning). We deploy a variation of this model for public use in the last version of the Seeing AI app developed by Microsoft, which has over a 100 thousand monthly active users.
1 PAPER • NO BENCHMARKS YET
VizWiz-FewShot is a a few-shot localization dataset originating from photographers who authentically were trying to learn about the visual content in the images they took. It includes nearly 10,000 segmentations of 100 categories in over 4,500 images that were taken by people with visual impairments.