The RefCOCO dataset is a referring expression generation (REG) dataset used for tasks related to understanding natural language expressions that refer to specific objects in images. Here are the key details about RefCOCO:
301 PAPERS • 19 BENCHMARKS
DAVIS17 is a dataset for video object segmentation. It contains a total of 150 videos - 60 for training, 30 for validation, 60 for testing
270 PAPERS • 11 BENCHMARKS
JHMDB is an action recognition dataset that consists of 960 video sequences belonging to 21 actions. It is a subset of the larger HMDB51 dataset collected from digitized movies and YouTube videos. The dataset contains video and annotation for puppet flow per frame (approximated optimal flow on the person), puppet mask per frame, joint positions per frame, action label per clip and meta label per clip (camera motion, visible body parts, camera viewpoint, number of people, video quality).
230 PAPERS • 8 BENCHMARKS
Our task is to localize and provide a pixel-level mask of an object on all video frames given a language referring expression obtained either by looking at the first frame only or the full video. To validate our approach we employ two popular video object segmentation datasets, DAVIS16 [38] and DAVIS17 [42]. These two datasets introduce various challenges, containing videos with single or multiple salient objects, crowded scenes, similar looking instances, occlusions, camera view changes, fast motion, etc.
75 PAPERS • 5 BENCHMARKS
There exist previous works [6, 10] that constructed referring segmentation datasets for videos. Gavrilyuk et al. [6] extended the A2D [33] and J-HMDB [9] datasets with natural sentences; the datasets focus on describing the ‘actors’ and ‘actions’ appearing in videos, therefore the instance annotations are limited to only a few object categories corresponding to the dominant ‘actors’ performing a salient ‘action’. Khoreva et al. [10] built a dataset based on DAVIS [25], but the scales are barely sufficient to learn an end-to-end model from scratch
34 PAPERS • 3 BENCHMARKS
A new large-scale dataset for referring expressions, based on MS-COCO.
30 PAPERS • 4 BENCHMARKS
The Actor-Action Dataset (A2D) by Xu et al. [29] serves as the largest video dataset for the general actor and action segmentation task. It contains 3,782 videos from YouTube with pixel-level labeled actors and their actions. The dataset includes eight different actions, while a total of seven actor classes are considered to perform those actions. We follow [29], who split the dataset into 3,036 training videos and 746 testing videos.
29 PAPERS • 1 BENCHMARK
PhraseCut is a dataset consisting of 77,262 images and 345,486 phrase-region pairs. The dataset is collected on top of the Visual Genome dataset and uses the existing annotations to generate a challenging set of referring phrases for which the corresponding regions are manually annotated.
23 PAPERS • 1 BENCHMARK
CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators.
16 PAPERS • 2 BENCHMARKS
We obtain A2Dre by selecting only instances that were labeled as non-trivial, which are 433 REs from 190 videos. We do not use the trivial cases as the analysis of such examples is not relevant, as referents can be described by using the category alone. Each annotator was presented with a RE, a video in which the target object was marked by a bounding box, and a set of questions paraphrasing our categories. A2Dre was annotated by 3 authors of the paper. Our final set of category annotations used for analysis was derived by means of majority voting: for each nontrivial RE, we kept all category labels which were assigned to the RE by at least two annotators.
1 PAPER • 1 BENCHMARK
A2Dre is a subset from the A2D test set including $433$~\textit{non-trivial} REs. Due to its highly unbalanced distribution across the $7$~semantic categories we select the $4$~major categories \textsl{appearance, location, motion and static}. The four categories have in common that in most cases, for a given referent, a RE can be provided that expresses a certain category, and one that does not. We use these categories to augment A2Dre with additional REs, which vary according to the presence or absence of each of them. Specifically, based on our categorization of the original REs, for each RE~$re$ and category~$C$, we produce an additional RE~$re'$ by modifying $re$ slightly such that it does (or does not) express~$C$. For example, for the last RE in Figure~\ref{fig:a2d-images}, i.e. \emph{girl in yellow dress standing near the woman}, which could be categorized as \textit{appearance}, \textit{location}, no \textit{motion} and \textit{static}, we produce new REs for each category:
1 PAPER • NO BENCHMARKS YET