HICO-DET is a dataset for detecting human-object interactions (HOI) in images. It contains 47,776 images (38,118 in train set and 9,658 in test set), 600 HOI categories constructed by 80 object categories and 117 verb classes. HICO-DET provides more than 150k annotated human-object pairs. V-COCO provides 10,346 images (2,533 for training, 2,867 for validating and 4,946 for testing) and 16,199 person instances. Each person has annotations for 29 action categories and there are no interaction labels including objects.
156 PAPERS • 5 BENCHMARKS
Verbs in COCO (V-COCO) is a dataset that builds off COCO for human-object interaction detection. V-COCO provides 10,346 images (2,533 for training, 2,867 for validating and 4,946 for testing) and 16,199 person instances. Each person has annotations for 29 action categories and there are no interaction labels including objects.
137 PAPERS • 1 BENCHMARK
FineGym is an action recognition dataset build on top of gymnasium videos. Compared to existing action recognition datasets, FineGym is distinguished in richness, quality, and diversity. In particular, it provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy. For example, a "balance beam" event will be annotated as a sequence of elementary sub-actions derived from five sets: "leap-jumphop", "beam-turns", "flight-salto", "flight-handspring", and "dismount", where the sub-action in each set will be further annotated with finely defined class labels. This new level of granularity presents significant challenges for action recognition, e.g. how to parse the temporal structures from a coherent action, and how to distinguish between subtly different action classes.
56 PAPERS • NO BENCHMARKS YET
HICO is a benchmark for recognizing human-object interactions (HOI).
45 PAPERS • 2 BENCHMARKS
Inferring human-scene contact (HSC) is the first step toward understanding how humans interact with their surroundings. While detecting 2D human-object interaction (HOI) and reconstructing 3D human pose and shape (HPS) have enjoyed significant progress, reasoning about 3D human-scene contact from a single image is still challenging. Existing HSC detection methods consider only a few types of predefined contact, often reduce body and scene to a small number of primitives, and even overlook image evidence. To predict human-scene contact from a single image, we address the limitations above from both data and algorithmic perspectives. We capture a new dataset called RICH for “Real scenes, Interaction, Contact and Humans.” RICH contains multiview outdoor/indoor video sequences at 4K resolution, ground-truth 3D human bodies captured using markerless motion capture, 3D body scans, and high resolution 3D scene scans. A key feature of RICH is that it also contains accurate vertex-level contact
39 PAPERS • 1 BENCHMARK
BEHAVE is a full body human-object interaction dataset with multi-view RGBD frames and corresponding 3D SMPL and object fits along with the annotated contacts between them. Dataset contains ~15k frames at 5 locations with 8 subjects performing a wide range of interactions with 20 common objects.
33 PAPERS • 3 BENCHMARKS
HAKE is built upon existing activity datasets and provides human body part level atomic action labels (Part States).
14 PAPERS • NO BENCHMARKS YET
The MECCANO dataset is the first dataset of egocentric videos to study human-object interactions in industrial-like settings. The MECCANO dataset has been acquired in an industrial-like scenario in which subjects built a toy model of a motorbike. We considered 20 object classes which include the 16 classes categorizing the 49 components, the two tools (screwdriver and wrench), the instructions booklet and a partial_model class.
14 PAPERS • 3 BENCHMARKS
A large-scale 4D egocentric dataset with rich annotations, to catalyze the research of category-level human-object interaction. HOI4D consists of 2.4M RGB-D egOCentric video frames over 4000 sequences collected by 4 participants interacting with 800 different object instances from 16 categories over 610 different indoor rooms.
13 PAPERS • NO BENCHMARKS YET
The Watch-n-Patch dataset was created with the focus on modeling human activities, comprising multiple actions in a completely unsupervised setting. It is collected with Microsoft Kinect One sensor for a total length of about 230 minutes, divided in 458 videos. 7 subjects perform human daily activities in 8 offices and 5 kitchens with complex backgrounds. Moreover, skeleton data are provided as ground truth annotations.
12 PAPERS • NO BENCHMARKS YET
COUCH is a large human-chair interaction dataset with clean annotations. The dataset consists of 3 hours and over 500 sequences of motion capture (MoCap) on human-chair interactions.
7 PAPERS • NO BENCHMARKS YET
VidHOI is a video-based human-object interaction detection benchmark. VidHOI is based on VidOR which is densely annotated with all humans and predefined objects showing up in each frame. VidOR is also more challenging as the videos are non-volunteering user-generated and thus jittery at times.
6 PAPERS • 2 BENCHMARKS
Ambiguous-HOI is a challenging dataset containing ambiguous human-object interaction images for HOI detection based on HICO-DET.
2 PAPERS • NO BENCHMARKS YET
CHAIRS is a large-scale motion-captured f-AHOI dataset, consisting of 17.3 hours of versatile interactions between 46 participants and 81 articulated and rigid sittable objects. CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process, as well as realistic and physically plausible full-body interactions.
V-HICO is a dataset for human-object interaction in videos. There are 6,594 videos, including 5,297 training videos, 635 validation videos, 608 test videos, and 54 unseen test videos, of human-object interaction. To test the performance of models on common human-object interaction classes and generalization to new human-object interaction classes, we provide two test splits, the first one has the same human-object interaction classes in the training split while the second one consists of unseen novel classes.
Discovering Interacted Objects (DIO) is a benchmark containing 51 interactions and 1,000+ objects designed for Spatio-temporal Human-Object Interaction (ST-HOI) detection.
1 PAPER • NO BENCHMARKS YET
EgoISM-HOI is a new multimodal dataset composed of synthetic and real images of egocentric human-objects interactions in an industrial environment with rich annotations of hands and objects. EgoISM-HOI contains a total of 39,304 RGB images, 23,356 depth maps and instance segmentation masks, 59,860 hand annotations, 237,985 object instances across 19 object categories and 35,416 egocentric human-object interactions.
The Human-to-Human-or-Object Interaction Dataset (H2O) dataset is a dataset for Human-Object Interaction (HOI) detection. It consists in determining and locating the list of triplets <subject,verb,target> which describe all the simultaneous interactions in an image.
MPHOI-72 is a multi-person human-object interaction dataset that can be used for a wide variety of HOI/activity recognition and pose estimation/object tracking tasks. The dataset is challenging due to many body occlusions among the humans and objects. It consists of 72 videos captured from 3 different angles at 30 fps, with totally 26,383 frames and an average length of 12 seconds. It involves 5 humans performing in pairs, 6 object types, 3 activities and 13 sub-activities. The dataset includes color video, depth video, human skeletons, human and object bounding boxes.
The dataset is composed of 100 video sequences densely annotated with 60K bounding boxes, 17 sequence attributes, 13 action verb attributes and 29 target object attributes.
First of its kind paired win-fail action understanding dataset with samples from the following domains: “General Stunts,” “Internet Wins-Fails,” “Trick Shots,” & “Party Games.” The task is to identify successful and failed attempts at various activities. Unlike existing action recognition datasets, intra-class variation is high making the task challenging, yet feasible.
1 PAPER • 2 BENCHMARKS
H²O is an image dataset annotated for Human-to-human-or-object interaction detection. H²O is composed of the images from V-COCO dataset to which are added images which mostly contain interactions between people. The dataset has been introduced in this paper: Orcesi, A., Audigier, R., Toukam, F. P., & Luvison, B. (2021, December). Detecting Human-to-Human-or-Object (H 2 O) Interactions with DIABOLO. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021) (pp. 1-8). IEEE. The annotations were made with Pixano, an opensource, smart annotation tool for computer vision applications: https://pixano.cea.fr/
0 PAPER • NO BENCHMARKS YET