Title-based Video Summarization (TVSum) dataset serves as a benchmark to validate video summarization techniques. It contains 50 videos of various genres (e.g., news, how-to, documentary, vlog, egocentric) and 1,000 annotations of shot-level importance scores obtained via crowdsourcing (20 per video).
135 PAPERS • 4 BENCHMARKS
The SumMe dataset is a video summarization dataset consisting of 25 videos, each annotated with at least 15 human summaries (390 in total).
124 PAPERS • 3 BENCHMARKS
The Video2GIF dataset contains over 100,000 pairs of GIFs and their source videos. The GIFs were collected from two popular GIF websites (makeagif.com, gifsoup.com) and the corresponding source videos were collected from YouTube in Summer 2015. IDs and URLs of the GIFs and the videos are provided, along with temporal alignment of GIF segments to their source videos. The dataset shall be used to evaluate GIF creation and video highlight techniques.
11 PAPERS • NO BENCHMARKS YET
HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total of around 1 million images and video frames altogether.
10 PAPERS • 2 BENCHMARKS
Collects dense per-video-shot concept annotations.
4 PAPERS • 1 BENCHMARK
Contains 140 videos with multiple human created summaries, which were acquired in a controlled experiment.
3 PAPERS • NO BENCHMARKS YET
VideoXum is an enriched large-scale dataset for cross-modal video summarization. The dataset is built on ActivityNet Captions. The datasets includes three subtasks: Video-to-Video Summarization (V2V-SUM), Video-to-Text Summarization (V2T-SUM), and Video-to-Video&Text Summarization (V2VT-SUM).
Contains about 1, 000 videos from 10 queries and their video tags, manual annotations, and associated web images.
1 PAPER • NO BENCHMARKS YET
A new multi-view egocentric dataset, Multi-Ego. The dataset is recorded simultaneously by three cameras, covering a wide variety of real-life scenarios. The footage is annotated by multiple individuals under various summarization configurations, with a consensus analysis ensuring a reliable ground truth.
MultiSum is a dataset for multimodal summarization (MSMO). It consists of 17 categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. The dataset features:
This dataset consists of 18 movies with duration range between 10 and 104 minutes leveraged from the OVSD dataset (Rotman et al., 2016). For these videos, the summary length limit is set to be the minimum between 4 minutes and 10% of the video length.
The dataset is useful for query-adaptive video summarization and annotated with diversity and query-specific relevance labels.
A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it.
1 PAPER • 3 BENCHMARKS