Action Segmentation
72 papers with code • 9 benchmarks • 16 datasets
Action Segmentation is a challenging problem in high-level video understanding. In its simplest form, Action Segmentation aims to segment a temporally untrimmed video by time and label each segmented part with one of pre-defined action labels. The results of Action Segmentation can be further used as input to various applications, such as video-to-text and action localization.
Source: TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation
Libraries
Use these libraries to find Action Segmentation models and implementationsDatasets
Subtasks
Most implemented papers
Temporal Convolutional Networks for Action Segmentation and Detection
The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond.
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Annotating videos is cumbersome, expensive and not scalable.
LOGO: A Long-Form Video Dataset for Group Action Quality Assessment
Action quality assessment (AQA) has become an emerging topic since it can be extensively applied in numerous scenarios.
MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation
Temporally locating and classifying action segments in long untrimmed videos is of particular interest to many applications like surveillance and robotics.
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks.
Alleviating Over-segmentation Errors by Detecting Action Boundaries
Our model architecture consists of a long-term feature extractor and two branches: the Action Segmentation Branch (ASB) and the Boundary Regression Branch (BRB).
Global2Local: Efficient Structure Search for Video Action Segmentation
Our search scheme exploits both global search to find the coarse combinations and local search to get the refined receptive field combination patterns further.
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.
RF-Next: Efficient Receptive Field Search for Convolutional Neural Networks
Our search scheme exploits both global search to find the coarse combinations and local search to get the refined receptive field combinations further.
Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation
This paper introduces a unified framework for video action segmentation via sequence to sequence (seq2seq) translation in a fully and timestamp supervised setup.