3D dense captioning
9 papers with code • 0 benchmarks • 1 datasets
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest.
Benchmarks
These leaderboards are used to track progress in 3D dense captioning
Most implemented papers
X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning
Thus, a more faithful caption can be generated only using point clouds during the inference.
MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes
3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.
Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training
The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks.
End-to-End 3D Dense Captioning with Vote2Cap-DETR
Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner.
Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning
Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture.
An Embodied Generalist Agent in 3D World
Leveraging massive knowledge and learning schemes from large language models (LLMs), recent machine learning models show notable successes in building generalist agents that exhibit the capability of general-purpose task solving in diverse domains, including natural language processing, computer vision, and robotics.
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene.
TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the \textbf{domain gap} between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the \textbf{lack of data} with comprehensive box-caption pair annotations specifically tailored for outdoor scenes.