Dense Captioning
23 papers with code • 1 benchmarks • 1 datasets
Most implemented papers
3D-LLM: Injecting the 3D World into Large Language Models
Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs.
Dense-Captioning Events in Videos
We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events.
A Hierarchical Approach for Generating Descriptive Image Paragraphs
Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail.
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language.
Dense Captioning with Joint Inference and Visual Context
The goal is to densely detect visual concepts (e. g., objects, object parts, and interactions between them) from images, labeling each with a short descriptive phrase.
Joint Event Detection and Description in Continuous Video Streams
In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context.
Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020
This technical report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.
Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization
Prior work in this domain has shown that there is ample room for improvement in the generated image sequence in terms of visual quality, consistency and relevance.
X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning
Thus, a more faithful caption can be generated only using point clouds during the inference.
MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes
3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.