Person-centric Visual Grounding
4 papers with code • 1 benchmarks • 1 datasets
Person-centric visual grounding is the problem of linking between people named in a caption and people pictured in an image. Introduced in "Who's Waldo? Linking People Across Text and Images" (Cui et al, ICCV 2021).
Most implemented papers
Who's Waldo? Linking People Across Text and Images
We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image.
TubeDETR: Spatio-Temporal Video Grounding with Transformers
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.
To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo
We find that the original Who's Waldo dataset compiled for this task contains a large number of biased samples that are solvable simply by heuristic methods; for instance, in many cases the first name in the sentence corresponds to the largest bounding box, or the sequence of names in the sentence corresponds to an exact left-to-right order in the image.