Latest Research

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

magic-research/PLLaVA • arXiv 2024

PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.

Ranked #1 on Zero-Shot Video Question Answer on TGIF-QA

Video-based Generative Performance Benchmarking (Consistency) Video-based Generative Performance Benchmarking (Contextual Understanding) +4

26 Apr 2024

Paper
Code

Interpreting Answers to Yes-No Questions in Dialogues from Multiple Domains

wang-zijie/yn-question-multi-domains • 25 Apr 2024

People often answer yes-no questions without explicitly saying yes, no, or similar polar keywords.

25 Apr 2024

Paper
Code

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

x-plug/mplug-docowl • • 25 Apr 2024

Charts are important for presenting and explaining complex data relationships.

876

25 Apr 2024

Paper
Code

OmniSearchSage: Multi-Task Multi-Entity Embeddings for Pinterest Search

pinterest/atg-research • 25 Apr 2024

In this paper, we present OmniSearchSage, a versatile and scalable system for understanding search queries, pins, and products for Pinterest search.

25 Apr 2024

Paper
Code

Vision-based robot manipulation of transparent liquid containers in a laboratory setting

danischober/labliquidvision • 25 Apr 2024

Laboratory processes involving small volumes of solutions and active ingredients are often performed manually due to challenges in automation, such as high initial costs, semi-structured environments and protocol variability.

25 Apr 2024

Paper
Code

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

opengvlab/internvl • • 25 Apr 2024

Compared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.

844

25 Apr 2024

Paper
Code

AAPL: Adding Attributes to Prompt Learning for Vision-Language Models

Gahyeonkim09/AAPL • 25 Apr 2024

Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes.

25 Apr 2024

Paper
Code

Deep learning-based blind image super-resolution with iterative kernel reconstruction and noise estimation

hfates/ikr-net • 25 Apr 2024

Yet, there is a gap in the literature to provide a well-generalized deep learning-based solution that performs well on images with unknown and highly complex degradations.

25 Apr 2024

Paper
Code

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

ailab-cvc/seed-bench • • 25 Apr 2024

We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs.

236

25 Apr 2024

Paper
Code

TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

saidwivedi/TokenHMR • 25 Apr 2024

We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy.

25 Apr 2024

Paper
Code