Search Results for author: Ying Shan

Found 185 papers, 114 papers with code

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

1 code implementation • 25 Apr 2024 • Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan

We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs.

236

Paper
Code

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

1 code implementation • 22 Apr 2024 • Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan

We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.

Image Generation

Paper
Code

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

2 code implementations • 10 Apr 2024 • Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, Ying Shan

We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability.

Image to 3D

1,389

Paper
Code

ST-LLM: Large Language Models Are Effective Temporal Learners

1 code implementation • 30 Mar 2024 • Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li

In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?

Ranked #1 on Video-based Generative Performance Benchmarking (Temporal Understanding) on VideoInstruct

Reading Comprehension Video Understanding

Paper
Code

UV Gaussians: Joint Learning of Mesh Deformation and Gaussian Textures for Human Avatar Modeling

no code implementations • 18 Mar 2024 • Yujiao Jiang, Qingmin Liao, Xiaoyu Li, Li Ma, Qi Zhang, Chaopeng Zhang, Zongqing Lu, Ying Shan

Therefore, we propose UV Gaussians, which models the 3D human body by jointly learning mesh deformations and 2D UV-space Gaussian textures.

Paper
Add Code

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

no code implementations • 15 Mar 2024 • Tao Wu, XueWei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li

Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.

Denoising Image Generation

Paper
Add Code

Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing

no code implementations • 15 Mar 2024 • Tian-Xing Xu, WenBo Hu, Yu-Kun Lai, Ying Shan, Song-Hai Zhang

3D Gaussian splatting, emerging as a groundbreaking approach, has drawn increasing attention for its capabilities of high-fidelity reconstruction and real-time rendering.

Disentanglement

Paper
Add Code

HeadEvolver: Text to Head Avatars via Locally Learnable Mesh Deformation

no code implementations • 14 Mar 2024 • Duotun Wang, Hengyu Meng, Zeyu Cai, Zhijing Shao, Qianxi Liu, Lin Wang, Mingming Fan, Ying Shan, Xiaohang Zhan, Zeyu Wang

We present HeadEvolver, a novel framework to generate stylized head avatars from text guidance.

Paper
Add Code

HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback

no code implementations • 13 Mar 2024 • Ang Li, Qiugen Xiao, Peng Cao, Jian Tang, Yi Yuan, Zijie Zhao, Xiaoyuan Chen, Liang Zhang, Xiangyang Li, Kaitong Yang, Weidong Guo, Yukang Gan, Xu Yu, Daniell Wang, Ying Shan

Using ChatGPT as a labeler to provide feedback on open-domain prompts in RLAIF training, we observe an increase in human evaluators' preference win ratio for model responses, but a decrease in evaluators' satisfaction rate.

Language Modelling Large Language Model +2

Paper
Add Code

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

2 code implementations • 11 Mar 2024 • Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, Qiang Xu

Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs).

Image Inpainting

865

Paper
Code

DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos

no code implementations • 9 Mar 2024 • Xiuzhe Wu, Xiaoyang Lyu, Qihao Huang, Yong liu, Yang Wu, Ying Shan, Xiaojuan Qi

Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.

Depth Estimation Disentanglement +5

Paper
Add Code

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

1 code implementation • 16 Feb 2024 • Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, YuFei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen

Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data.

Video Generation

Paper
Code

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

1 code implementation • 4 Feb 2024 • Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang

Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years.

Image Generation

645

Paper
Code

Advances in 3D Generation: A Survey

no code implementations • 31 Jan 2024 • Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, Ying Shan

In this survey, we aim to introduce the fundamental methodologies of 3D generation methods and establish a structured roadmap, encompassing 3D representation, generation methods, datasets, and corresponding applications.

3D Generation Novel View Synthesis

Paper
Add Code

YOLO-World: Real-Time Open-Vocabulary Object Detection

1 code implementation • 30 Jan 2024 • Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools.

Instance Segmentation Language Modelling +4

3,379

Paper
Code

RecDCL: Dual Contrastive Learning for Recommendation

1 code implementation • 28 Jan 2024 • Dan Zhang, Yangliao Geng, Wenwen Gong, Zhongang Qi, Zhiyu Chen, Xing Tang, Ying Shan, Yuxiao Dong, Jie Tang

In this work, we investigate how to employ both batch-wise CL (BCL) and feature-wise CL (FCL) for recommendation.

Collaborative Filtering Contrastive Learning +2

Paper
Code

TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts

no code implementations • 26 Jan 2024 • Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, Ying Shan

To this end, we propose a 3D scene editing framework, TIPEditor, that accepts both text and image prompts and a 3D bounding box to specify the editing region.

3D scene Editing

Paper
Add Code

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

1 code implementation • 25 Jan 2024 • Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue

We propose to improve transformers of a specific modality with irrelevant data from other modalities, e. g., improve an ImageNet model with audio or point cloud datasets.

Paper
Code

Supervised Fine-tuning in turn Improves Visual Foundation Models

1 code implementation • 18 Jan 2024 • Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan

Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years.

Paper
Code

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

2 code implementations • 17 Jan 2024 • Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan

Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model.

Ranked #1 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)

Text-to-Video Generation Video Generation

4,067

Paper
Code

Towards A Better Metric for Text-to-Video Generation

no code implementations • 15 Jan 2024 • Jay Zhangjie Wu, Guian Fang, HaoNing Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, YuChao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou

Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.

Text-to-Video Generation Video Alignment +1

Paper
Add Code

LLaMA Pro: Progressive LLaMA with Block Expansion

1 code implementation • 4 Jan 2024 • Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan

Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e. g., from LLaMA to CodeLLaMA.

Instruction Following Math

389

Paper
Code

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

1 code implementation • 14 Dec 2023 • Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan

In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model.

Image Captioning In-Context Learning +4

Paper
Code

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

1 code implementation • 11 Dec 2023 • Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan

Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.

135

Paper
Code

EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models

1 code implementation • 11 Dec 2023 • Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, Xihui Liu

Given diverse environmental inputs, including real-time task progress, visual observations, and open-form language instructions, a proficient task planner is expected to predict feasible actions, which is a feat inherently achievable by Multimodal Large Language Models (MLLMs).

Benchmarking Human-Object Interaction Detection

Paper
Code

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

1 code implementation • 7 Dec 2023 • Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, Ying Shan

Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts.

Ranked #6 on Diffusion Personalization Tuning Free on AgeDB

Diffusion Personalization Tuning Free Text-to-Image Generation

8,248

Paper
Code

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

1 code implementation • 6 Dec 2023 • Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, Ying Shan

Therefore, this paper presents MotionCtrl, a unified and flexible motion controller for video generation designed to effectively and independently control camera and object motion.

Object Video Generation

1,057

Paper
Code

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

1 code implementation • 6 Dec 2023 • Jiwen Yu, Xiaodong Cun, Chenyang Qi, Yong Zhang, Xintao Wang, Ying Shan, Jian Zhang

For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image.

Image Animation Video Generation

345

Paper
Code

MagicStick: Controllable Video Editing via Control Handle Transformations

1 code implementation • 5 Dec 2023 • Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen

Yet succinct, our method is the first method to show the ability of video property editing from the pre-trained text-to-image model.

Video Editing Video Generation

Paper
Code

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

2 code implementations • 1 Dec 2023 • Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Xintao Wang, Yujiu Yang, Ying Shan

To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image.

Disentanglement Text-to-Video Generation +1

158

Paper
Code

ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis

no code implementations • 28 Nov 2023 • Xiangjun Gao, Xiaoyu Li, Chaopeng Zhang, Qi Zhang, YanPei Cao, Ying Shan, Long Quan

In this work, we propose a method to address the challenge of rendering a 3D human from a single image in a free-view manner.

Paper
Add Code

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

no code implementations • 28 Nov 2023 • Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, Ziwei Liu

In this paper, we propose an efficient yet effective framework, HumanGaussian, that generates high-quality 3D humans with fine-grained geometry and realistic appearance.

Paper
Add Code

HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion

no code implementations • 28 Nov 2023 • Jingbo Zhang, Xiaoyu Li, Qi Zhang, YanPei Cao, Ying Shan, Jing Liao

Optimization-based methods that lift text-to-image diffusion models to 3D generation often fail to preserve the texture details of the reference image, resulting in inconsistent appearances in different views.

3D Generation Image to 3D

Paper
Add Code

SEED-Bench-2: Benchmarking Multimodal Large Language Models

1 code implementation • 28 Nov 2023 • Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan

Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3).

Benchmarking Image Generation +1

236

Paper
Code

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

2 code implementations • 27 Nov 2023 • Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan

1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep.

Ranked #1 on Object Detection on COCO 2017 (mAP metric)

Image Classification Object Detection +3

807

Paper
Code

ViT-Lens: Towards Omni-modal Representations

1 code implementation • 27 Nov 2023 • Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou

In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space.

EEG Image Generation +2

129

Paper
Code

GS-IR: 3D Gaussian Splatting for Inverse Rendering

1 code implementation • 26 Nov 2023 • Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, Kui Jia

We propose GS-IR, a novel inverse rendering approach based on 3D Gaussian Splatting (GS) that leverages forward mapping volume rendering to achieve photorealistic novel view synthesis and relighting results.

Inverse Rendering Novel View Synthesis

218

Paper
Code

Vision-Language Instruction Tuning: A Review and Analysis

1 code implementation • 14 Nov 2023 • Chen Li, Yixiao Ge, Dian Li, Ying Shan

Instruction tuning is a crucial supervised training phase in Large Language Models (LLMs), aiming to enhance the LLM's ability to generalize instruction execution and adapt to user preferences.

Paper
Code

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

1 code implementation • NeurIPS 2023 • Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, Ying Shan

Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3. 6\% on eight image classification datasets with higher inference speed.

Few-Shot Learning Image Classification +3

Paper
Code

SemanticBoost: Elevating Motion Generation with Augmented Textual Cues

no code implementations • 31 Oct 2023 • Xin He, Shaoli Huang, Xiaohang Zhan, Chao Weng, Ying Shan

Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD).

Paper
Add Code

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

3 code implementations • 30 Oct 2023 • Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan

The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style.

Ranked #3 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)

Text-to-Video Generation Video Generation

4,067

Paper
Code

CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

no code implementations • 30 Oct 2023 • Ziyang Yuan, Mingdeng Cao, Xintao Wang, Zhongang Qi, Chun Yuan, Ying Shan

As a result, our CustomNet ensures enhanced identity preservation and generates diverse, harmonious outputs.

Novel View Synthesis Object +1

Paper
Add Code

FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling

3 code implementations • 23 Oct 2023 • Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu

With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress.

Video Generation

319

Paper
Code

TapMo: Shape-aware Motion Generation of Skeleton-free Characters

no code implementations • 19 Oct 2023 • Jiaxu Zhang, Shaoli Huang, Zhigang Tu, Xin Chen, Xiaohang Zhan, Gang Yu, Ying Shan

In this work, we present TapMo, a Text-driven Animation Pipeline for synthesizing Motion in a broad spectrum of skeleton-free 3D characters.

Paper
Add Code

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

1 code implementation • 18 Oct 2023 • Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Xintao Wang, Tien-Tsin Wong, Ying Shan

Animating a still image offers an engaging visual experience.

Image Animation

1,619

Paper
Code

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

1 code implementation • 17 Oct 2023 • Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, Ying Shan

For video generation, various open-sourced models and public-available services have been developed to generate high-quality videos.

Benchmarking Language Modelling +4

Paper
Code

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

no code implementations • 16 Oct 2023 • Jia-Wei Liu, Yan-Pei Cao, Jay Zhangjie Wu, Weijia Mao, YuChao Gu, Rui Zhao, Jussi Keppo, Ying Shan, Mike Zheng Shou

To overcome this, we propose to introduce the dynamic Neural Radiance Fields (NeRF) as the innovative video representation, where the editing can be performed in the 3D spaces and propagated to the entire video via the deformation field.

Style Transfer Super-Resolution +1

Paper
Add Code

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

1 code implementation • 11 Oct 2023 • Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan

Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.

Image Generation

436

Paper
Code

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

no code implementations • 10 Oct 2023 • Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, WenBo Hu, Long Quan, Ying Shan, Yonghong Tian

Our contributions are twofold: First, we propose a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity of diffusion-based zero-shot novel view synthesis methods.

3D Generation Image to 3D +1

Paper
Add Code

Making LLaMA SEE and Draw with SEED Tokenizer

1 code implementation • 2 Oct 2023 • Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan

We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs.

multimodal generation

466

Paper
Code

One For All: Video Conversation is Feasible Without Video Instruction Tuning

1 code implementation • 27 Sep 2023 • Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li

Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours.

Ranked #5 on Zero-Shot Video Retrieval on LSMDC

Video-based Generative Performance Benchmarking (Consistency) Video-based Generative Performance Benchmarking (Contextual Understanding) +6

Paper
Code

Anti-Aliased Neural Implicit Surfaces with Encoding Level of Detail

no code implementations • 19 Sep 2023 • Yiyu Zhuang, Qi Zhang, Ying Feng, Hao Zhu, Yao Yao, Xiaoyu Li, Yan-Pei Cao, Ying Shan, Xun Cao

Drawing inspiration from voxel-based representations with the level of detail (LoD), we introduce a multi-scale tri-plane-based scene representation that is capable of capturing the LoD of the signed distance function (SDF) and the space radiance.

Surface Reconstruction

Paper
Add Code

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

1 code implementation • ICCV 2023 • Xiuzhe Wu, Pengfei Hu, Yang Wu, Xiaoyang Lyu, Yan-Pei Cao, Ying Shan, Wenming Yang, Zhongqian Sun, Xiaojuan Qi

Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training.

Image Generation

Paper
Code

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

no code implementations • 4 Sep 2023 • Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, Ping Luo

StyleAdapter can generate high-quality images that match the content of the prompts and adopt the style of the references (even for unseen styles) in a single pass, which is more flexible and efficient than previous methods.

Image Generation

Paper
Add Code

Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training

no code implementations • 1 Sep 2023 • Shaohuan Zhou, Xu Li, Zhiyong Wu, Ying Shan, Helen Meng

Specifically, in the pre-training step, we design a phoneme predictor to produce the frame-level phoneme probability vectors as the phonemic timing information and a speaker encoder to model the timbre variations of different singers, and directly estimate the frame-level f0 values from the audio to provide the pitch information.

Singing Voice Synthesis Unsupervised Pre-training

Paper
Add Code

Exploring Model Transferability through the Lens of Potential Energy

1 code implementation • ICCV 2023 • Xiaotong Li, Zixuan Hu, Yixiao Ge, Ying Shan, Ling-Yu Duan

The experimental results on 10 downstream tasks and 12 self-supervised models demonstrate that our approach can seamlessly integrate into existing ranking techniques and enhance their performances, revealing its effectiveness for the model selection task and its potential for understanding the mechanism in transfer learning.

Model Selection Transfer Learning

Paper
Code

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

no code implementations • 27 Aug 2023 • Zi-Xin Zou, Weihao Cheng, Yan-Pei Cao, Shi-Sheng Huang, Ying Shan, Song-Hai Zhang

While recent techniques employ image diffusion models for generating plausible images at novel viewpoints or for distilling pre-trained diffusion priors into 3D representations using score distillation sampling (SDS), these methods often struggle to simultaneously achieve high-quality, consistent, and detailed results for both novel-view synthesis (NVS) and geometry.

3D Reconstruction Novel View Synthesis +1

Paper
Add Code

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

2 code implementations • 22 Aug 2023 • Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan

To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions.

Ranked #1 on Music Question Answering on MusicQA

Caption Generation Large Language Model +3

390

Paper
Code

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

1 code implementation • 20 Aug 2023 • Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou

A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities.

Ranked #2 on Zero-Shot Transfer 3D Point Cloud Classification on ModelNet40 (using extra training data)

3D Classification Question Answering +4

129

Paper
Code

Guide3D: Create 3D Avatars from Text and Image Guidance

no code implementations • 18 Aug 2023 • Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong

To this end, we introduce Guide3D, a zero-shot text-and-image-guided generative model for 3D avatar generation based on diffusion models.

3D Generation Text to 3D +1

Paper
Add Code

OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution

no code implementations • ICCV 2023 • Zidong Cao, Hao Ai, Yan-Pei Cao, Ying Shan, XiaoHu Qie, Lin Wang

The M\"obius transformation is typically employed to further provide the opportunity for movement and zoom on ODIs, but applying it to the image level often results in blurry effect and aliasing problem.

Paper
Add Code

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

2 code implementations • 30 Jul 2023 • Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan

Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.

Benchmarking Multiple-choice

357

Paper
Code

GET3D--: Learning GET3D from Unconstrained Image Collections

no code implementations • 27 Jul 2023 • Fanghua Yu, Xintao Wang, Zheyuan Li, Yan-Pei Cao, Ying Shan, Chao Dong

While generative models have shown potential in creating 3D textured shapes from 2D images, their applicability in 3D industries is limited due to the lack of a well-defined camera distribution in real-world scenarios, resulting in low-quality shapes.

Paper
Add Code

Planting a SEED of Vision in Large Language Model

1 code implementation • 16 Jul 2023 • Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan

Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.)

Language Modelling Large Language Model +1

466

Paper
Code

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

1 code implementation • 13 Jul 2023 • Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen

For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.

Retrieval Video Generation +2

234

Paper
Code

Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar

no code implementations • 11 Jul 2023 • Cong Wang, Di Kang, Yan-Pei Cao, Linchao Bao, Ying Shan, Song-Hai Zhang

Rendering photorealistic and dynamically moving human heads is crucial for ensuring a pleasant and immersive experience in AR/VR and video conferencing applications.

Paper
Add Code

NOFA: NeRF-based One-shot Facial Avatar Reconstruction

no code implementations • 7 Jul 2023 • Wangbo Yu, Yanbo Fan, Yong Zhang, Xuan Wang, Fei Yin, Yunpeng Bai, Yan-Pei Cao, Ying Shan, Yang Wu, Zhongqian Sun, Baoyuan Wu

In this work, we propose a one-shot 3D facial avatar reconstruction framework that only requires a single source image to reconstruct a high-fidelity 3D facial avatar.

Paper
Add Code

DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

1 code implementation • 5 Jul 2023 • Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang

Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model.

Object

645

Paper
Code

DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models

1 code implementation • 5 Jul 2023 • Liangbin Xie, Xintao Wang, Xiangyu Chen, Gen Li, Ying Shan, Jiantao Zhou, Chao Dong

After detecting the artifact regions, we develop a finetune procedure to improve GAN-based SR models with a few samples, so that they can deal with similar types of artifacts in more unseen real data.

Image Super-Resolution

114

Paper
Code

ID-Pose: Sparse-view Camera Pose Estimation by Inverting Diffusion Models

no code implementations • 29 Jun 2023 • Weihao Cheng, Yan-Pei Cao, Ying Shan

ID-Pose adds a noise to one image, and predicts the noise conditioned on the other image and a hypothesis of the relative pose.

Denoising Pose Estimation

Paper
Add Code

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals

1 code implementation • 29 Jun 2023 • Yunpeng Bai, Xintao Wang, Yan-Pei Cao, Yixiao Ge, Chun Yuan, Ying Shan

This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text.

EEG Image Generation

395

Paper
Code

On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection

1 code implementation • 27 Jun 2023 • Songyang Gao, Shihan Dou, Qi Zhang, Xuanjing Huang, Jin Ma, Ying Shan

Detecting adversarial samples that are carefully crafted to fool the model is a critical step to socially-secure applications.

text-classification Text Classification

Paper
Code

DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization

1 code implementation • 27 Jun 2023 • Songyang Gao, Shihan Dou, Yan Liu, Xiao Wang, Qi Zhang, Zhongyu Wei, Jin Ma, Ying Shan

Adversarial training is one of the best-performing methods in improving the robustness of deep language models.

Paper
Code

PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas

1 code implementation • 26 Jun 2023 • Chen Li, Xutan Peng, Teng Wang, Yixiao Ge, Mengyang Liu, Xuyuan Xu, Yexin Wang, Ying Shan

Art forms such as movies and television (TV) dramas are reflections of the real world, which have attracted much attention from the multimodal learning community recently.

Genre classification Retrieval +1

Paper
Code

Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation

no code implementations • 23 Jun 2023 • Qianji Di, Wenxi Ma, Zhongang Qi, Tianxiang Hou, Ying Shan, Hanzi Wang

In this work, we propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models.

Graph Generation Scene Graph Generation +1

Paper
Add Code

TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter

no code implementations • 22 Jun 2023 • Binjie Zhang, Yixiao Ge, Xuyuan Xu, Ying Shan, Mike Zheng Shou

In situations involving system upgrades that require updating the upstream foundation model, it becomes essential to re-train all downstream modules to adapt to the new foundation model, which is inflexible and inefficient.

Question Answering Retrieval +5

Paper
Add Code

InstructP2P: Learning to Edit 3D Point Clouds with Text Instructions

no code implementations • 12 Jun 2023 • Jiale Xu, Xintao Wang, Yan-Pei Cao, Weihao Cheng, Ying Shan, Shenghua Gao

Enhancing AI systems to perform tasks following human instructions can significantly boost productivity.

Language Modelling Large Language Model

Paper
Add Code

Sticker820K: Empowering Interactive Retrieval with Stickers

no code implementations • 12 Jun 2023 • Sijie Zhao, Yixiao Ge, Zhongang Qi, Lin Song, Xiaohan Ding, Zehua Xie, Ying Shan

Therefore, we propose StickerCLIP as a benchmark model on the Sticker820K dataset.

Image Retrieval Retrieval

Paper
Add Code

SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation

1 code implementation • 6 Jun 2023 • XueWei Li, Tao Wu, Zhongang Qi, Gaoang Wang, Ying Shan, Xi Li

Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS significantly improves performance and robustness, with approximately a 2% increase in mIoU, and when small 3D disturbances occur in the data, the stability of our performance is improved by an order of magnitude.

Ranked #4 on Semantic Segmentation on Stanford2D3D Panoramic

Semantic Segmentation

Paper
Code

PanoGRF: Generalizable Spherical Radiance Fields for Wide-baseline Panoramas

no code implementations • NeurIPS 2023 • Zheng Chen, Yan-Pei Cao, Yuan-Chen Guo, Chen Wang, Ying Shan, Song-Hai Zhang

Unlike generalizable radiance fields trained on perspective images, PanoGRF avoids the information loss from panorama-to-perspective conversion and directly aggregates geometry and appearance features of 3D sample points from each panoramic view based on spherical projection.

Depth Estimation

Paper
Add Code

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

no code implementations • 1 Jun 2023 • Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan, Tien-Tsin Wong

Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules.

Image Generation Video Generation

Paper
Add Code

Inserting Anybody in Diffusion Models via Celeb Basis

1 code implementation • NeurIPS 2023 • Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li, Chenyang Qi, Xintao Wang, Ying Shan, Huicheng Zheng

Empowered by the proposed celeb basis, the new identity in our customized model showcases a better concept combination ability than previous personalization methods.

240

Paper
Code

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

1 code implementation • NeurIPS 2023 • Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, Ying Shan

This paper aims to efficiently enable Large Language Models (LLMs) to use multimodal tools.

Image Generation Instruction Following +3

726

Paper
Code

TaleCrafter: Interactive Story Visualization with Multiple Characters

1 code implementation • 29 May 2023 • Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, Yujiu Yang

Accurate Story visualization requires several necessary elements, such as identity consistency across frames, the alignment between plain text and visual content, and a reasonable layout of objects in images.

Story Visualization Text-to-Image Generation

239

Paper
Code

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

2 code implementations • NeurIPS 2023 • YuChao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, Mike Zheng Shou

Public large-scale text-to-image diffusion models, such as Stable Diffusion, have gained significant attention from the community.

Attribute

366

Paper
Code

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

1 code implementation • 23 May 2023 • Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan

We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers.

Representation Learning

Paper
Code

A Confidence-based Partial Label Learning Model for Crowd-Annotated Named Entity Recognition

1 code implementation • 21 May 2023 • Limao Xiong, Jie zhou, Qunxi Zhu, Xiao Wang, Yuanbin Wu, Qi Zhang, Tao Gui, Xuanjing Huang, Jin Ma, Ying Shan

Particularly, we propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER.

named-entity-recognition Named Entity Recognition +2

Paper
Code

What Makes for Good Visual Tokenizers for Large Language Models?

1 code implementation • 20 May 2023 • Guangzhi Wang, Yixiao Ge, Xiaohan Ding, Mohan Kankanhalli, Ying Shan

In our benchmark, which is curated to evaluate MLLMs visual semantic understanding and fine-grained perception capabilities, we discussed different visual tokenizers pre-trained with dominant methods (i. e., DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models capture more semantics than self-supervised models, but the gap is narrowed by scaling up the pre-training dataset.

Image Captioning Object Counting +2

Paper
Code

SparseGNV: Generating Novel Views of Indoor Scenes with Sparse Input Views

1 code implementation • 11 May 2023 • Weihao Cheng, Yan-Pei Cao, Ying Shan

We study to generate novel views of indoor scenes given sparse input views.

Conditional Image Generation

Paper
Code

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

1 code implementation • 27 Apr 2023 • Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo

Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal and multimodal tasks.

Multi-Task Learning

Paper
Code

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video

no code implementations • ICCV 2023 • Jia-Wei Liu, Yan-Pei Cao, Tianyuan Yang, Eric Zhongcong Xu, Jussi Keppo, Ying Shan, XiaoHu Qie, Mike Zheng Shou

Our method enables pausing the video at any frame and rendering all scene details (dynamic humans, objects, and backgrounds) from arbitrary viewpoints.

Human-Object Interaction Detection Object

Paper
Add Code

SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes

no code implementations • CVPR 2023 • Yiming Gao, Yan-Pei Cao, Ying Shan

Online reconstructing and rendering of large-scale indoor scenes is a long-standing challenge.

3D Reconstruction Novel View Synthesis

Paper
Add Code

NeAI: A Pre-convoluted Representation for Plug-and-Play Neural Ambient Illumination

no code implementations • 18 Apr 2023 • Yiyu Zhuang, Qi Zhang, Xuan Wang, Hao Zhu, Ying Feng, Xiaoyu Li, Ying Shan, Xun Cao

Recent advances in implicit neural representation have demonstrated the ability to recover detailed geometry and material from multi-view images.

Paper
Add Code

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

3 code implementations • ICCV 2023 • Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, XiaoHu Qie, Yinqiang Zheng

Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results.

Ranked #11 on Text-based Image Editing on PIE-Bench

Text-based Image Editing

631

Paper
Code

Improved Test-Time Adaptation for Domain Generalization

1 code implementation • CVPR 2023 • Liang Chen, Yong Zhang, Yibing Song, Ying Shan, Lingqiao Liu

Generally, a TTT strategy hinges its performance on two main factors: selecting an appropriate auxiliary TTT task for updating and identifying reliable parameters to update during the test phase.

Domain Generalization Test-time Adaptation

Paper
Code

TagGPT: Large Language Models are Zero-shot Multimodal Taggers

1 code implementation • 6 Apr 2023 • Chen Li, Yixiao Ge, Jiayong Mao, Dian Li, Ying Shan

Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i. e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts.

Optical Character Recognition (OCR) Prompt Engineering +5

Paper
Code

Learning Anchor Transformations for 3D Garment Animation

no code implementations • CVPR 2023 • Fang Zhao, Zekun Li, Shaoli Huang, Junwu Weng, Tianfei Zhou, Guo-Sen Xie, Jue Wang, Ying Shan

Once the anchor transformations are found, per-vertex nonlinear displacements of the garment template can be regressed in a canonical space, which reduces the complexity of deformation space learning.

Position

Paper
Add Code

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

1 code implementation • 3 Apr 2023 • Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong

We present DreamAvatar, a text-and-shape guided framework for generating high-quality 3D human avatars with controllable poses.

5,631

Paper
Code

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

1 code implementation • 3 Apr 2023 • Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, Qifeng Chen

Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human.

Text-to-Image Generation Text-to-Video Generation +1

1,011

Paper
Code

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

1 code implementation • CVPR 2023 • Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, Antoni B. Chan

However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS.

Ranked #1 on Visual Object Tracking on TrackingNet (AUC metric)

Semantic Segmentation Video Object Segmentation +2

Paper
Code

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

2 code implementations • CVPR 2023 • Guangcong Zheng, Xianpan Zhou, XueWei Li, Zhongang Qi, Ying Shan, Xi Li

To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form.

Ranked #1 on Layout-to-Image Generation on Visual Genome 128x128

Layout-to-Image Generation Object

222

Paper
Code

VMesh: Hybrid Volume-Mesh Representation for Efficient View Synthesis

no code implementations • 28 Mar 2023 • Yuan-Chen Guo, Yan-Pei Cao, Chen Wang, Yu He, Ying Shan, XiaoHu Qie, Song-Hai Zhang

With the emergence of neural radiance fields (NeRFs), view synthesis quality has reached an unprecedented level.

Paper
Add Code

Accelerating Vision-Language Pretraining with Free Language Modeling

1 code implementation • CVPR 2023 • Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, XiaoHu Qie, Ping Luo

FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted.

Language Modelling Masked Language Modeling

Paper
Code

HRDFuse: Monocular 360°Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth Distributions

no code implementations • 21 Mar 2023 • Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, Lin Wang

Depth estimation from a monocular 360{\deg} image is a burgeoning problem owing to its holistic sensing of a scene.

Depth Estimation ERP

Paper
Add Code

BoPR: Body-aware Part Regressor for Human Shape and Pose Estimation

1 code implementation • 21 Mar 2023 • Yongkang Cheng, Shaoli Huang, Jifeng Ning, Ying Shan

This paper presents a novel approach for estimating human body shape and pose from monocular images that effectively addresses the challenges of occlusions and depth ambiguity.

Ranked #16 on 3D Human Pose Estimation on 3DPW

3D Human Pose Estimation Occlusion Handling

Paper
Code

HMC: Hierarchical Mesh Coarsening for Skeleton-free Motion Retargeting

no code implementations • 20 Mar 2023 • Haoyu Wang, Shaoli Huang, Fang Zhao, Chun Yuan, Ying Shan

We present a simple yet effective method for skeleton-free motion retargeting.

motion retargeting

Paper
Add Code

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

1 code implementation • ICCV 2023 • Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, Qifeng Chen

We also have a better zero-shot shape-aware editing ability based on the text-to-video model.

Attribute Text-to-Video Editing +2

1,042

Paper
Code

Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry

1 code implementation • CVPR 2023 • Jiaxu Zhang, Junwu Weng, Di Kang, Fang Zhao, Shaoli Huang, Xuefei Zhe, Linchao Bao, Ying Shan, Jue Wang, Zhigang Tu

Driven by our explored distance-based losses that explicitly model the motion semantics and geometry, these two modules can learn residual motion modifications on the source motion to generate plausible retargeted motion in a single inference without post-processing.

motion retargeting

137

Paper
Code

Binary Embedding-based Retrieval at Tencent

1 code implementation • 17 Feb 2023 • Yukang Gan, Yixiao Ge, Chang Zhou, Shupeng Su, Zhouchuan Xu, Xuyuan Xu, Quanchao Hui, Xiang Chen, Yexin Wang, Ying Shan

To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension.

Binarization Retrieval

Paper
Code

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

2 code implementations • 16 Feb 2023 • Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, XiaoHu Qie

In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly.

Image Generation Style Transfer

3,157

Paper
Code

OSRT: Omnidirectional Image Super-Resolution with Distortion-aware Transformer

1 code implementation • CVPR 2023 • Fanghua Yu, Xintao Wang, Mingdeng Cao, Gen Li, Ying Shan, Chao Dong

Omnidirectional images (ODIs) have obtained lots of research interest for immersive experiences.

Data Augmentation ERP +1

Paper
Code

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

no code implementations • 30 Jan 2023 • Yizhen Chen, Jie Wang, Lijian Lin, Zhongang Qi, Jin Ma, Ying Shan

Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years.

Language Modelling Masked Language Modeling +5

Paper
Add Code

RILS: Masked Visual Reconstruction in Language Semantic Space

1 code implementation • CVPR 2023 • Shusheng Yang, Yixiao Ge, Kun Yi, Dian Li, Ying Shan, XiaoHu Qie, Xinggang Wang

Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training.

Sentence

Paper
Code

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

1 code implementation • CVPR 2023 • Youxin Pang, Yong Zhang, Weize Quan, Yanbo Fan, Xiaodong Cun, Ying Shan, Dong-Ming Yan

In this paper, we introduce a novel self-supervised disentanglement framework to decouple pose and expression without 3DMMs and paired data, which consists of a motion editing module, a pose generator, and an expression generator.

Disentanglement Talking Face Generation +1

391

Paper
Code

HRDFuse: Monocular 360deg Depth Estimation by Collaboratively Learning Holistic-With-Regional Depth Distributions

no code implementations • CVPR 2023 • Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, Lin Wang

Depth estimation from a monocular 360 image is a burgeoning problem owing to its holistic sensing of a scene.

Depth Estimation ERP

Paper
Add Code

Generating Human Motion From Textual Descriptions With Discrete Representations

no code implementations • CVPR 2023 • Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, Ying Shan

Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach.

Paper
Add Code

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

no code implementations • CVPR 2023 • Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Weiming Hu, XiaoHu Qie, Jianping Wu

ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information.

Ranked #45 on Visual Reasoning on Winoground

Contrastive Learning Retrieval +3

Paper
Add Code

Order-Prompted Tag Sequence Generation for Video Tagging

no code implementations • ICCV 2023 • Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Yingmin Luo, Zekun Li, Chunfeng Yuan, Bing Li, XiaoHu Qie, Ying Shan, Weiming Hu

This paper proposes a novel generative model, Order-Prompted Tag Sequence Generation (OP-TSG), according to the above characteristics.

Multi-Label Classification TAG

Paper
Add Code

Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models

no code implementations • CVPR 2023 • Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, XiaoHu Qie, Shenghua Gao

Specifically, we first generate a high-quality 3D shape from the input text in the text-to-shape stage as a 3D shape prior.

Image Generation Text to 3D +1

Paper
Add Code

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

3 code implementations • ICCV 2023 • Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, YuChao Gu, Yufei Shi, Wynne Hsu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator.

Style Transfer Text-to-Video Generation +1

4,082

Paper
Code

Mitigating Artifacts in Real-World Video Super-Resolution Models

1 code implementation • 14 Dec 2022 • Liangbin Xie, Xintao Wang, Shuwei Shi, Jinjin Gu, Chao Dong, Ying Shan

To aggregate a new hidden state that contains fewer artifacts from the hidden state pool, we devise a Selective Cross Attention (SCA) module, in which the attention between input features and each hidden state is calculated.

Video Super-Resolution

Paper
Code

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

no code implementations • 6 Dec 2022 • YuChao Gu, Xintao Wang, Yixiao Ge, Ying Shan, XiaoHu Qie, Mike Zheng Shou

Vector-Quantized (VQ-based) generative models usually consist of two basic components, i. e., VQ tokenizers and generative transformers.

Conditional Image Generation

Paper
Add Code

3D GAN Inversion with Facial Symmetry Prior

no code implementations • CVPR 2023 • Fei Yin, Yong Zhang, Xuan Wang, Tengfei Wang, Xiaoyu Li, Yuan Gong, Yanbo Fan, Xiaodong Cun, Ying Shan, Cengiz Oztireli, Yujiu Yang

It is natural to associate 3D GANs with GAN inversion methods to project a real image into the generator's latent space, allowing free-view consistent synthesis and editing, referred as 3D GAN inversion.

Image Reconstruction Neural Rendering

Paper
Add Code

High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors

no code implementations • CVPR 2023 • Yunpeng Bai, Yanbo Fan, Xuan Wang, Yong Zhang, Jingxiang Sun, Chun Yuan, Ying Shan

Compared with existing works, we obtain superior novel view synthesis results and faithfully face reenactment performance.

Face Reenactment Novel View Synthesis +1

Paper
Add Code

Latent Video Diffusion Models for High-Fidelity Long Video Generation

1 code implementation • 23 Nov 2022 • Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen

Diffusion models have shown remarkable results recently but require significant computational resources.

Ranked #2 on Video Generation on Taichi

Denoising Image Generation +3

406

Paper
Code

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

1 code implementation • CVPR 2023 • Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, Fei Wang

We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation.

Image Animation Talking Head Generation

10,329

Paper
Code

Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields

no code implementations • CVPR 2023 • Yue Chen, Xingyu Chen, Xuan Wang, Qi Zhang, Yu Guo, Ying Shan, Fei Wang

Neural Radiance Fields (NeRF) have achieved photorealistic novel views synthesis; however, the requirement of accurate camera poses limits its application.

Paper
Add Code

Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation

1 code implementation • 10 Nov 2022 • Runbang Zhang, Yixiao Zhang, Kai Shao, Ying Shan, Gus Xia

In this study, we explore the representation mapping from the domain of visual arts to the domain of music, with which we can use visual arts as an effective handle to control music generation.

Music Generation Representation Learning +1

Paper
Code

Darwinian Model Upgrades: Model Evolving with Selective Compatibility

no code implementations • 13 Oct 2022 • Binjie Zhang, Shupeng Su, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, Mike Zheng Shou, Ying Shan

The traditional model upgrading paradigm for retrieval requires recomputing all gallery embeddings before deploying the new model (dubbed as "backfilling"), which is quite expensive and time-consuming considering billions of instances in industrial applications.

Face Recognition Retrieval

Paper
Add Code

Robust Human Matting via Semantic Guidance

1 code implementation • 11 Oct 2022 • Xiangguang Chen, Ye Zhu, Yu Li, Bingtao Fu, Lei Sun, Ying Shan, Shan Liu

Unlike previous works, our framework is data efficient, which requires a small amount of matting ground-truth to learn to estimate high quality object mattes.

Image Matting Segmentation

205

Paper
Code

MonoNeuralFusion: Online Monocular Neural 3D Reconstruction with Geometric Priors

no code implementations • 30 Sep 2022 • Zi-Xin Zou, Shi-Sheng Huang, Yan-Pei Cao, Tai-Jiang Mu, Ying Shan, Hongbo Fu

This paper introduces a novel neural implicit scene representation with volume rendering for high-fidelity online 3D scene reconstruction from monocular videos.

3D Reconstruction 3D Scene Reconstruction

Paper
Add Code

Music-driven Dance Regeneration with Controllable Key Pose Constraints

no code implementations • 8 Jul 2022 • Junfu Pu, Ying Shan

The cross-modal transformer decoder achieves the capability of synthesizing smooth dance motion sequences, which keeps a consistency with key poses at corresponding positions, by introducing the local neighbor position embedding.

Motion Synthesis

Paper
Add Code

Learning Music-Dance Representations through Explicit-Implicit Rhythm Synchronization

no code implementations • 7 Jul 2022 • Jiashuo Yu, Junfu Pu, Ying Cheng, Rui Feng, Ying Shan

Although audio-visual representation has been proved to be applicable in many downstream tasks, the representation of dancing videos, which is more specific and always accompanied by music with complex auditory contents, remains challenging and uninvestigated.

Contrastive Learning Representation Learning +2

Paper
Add Code

Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space

1 code implementation • 7 Jul 2022 • Wenqi Shao, Xun Zhao, Yixiao Ge, Zhaoyang Zhang, Lei Yang, Xiaogang Wang, Ying Shan, Ping Luo

It is challenging because the ground-truth model ranking for each task can only be generated by fine-tuning the pre-trained models on the target dataset, which is brute-force and computationally expensive.

Ranked #2 on Transferability on classification benchmark

Transferability

Paper
Code

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

no code implementations • 28 Jun 2022 • Xu Li, Shansong Liu, Ying Shan

It is suspected that a single embedding vector may only capture averaged and coarse-grained speaker characteristics, which is insufficient for the SVC task.

Speaker Recognition Voice Conversion

Paper
Add Code

Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning

1 code implementation • 22 Jun 2022 • Jia-Run Du, Jia-Chang Feng, Kun-Yu Lin, Fa-Ting Hong, Xiao-Ming Wu, Zhongang Qi, Ying Shan, Wei-Shi Zheng

Accordingly, we first exclude these surely non-existent categories by a complementary learning loss.

Multiple Instance Learning Representation Learning +3

Paper
Code

AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos

1 code implementation • 14 Jun 2022 • Yanze Wu, Xintao Wang, Gen Li, Ying Shan

This paper studies the problem of real-world video super-resolution (VSR) for animation videos, and reveals three key improvements for practical animation VSR.

Video Super-Resolution

314

Paper
Code

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes

1 code implementation • 31 May 2022 • Jia-Wei Liu, Yan-Pei Cao, Weijia Mao, Wenqiao Zhang, David Junhao Zhang, Jussi Keppo, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this paper, we present DeVRF, a novel representation to accelerate learning dynamic radiance fields.

Novel View Synthesis

177

Paper
Code

Do we really need temporal convolutions in action segmentation?

1 code implementation • 26 May 2022 • Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, Ying Shan

Most state-of-the-art methods focus on designing temporal convolution-based models, but the inflexibility of temporal convolutions and the difficulties in modeling long-term temporal dependencies restrict the potential of these models.

Ranked #1 on Action Segmentation on 50Salads

Action Classification Action Segmentation +1

Paper
Code

Masked Image Modeling with Denoising Contrast

1 code implementation • 19 May 2022 • Kun Yi, Yixiao Ge, Xiaotong Li, Shusheng Yang, Dian Li, Jianping Wu, Ying Shan, XiaoHu Qie

Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up.

Contrastive Learning Denoising +6

Paper
Code

VQFR: Blind Face Restoration with Vector-Quantized Dictionary and Parallel Decoder

1 code implementation • 13 May 2022 • YuChao Gu, Xintao Wang, Liangbin Xie, Chao Dong, Gen Li, Ying Shan, Ming-Ming Cheng

Equipped with the VQ codebook as a facial detail dictionary and the parallel decoder design, the proposed VQFR can largely enhance the restored quality of facial details while keeping the fidelity to previous methods.

Blind Face Restoration Quantization

301

Paper
Code

RepSR: Training Efficient VGG-style Super-Resolution Networks with Structural Re-Parameterization and Batch Normalization

no code implementations • 11 May 2022 • Xintao Wang, Chao Dong, Ying Shan

Extensive experiments demonstrate that our simple RepSR is capable of achieving superior performance to previous SR re-parameterization methods among different model sizes.

Super-Resolution

Paper
Add Code

Accelerating the Training of Video Super-Resolution Models

no code implementations • 10 May 2022 • Lijian Lin, Xintao Wang, Zhongang Qi, Ying Shan

In this work, we show that it is possible to gradually train video models from small to large spatial/temporal sizes, i. e., in an easy-to-hard manner.

Video Super-Resolution

Paper
Add Code

MM-RealSR: Metric Learning based Interactive Modulation for Real-World Super-Resolution

1 code implementation • 10 May 2022 • Chong Mou, Yanze Wu, Xintao Wang, Chao Dong, Jian Zhang, Ying Shan

Instead of using known degradation levels as explicit supervision to the interactive mechanism, we propose a metric learning strategy to map the unquantifiable degradation levels in real-world scenarios to a metric space, which is trained in an unsupervised manner.

Image Restoration Metric Learning +1

143

Paper
Code

VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution

no code implementations • 6 May 2022 • Liangbin Xie. Xintao Wang, Honglun Zhang, Chao Dong, Ying Shan

As a consequence, the VFSR models trained on this dataset can not output visual-pleasing results.

Benchmarking Speaker Identification +1

Paper
Add Code

Privacy-Preserving Model Upgrades with Bidirectional Compatible Training in Image Retrieval

1 code implementation • 29 Apr 2022 • Shupeng Su, Binjie Zhang, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, Ying Shan

The task of privacy-preserving model upgrades in image retrieval desires to reap the benefits of rapidly evolving new models without accessing the raw gallery images.

Image Retrieval Privacy Preserving +1

Paper
Code

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

1 code implementation • 26 Apr 2022 • Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.

Ranked #7 on Zero-Shot Video Retrieval on MSVD

Action Recognition Retrieval +6

130

Paper
Code

Temporally Efficient Vision Transformer for Video Instance Segmentation

3 code implementations • CVPR 2022 • Shusheng Yang, Xinggang Wang, Yu Li, Yuxin Fang, Jiemin Fang, Wenyu Liu, Xun Zhao, Ying Shan

To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS).

Ranked #35 on Video Instance Segmentation on OVIS validation

Instance Segmentation Semantic Segmentation +1

400

Paper
Code

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

2 code implementations • ICCV 2023 • Yuxin Fang, Shusheng Yang, Shijie Wang, Yixiao Ge, Ying Shan, Xinggang Wang

We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e. g., only 25% $\sim$ 50% of the input embeddings.

Instance Segmentation Object +2

1,962

Paper
Code

CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation

no code implementations • 31 Mar 2022 • Ziqi Zhang, Yuxin Chen, Zongyang Ma, Zhongang Qi, Chunfeng Yuan, Bing Li, Ying Shan, Weiming Hu

In this paper, we propose to CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEneration benchmark, to facilitate research and application in video titling and video retrieval in Chinese.

Retrieval Video Captioning +1

Paper
Add Code

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

1 code implementation • 29 Mar 2022 • Xiaotong Li, Yixiao Ge, Kun Yi, Zixuan Hu, Ying Shan, Ling-Yu Duan

Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning.

Ranked #35 on Self-Supervised Image Classification on ImageNet (finetuned)

Instance Segmentation object-detection +5

Paper
Code

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

1 code implementation • CVPR 2022 • Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, XiaoHu Qie

Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era.

Ranked #3 on Highlight Detection on YouTube Highlights

Highlight Detection Moment Retrieval +3

178

Paper
Code

Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

2 code implementations • 15 Mar 2022 • Guanyu Cai, Yixiao Ge, Binjie Zhang, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, XiaoHu Qie, Jianping Wu, Mike Zheng Shou

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval.

Question Answering Retrieval +4

Paper
Code

All in One: Exploring Unified Video-Language Pre-training

1 code implementation • CVPR 2023 • Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

Ranked #6 on TGIF-Transition on TGIF-QA (using extra training data)

Language Modelling Multiple-choice +10

272

Paper
Code

Towards Universal Backward-Compatible Representation Learning

2 code implementations • 3 Mar 2022 • Binjie Zhang, Yixiao Ge, Yantao Shen, Shupeng Su, Fanzi Wu, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan

The task of backward-compatible representation learning is therefore introduced to support backfill-free model upgrades, where the new query features are interoperable with the old gallery features.

Face Recognition Representation Learning

Paper
Code

Uncertainty Modeling for Out-of-Distribution Generalization

1 code implementation • ICLR 2022 • Xiaotong Li, Yongxing Dai, Yixiao Ge, Jun Liu, Ying Shan, Ling-Yu Duan

In this paper, we improve the network generalization ability by modeling the uncertainty of domain shifts with synthesized feature statistics during training.

Image Classification Out-of-Distribution Generalization +2

137

Paper
Code

Hot-Refresh Model Upgrades with Regression-Alleviating Compatible Training in Image Retrieval

1 code implementation • 24 Jan 2022 • Binjie Zhang, Yixiao Ge, Yantao Shen, Yu Li, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan

In contrast, hot-refresh model upgrades deploy the new model immediately and then gradually improve the retrieval accuracy by backfilling the gallery on-the-fly.

Image Retrieval regression +1

Paper
Code

Bridging Video-text Retrieval with Multiple Choice Questions

2 code implementations • CVPR 2022 • Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo

As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.

Ranked #8 on Zero-Shot Video Retrieval on MSVD

Action Recognition Multiple-choice +8

2,991

Paper
Code

BTS: A Bi-Lingual Benchmark for Text Segmentation in the Wild

no code implementations • CVPR 2022 • Xixi Xu, Zhongang Qi, jianqi ma, Honglun Zhang, Ying Shan, XiaoHu Qie

Current researches mainly focus on only English characters and digits, while few work studies Chinese characters due to the lack of public large-scale and high-quality Chinese datasets, which limits the practical application scenarios of text segmentation.

Segmentation Style Transfer +2

Paper
Add Code

Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning

1 code implementation • 30 Dec 2021 • Ziyu Wang, Dejing Xu, Gus Xia, Ying Shan

This is the audio-to-symbolic arrangement problem we tackle in this paper.

Representation Learning

Paper
Code

Dynamic Token Normalization Improves Vision Transformers

1 code implementation • ICLR 2022 • Wenqi Shao, Yixiao Ge, Zhaoyang Zhang, Xuyuan Xu, Xiaogang Wang, Ying Shan, Ping Luo

It is difficult for Transformers to capture inductive bias such as the positional context in an image with LN.

Inductive Bias ListOps +2

Paper
Code

Object-aware Video-language Pre-training for Retrieval

1 code implementation • CVPR 2022 • Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.

Ranked #20 on Zero-Shot Video Retrieval on DiDeMo

Object Retrieval +2

Paper
Code

ACNet: Approaching-and-Centralizing Network for Zero-Shot Sketch-Based Image Retrieval

no code implementations • 24 Nov 2021 • Hao Ren, Ziqiang Zheng, Yang Wu, Hong Lu, Yang Yang, Ying Shan, Sai-Kit Yeung

The huge domain gap between sketches and photos and the highly abstract sketch representations pose challenges for sketch-based image retrieval (\underline{SBIR}).

Retrieval Sketch-Based Image Retrieval

Paper
Add Code

Hot-Refresh Model Upgrades with Regression-Free Compatible Training in Image Retrieval

no code implementations • ICLR 2022 • Binjie Zhang, Yixiao Ge, Yantao Shen, Yu Li, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan

In contrast, hot-refresh model upgrades deploy the new model immediately and then gradually improve the retrieval accuracy by backﬁlling the gallery on-the-ﬂy.

Image Retrieval regression +1

Paper
Add Code

Towards Vivid and Diverse Image Colorization with Generative Color Prior

1 code implementation • ICCV 2021 • Yanze Wu, Xintao Wang, Yu Li, Honglun Zhang, Xun Zhao, Ying Shan

Codes are available at https://github. com/ToTheBeginning/GCP-Colorization.

Colorization Image Colorization

Paper
Code

Finding Discriminative Filters for Specific Degradations in Blind Super-Resolution

1 code implementation • NeurIPS 2021 • Liangbin Xie, Xintao Wang, Chao Dong, Zhongang Qi, Ying Shan

Unlike previous integral gradient methods, our FAIG aims at finding the most discriminative filters instead of input pixels/features for degradation removal in blind SR networks.

Blind Super-Resolution Super-Resolution

117

Paper
Code

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

2 code implementations • 27 Jul 2021 • Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, Wei-Shi Zheng

In this work, we argue that the features extracted from the pretrained extractor, e. g., I3D, are not the WS-TALtask-specific features, thus the feature re-calibration is needed for reducing the task-irrelevant information redundancy.

Ranked #1 on Weakly-supervised Temporal Action Localization on THUMOS’14

Weakly Supervised Action Localization Weakly-supervised Temporal Action Localization +1

Paper
Code

Cross-modal Consensus Network forWeakly Supervised Temporal Action Localization

1 code implementation • Proceedings of the 29th ACM International Conference on Multimedia 2021 • Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, Wei-Shi Zheng

Ranked #1 on Weakly Supervised Temporal Action Localization on THUMOS14

Weakly-supervised Temporal Action Localization Weakly Supervised Temporal Action Localization

Paper
Code

Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

8 code implementations • 22 Jul 2021 • Xintao Wang, Liangbin Xie, Chao Dong, Ying Shan

Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images.

Ranked #3 on Video Super-Resolution on MSU Video Upscalers: Quality Enhancement

Blind Super-Resolution Video Super-Resolution

26,068

Paper
Code

Tracking Instances as Queries

1 code implementation • 22 Jun 2021 • Shusheng Yang, Yuxin Fang, Xinggang Wang, Yu Li, Ying Shan, Bin Feng, Wenyu Liu

Recently, query based deep networks catch lots of attention owing to their end-to-end pipeline and competitive results on several fundamental computer vision tasks, such as object detection, semantic segmentation, and instance segmentation.

Instance Segmentation object-detection +4

400

Paper
Code

Instances as Queries

5 code implementations • ICCV 2021 • Yuxin Fang, Shusheng Yang, Xinggang Wang, Yu Li, Chen Fang, Ying Shan, Bin Feng, Wenyu Liu

The key insight of QueryInst is to leverage the intrinsic one-to-one correspondence in object queries across different stages, as well as one-to-one correspondence between mask RoI features and object queries in the same stage.

Ranked #13 on Object Detection on COCO-O (using extra training data)

Instance Segmentation Object +4

27,790

Paper
Code

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

1 code implementation • CVPR 2021 • Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata

Having access to multi-modal cues (e. g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality.

Audio Tagging audio-visual learning +5

Paper
Code

Crossover Learning for Fast Online Video Instance Segmentation

1 code implementation • ICCV 2021 • Shusheng Yang, Yuxin Fang, Xinggang Wang, Yu Li, Chen Fang, Ying Shan, Bin Feng, Wenyu Liu

For temporal information modeling in VIS, we present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.

Ranked #34 on Video Instance Segmentation on OVIS validation

Instance Segmentation Semantic Segmentation +2

Paper
Code

Open-book Video Captioning with Retrieve-Copy-Generate Network

no code implementations • CVPR 2021 • Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Ying Deng, Weiming Hu

Due to the rapid emergence of short videos and the requirement for content understanding and creation, the video captioning task has received increasing attention in recent years.

Retrieval Video Captioning

Paper
Add Code

A Generic Object Re-identification System for Short Videos

no code implementations • 10 Feb 2021 • Tairu Qiu, Guanxian Chen, Zhongang Qi, Bin Li, Ying Shan, xiangyang xue

Short video applications like TikTok and Kwai have been a great hit recently.

Object object-detection +1

Paper
Add Code

Towards Real-World Blind Face Restoration with Generative Facial Prior

1 code implementation • CVPR 2021 • Xintao Wang, Yu Li, Honglun Zhang, Ying Shan

Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details.

Ranked #2 on Blind Face Restoration on CelebA-Test

Blind Face Restoration Video Super-Resolution

34,597

Paper
Code

Non-Inherent Feature Compatible Learning

no code implementations • 1 Jan 2021 • Yantao Shen, Fanzi Wu, Ying Shan

In this work, we introduce an approach for feature compatible learning without inheriting old classifier and training data, i. e., Non-Inherent Feature Compatible Learning.

Retrieval

Paper
Add Code

Detecting Interactions from Neural Networks via Topological Analysis

no code implementations • NeurIPS 2020 • Zirui Liu, Qingquan Song, Kaixiong Zhou, Ting-Hsiang Wang, Ying Shan, Xia Hu

Motivated by the observation, in this paper, we propose to investigate the interaction detection problem from a novel topological perspective by analyzing the connectivity in neural networks.

Paper
Add Code

Towards Interaction Detection Using Topological Analysis on Neural Networks

no code implementations • 25 Oct 2020 • Zirui Liu, Qingquan Song, Kaixiong Zhou, Ting Hsiang Wang, Ying Shan, Xia Hu

Detecting statistical interactions between input features is a crucial and challenging task.

Paper
Add Code

A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention

no code implementations • 23 Sep 2020 • Binjie Zhang, Yu Li, Chun Yuan, Dejing Xu, Pin Jiang, Ying Shan

The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video.

Sentence

Paper
Add Code

Dual Semantic Fusion Network for Video Object Detection

no code implementations • 16 Sep 2020 • Lijian Lin, Haosheng Chen, Honglun Zhang, Jun Liang, Yu Li, Ying Shan, Hanzi Wang

Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments.

Object object-detection +2

Paper
Add Code

Fast Video Object Segmentation using the Global Context Module

1 code implementation • ECCV 2020 • Yu Li, Zhuoran Shen, Ying Shan

Therefore, it uses constant memory regardless of the video length and costs substantially less memory and computation.

Ranked #17 on Semi-Supervised Video Object Segmentation on DAVIS (no YouTube-VOS training)

Object Segmentation +4

Paper
Code

Recurrent Binary Embedding for GPU-Enabled Exhaustive Retrieval from Billion-Scale Semantic Vectors

no code implementations • 18 Feb 2018 • Ying Shan, Jian Jiao, Jie Zhu, JC Mao

Building on top of the powerful concept of semantic learning, this paper proposes a Recurrent Binary Embedding (RBE) model that learns compact representations for real-time retrieval.

Information Retrieval Retrieval

Paper
Add Code

Deep Embedding Forest: Forest-based Serving with Deep Embedding Features

no code implementations • 15 Mar 2017 • Jie Zhu, Ying Shan, JC Mao, Dong Yu, Holakou Rahmanian, Yi Zhang

Built on top of a representative DNN model called Deep Crossing, and two forest/tree-based models including XGBoost and LightGBM, a two-step Deep Embedding Forest algorithm is demonstrated to achieve on-par or slightly better performance as compared with the DNN counterpart, with only a fraction of serving time on conventional hardware.

Paper
Add Code

Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features

2 code implementations • KDD 2016 • Ying Shan, T. Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, JC Mao

Manually crafted combinatorial features have been the “secret sauce” behind many successful models.

Feature Engineering

781

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.