The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video.
We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
Large language models (LLMs) have fueled many intelligent agent tasks, such as web navigation -- but most existing agents perform far from satisfying in real-world webpages due to three factors: (1) the versatility of actions on webpages, (2) HTML text exceeding model processing capacity, and (3) the complexity of decision-making due to the open-domain nature of web.
We propose Magic Clothing, a latent diffusion model (LDM)-based network architecture for an unexplored garment-driven image synthesis task.
We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction".
Ranked #7 on Image Generation on ImageNet 256x256
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Ranked #2 on Question Answering on PubChemQA
We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.
The chains of nodes can be designed to explicitly enforce a naturally structured "thought process".
The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose.
In this paper, we explore LLMs as copilots that assist humans in proving theorems.