1 code implementation • 22 Apr 2024 • Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen
Specifically, SnapKV achieves a consistent decoding speed with a 3. 6x increase in generation speed and an 8. 2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens.
2 code implementations • 4 Apr 2024 • Sean Farhat, Deming Chen
We observe that, when distilled on a task from a pre-trained teacher model, a small model can achieve or surpass the performance it would achieve if it was pre-trained then finetuned on that task.
1 code implementation • 31 Jan 2024 • Hongpeng Guo, Haotian Gu, Xiaoyang Wang, Bo Chen, Eun Kyung Lee, Tamar Eilam, Deming Chen, Klara Nahrstedt
Federated learning (FL) is a machine learning paradigm that allows multiple clients to collaboratively train a shared model while keeping their data on-premise.
no code implementations • 22 Jan 2024 • Hanchen Ye, David Z. Pan, Chris Leary, Deming Chen, Xiaoqing Xu
This paper proposes ISDC, a novel feedback-guided iterative system of difference constraints (SDC) scheduling algorithm for high-level synthesis (HLS).
1 code implementation • 19 Jan 2024 • Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao
We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration.
no code implementations • ICCV 2023 • Yuhong Li, Jiajie Li, Cong Hao, Pan Li, JinJun Xiong, Deming Chen
We further propose a Discrete Proxy Search (DPS) method to find the optimized training settings for Eproxy with only a handful of benchmarked architectures on the target tasks.
1 code implementation • 17 Oct 2022 • Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, Debadeepta Dey
We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length.
Ranked #6 on Long-range modeling on LRA
1 code implementation • 17 Oct 2022 • Yuhong Li, Jiajie Li, Cong Han, Pan Li, JinJun Xiong, Deming Chen
(2) Efficient proxies are not extensible to multi-modality downstream tasks.
no code implementations • 22 Jul 2022 • Yao Chen, Junhao Pan, Xinheng Liu, JinJun Xiong, Deming Chen
In this study, we propose HiKonv, a unified solution that maximizes the throughput of convolution on a given underlying processing unit with low-bitwidth quantized data inputs through novel bit-wise management and parallel computation.
no code implementations • 18 Jul 2022 • Vibhakar Vemulapati, Deming Chen
Simultaneous Localization and Mapping (SLAM) is one of the main components of autonomous navigation systems.
no code implementations • 3 Jul 2022 • Mang Yu, Sitao Huang, Deming Chen
However, with the high flexibility comes the difficulty in design and optimization.
no code implementations • 6 Jun 2022 • Xiaofan Zhang, Yao Chen, Cong Hao, Sitao Huang, Yuhong Li, Deming Chen
Deep Neural Networks (DNNs) have achieved great success in a variety of machine learning (ML) applications, delivering high-quality inferencing solutions in computer vision, natural language processing, and virtual reality, etc.
no code implementations • 30 Mar 2022 • Philip Harris, Erik Katsavounidis, William Patrick McCormack, Dylan Rankin, Yongbin Feng, Abhijith Gandrakota, Christian Herwig, Burt Holzman, Kevin Pedro, Nhan Tran, Tingjun Yang, Jennifer Ngadiuba, Michael Coughlin, Scott Hauck, Shih-Chieh Hsu, Elham E Khoda, Deming Chen, Mark Neubauer, Javier Duarte, Georgia Karagiorgi, Mia Liu
Machine learning (ML) is becoming an increasingly important component of cutting-edge physics research, but its computational requirements present significant challenges.
no code implementations • 21 Jan 2022 • Xiaofan Zhang, Zongwei Zhou, Deming Chen, Yu Emma Wang
By evaluating on SQuAD, a model found by AutoDistill achieves an 88. 4% F1 score with 22. 8M parameters, which reduces parameters by more than 62% while maintaining higher accuracy than DistillBERT, TinyBERT, and NAS-BERT.
no code implementations • 28 Dec 2021 • Xinheng Liu, Yao Chen, Prakhar Ganesh, Junhao Pan, JinJun Xiong, Deming Chen
Quantization for Convolutional Neural Network (CNN) has shown significant progress with the intention of reducing the cost of computation and storage with low-bitwidth data inputs.
1 code implementation • 24 Nov 2021 • Qian Jiang, Xiaofan Zhang, Deming Chen, Minh N. Do, Raymond A. Yeh
In this work, we propose End-to-end Hardware-aware DNAS (EH-DNAS), a seamless integration of end-to-end hardware benchmarking, and fully automated DNAS to deliver hardware-efficient deep neural networks on various platforms, including Edge GPUs, Edge TPUs, Mobile CPUs, and customized accelerators.
1 code implementation • 26 Oct 2021 • Prakhar Ganesh, Yao Chen, Yin Yang, Deming Chen, Marianne Winslett
Performance of object detection models has been growing rapidly on two major fronts, model accuracy and efficiency.
2 code implementations • NeurIPS 2021 • Yuhong Li, Cong Hao, Pan Li, JinJun Xiong, Deming Chen
Such a self-supervised regression task can effectively evaluate the intrinsic power of an architecture to capture and transform the input signal patterns, and allow more sufficient usage of training samples.
Ranked #1 on Neural Architecture Search on NAS-Bench-101 (Spearman Correlation metric)
1 code implementation • 9 Jul 2021 • Xinheng Liu, Yao Chen, Cong Hao, Ashutosh Dhar, Deming Chen
We implement our proposed accelerator on multiple FPGAs, which outperforms the state-of-the-art designs in terms of both throughput and DSP efficiency.
no code implementations • 8 Apr 2021 • Cong Hao, Deming Chen
We formulate the MMMT model and heterogeneous hardware implementation co-design as a differentiable optimization problem, with the objective of improving the solution quality and reducing the overall power consumption and critical path latency.
no code implementations • 25 Mar 2021 • Cong Hao, Jordan Dotzel, JinJun Xiong, Luca Benini, Zhiru Zhang, Deming Chen
Artificial intelligence (AI) technologies have dramatically advanced in recent years, resulting in revolutionary changes in people's lives.
no code implementations • 8 Mar 2021 • Xiaofan Zhang, Dawei Wang, Pierce Chuang, Shugao Ma, Deming Chen, Yuecheng Li
Creating virtual avatars with realistic rendering is one of the most essential and challenging tasks to provide highly immersive virtual reality (VR) experiences.
1 code implementation • 4 Mar 2021 • Seung Won Min, Kun Wu, Sitao Huang, Mert Hidayetoğlu, JinJun Xiong, Eiman Ebrahimi, Deming Chen, Wen-mei Hwu
In this work, we propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory through zero-copy accesses without much CPU help.
1 code implementation • 20 Jan 2021 • Seung Won Min, Kun Wu, Sitao Huang, Mert Hidayetoğlu, JinJun Xiong, Eiman Ebrahimi, Deming Chen, Wen-mei Hwu
While this process accounts for a significant portion of the training time, we find existing GNN implementations using popular deep neural network (DNN) libraries such as PyTorch are limited to a CPU-centric approach for the entire data preparation step.
no code implementations • 1 Jan 2021 • Hyunmin Jeong, Deming Chen
This is the first work that considers using a highly compressed DNN along with the original DNN in parallel to improve latency significantly while effectively maintaining the original model accuracy.
1 code implementation • 1 Jan 2021 • Yuhong Li, Cong Hao, Xiaofan Zhang, JinJun Xiong, Wen-mei Hwu, Deming Chen
This raises the question of whether we can find an effective proxy search space (PS) that is only a small subset of GS to dramatically improve RandomNAS’s search efficiency while at the same time keeping a good correlation for the top-performing architectures.
2 code implementations • 22 Dec 2020 • Yichi Zhang, Junhao Pan, Xinheng Liu, Hongzheng Chen, Deming Chen, Zhiru Zhang
We design an efficient FPGA-based accelerator for our novel BNN model that supports the fractional activations.
no code implementations • 14 Oct 2020 • Cong Hao, Yao Chen, Xiaofan Zhang, Yuhong Li, JinJun Xiong, Wen-mei Hwu, Deming Chen
High quality AI solutions require joint optimization of AI algorithms, such as deep neural networks (DNNs), and their hardware accelerators.
no code implementations • 10 Jul 2020 • Yun Heo, Gowthami Manikandan, Anand Ramachandran, Deming Chen
We also present a compilation of sequencing datasets for Illumina, PacBio and ONT platforms that present challenging scenarios for error-correction tools.
1 code implementation • 18 May 2020 • Cheng Gong, Yao Chen, Ye Lu, Tao Li, Cong Hao, Deming Chen
Quantization has been proven to be an effective method for reducing the computing and/or storage cost of DNNs.
no code implementations • 6 May 2020 • Yuhong Li, Cong Hao, Xiaofan Zhang, Xinheng Liu, Yao Chen, JinJun Xiong, Wen-mei Hwu, Deming Chen
We formulate the co-search problem by fusing DNN search variables and hardware implementation variables into one solution space, and maximize both algorithm accuracy and hardware implementation quality.
no code implementations • 8 Apr 2020 • Hanchen Ye, Xiaofan Zhang, Zhize Huang, Gengsheng Chen, Deming Chen
To speedup Deep Neural Networks (DNN) accelerator design and enable effective implementation, we propose HybridDNN, a framework for building high-performance hybrid DNN accelerators and delivering FPGA-based hardware implementations.
no code implementations • 27 Feb 2020 • Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, Deming Chen, Marianne Winslett
Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks.
1 code implementation • 6 Jan 2020 • Pengfei Xu, Xiaofan Zhang, Cong Hao, Yang Zhao, Yongan Zhang, Yue Wang, Chaojian Li, Zetong Guan, Deming Chen, Yingyan Lin
Specifically, AutoDNNchip consists of two integrated enablers: (1) a Chip Predictor, built on top of a graph-based accelerator representation, which can accurately and efficiently predict a DNN accelerator's energy, throughput, and area based on the DNN model parameters, hardware configuration, technology-based IPs, and platform constraints; and (2) a Chip Builder, which can automatically explore the design space of DNN chips (including IP selection, block configuration, resource balancing, etc.
no code implementations • 18 Nov 2019 • Cong Hao, Yao Chen, Xinheng Liu, Atif Sarwari, Daryl Sew, Ashutosh Dhar, Bryan Wu, Dongdong Fu, JinJun Xiong, Wen-mei Hwu, Junli Gu, Deming Chen
The rapidly growing demands for powerful AI algorithms in many application domains have motivated massive investment in both high-quality deep neural network (DNN) models and high-efficiency implementations.
2 code implementations • 20 Sep 2019 • Xiaofan Zhang, Haoming Lu, Cong Hao, Jiachen Li, Bowen Cheng, Yuhong Li, Kyle Rupnow, JinJun Xiong, Thomas Huang, Honghui Shi, Wen-mei Hwu, Deming Chen
Object detection and tracking are challenging tasks for resource-constrained embedded systems.
1 code implementation • 25 Jun 2019 • Xiaofan Zhang, Cong Hao, Haoming Lu, Jiachen Li, Yuhong Li, Yuchen Fan, Kyle Rupnow, JinJun Xiong, Thomas Huang, Honghui Shi, Wen-mei Hwu, Deming Chen
Developing artificial intelligence (AI) at the edge is always challenging, since edge devices have limited computation capability and memory resources but need to meet demanding requirements, such as real-time processing, high throughput performance, and high inference accuracy.
2 code implementations • 20 May 2019 • Xiaofan Zhang, Cong Hao, Yuhong Li, Yao Chen, JinJun Xiong, Wen-mei Hwu, Deming Chen
Developing deep learning models for resource-constrained Internet-of-Things (IoT) devices is challenging, as it is difficult to achieve both good quality of results (QoR), such as DNN model inference accuracy, and quality of service (QoS), such as inference latency, throughput, and power consumption.
2 code implementations • 9 Apr 2019 • Cong Hao, Xiaofan Zhang, Yuhong Li, Sitao Huang, JinJun Xiong, Kyle Rupnow, Wen-mei Hwu, Deming Chen
While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources of edge-scale FPGA devices also makes it challenging for DNN deployment.
4 code implementations • 7 Feb 2019 • Yuhong Li, Xiaofan Zhang, Deming Chen
It combines a Convolutional Neural Network (CNN) backbone and a cross-correlation operator, and takes advantage of the features from exemplary images for more accurate object tracking.
Ranked #1 on Visual Object Tracking on OTB-50
no code implementations • 5 Nov 2018 • Di He, Xuesong Yang, Boon Pang Lim, Yi Liang, Mark Hasegawa-Johnson, Deming Chen
In this paper, the convergence properties of CTC are improved by incorporating acoustic landmarks.
no code implementations • 31 Jul 2018 • Junsong Wang, Qiuwen Lou, Xiaofan Zhang, Chao Zhu, Yonghua Lin, Deming Chen
To create such accelerators, we propose a design flow for accelerating the extremely low bit-width neural network (ELB-NN) in embedded FPGAs with hybrid quantization schemes.
no code implementations • 15 May 2018 • Di He, Boon Pang Lim, Xuesong Yang, Mark Hasegawa-Johnson, Deming Chen
Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 23 Mar 2018 • Chuanhao Zhuge, Xinheng Liu, Xiaofan Zhang, Sudeep Gummadi, JinJun Xiong, Deming Chen
Deep Convolutional Neural Networks have become a Swiss knife in solving critical artificial intelligence tasks.
13 code implementations • CVPR 2018 • Yuhong Li, Xiaofan Zhang, Deming Chen
We demonstrate CSRNet on four datasets (ShanghaiTech dataset, the UCF_CC_50 dataset, the WorldEXPO'10 dataset, and the UCSD dataset) and we deliver the state-of-the-art performance.
Ranked #3 on Crowd Counting on Venice