1 code implementation • EMNLP (ACL) 2021 • Yash Kumar Lal, Reetu Singh, Harsh Trivedi, Qingqing Cao, Aruna Balasubramanian, Niranjan Balasubramanian
IrEne is an energy prediction system that accurately predicts the interpretable inference energy consumption of a wide range of Transformer-based NLP models.
1 code implementation • 22 Apr 2024 • Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari
To this end, we release OpenELM, a state-of-the-art open language model.
no code implementations • 22 Jan 2024 • Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao
Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86. 4% LLaMA models' performance with 70% parameters remained.
1 code implementation • 2 Oct 2023 • Qingqing Cao, Sewon Min, Yizhong Wang, Hannaneh Hajishirzi
Retrieval augmentation addresses many critical problems in large language models such as hallucination, staleness, and privacy leaks.
no code implementations • 19 Jul 2023 • Hao Peng, Qingqing Cao, Jesse Dodge, Matthew E. Peters, Jared Fernandez, Tom Sherborne, Kyle Lo, Sam Skjonsberg, Emma Strubell, Darrell Plessas, Iz Beltagy, Evan Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi
In response, we introduce Pentathlon, a benchmark for holistic and realistic evaluation of model efficiency.
1 code implementation • NeurIPS 2023 • Aniket Rege, Aditya Kusupati, Sharan Ranjit S, Alan Fan, Qingqing Cao, Sham Kakade, Prateek Jain, Ali Farhadi
Finally, we demonstrate that AdANNS can enable inference-time adaptivity for compute-aware search on ANNS indices built non-adaptively on matryoshka representations.
1 code implementation • 27 May 2023 • Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi
Large-scale vision language (VL) models use Transformers to perform cross-modal interactions between the input text and image.
no code implementations • 15 Nov 2022 • Qin Zhang, Shangsi Chen, Dongkuan Xu, Qingqing Cao, Xiaojun Chen, Trevor Cohn, Meng Fang
Thus, a trade-off between accuracy, memory consumption and processing speed is pursued.
no code implementations • 31 Aug 2022 • Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, Roy Schwartz
Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows.
1 code implementation • ACL 2021 • Qingqing Cao, Yash Kumar Lal, Harsh Trivedi, Aruna Balasubramanian, Niranjan Balasubramanian
We present IrEne, an interpretable and extensible energy prediction system that accurately predicts the inference energy consumption of a wide range of Transformer-based NLP models.
no code implementations • 10 Dec 2020 • Qingqing Cao, Oriana Riva, Aruna Balasubramanian, Niranjan Balasubramanian
We present a practical approach, called BewQA, that can answer Bew queries by mining a template of the business-related webpages and using the template to guide the search.
1 code implementation • EMNLP (sustainlp) 2020 • Qingqing Cao, Aruna Balasubramanian, Niranjan Balasubramanian
In this work, we show that existing software-based energy measurements are not accurate because they do not take into account hardware differences and how resource utilization affects energy consumption.
1 code implementation • ACL 2020 • Qingqing Cao, Harsh Trivedi, Aruna Balasubramanian, Niranjan Balasubramanian
It turns out that we can get by without input-wide self-attention at all layers, especially in the lower layers.
1 code implementation • 3 Jun 2017 • Qingqing Cao, Niranjan Balasubramanian, Aruna Balasubramanian
In this paper, we explore optimizations to run Recurrent Neural Network (RNN) models locally on mobile devices.