MatFormer: Nested Transformer for Elastic Inference

Devvrit; Kudugunta, Sneha; Kusupati, Aditya; Dettmers, Tim; Chen, Kaifeng; Dhillon, Inderjit; Tsvetkov, Yulia; Hajishirzi, Hannaneh; Kakade, Sham; Farhadi, Ali; Jain, Prateek

Computer Science > Machine Learning

arXiv:2310.07707 (cs)

[Submitted on 11 Oct 2023 (v1), last revised 15 Dec 2024 (this version, v2)]

Title:MatFormer: Nested Transformer for Elastic Inference

Authors:Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain

View PDF HTML (experimental)

Abstract:Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, the substantial costs associated with training these models often limit the number of unique model sizes that can be offered. Consequently, practitioners are compelled to select a model that may not be optimally aligned with their specific latency and cost requirements. We present MatFormer, a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints. MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model. During training, we optimize the parameters of multiple nested FFN blocks with varying sizes, enabling the extraction of hundreds of accurate smaller models without incurring additional computational costs. We empirically validate the efficacy of MatFormer across different model classes (decoders and encoders) and modalities (language and vision), demonstrating its potential for real-world deployment. We show that a 850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters, each exhibiting better validation loss and one-shot downstream evaluations than independently trained counterparts. Furthermore, we observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval. Finally, we showcase that speculative decoding with the accurate and consistent submodels extracted from MatFormer can lead to significant reduction in inference latency. Project website: this https URL

Comments:	30 pages, 11 figures, first three authors contributed equally. NeurIPS, 2024
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2310.07707 [cs.LG]
	(or arXiv:2310.07707v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2310.07707

Submission history

From: Fnu Devvrit [view email]
[v1] Wed, 11 Oct 2023 17:57:14 UTC (1,351 KB)
[v2] Sun, 15 Dec 2024 03:45:36 UTC (1,642 KB)

Computer Science > Machine Learning

Title:MatFormer: Nested Transformer for Elastic Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:MatFormer: Nested Transformer for Elastic Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators