Skip to content

Proposal

Research Paper Summary

Paper

Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. The lottery ticket hypothesis for pre-trained bert networks. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS 20), pages 15834–15846, 2020.1

Background

In NLP, enormous pre-trained models such as BERT have become the standard starting point for training across many downstream tasks, reflecting a broader trend toward scaling neural networks to extremely large parameter counts, often in the billions, with rising training cost. In parallel, work on the Lottery Ticket Hypothesis (LTH) has shown that both NLP and computer vision models can contain much smaller “matching” subnetworks that can be trained in isolation to reach full accuracy, and sometimes transfer across related tasks.

Despite these advances, it remained unclear how lottery-ticket phenomena manifest in pre-trained transformer language models—particularly BERT—where the dominant workflow is to reuse a costly pre-trained initialization and fine-tune it for each task. Chen et al. frame BERT as an especially natural testbed because two themes in the lottery-ticket literature—initialization via (dense) pre-training and transferability of discovered subnetworks—closely mirror the pretrain-and-transfer paradigm.

They therefore ask: if the pre-trained weights are treated as the initialization, do matching subnetworks exist for downstream tasks, do they transfer between tasks, and do “universal” subnetworks exist that could reduce the cost of fine-tuning while preserving performance?

Summary of Contributions

Chen et al. show that lottery-ticket phenomena extend to pre-trained Transformer language models. Using unstructured magnitude pruning on BERT, they find “matching” subnetworks—sparse masks that, when fine-tuned, reach the same downstream performance as the full model—at roughly 40%–90% sparsity across standard GLUE and SQuAD tasks.

A central empirical result is that, unlike several prior lottery-ticket studies in larger-scale NLP and vision, these matching subnetworks can be identified directly at the pre-trained initialization (i.e., without needing to rewind to an early point during downstream training). The resulting subnetworks also outperform baselines obtained by pruning randomly or by random reinitialization, underscoring the role of pre-training in producing an initialization that is amenable to sparsification.

The paper then analyzes transfer. For most downstream tasks, the sparsity patterns are task-specific: a mask found for one task typically does not transfer well to another. However, they identify an important exception: subnetworks at about 70% sparsity found using BERT’s masked language modeling objective can be universal and transfer across downstream tasks while maintaining accuracy. Together, these findings sharpen both the practical and scientific picture of redundancy in pre-trained NLP models.

Limitations and Discussion

The paper makes a clear case that BERT contains matching subnetworks, but several practical and scientific questions remain open. Most importantly, the work evaluates sparsity largely through downstream accuracy, without reporting end-to-end efficiency (training/inference latency, throughput, or energy).

This matters because unstructured sparsity does not automatically translate into wall-clock or energy gains on typical hardware without optimized sparse kernels. The authors themselves note this gap in the conclusion and point to future hardware studies, citing existing speedups at 70%–90% sparsity with libraries such as XNNPACK on smartphone processors.

Methodologically, the study focuses on unstructured magnitude pruning; it does not compare against structured alternatives (e.g., pruning heads, layers, or intermediate dimensions) that may yield more reliable speedups in standard deployment stacks.

Despite these limitations—and without a dedicated discussion section beyond the conclusion—the core empirical claims are convincing within scope. The gaps primarily hint at the next step: a system-level characterization of accuracy–efficiency tradeoffs for sparse subnetworks under realistic hardware constraints, which directly motivates our project.

Why This Paper?

This paper was brought to our attention during office hours with Professor Diyi Yang, when we presented an initial project idea centered on evaluating accuracy–efficiency tradeoffs in NLP models. This suggestion proved well aligned with the questions we were trying to sharpen, and helped clarify both the scope and the intellectual grounding of our project.

The paper resonates with our work because it starts from a shared premise: modern pre-trained language models are substantially overparameterized for many downstream tasks. Chen et al. study this overcapacity directly, showing that highly sparse subnetworks within pre-trained BERT can match the accuracy of the full model. This result provides a concrete empirical foundation for asking a complementary question: if such redundancy exists, how much computation and energy can be reduced while maintaining acceptable task performance?

The paper is also a strong fit because it operates squarely within the pretrain–fine-tune paradigm that dominates contemporary NLP practice. By treating pretraining as a fixed initialization and focusing on downstream adaptation, the study mirrors our intended experimental setup and constraints.

Finally, while the paper convincingly establishes the existence and properties of sparse subnetworks, it does not evaluate system-level efficiency metrics or compare sparsity to alternative efficiency strategies. This gap does not weaken the paper’s conclusions, but it does motivate our work. Reading the paper clarified both what is known about redundancy in pre-trained models and what remains to be understood when efficiency is treated as a first-class objective alongside accuracy.

Wider Research Context

Chen et al.’s study addresses a central question in modern NLP: how large Transformer-based models encode linguistic information during pretraining and reuse it across downstream tasks. Models such as BERT learn deep, distributed representations during large-scale pretraining, with linguistic information encoded across many parameters and layers. By showing that highly sparse subnetworks within a pre-trained BERT model can retain downstream accuracy, this work provides evidence that these representations are substantially redundant rather than tightly localized.

This observation complements prior analyses of redundancy in Transformers, including findings that individual attention heads or weights/components can often be pruned with limited impact on performance234. Taken together, these results suggest that effective language representations do not require dense parameterization at inference time, but instead emerge from robust structures formed during pretraining.

The paper also connects to broader discussions about the role of pretraining in modern NLP. Indeed, Chen et al.'s findings reinforce the now-standard view that much of the representational “heavy lifting” occurs during pretraining, while fine-tuning primarily specializes behavior for individual tasks5.

Finally, the work highlights a growing tension in NLP evaluation. Like much of the literature, it assesses models primarily through downstream accuracy, yet its findings implicitly motivate alternative criteria—such as efficiency and capacity utilization—that have become increasingly important as models scale67. In this sense, the paper situates lottery-ticket analyses within emerging efforts to better understand and evaluate model efficiency alongside accuracy.

Project description

Goal

The primary goal of this project is to characterize accuracy–efficiency tradeoffs in pretrained NLP models by treating computational efficiency as a first-class evaluation criterion alongside task performance. Concretely, we seek to answer the following research question: For a fixed NLP task and a fixed pretrained model, which architectural or inference-time reductions yield the best accuracy-per-joule tradeoff under realistic resource constraints?

This question is motivated by a growing mismatch between how NLP models are evaluated in research—typically by accuracy alone—and how they are deployed in practice, where energy consumption, latency, and memory footprint are often decisive. While prior work[1^] has shown that pretrained models contain substantial redundancy, there is limited systematic understanding of how different efficiency strategies compare when evaluated under a unified, energy-aware framework.

Our project builds directly on the findings of Chen et al.[1^], who demonstrate that highly sparse subnetworks (“winning tickets”) exist within pretrained BERT models and can match downstream accuracy. Rather than proposing new compression or pruning algorithms, we take this redundancy as a given and focus on its practical implications: how accuracy degrades as different forms of model reduction are applied, and how those choices translate into real computational cost.

The main outcome of the project is a set of accuracy–efficiency Pareto frontiers that enable direct comparison across reduction strategies and highlight regions of diminishing returns. As a secondary goal, time permitting, we will examine whether these tradeoffs are task-dependent, providing further guidance for efficient model deployment.

Task

The primary NLP task in this project is sentence-level text classification, with a focus on binary sentiment analysis. In this setting, the model receives a single sentence as input and predicts a discrete label. For example, given the input sentence “The film is engaging and well-paced,” the expected output is a positive sentiment label.

We choose sentence-level classification as our core task because it provides a controlled and well-understood experimental setting for studying accuracy–efficiency tradeoffs. Inputs are relatively short, evaluation metrics are stable and interpretable, and training dynamics are well characterized, which reduces confounding factors when measuring energy consumption, inference latency, and memory usage. This makes it possible to isolate the effects of architectural and inference-time reductions while holding the task constant.

As a secondary task, time permitting, we will consider pairwise sentence classification, in which the model predicts whether two input sentences express the same semantic meaning. This task introduces longer effective inputs and interaction between sentence representations, allowing us to test whether efficiency tradeoffs differ in a more complex classification setting.

As a stretch goal, we will explore extending the analysis to a question answering task with longer contexts. This would enable us to assess whether the accuracy–efficiency Pareto frontiers observed for classification tasks generalize to settings with increased sequence length and contextual reasoning.

Data

We will use publicly available NLP benchmarks with fixed train/validation/test splits. Our primary dataset is SST-2, a binary sentiment classification task from the GLUE benchmark, consisting of approximately 67,000 labeled sentences. Its short input lengths and stable evaluation protocol make it well suited for controlled efficiency experiments.

As a secondary dataset, time permitting, we will use Quora Question Pairs (QQP), which contains over 400,000 pairs of questions labeled for semantic equivalence. QQP introduces longer and paired inputs, providing a complementary setting for examining how efficiency tradeoffs vary with increased sequence length and input interaction.

As a stretch goal, if we extend the analysis to question answering, we will use a standard extractive QA benchmark such as SQuAD v1.1, which pairs questions with longer context passages.

For all datasets, we will use the standard tokenization associated with the chosen pretrained model. Inputs will be padded or truncated to fixed maximum sequence lengths as required. In experiments that vary input length, we will systematically adjust these limits while keeping dataset splits unchanged.

We do not plan to collect or annotate new data; all experiments rely on existing benchmarks to ensure reproducibility and efficient use of the project timeline.

Methods

We conduct experiments using encoder-only Transformer models, with BERT-base as the primary experimental platform. BERT-base is widely used in downstream NLP tasks and provides a clean setting for measuring inference-time efficiency without the confounding effects of autoregressive decoding. Where relevant, we additionally evaluate DistilBERT to assess whether observed tradeoffs are consistent across model sizes within the same architectural family.

We treat pretrained model weights as a fixed initialization and do not perform pretraining from scratch. Rather than proposing new architectures or optimization algorithms, our methodology is deliberately comparative: we systematically apply established adaptation and inference-time reduction techniques under a unified experimental and measurement framework.

At training time, we consider standard fine-tuning regimes, including frozen encoders and partial or full fine-tuning, using off-the-shelf implementations. At inference time, we evaluate a focused set of efficiency techniques that target different sources of redundancy: (i) layer reduction and early exit, (ii) token-level pruning, (iii) quantization, and (iv) sequence length reduction.

All models and techniques are implemented using existing open-source libraries (e.g., standard Hugging Face and PyTorch tooling). Our contribution is not algorithmic novelty, but the controlled application, combination, and comparison of these methods while holding the underlying model, task, and training protocol constant. This enables direct, apples-to-apples analysis of accuracy–efficiency tradeoffs.

Baselines

We use a small set of standard baselines that reflect common practice in pretrained NLP model adaptation and inference. All baselines are implemented and evaluated by us using off-the-shelf components from open-source libraries, under the same hardware and software conditions as the experimental configurations.

Our primary baseline is a fully fine-tuned BERT-base model, with all parameters updated during training and inference performed at full numerical precision, without architectural or inference-time reductions. This represents the default approach for downstream NLP tasks and serves as a reference for both task performance and inference-time computational cost.

As a secondary baseline, we include a frozen encoder setup, where the pretrained model is kept fixed and only a lightweight task-specific head is trained. When applicable, we also evaluate DistilBERT as an auxiliary baseline to compare architectural downsizing with inference-time efficiency techniques.

Evaluation

We evaluate each configuration on task performance and measured efficiency, using fully automatic, quantitative metrics. For task performance, we report standard benchmark metrics: accuracy for sentence-level sentiment classification, accuracy/F1 for pairwise sentence classification (when included), and EM/F1 for question answering (where applicable), using the official dataset splits.

As reference points, fully fine-tuned BERT-base models typically achieve approximately 93.5% accuracy on SST-2, 88.4% F1 with 71.2% accuracy on QQP, and about 80.8% Exact Match and 88.5% F1 on SQuAD v1.1. We compare our baseline results against these ranges to ensure our implementations are reasonable and that observed accuracy–efficiency tradeoffs are meaningful.

For efficiency, we report energy, latency, and peak memory during inference (and, where relevant, during training). Energy is measured by sampling device power draw over time and integrating to obtain joules (or watt-hours). Latency is measured as wall-clock milliseconds per example, and memory as peak accelerator memory.

Our primary output is an accuracy–energy (and accuracy–latency) Pareto frontier. In addition, we report simple summary metrics such as energy per correct prediction and the minimum energy required to remain within a fixed accuracy drop (e.g., within 1–2% of the full fine-tuned baseline). A successful result is a clear, reproducible Pareto frontier that identifies non-dominated configurations and regions of diminishing returns.

Ethical implications

This project aims to improve the efficiency—and sustainability—of NLP systems, but it raises several ethical considerations. First, emphasizing efficiency may lower barriers to deployment, enabling large-scale use of language technologies in sensitive contexts (e.g., automated moderation or surveillance) without adequate oversight. To mitigate this risk, we present our results as decision-support analyses rather than deployment recommendations, explicitly reporting accuracy losses alongside efficiency gains so that tradeoffs remain transparent.

Second, efficiency-driven reductions may disproportionately degrade performance on minority language phenomena or rare linguistic constructions, even when aggregate accuracy remains high. Such effects may not be captured by standard benchmarks. As mitigation, we clearly document evaluation scope and limitations, avoid claims of universal applicability, and caution that results obtained on English-language benchmarks may not generalize without further validation.

Finally, there is a risk that energy metrics could be misconstrued as a complete measure of societal impact. We therefore position energy consumption as one dimension of responsible evaluation, alongside broader considerations of fairness, misuse, and downstream harm.

References


  1. Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. The lottery ticket hypothesis for pre-trained bert networks. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS 20), pages 15834–15846, 2020. 

  2. Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 19), pages 14037–14047, 2019. 

  3. Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pages 5797–5808, 2019. 

  4. Victor Sanh, Thomas Wolf, and Alexander M. Rush. Movement pruning: Adaptive sparsity by fine-tuning. In 34th International Conference on Neural Information Processing Systems (NeurIPS 2020), pages 20378–20389, 2020. 

  5. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), pages 4171–4186, 2019. 

  6. Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pages 3645–3650, 2019. 

  7. Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green ai. Communications of the ACM, 63(12):54–63, 2020.