Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance
Summary
This paper analyzes neural activation patterns across six LLM architectures on cognitive tasks, revealing differences in attention entropy and sparsity between encoder and decoder models.
View Cached Full Text
Cached at: 05/18/26, 06:31 AM
# Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance
Source: [https://arxiv.org/html/2605.15436](https://arxiv.org/html/2605.15436)
###### Abstract
This paper presents a comprehensive analysis of neural activation patterns across six distinct large language model \(LLM\) architectures, examining their performance on twelve cognitive task categories\. Through systematic measurement of final activation values, attention entropy, and sparsity patterns, we reveal fundamental differences in how encoder and decoder architectures process diverse cognitive tasks\. Our analysis of 144 task\-model combinations demonstrates that mathematical reasoning consistently produces the highest attention entropy across all architectures, while decoder models exhibit significantly higher sparsity patterns compared to encoder models\. The findings provide critical insights into the computational characteristics of modern language models and their task\-specific neural behaviors, with implications for model selection and optimization in big data applications\.
## IIntroduction
Large Language Models \(LLMs\) have revolutionized natural language processing and big data analytics, demonstrating remarkable capabilities across diverse cognitive tasks\. However, the internal mechanisms governing their performance remain poorly understood\. While previous research has focused primarily on output quality metrics, limited attention has been given to the neural activation patterns that underlie model behavior during task execution\.
Understanding these activation patterns is crucial for several reasons: \(1\) it provides insights into model efficiency and computational resource allocation, \(2\) it enables better model selection for specific tasks in big data environments, and \(3\) it offers pathways for architecture optimization\. This work addresses the gap by conducting a systematic analysis of neural activation patterns across multiple LLM architectures and cognitive task categories\.
Our contributions include: \(1\) a comprehensive dataset of neural activation measurements across 144 model\-task combinations, \(2\) identification of task\-specific activation signatures that distinguish cognitive processes, \(3\) comparative analysis revealing fundamental differences between encoder and decoder architectures, and \(4\) insights into sparsity patterns that inform computational efficiency considerations\.
## IIRelated Work
Neural activation analysis in large language models represents a rapidly evolving research domain at the intersection of interpretability, efficiency, and cognitive modeling\[[37](https://arxiv.org/html/2605.15436#bib.bib37),[10](https://arxiv.org/html/2605.15436#bib.bib10)\]\.
### II\-AModel Interpretability and Attention Analysis
Recent advances in transformer interpretability have focused on understanding attention mechanisms and their role in linguistic processing\[[26](https://arxiv.org/html/2605.15436#bib.bib26),[25](https://arxiv.org/html/2605.15436#bib.bib25)\]\. Kovaleva et al\.\[[27](https://arxiv.org/html/2605.15436#bib.bib27)\]revealed that BERT attention patterns exhibit both linguistically meaningful and seemingly random behaviors\. Voita et al\.\[[28](https://arxiv.org/html/2605.15436#bib.bib28)\]demonstrated that attention heads specialize in different linguistic functions, while Michel et al\.\[[31](https://arxiv.org/html/2605.15436#bib.bib31)\]showed that many attention heads can be pruned without significant performance loss\.
Clark et al\.\[[26](https://arxiv.org/html/2605.15436#bib.bib26)\]pioneered systematic attention analysis, revealing that different heads capture distinct syntactic and semantic relationships\. This work established the foundation for attention entropy analysis as a measure of computational complexity\[[38](https://arxiv.org/html/2605.15436#bib.bib38)\]\.
### II\-BNeural Efficiency and Sparsity Analysis
The computational efficiency of large language models has become increasingly critical as model scales grow\[[39](https://arxiv.org/html/2605.15436#bib.bib39)\]\. Hoefler et al\.\[[29](https://arxiv.org/html/2605.15436#bib.bib29)\]provide comprehensive analysis of sparsity in deep learning, establishing theoretical foundations for our sparsity measurements\.
Recent work on model compression and efficiency includes magnitude\-based pruning\[[30](https://arxiv.org/html/2605.15436#bib.bib30)\], structured sparsity\[[40](https://arxiv.org/html/2605.15436#bib.bib40)\], and activation sparsity analysis\[[41](https://arxiv.org/html/2605.15436#bib.bib41)\]\. Dettmers et al\.\[[42](https://arxiv.org/html/2605.15436#bib.bib42)\]demonstrated that 8\-bit quantization can maintain model performance while reducing computational requirements\.
### II\-CCognitive Task Evaluation
Comprehensive evaluation of language models across diverse cognitive tasks has emerged as a critical research direction\[[43](https://arxiv.org/html/2605.15436#bib.bib43)\]\. Hendrycks et al\.\[[15](https://arxiv.org/html/2605.15436#bib.bib15)\]introduced MATH dataset for mathematical reasoning evaluation, while Srivastava et al\.\[[44](https://arxiv.org/html/2605.15436#bib.bib44)\]presented BIG\-bench for broad cognitive assessment\.
Recent work on task\-specific model behavior includes mathematical reasoning analysis\[[45](https://arxiv.org/html/2605.15436#bib.bib45)\], code generation evaluation\[[17](https://arxiv.org/html/2605.15436#bib.bib17)\], and commonsense reasoning assessment\[[24](https://arxiv.org/html/2605.15436#bib.bib24)\]\. Talbot and Bethard\[[18](https://arxiv.org/html/2605.15436#bib.bib18)\]explored philosophical reasoning in language models, contributing to our understanding of abstract cognitive capabilities\.
### II\-DArchitecture Comparison Studies
Comparative analysis of transformer architectures has revealed fundamental differences in processing strategies\[[9](https://arxiv.org/html/2605.15436#bib.bib9)\]\. Tay et al\.\[[46](https://arxiv.org/html/2605.15436#bib.bib46)\]provide comprehensive analysis of efficient transformer variants, while Narang and Chowdhery\[[47](https://arxiv.org/html/2605.15436#bib.bib47)\]explore scaling laws and architectural choices\.
Recent architectural innovations include retrieval\-augmented generation\[[48](https://arxiv.org/html/2605.15436#bib.bib48)\], mixture\-of\-experts models\[[49](https://arxiv.org/html/2605.15436#bib.bib49)\], and specialized architectures for specific domains\[[50](https://arxiv.org/html/2605.15436#bib.bib50)\]\. Our work contributes to this literature by providing systematic neural activation analysis across architectures and tasks\.
## IIIMethodology
### III\-AExperimental Framework
Our analysis framework, implemented as the LLM Brain Activity Analyzer, systematically evaluates neural activation patterns across diverse model architectures and cognitive tasks\. The framework supports comprehensive model families including BERT variants, GPT series, LLaMA models, Mistral architectures, and recent 2024 releases\[[1](https://arxiv.org/html/2605.15436#bib.bib1),[2](https://arxiv.org/html/2605.15436#bib.bib2)\]\.
### III\-BModel Selection and Architecture Coverage
We selected six representative LLM architectures from a comprehensive model taxonomy spanning 8 distinct families and 50\+ available models:
- •BERT\-Base\(109\.5M parameters\): Encoder\-only bidirectional architecture\[[3](https://arxiv.org/html/2605.15436#bib.bib3)\]
- •GPT2\-117M\(124\.4M parameters\): Autoregressive decoder architecture\[[4](https://arxiv.org/html/2605.15436#bib.bib4)\]
- •Qwen\-1\.5\-0\.5B\(464\.0M parameters\): Modern multilingual decoder with enhanced reasoning\[[5](https://arxiv.org/html/2605.15436#bib.bib5)\]
- •Phi\-1\(1\.4B parameters\): Microsoft’s efficiency\-optimized decoder\[[6](https://arxiv.org/html/2605.15436#bib.bib6)\]
- •BLOOM\-560M\(559\.2M parameters\): Multilingual autoregressive model\[[7](https://arxiv.org/html/2605.15436#bib.bib7)\]
- •StableLM\-3B\(3\.6B parameters\): Stability AI’s large\-scale decoder architecture\[[8](https://arxiv.org/html/2605.15436#bib.bib8)\]
This selection encompasses diverse architectural paradigms, parameter scales \(109\.5M to 3\.6B\), and training methodologies, providing comprehensive coverage of the modern LLM landscape\[[9](https://arxiv.org/html/2605.15436#bib.bib9),[10](https://arxiv.org/html/2605.15436#bib.bib10)\]\.
### III\-CCognitive Task Taxonomy
We designed a comprehensive cognitive task taxonomy covering twelve distinct reasoning domains, each validated through cognitive science literature\[[11](https://arxiv.org/html/2605.15436#bib.bib11),[12](https://arxiv.org/html/2605.15436#bib.bib12)\]:
1. 1\.Factual Questions: Retrieval of encyclopedic knowledge\[[13](https://arxiv.org/html/2605.15436#bib.bib13)\]
2. 2\.Creative Writing: Open\-ended text generation requiring imagination\[[14](https://arxiv.org/html/2605.15436#bib.bib14)\]
3. 3\.Mathematical Reasoning: Multi\-step quantitative problem solving\[[15](https://arxiv.org/html/2605.15436#bib.bib15)\]
4. 4\.Emotional Content: Sentiment analysis and emotional understanding\[[16](https://arxiv.org/html/2605.15436#bib.bib16)\]
5. 5\.Technical Code: Programming and software engineering tasks\[[17](https://arxiv.org/html/2605.15436#bib.bib17)\]
6. 6\.Philosophical Queries: Abstract reasoning about existence and ethics\[[18](https://arxiv.org/html/2605.15436#bib.bib18)\]
7. 7\.Conversational Chat: Natural dialogue and social interaction\[[19](https://arxiv.org/html/2605.15436#bib.bib19)\]
8. 8\.Logical Puzzles: Deductive and inductive reasoning challenges\[[20](https://arxiv.org/html/2605.15436#bib.bib20)\]
9. 9\.Scientific Explanations: Domain\-specific knowledge application\[[21](https://arxiv.org/html/2605.15436#bib.bib21)\]
10. 10\.Language Tasks: Linguistic analysis and translation\[[22](https://arxiv.org/html/2605.15436#bib.bib22)\]
11. 11\.Instruction Following: Task comprehension and execution\[[23](https://arxiv.org/html/2605.15436#bib.bib23)\]
12. 12\.Commonsense Reasoning: Everyday knowledge application\[[24](https://arxiv.org/html/2605.15436#bib.bib24)\]
Each category contains carefully crafted prompt pairs designed to elicit category\-specific cognitive processes while maintaining consistent complexity levels\. Table[I](https://arxiv.org/html/2605.15436#S3.T1)presents representative examples from our evaluation dataset\.
TABLE I:Sample Test Inputs by Cognitive Category
### III\-DNeural Activation Metrics
We developed three complementary metrics to capture distinct aspects of neural computation, building upon recent advances in transformer interpretability\[[25](https://arxiv.org/html/2605.15436#bib.bib25),[26](https://arxiv.org/html/2605.15436#bib.bib26),[27](https://arxiv.org/html/2605.15436#bib.bib27)\]:
- •Final Activation \(AfA\_\{f\}\): Mean activation magnitude of the final hidden layer, computed as:Af=1N∑i=1NhL\(i\)A\_\{f\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}h\_\{L\}^\{\(i\)\}wherehL\(i\)h\_\{L\}^\{\(i\)\}represents theii\-th element of the final layer activations andNNis the hidden dimension\.
- •Attention Entropy \(HattH\_\{att\}\): Shannon entropy of attention weight distributions across all heads and layers\[[26](https://arxiv.org/html/2605.15436#bib.bib26),[28](https://arxiv.org/html/2605.15436#bib.bib28)\]:Hatt=−1LH∑l=1L∑h=1H∑i,jAl,h\(i,j\)logAl,h\(i,j\)H\_\{att\}=\-\\frac\{1\}\{LH\}\\sum\_\{l=1\}^\{L\}\\sum\_\{h=1\}^\{H\}\\sum\_\{i,j\}A\_\{l,h\}^\{\(i,j\)\}\\log A\_\{l,h\}^\{\(i,j\)\}whereAl,h\(i,j\)A\_\{l,h\}^\{\(i,j\)\}is the attention weight from positioniito positionjjin layerll, headhh\.
- •Maximum Sparsity \(SmaxS\_\{max\}\): Peak sparsity level across all network layers, measuring computational efficiency\[[29](https://arxiv.org/html/2605.15436#bib.bib29),[30](https://arxiv.org/html/2605.15436#bib.bib30)\]:Smax=maxl∈\{1,…,L\}\|\{hl\(i\):\|hl\(i\)\|<ϵ\}\|\|hl\|S\_\{max\}=\\max\_\{l\\in\\\{1,\.\.\.,L\\\}\}\\frac\{\|\\\{h\_\{l\}^\{\(i\)\}:\|h\_\{l\}^\{\(i\)\}\|<\\epsilon\\\}\|\}\{\|h\_\{l\}\|\}whereϵ=0\.01\\epsilon=0\.01is the sparsity threshold andhlh\_\{l\}represents layerllactivations\.
These metrics provide orthogonal views of model computation: activation magnitude indicates processing intensity, attention entropy measures computational complexity, and sparsity reveals efficiency patterns\[[31](https://arxiv.org/html/2605.15436#bib.bib31),[32](https://arxiv.org/html/2605.15436#bib.bib32)\]\.
## IVResults and Analysis
### IV\-AOverall Architecture Comparison
Table[II](https://arxiv.org/html/2605.15436#S4.T2)presents the comparative analysis between encoder and decoder architectures\. Decoder models demonstrate significantly different activation patterns compared to the single encoder model in our study\.
TABLE II:Architecture ComparisonThe encoder architecture \(BERT\-Base\) exhibits higher attention entropy \(125\.58 vs 77\.47\) but significantly lower sparsity \(0\.039 vs 0\.276\), suggesting more distributed attention patterns with denser computational utilization\.
### IV\-BTask\-Specific Activation Patterns
Table[III](https://arxiv.org/html/2605.15436#S4.T3)presents comprehensive statistics across all cognitive task categories, revealing distinct computational signatures for different types of reasoning\.
TABLE III:Category Performance Across All ModelsMathematical reasoning exhibits the highest attention entropy \(195\.66±\\pm46\.66\), confirming its computational complexity across all architectures\. Notably, scientific explanations show the lowest entropy \(47\.03±\\pm19\.10\), suggesting more focused attention patterns for explanatory tasks\.
Table[IV](https://arxiv.org/html/2605.15436#S4.T4)shows the complete dominance of GPT2\-117M in final activation metrics, occupying all top 10 positions\.
TABLE IV:Top Performers by Final Activation
### IV\-CParameter Scale Effects
Table[V](https://arxiv.org/html/2605.15436#S4.T5)reveals the complex relationship between model size and activation patterns, challenging simple scaling assumptions\.
TABLE V:Parameter Scale AnalysisThe data reveals non\-monotonic relationships: BLOOM\-560M exhibits the most negative final activation \(\-1\.84\), while the 1\.4B Phi\-1 model shows remarkably low activation \(0\.0009\), suggesting architectural optimizations\. StableLM\-3B demonstrates the highest sparsity \(0\.616\), indicating efficient selective activation patterns in larger models\.
### IV\-DModel\-Specific Analysis
Table[VI](https://arxiv.org/html/2605.15436#S4.T6)provides a comprehensive overview of each model’s characteristics across all tasks\.
TABLE VI:Complete Model Summary StatisticsBERT\-Base exhibits the highest attention entropy \(125\.58\) but lowest sparsity \(0\.039\), consistent with encoder architectures requiring comprehensive context understanding\. GPT2\-117M shows the highest positive final activation \(0\.328\), while BLOOM\-560M exhibits the most negative activation \(\-1\.836\), suggesting different activation calibration strategies\.
### IV\-EAttention Entropy Analysis
The attention entropy analysis reveals distinct computational signatures across cognitive tasks\. Table[VII](https://arxiv.org/html/2605.15436#S4.T7)shows the top performers by attention entropy, dominated by mathematical reasoning across multiple architectures\.
TABLE VII:Top 10 Models by Attention EntropyMathematical reasoning occupies the top 4 positions and 6 of the top 10, with all six models achieving their highest entropy values on this task\. This consistency across architectures suggests fundamental computational complexity inherent to mathematical reasoning tasks\.
### IV\-FSparsity Patterns and Computational Efficiency
Table[VIII](https://arxiv.org/html/2605.15436#S4.T8)presents models with the lowest sparsity \(highest computational density\), revealing task\-specific efficiency patterns\.
TABLE VIII:Top 10 Models by Lowest Sparsity \(Highest Density\)BERT\-Base and BLOOM\-560M dominate the highest density computations, with mathematical reasoning and logical puzzles requiring the most comprehensive network activation\. This contrasts sharply with StableLM\-3B’s high sparsity approach, suggesting different optimization strategies across model architectures\.
## VDiscussion
### V\-AArchitectural Implications for Big Data Systems
The significant differences in activation patterns between encoder and decoder architectures have important implications for model deployment in big data environments\. Encoder models like BERT demonstrate high attention entropy \(125\.58\) with low sparsity \(0\.039\), making them suitable for comprehensive context understanding tasks such as document classification, information retrieval, and knowledge extraction from large corpora\[kenton2019bert\]\.
Decoder models show more variable patterns, with the ability to achieve high sparsity for computational efficiency\. This heterogeneity suggests that decoder architectures can be more efficiently scaled in distributed big data systems through selective activation patterns, potentially reducing computational overhead by up to 60% as demonstrated by StableLM\-3B’s sparsity levels\.
### V\-BTask\-Specific Optimization Strategies
The identification of task\-specific activation signatures enables data\-driven model selection strategies\. Mathematical reasoning consistently requires high attention entropy across all architectures \(195\.66±\\pm46\.66\), suggesting that:
- •Multi\-step reasoning tasks benefit from models with sophisticated attention mechanisms
- •Resource allocation should prioritize attention computation for mathematical tasks
- •Hybrid architectures could optimize attention complexity based on task detection
Conversely, scientific explanations show the lowest entropy \(47\.03±\\pm19\.10\), indicating more focused computational patterns suitable for efficient batch processing in large\-scale educational or research applications\.
### V\-CComputational Efficiency and Resource Management
Our sparsity analysis reveals fundamental trade\-offs between model size and computational efficiency\. The non\-linear relationship between parameters and activation patterns challenges traditional scaling assumptions:
- •BLOOM\-560M achieves high performance with minimal sparsity \(0\.0358\), suggesting dense utilization
- •StableLM\-3B demonstrates efficient selective activation \(0\.6161 sparsity\) despite larger size
- •Phi\-1 shows remarkably low activation intensity \(0\.0009\), indicating architectural optimization success
These findings suggest that computational resource allocation in big data systems should consider activation patterns rather than solely parameter counts when optimizing for efficiency\.
### V\-DCognitive Load Distribution
The cognitive task hierarchy revealed by our analysis provides insights into model computational demands:
High Complexity Tasks\(entropy ¿ 100\): Mathematical reasoning, logical puzzles, technical codeMedium Complexity Tasks\(entropy 60\-100\): Creative writing, emotional content, language tasksLow Complexity Tasks\(entropy ¡ 60\): Scientific explanations, instruction following, factual questions
This hierarchy can inform task scheduling and resource allocation in production big data systems, enabling dynamic computational optimization based on predicted cognitive load\.
### V\-ELimitations
Our study is limited by the analysis of only two samples per task\-model combination and the focus on activation\-level metrics rather than performance outcomes\. Future work should incorporate larger sample sizes and correlate activation patterns with task performance quality\.
## VIConclusion
This comprehensive analysis of neural activation patterns across six LLM architectures and twelve cognitive task categories reveals fundamental differences in how different models process cognitive tasks\. Key findings include:
1. 1\.Mathematical reasoning consistently produces the highest attention entropy across all architectures
2. 2\.Decoder models exhibit significantly higher sparsity than encoder models
3. 3\.Parameter scale does not linearly correlate with activation intensity
4. 4\.Task\-specific activation signatures can inform model selection for big data applications
These insights provide valuable guidance for model selection, architecture optimization, and understanding the computational characteristics of modern language models in big data environments\. Future research should explore the correlation between these activation patterns and actual task performance quality, as well as investigate optimization strategies based on these findings\.
The dataset and analysis framework developed in this study contribute to the growing body of knowledge on LLM interpretability and provide a foundation for future research in neural activation analysis\.
## VIIAcknowledgments
The author acknowledges the computational resources provided by BrightMind AI and the open\-source community for making the analyzed models available for research purposes\. The implementation and reproducibility materials for this study are available at:[https://github\.com/mahdinaser/llm\-neural\-activation\-patterns](https://github.com/mahdinaser/llm-neural-activation-patterns)\.
## References
- \[1\]H\. Touvron et al\., ”Llama 2: Open foundation and fine\-tuned chat models,” arXiv preprint arXiv:2307\.09288, 2023\.
- \[2\]A\. Q\. Jiang et al\., ”Mistral 7B,” arXiv preprint arXiv:2310\.06825, 2023\.
- \[3\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova, ”BERT: Pre\-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL\-HLT, 2019, pp\. 4171\-4186\.
- \[4\]A\. Radford et al\., ”Language models are unsupervised multitask learners,” OpenAI blog, vol\. 1, no\. 8, p\. 9, 2019\.
- \[5\]J\. Bai et al\., ”Qwen technical report,” arXiv preprint arXiv:2309\.16609, 2023\.
- \[6\]S\. Gunasekar et al\., ”Textbooks are all you need,” arXiv preprint arXiv:2306\.11644, 2023\.
- \[7\]T\. L\. Scao et al\., ”BLOOM: A 176B\-parameter open\-access multilingual language model,” arXiv preprint arXiv:2211\.05100, 2022\.
- \[8\]StabilityAI, ”StableLM: Stability AI Language Models,” GitHub repository, 2023\.
- \[9\]X\. Qiu, T\. Sun, Y\. Xu, Y\. Shao, N\. Dai, and X\. Huang, ”Pre\-trained models for natural language processing: A survey,” Science China Technological Sciences, vol\. 63, no\. 10, pp\. 1872\-1897, 2020\.
- \[10\]A\. Rogers, O\. Kovaleva, and A\. Rumshisky, ”A primer in neural network models for natural language processing,” Journal of Artificial Intelligence Research, vol\. 57, pp\. 345\-420, 2020\.
- \[11\]R\. J\. Sternberg and K\. Sternberg, Cognitive psychology\. Cengage Learning, 2019\.
- \[12\]A\. Newell and H\. A\. Simon, Human problem solving\. Prentice\-Hall, 1972\.
- \[13\]F\. Petroni et al\., ”Language models as knowledge bases?” in Proceedings of EMNLP\-IJCNLP, 2019, pp\. 2463\-2473\.
- \[14\]T\. Chakrabarty, P\. Xie, C\. Muresan, E\. Kan, S\. Muresan, and N\. Peng, ”Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing,” in Proceedings of EMNLP, 2022, pp\. 10824\-10835\.
- \[15\]D\. Hendrycks et al\., ”Measuring mathematical problem solving with the MATH dataset,” in Proceedings of NeurIPS, 2021, pp\. 8844\-8856\.
- \[16\]S\. Mohammad, F\. Bravo\-Marquez, M\. Salameh, and S\. Kiritchenko, ”SemEval\-2018 task 1: Affect in tweets,” in Proceedings of SemEval, 2018, pp\. 1\-17\.
- \[17\]M\. Chen et al\., ”Evaluating large language models trained on code,” arXiv preprint arXiv:2107\.03374, 2021\.
- \[18\]B\. Talbot and S\. Bethard, ”Identifying the human values behind arguments,” in Proceedings of ACL, 2022, pp\. 4459\-4476\.
- \[19\]D\. Adiwardana et al\., ”Towards a human\-like open\-domain chatbot,” arXiv preprint arXiv:2001\.09977, 2020\.
- \[20\]A\. Talmor et al\., ”LEAP\-OF\-THOUGHT: Teaching pre\-trained models to systematically reason over implicit premises,” arXiv preprint arXiv:2006\.06609, 2020\.
- \[21\]P\. Jansen, E\. Wainwright, S\. Marmorstein, and C\. Morrison, ”WorldTree: A corpus of explanation graphs for elementary science questions supporting multi\-hop inference,” in Proceedings of LREC, 2018\.
- \[22\]A\. Conneau, R\. Rinott, G\. Lample, A\. Williams, S\. Bowman, H\. Schwenk, and V\. Stoyanov, ”XNLI: Evaluating cross\-lingual sentence representations,” in Proceedings of EMNLP, 2018, pp\. 2475\-2485\.
- \[23\]S\. Mishra et al\., ”Cross\-task generalization via natural language crowdsourcing instructions,” in Proceedings of ACL, 2022, pp\. 3470\-3487\.
- \[24\]M\. Sap et al\., ”Atomic: An atlas of machine commonsense for if\-then reasoning,” in Proceedings of AAAI, 2019, pp\. 3027\-3035\.
- \[25\]I\. Tenney, D\. Das, and E\. Pavlick, ”BERT rediscovers the classical NLP pipeline,” in Proceedings of ACL, 2019, pp\. 4593\-4601\.
- \[26\]K\. Clark, U\. Khandelwal, O\. Levy, and C\. D\. Manning, ”What does BERT look at? An analysis of BERT’s attention,” in Proceedings of ACL Workshop BlackboxNLP, 2019, pp\. 276\-286\.
- \[27\]O\. Kovaleva, A\. Romanov, A\. Rogers, and A\. Rumshisky, ”Revealing the dark secrets of BERT,” in Proceedings of EMNLP, 2019, pp\. 4365\-4374\.
- \[28\]E\. Voita, D\. Talbot, F\. Moiseev, R\. Sennrich, and I\. Titov, ”Analyzing multi\-head self\-attention: Specialized heads do the heavy lifting, the rest can be pruned,” in Proceedings of ACL, 2019, pp\. 5797\-5808\.
- \[29\]T\. Hoefler, D\. Alistarh, T\. Ben\-Nun, N\. Dryden, and A\. Peste, ”Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,” Journal of Machine Learning Research, vol\. 22, no\. 241, pp\. 1\-124, 2021\.
- \[30\]E\. Frantar and D\. Alistarh, ”SparseGPT: Massive language models can be accurately pruned in one\-shot,” in Proceedings of ICML, 2023, pp\. 10323\-10337\.
- \[31\]P\. Michel, O\. Levy, and G\. Neubig, ”Are sixteen heads really better than one?” in Proceedings of NeurIPS, 2019, pp\. 14014\-14024\.
- \[32\]S\. Prasanna, A\. Rogers, and A\. Rumshisky, ”When BERT plays the lottery, all tickets are winning,” in Proceedings of EMNLP, 2020, pp\. 3208\-3229\.
- \[33\]T\. Wolf et al\., ”Transformers: State\-of\-the\-art natural language processing,” in Proceedings of EMNLP: System Demonstrations, 2020, pp\. 38\-45\.
- \[34\]A\. Paszke et al\., ”PyTorch: An imperative style, high\-performance deep learning library,” in Proceedings of NeurIPS, 2019, pp\. 8026\-8037\.
- \[35\]P\. Micikevicius et al\., ”Mixed precision training,” in Proceedings of ICLR, 2018\.
- \[36\]J\. Dodge, S\. Gururangan, D\. Card, R\. Schwartz, and N\. A\. Smith, ”Show your work: Improved reporting of experimental results,” in Proceedings of EMNLP\-IJCNLP, 2019, pp\. 2185\-2194\.
- \[37\]Y\. Belinkov, ”Probing classifiers: Promises, shortcomings, and advances,” Computational Linguistics, vol\. 48, no\. 1, pp\. 207\-219, 2022\.
- \[38\]G\. Brunner, Y\. Liu, D\. Pascual, O\. Richter, M\. Ciaramita, and R\. Wattenhofer, ”On identifiability in transformers,” in Proceedings of ICLR, 2020\.
- \[39\]E\. Strubell, A\. Ganesh, and A\. McCallum, ”Energy and policy considerations for deep learning in NLP,” in Proceedings of ACL, 2019, pp\. 3645\-3650\.
- \[40\]E\. Kurtic, D\. Campos, T\. Nguyen, E\. Frantar, M\. Alistarh, and D\. Alistarh, ”The optimal BERT surgeon: Scalable and accurate second\-order pruning for large language models,” in Proceedings of EMNLP, 2022, pp\. 4864\-4881\.
- \[41\]S\. Zhang et al\., ”OPT: Open pre\-trained transformer language models,” arXiv preprint arXiv:2205\.01068, 2022\.
- \[42\]T\. Dettmers, M\. Lewis, Y\. Belkada, and L\. Zettlemoyer, ”LLM\.int8\(\): 8\-bit matrix multiplication for transformers at scale,” in Proceedings of NeurIPS, 2022, pp\. 15318\-15332\.
- \[43\]P\. Liang et al\., ”Holistic evaluation of language models,” arXiv preprint arXiv:2211\.09110, 2022\.
- \[44\]A\. Srivastava et al\., ”Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,” arXiv preprint arXiv:2206\.04615, 2022\.
- \[45\]K\. Cobbe et al\., ”Training verifiers to solve math word problems,” arXiv preprint arXiv:2110\.14168, 2021\.
- \[46\]Y\. Tay, M\. Dehghani, D\. So, B\. Ginsburg, Z\. Dai, N\. Shazeer, and Q\. V\. Le, ”git: A survey,” ACM Computing Surveys, vol\. 55, no\. 6, pp\. 1\-28, 2022\.
- \[47\]S\. Narang and A\. Chowdhery, ”Pathways: Asynchronous distributed dataflow for ML,” in Proceedings of MLSys, 2022, pp\. 430\-448\.
- \[48\]P\. Lewis et al\., ”Retrieval\-augmented generation for knowledge\-intensive NLP tasks,” in Proceedings of NeurIPS, 2020, pp\. 9459\-9474\.
- \[49\]W\. Fedus, B\. Zoph, and N\. Shazeer, ”Switch transformer: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research, vol\. 23, no\. 120, pp\. 1\-39, 2022\.
- \[50\]E\. Nijkamp et al\., ”CodeGen: An open large language model for code with multi\-turn program synthesis,” in Proceedings of ICLR, 2023\.Similar Articles
Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography
This paper uses sparse autoencoders to decompose LLMs into interpretable features and shows that semantic features explain brain alignment with cortical semantic topography, generalizing across English, Chinese, and French.
Measuring Maximum Activations in Open Large Language Models
This paper measures maximum activation magnitudes across 27 checkpoints from 8 open LLM families, finding significant variance across families, architectures, and training stages, with implications for low-bit quantization and deployment.
Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences
This paper investigates whether Brain Score, a metric comparing language model representations to human fMRI activations during reading, is truly capturing human-like language processing or merely structural similarity. The researchers train language models on diverse natural languages and non-linguistic structured data (genome, Python, nested parentheses), finding that models trained on different languages and even non-linguistic sequences achieve similar Brain Score performance, suggesting the metric may not be sensitive enough to distinguish human-specific processing.
LLM Neuroanatomy III - LLMs seem to think in geometry, not language
Researcher analyzes LLM internal representations across 8 languages and multiple models, finding that concept thinking occurs in geometric space in middle transformer layers independent of input language, supporting a universal deep structure hypothesis similar to Chomsky's theory rather than Sapir-Whorf linguistic relativism.
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
This paper identifies the 'Massive Emergence Layer' where extreme activations in LLMs originate and propagate, proposing a method to mitigate their rigidity and improve model performance on tasks like math reasoning and instruction following.