Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis

Abstract

We find arithmetic ability resides within a limited number of attention heads, with each head specializing in distinct operations. To delve into the reason, we introduce the Comparative Neuron Analysis (CNA) method, which identifies an internal logic chain consisting of four distinct stages from input to prediction: feature enhancing with shallow FFN neurons, feature transferring by shallow attention layers, feature predicting by arithmetic heads, and prediction enhancing among deep FFN neurons. Moreover, we identify the human-interpretable FFN neurons within both feature-enhancing and feature-predicting stages. These findings lead us to investigate the mechanism of LoRA, revealing that it enhances prediction probabilities by amplifying the coefficient scores of FFN neurons related to predictions. Finally, we apply our method in model pruning for arithmetic tasks and model editing for reducing gender bias.

Video

Introduction

Arithmetic ability is a crucial foundational skill of large language models, which is related to reasoning ability. Previous studies (Stolfo et al., 2023) explored the layer-level information flow in arithmetic tasks. However, layer-level information flow is not enough for understanding the mechanism. Many studies have found that the attention heads and FFN neurons are the fundamental units for storing different abilities and different knowledge. Furthermore, as model editing typically occurs at the neuron level, it remains unclear how to effectively leverage the explanations due to the uncertainty surrounding the precise locations of important parameters.

In this study, we take attention heads and FFN neurons as fundamental units, and explore the exact parameters store the arithmetic ability for different operations. We observe that only a minority of heads play significant roles in arithmetic tasks, which we refer to as "arithmetic heads". Through experiments involving 1-digit to 3-digit operations, we find critical memorization of 1-digit operations is lost when these heads are intervened.

To explore the underlying mechanisms of this phenomenon, we propose the Comparative Neuron Analysis (CNA) method, which compares the change of neurons between the original model and the intervened model for the same case. We construct the internal logic chain by identifying four distinct stages that span from inputs to prediction, as depicted in Figure 1. During the feature enhancing stage, hidden-interpretable features are extracted from shallow FFN neurons. Subsequently, in the feature transferring stage, shallow attention layers convert these features into directly interpretable features and then transfer them to the last position. In the feature predicting stage, the arithmetic heads play critical roles, activating deep FFN neurons related to the final prediction. Finally, a prediction enhancing stage exists among deep FFN neurons. Lower FFN neurons activate upper FFN neurons, while both of them enhance the probability of the final prediction.

Based on this analysis, we investigate the mechanism of LoRA. Employing our CNA method to compare the original model with the fine-tuned LoRA model, we note a significant increase in the coefficient scores of crucial deep FFN neurons. Hence, we conclude that LoRA enhances the final prediction by amplifying the coefficient scores of important FFN neurons. Finally, using our findings, we develop methods on model pruning for arithmetic tasks, and model editing for reducing gender bias.

Neuron-Level Information Flow

Contributions

To summarize, our contributions are as follows.

a) We find the reason why only a few heads can influence arithmetic ability is that these heads store crucial parameters for memorizing 1D operations. We identify human-interpretable FFN neurons across both shallow and deep layers.

b) We propose the CNA method and construct the internal logic chain from inputs to prediction with four stages: feature enhancing, feature transferring, feature predicting, prediction enhancing.

c) We use the CNA method to explore the mechanism of LoRA and find LoRA increases the probability of final predictions by amplifying the important FFN neurons’ coefficient scores. We design a model pruning method for arithmetic tasks, and a model editing method for reducing gender bias.

Four stages of arithmetic tasks

We show how we find the four stages in arithmetic tasks. More experiments and analysis can be found in the paper.

Step 1: find arithmetic head using causal-based method. We conduct experiments on 1-digit, 2-digit and 3-digit arithmetic operations, and find that zero interventions on only a few heads can cause very much decrease. For example, when intervening the arithmetic head 17-22 (the 22th head on layer 17), the 2-digit addition accuracy drops from 96.8 to 42.9%. The 1-digit addition accuracy drops from 88.9% to 47.6%. Therefore, head 17-22 stores the 1-digit addition memorization.

Step 2: find deep FFN neurons using comparable neuron analysis (CNA) method. The core idea of our proposed comparable neuron analysis method is to compare the change of the same neuron's importance scores between the original model and the intervened model (replacing head 17-22 parameters with zero). If a neuron is affected much, its importance score should reduce much. Specifically, the importance score is log probability increase in unembedding space. When intervening top100 identified neurons, the accuracy drops 100%. When keeping these top100 neurons and intervening all the other neurons in deep layers, the accuracy only drops 3.9%. Therefore, arithmetic head plays the "feature predicting" role by activating the identified deep FFN neurons.

Step 3: prediction enhancing stage among deep FFN neurons. Among the identified top100 deep FFN neurons, we find that the lower FFN neurons can activate the upper FFN neurons. When zero-intervening the lowest neuron among the top100 neurons, the sum of all the neurons' coefficient score decrease much. Therefore, there is a prediction enhancing stage among deep FFN neurons.

Step 4: feature enhancing stage with shallow FFN neurons. It is hard to locate the important shallow FFN neurons, because they usually don't contain the information related to the final predictions directly. We analyze the important shallow FFN neurons in case "3+5=" -> "8" by computing the inner product between the shallow neurons and the important deep neurons. We find that the important shallow neurons are hidden-interpretable. When directly projecting these neurons in unembedding space, the top tokens are not interpretable. But if we compute these neurons' transformed vectors by attention layers, these transformed neurons become interpretable in unembedding space. This phenomenon is because all the shallow neurons need to be transformed by attention layers before transferring into the last position. We design a zero-shot method to locate the hidden-interpretable neurons. We compute the upper layers’ transform of each neuron and project them into unembedding space. If the top tokens are related to the input number or operators, we add this shallow neuron into a neuron set. At last, we mask the shallow neurons in the neuron set and compute the accuracy decrease on all one-digit cases. The accuracy drops much. This can prove that the hidden-interpretable neurons are important.

Constucting the internal logic chain of "3+5=" -> "8". First, in shallow layers there is a feature enhancing stage, the hidden-interpretable shallow FFN neurons related to the input tokens are activated. Then these enhanced features are transformed into the last position. The arithmetic head takes all these enhanced features as input and activate the deep FFN neurons related to the final prediction eight. Among the deep neurons, the lower neurons activate the upper neurons during the prediction enhancing stage.

Top tokens of identified hidden-interpretable shallow FFN neurons:

12_4072 (not transformed): [rd, quarters, PO, Constraint, ran, avas]

12_4072 (attention transformed): [III, three, Three, 3, triple]

11_2258 (not transformed): [enz, Trace, lis, vid, suite, HT, ung, icano]

11_2258 (attention transformed): [XV, fifth, Fif, avas, Five, five, abase, fif]

Top tokens of identified shallow FFN neurons using comparable neuron analysis (CNA):

28_3696: [8, eight, VIII, huit, acht, otto]

25_7164: [six, eight, acht, Four, twelve, six, four]

19_5769: [eight, VIII, 8, III, huit, acht]

Applications

a) Understand the mechanism of LoRA.

We apply the comparable neuron analysis method to explore the mechanism of LoRA. We add LoRA on the 9th attention layer and compare the important neurons between the original model and the LoRA model, and we find that LoRA enhances the accuracy by enlarging the important neurons’ coefficient scores. In other words, the deep FFN neurons can be regarded as learned features. LoRA learns how to enlarge the features, rather than learning the features in the LoRA parameters.

b) Model pruning for arithmetic tasks.

In previous analysis, we find that the accuracy is not affected much when pruning many deep neurons. Based on this, we propose a method for model pruning. We first use comparable neuron analysis method to locate the important neurons by comparing the original model with the LoRA fine-tuned model, and get the pruned model by pruning the neurons in deep layers which are not important. Then we add another LoRA and fine-tune it again on the pruned model. This method can get 82.3% accuracy when pruning 95% deep FFN neurons. When randomly prune the same neurons and add LoRA on this randomly-pruned model, the accuracy is only 17.1%.

c) Model editing for reducing gender bias.

We compare the neuron’s importance score under different input cases (e.g. A woman works as a -> nurse & A man works as a -> nurse). The neurons’ importance scores are different under these cases, because the probability is not the same. Using this method, we can locate the gender bias neurons and reduce the gender bias by zero-editing them. The average perplexity difference between different genders decreases 35.7% when only 18 neurons are edited.

BibTeX

@inproceedings{yu2024interpreting,
  title={Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis},
  author={Yu, Zeping and Ananiadou, Sophia},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
  pages={3293--3306},
  year={2024}
}