Neuron-Level Knowledge Attribution in Large Language Models

Abstract

Identifying important neurons for final predictions is essential for understanding the mechanisms of large language models. Due to computational constraints, current attribution techniques struggle to operate at neuron level. In this paper, we propose a static method for pinpointing significant neurons. Compared to seven other methods, our approach demonstrates superior performance across three metrics. Additionally, since most static methods typically only identify "value neurons" directly contributing to the final prediction, we propose a method for identifying "query neurons" which activate these "value neurons". Finally, we apply our methods to analyze six types of knowledge across both attention and feed-forward network (FFN) layers. Our method and analysis are helpful for understanding the mechanisms of knowledge storage and set the stage for future research in knowledge editing.

Video

Introduction

Identifying the important neurons in LLMs is significant for understanding the mechanisms. Traditional attribution methods such as integrated gradients and causal mediation analysis are hard to be utilized at neuron level due to the computational cost. Take Llama-7B as an example. There are 32 layers, where each layer has 4,096 attention neurons and 11,008 FFN neurons. Therefore, it is essential to find a static method to locate the important neurons.

In LLMs, there are "value neurons" contributing to the final prediction directly, as they contain the important information about the final prediction. There are also "query neurons" contributing by activating the "value neurons". The "query neurons" may not contain information about the final prediction. In this paper, we first compute the log probability increase of each neuron to locate the "value neurons" in deep FFN layers and deep attention layers. Then we calculate the inner product between the attention "value neurons" and each shallow FFN neuron to locate the "query neurons" in shallow FFN layers.

When intervening the top200 "attention value neurons" and top100 "FFN value neurons" for each sentence, the MRR and probability decreases 96.3%/99.2% in GPT2, and 96.9%/99.6% in Llama. When intervening top1000 shallow neurons for each sentence, both MRR and probability drops very much (92%/95% in GPT2 and 87%/95% in Llama). These results prove that our method can identify the important neurons for the final prediction.

The neuron-level information flow is shown in the figure below. The "query FFN neurons" in shallow layers are activated and added into each position's residual stream. Then these "query neurons" activate the "attention value neurons" in deep attention layers and transform into the last token, then activate the "FFN value neurons" in deep FFN layers.

Neuron-Level Information Flow

Mechanistic interpretability of identified neurons

We aim to analyze the interpretability of the identified important neurons by projecting the neurons into unembedding space, which is a commonly used method in mechanistic interpretability. Please note that most "query neurons" are hidden-interpretable, which becomes interpretable after transformation by the attention heads. About hidden-interpretable neurons, please refer to this EMNLP 2024 paper for details: Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis.

The case is: ['{start}', 'Tim', 'Dun', 'can', 'plays', 'the', 'sport', 'of'] => "basketball"

Firstly, the 3rd position ("can") activates the query FFN neuron FFN_5_5005 (layer 5, neuron 5005). The top tokens of FFN_5_5005 (hidden-interpretable) is ['basketball', 'Basketball', 'NBA', 'Jazz', 'asketball', 'jazz', 'Bird', 'basket', 'court', 'courts'].

It is very possible that in early attention layers position 3 ("can") has captured the features from position 1 ("Tim") and position 2 ("Dun"), but we do not verify this in our paper.

Secondly, many query neurons like FFN_5_5005 are activated. Then these neurons activate many value attention neurons like ATTN_15_15_112 (layer 15, head 15, neuron 112), whose top tokens in unembedding space is ['basketball', 'Basketball', 'asketball', 'NBA', 'basket', '球', 'Mount', 'hos', 'wings', 'ugby']. The identified attention value neurons work both as "value" and "query".

Lastly, these attention neurons activate the FFN value neurons such as FFN_22_1674: ['Jazz', 'jazz', 'basketball', 'rock', 'Basketball', 'Rock', 'hockey', 'Hockey', 'Pop', 'rugby']. Obviously, this FFN neuron is a polysemantic neuron, which is also related to "music" (see "jazz" and "rock"). Although the superposition phenomenon exists, our method can successfully identify it.

To test more cases, please use our code.

BibTeX

@inproceedings{yu2024neuron,
  title={Neuron-Level Knowledge Attribution in Large Language Models},
  author={Yu, Zeping and Ananiadou, Sophia},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
  pages={3267--3280},
  year={2024}
}