We investigate the mechanism of in-context learning (ICL) on sentence classification tasks with semantically-unrelated labels ("foo"/"bar"). We find intervening in only 1% heads (named "in-context heads") significantly affects ICL accuracy from 87.6% to 24.4%. To understand this phenomenon, we analyze the value-output vectors in these heads and discover that the vectors at each label position contain substantial information about the corresponding labels. Furthermore, we observe that the prediction shift from "foo" to "bar" is due to the respective reduction and increase in these heads’ attention scores at "foo" and "bar" positions. Therefore, we propose a hypothesis for ICL: in in-context heads, the value-output matrices extract label features, while the query-key matrices compute the similarity between the features at the last position and those at each label position. The query and key matrices can be considered as two towers that learn the similarity metric between the last position’s features and each demonstration at label positions. Using this hypothesis, we explain the majority label bias and recency bias in ICL and propose two methods to reduce these biases by 22% and 17%, respectively.
In-context learning (ICL) is an emergent ability of large language models. By using some demonstration-label pairs as prompts, ICL performs well without updating parameters on many tasks. Because the mechanism of ICL remains unclear, many studies focus on understanding how ICL works. Although previous studies are important for understanding ICL, the exact mechanism of ICL remains a mystery for several reasons. Firstly, the information flow is typically observed as an average across each head, but understanding ICL requires exploring the precise importance of each head. Secondly, each head has a query matrix, key matrix, value matrix, and output matrix; it is essential to study the role of each matrix in detail. Lastly, ICL is plagued by issues such as majority label bias and recency bias, and how to explain and mitigate these biases has not yet been thoroughly investigated.
Contributions: In this paper, we propose a hypothesis of in-context learning, and design experiments to understand the roles of different modules (query, key, value, output matrices). We understand and reduce the majority label bias and recency bias of in-context learning.
Hypothesis of In-context Learning
The proposed hypothesis is: In shallow layers, the label positions extract the demonstration features, and the last position extract features in all positions (X% input text + Y% near demonstrations + Z% far demonstrations). In deep layers' in-context heads, the value-output matrices extract the label features on the label positions, learning foo->foo and bar->bar. The query and key matrices are two towers learning the similarity metric between the last position and the demonstration features on the label positions. When the similarity between the input text and the demonstration is large, the attention score on the corresponding label position is large, thus the label information is transformed much into the last position , enhancing the probability of this label's token. Take France : bar Cat : foo Dog : -> foo as an example. Sim(X% Dog + Y% Cat + Z% France, cat) > Sim(X% Dog + Y% Cat + Z% France, France), so foo is predicted.
We list the experimental results supporting this hypothesis. Please find the details in the paper.
a) Previous studies find that label positions extract demonstrations' features.
b) We find a few fooheads important for predicting "foo", and a few barheads important for predicting "bar". When intervening the fooheads, the probability of "foo" decreases very much. When intervening the barheads, the probability of "bar" decreases very much. We name these heads "in-context heads".
c) We design a logit minus score to evaluate the information storage in the weighted value-output vectors on each position. We find that the "foo" positions in fooheads store much information about "foo", and the "bar" positions in barheads store much information about "bar". So the mechanism of the in-context heads is a copying mechanism, similar to the induction head.
d) We compare the sentences “S0 : bar S1 : bar S2 : foo S3 : foo S4 :” => foo and “S0 : foo S1 : foo S2 : bar S3 : bar S4 :” => bar. The prediction changes from "foo" to "bar" when the labels are reversed. For each position in the in-context heads, we compute the logit minus of the weighted value-output vectors, the attention scores, and the logit minus of the value-output vectors. We find that the changing of attention scores is the root reason causing the probability change from "foo" to "bar". In comparison, the value-output matrices learn foo->foo and bar->bar in both cases.
We also find mechanistic interpretability evidence of our hypothesis by projecting the vectors in unembedding space, shown below.
Mechanistic interpretability evidence in GPT2
@article{yu2024how,
author = {Yu, Zeping and Ananiadou, Sophia},
title = {How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning},
journal = {EMNLP},
year = {2024},
}