Writing One Attention Head in NumPy
I had read about attention more times than I can count. Watched the 3Blue1Brown video, skimmed the Illustrated Transformer, half-read the paper. I could recite the slogan queries, keys, values but when I tried to picture what a single forward pass actually looked like in memory, I got fuzzy. Specifically: I was never sure, in a given diagram, whether each row of the attention matrix corresponded to one query or one key. That tiny ambiguity told me I did not really understand it yet. ...