No description has been provided for this image

Lecture 21 - Transformers I¶

ECE364 - Programming Methods for Machine Learning¶

Nickvash Kani¶

Slides based off prior lectures by Alex Schwing, Aigou Han, Farzad Kamalabadi, Corey Snyder. All mistakes are my own!¶

In this lecture:

  • Continuing discussing the basic structure of the transformer model
  • Talk about attention and how it works

Attention is all you need¶

In 2017, Ashish Vaswani and his colleagues at Google Research/Brain published their seminal work "Attention is all you need" [2]. In it, they introduce the transformer architecture which has become the foundation of every large language model since. So in this lecture, let's go through and try to decipher this architecture:

No description has been provided for this image
No description has been provided for this image

Step 1: Tokenization¶

No description has been provided for this image

Step 2: Positional encoding¶

No description has been provided for this image

Step 3: Multi-head Attention¶

(Specifically self-attention, other types of attention next time)

No description has been provided for this image

Attention is a mechanism that allows the model to focus on different parts of the input when producing each output.

In the original transformer paper [1], Scaled Dot-Product Attention is used:

Given:

  • Query vector: $Q$
  • Key vector: $K$
  • Value vector: $V$

the attention output is:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V $$

Key Steps:¶

  1. Compute Scores:
    Measure similarity between Query ($Q$) and Key ($K$) by dot product: $QK^\top$

  2. Scale Scores:
    Divide by $\sqrt{d_k}$ (dimension of keys) to prevent extremely large values.

  3. Apply Softmax:
    Convert scores into probabilities that sum to 1.

  4. Weighted Sum:
    Multiply softmax weights by the Values ($V$).


Intuition:¶

  • Query asks: "What am I related to?"
  • Keys tell: "What components of the sequence are relevant to the duery?"
  • Values deliver: "Let's add add relevant words together so each vector contains the orginal word embedding values plus information about the relevant words near it."

Example of self-attention¶

No description has been provided for this image
In [20]:
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Correct model
tokenizer = AutoTokenizer.from_pretrained("nreimers/MiniLM-L6-H384-uncased")
model = AutoModel.from_pretrained("nreimers/MiniLM-L6-H384-uncased", output_attentions=True)

# Tokenize input
text = "The studious student graduated with honors."
inputs = tokenizer(text, return_tensors="pt")

# Forward pass
outputs = model(**inputs, output_attentions=True)
attentions = outputs.attentions

# Pick a layer and head
layer_idx = 0
head_idx = 0
attention_matrix = attentions[layer_idx][0, head_idx].detach().cpu().numpy()

# Tokens
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

# Make DataFrame
attention_df = pd.DataFrame(attention_matrix, index=tokens, columns=tokens)

# Plot
plt.figure(figsize=(10,8))
ax = sns.heatmap(attention_df, annot=True, fmt=".2f", cmap="Blues", cbar=False, square=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0)
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')
plt.title(f"MiniLM Attention (Layer {layer_idx}, Head {head_idx})", pad=20)
plt.tight_layout()
plt.show()
BertSdpaSelfAttention is used but `torch.nn.functional.scaled_dot_product_attention` does not support non-absolute `position_embedding_type` or `output_attentions=True` or `head_mask`. Falling back to the manual attention implementation, but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.
No description has been provided for this image

Some notes about attention:¶

  • We expect the values along the diagonal to be the largest (because a word is most relevant to itself)

  • Up until now we've used no parameters (though that's about to change)

  • What are [CLS] and [SEP] tokens?

Token Stands for Meaning
[CLS] Classification token Special token added at the beginning of every input sequence. Used to summarize the sentence.
[SEP] Separator token Special token used to separate different segments (like sentence A and B). Also marks the end of a single sentence.

Multi-Head Attention¶

Instead of applying a single attention function, Transformers use Multi-Head Attention.


How Multi-Head Attention Works:¶

  1. Project Queries, Keys, and Values into multiple subspaces:

    • Each head gets different $Q$, $K$, $V$ through learned linear transformations.
  2. Apply Attention independently in each head.

  3. Concatenate the outputs of all heads.

  4. Final Linear Layer combines the concatenated heads.


Mathematically:¶

For each head:

$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

Then:

$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O $$
No description has been provided for this image

[3]

Example of multi-head attention¶

No description has been provided for this image

Why Multi-Head Attention?¶

  • Each head can focus on different relationships (short-term, long-term, positional, syntactic).
  • Allows model to capture richer patterns.

Question for you¶

  • What's the point of splitting the matrices up and then recombining them? Why not just do one large self-attention?
No description has been provided for this image

Step 4: Normalize and more Linear layers¶

After the Multi-Head Attention block, the Transformer applies two critical operations:


1. Add & Norm (Residual Connection + Layer Normalization)¶

  • Residual Connection:

    • Adds the original input of the attention block to its output.
    $$ \text{Residual} = \text{Input} + \text{AttentionOutput} $$
  • Layer Normalization:

    • Normalizes across feature dimensions to stabilize and accelerate training.
    $$ \text{NormedOutput} = \text{LayerNorm}(\text{Residual}) $$
  • Helps prevent vanishing/exploding gradients.

  • Allows smoother gradient flow through the network.


2. Position-wise Feed-Forward Network (FFN)¶

  • Applies the same small MLP to each token individually.

    $$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$
    • $W_1$, $W_2$ are weight matrices.
    • First linear layer expands the dimension (e.g., from $d_{\text{model}}$ to 4×$d_{\text{model}}$).
    • ReLU non-linearity.
    • Second linear layer projects back to $d_{\text{model}}$.
  • Introduces non-linearity and richer transformations.

  • Same FFN weights are shared across all sequence positions.

No description has been provided for this image

Decoder¶

We have seen the components of the decoder but there are a few subtle variations:

  • Masked attention
  • Un-equal Q,K,V matrices

Masked Multi-Head Attention¶

Masked Multi-Head Attention is used in the Transformer decoder to prevent attending to future tokens during training.


Purpose:¶

  • During training, the model should only use known (past) tokens to predict the next token.
  • It must not "cheat" by looking ahead at future tokens.

No description has been provided for this image

How It Works:¶

  • Before applying the softmax to the attention scores $QK^\top$,
  • Mask out (set to $-\infty$) all connections to future tokens.
  • After masking:
    • Softmax assigns zero probability to any future token.

Mathematically:¶

$$ \text{MaskedAttention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top + M}{\sqrt{d_k}} \right)V $$

Where:

  • $M$ = Mask matrix:
    • $0$ for allowed connections (past and current tokens),
    • $-\infty$ for disallowed (future tokens).
In [14]:
import torch
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from transformers import AutoTokenizer, AutoModelForCausalLM

# Step 1: Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2", output_attentions=True)

# Step 2: Tokenize input
text = "The studious student graduated with honors."
inputs = tokenizer(text, return_tensors="pt")

# Step 3: Forward pass to get attentions
outputs = model(**inputs, output_attentions=True)
attentions = outputs.attentions

# Step 4: Select layer and head
layer_idx = 0
head_idx = 0
attention_matrix = attentions[layer_idx][0, head_idx].detach().cpu().numpy()

# Step 5: Decode tokens
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

# Step 6: Build square dataframe
attention_df = pd.DataFrame(attention_matrix, index=tokens, columns=tokens)

# Step 7: Plot without offset
plt.figure(figsize=(10, 8))
ax = sns.heatmap(attention_df, annot=True, fmt=".2f", cmap="Greens", cbar=False, square=True)

# Fix the ticks so row and column labels align
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0)

# Move x-axis labels to the top
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')

plt.title(f"Attention Matrix - Layer {layer_idx} Head {head_idx}", pad=20)
plt.tight_layout()
plt.show()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[14], line 3
      1 import torch
      2 import matplotlib.pyplot as plt
----> 3 import pandas as pd
      4 import seaborn as sns
      5 from transformers import AutoTokenizer, AutoModelForCausalLM

ModuleNotFoundError: No module named 'pandas'
In [23]:
# This will be useful later: 
print(f"Number of layers: {len(model.transformer.h)}")
print(f"Number of attention heads per layer: {model.config.n_head}")
Number of layers: 6
Number of attention heads per layer: 12

Cross-Attention in Transformer Decoder¶

Cross-Attention connects the decoder to the encoder outputs.


No description has been provided for this image

Purpose:¶

  • Enables the decoder to attend to the full encoded input sequence.
  • Essential for sequence-to-sequence tasks (e.g., translation, summarization).
  • Allows the decoder to gather relevant information from the source input at every step.

How It Works:¶

  • Queries ($Q$) come from the decoder's previous layer.
  • Keys ($K$) and Values ($V$) come from the encoder outputs (fixed).

Thus:

$$ \text{CrossAttention}(Q_{\text{decoder}}, K_{\text{encoder}}, V_{\text{encoder}}) $$

Full Step-by-Step:¶

  1. Input: Decoder hidden states (as queries), Encoder outputs (as keys and values).
  2. Compute Attention Scores: Compare decoder queries with encoder keys.
  3. Softmax over Scores: Determine relevance of each encoder token to the current decoder token.
  4. Weighted Sum of Encoder Values: Aggregate useful information from the encoder.

Training¶

Training a Transformer model involves teaching it to predict outputs from inputs using supervised learning.


No description has been provided for this image

Core Idea:¶

  • The Transformer learns to minimize a loss function that measures the difference between its predicted outputs and the ground-truth targets.
  • Training is done end-to-end with gradient descent.

🚀 Training Steps:¶

  1. Input Preparation:

    • Encoder receives the input sequence (e.g., source sentence).
    • Decoder receives the target sequence shifted right (teacher forcing).
  2. Forward Pass:

    • Encoder outputs hidden states.
    • Decoder generates predictions token-by-token, attending to encoder outputs (via cross-attention).
  3. Loss Computation:

    • Compare decoder predictions to ground truth using a loss function, typically cross-entropy loss.
  4. Backward Pass:

    • Compute gradients of the loss with respect to all model parameters (weights).
    • Backpropagate through attention, feed-forward layers, etc.
  5. Parameter Update:

    • Use an optimizer (e.g., Adam) to update parameters based on gradients.

That's it for today¶

  • HW10 (last HW!) has been released (due Monday).
  • MT2 is a week from Thursday

References¶

[1] Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. "Learning long-term dependencies with gradient descent is difficult." IEEE transactions on neural networks 5.2 (1994): 157-166.

[2] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

[3] Umar Jamil "Attention is all you need (Transformer) - Model explanation (including math), Inference and Training" - https://www.youtube.com/watch?v=bCz4OMemCcA

[4] 3Blue1Brown "Attention in transformers, step-by-step | DL6" https://www.youtube.com/watch?v=eMlx5fFNoYc