No description has been provided for this image

Lecture 23 - Transformers II¶

ECE364 - Programming Methods for Machine Learning¶

Nickvash Kani¶

Slides based off prior lectures by Alex Schwing, Aigou Han, Farzad Kamalabadi, Corey Snyder. All mistakes are my own!¶

In this lecture:

  • Beam search
  • Special tokens
  • Bidirectional transformers
  • Multi-layer transformer networks
  • BERT
In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load a tiny GPT-2 model (small enough for CPU)
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

# Make sure model is in eval mode
model.eval()
/Users/nickvashkani/anaconda3/envs/ece364_oldnb_rise/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Out[1]:
GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

Language generation - Beam search¶

Let's say we want to generate text. We can imagine the text generation process as choosign between leaves of a tree:

No description has been provided for this image

Last time we found the optimal path using greedy selections:

No description has been provided for this image

But is that the optimal result? Seems like there ar eleaves with lowere probabilities that the greedy approach misses?

What if we kept track of the the top-k possibilities at every iteration?

No description has been provided for this image

This is called beam search.

In [7]:
import torch
from graphviz import Digraph

def beam_search_tree(
    model, tokenizer, input_text, beam_width=3, depth=2, device='cpu'
):
    model.eval()
    input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

    beams = [(0.0, input_ids, input_text)]
    graph = Digraph(format='png')
    node_id = 0

    def add_node(label):
        nonlocal node_id
        curr_id = f"n{node_id}"
        graph.node(curr_id, label)
        node_id += 1
        return curr_id

    # Add root
    root_id = add_node(input_text)
    nodes = [(root_id, 0.0, input_ids, input_text)]

    for d in range(depth):
        new_nodes = []
        for parent_id, log_prob_sum, beam_ids, beam_text in nodes:
            with torch.no_grad():
                outputs = model(beam_ids)
                logits = outputs.logits
            next_token_logits = logits[:, -1, :]
            probs = torch.log_softmax(next_token_logits, dim=-1)
            topk_log_probs, topk_indices = torch.topk(probs, beam_width)

            for i in range(beam_width):
                new_token_id = topk_indices[0, i].unsqueeze(0).unsqueeze(0)
                new_log_prob = topk_log_probs[0, i].item()
                new_beam_ids = torch.cat([beam_ids, new_token_id], dim=1)
                decoded_token = tokenizer.decode(new_token_id[0])
                new_text = beam_text + decoded_token
                new_score = log_prob_sum + new_log_prob

                child_id = add_node(decoded_token)
                graph.edge(parent_id, child_id, label=f"{new_score:.2f}")
                new_nodes.append((child_id, new_score, new_beam_ids, new_text))

        nodes = new_nodes

    return graph
In [6]:
start_text = "The meaning of life is to"
g = beam_search_tree(model, tokenizer, start_text, beam_width=3, depth=3)
g.render("./img/beam_search_tree", view=True)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Out[6]:
'img/beam_search_tree.png'
In [2]:
import torch
import heapq

def beam_search_generate(
    model, tokenizer, input_text, beam_width=3, max_length=20, device='cpu', verbose=True
):
    """
    Beam search generation from input_text.
    Returns top `beam_width` sequences.
    """
    model.eval()
    input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

    # Each beam is a tuple: (log_prob_sum, token_ids)
    beams = [(0.0, input_ids)]

    for step in range(max_length):
        new_beams = []

        for log_prob_sum, beam_ids in beams:
            with torch.no_grad():
                outputs = model(beam_ids)
                logits = outputs.logits  # (1, seq_len, vocab_size)

            next_token_logits = logits[:, -1, :]  # Only last token
            probs = torch.log_softmax(next_token_logits, dim=-1)  # Log-probs

            # Top-k tokens
            topk_log_probs, topk_indices = torch.topk(probs, beam_width)

            for i in range(beam_width):
                new_token_id = topk_indices[0, i].unsqueeze(0).unsqueeze(0)
                new_log_prob = topk_log_probs[0, i].item()

                # Append new token and update log probability
                new_beam_ids = torch.cat([beam_ids, new_token_id], dim=1)
                new_beams.append((log_prob_sum + new_log_prob, new_beam_ids))

        # Keep top `beam_width` beams
        beams = heapq.nlargest(beam_width, new_beams, key=lambda x: x[0])

        if verbose:
            print(f"Step {step+1}")
            for i, (score, ids) in enumerate(beams):
                text = tokenizer.decode(ids[0], skip_special_tokens=True)
                print(f"Beam {i+1}: {repr(text)} | log_prob={score:.2f}")
            print("\n" + "-"*50 + "\n")

    return [
        (tokenizer.decode(ids[0], skip_special_tokens=True), score)
        for score, ids in beams
    ]

# Example usage
start_text = "The meaning of life is to"
top_beams = beam_search_generate(model, tokenizer, start_text, beam_width=3, max_length=20)

print("Final Beam Search Results:")
for i, (text, score) in enumerate(top_beams):
    print(f"Beam {i+1}: {repr(text)} | log_prob={score:.2f}")
Step 1
Beam 1: 'The meaning of life is to be' | log_prob=-1.34
Beam 2: 'The meaning of life is to live' | log_prob=-3.20
Beam 3: 'The meaning of life is to say' | log_prob=-3.96

--------------------------------------------------

Step 2
Beam 1: 'The meaning of life is to be understood' | log_prob=-4.55
Beam 2: 'The meaning of life is to live in' | log_prob=-4.59
Beam 3: 'The meaning of life is to be able' | log_prob=-4.80

--------------------------------------------------

Step 3
Beam 1: 'The meaning of life is to be able to' | log_prob=-4.82
Beam 2: 'The meaning of life is to live in a' | log_prob=-5.90
Beam 3: 'The meaning of life is to be understood as' | log_prob=-6.04

--------------------------------------------------

Step 4
Beam 1: 'The meaning of life is to be able to be' | log_prob=-6.76
Beam 2: 'The meaning of life is to be able to live' | log_prob=-7.47
Beam 3: 'The meaning of life is to be understood as a' | log_prob=-7.52

--------------------------------------------------

Step 5
Beam 1: 'The meaning of life is to be able to live in' | log_prob=-9.11
Beam 2: 'The meaning of life is to be able to live with' | log_prob=-9.84
Beam 3: 'The meaning of life is to be able to be able' | log_prob=-9.98

--------------------------------------------------

Step 6
Beam 1: 'The meaning of life is to be able to be able to' | log_prob=-10.00
Beam 2: 'The meaning of life is to be able to live in a' | log_prob=-10.36
Beam 3: 'The meaning of life is to be able to live in harmony' | log_prob=-11.04

--------------------------------------------------

Step 7
Beam 1: 'The meaning of life is to be able to live in harmony with' | log_prob=-11.68
Beam 2: 'The meaning of life is to be able to be able to be' | log_prob=-11.73
Beam 3: 'The meaning of life is to be able to be able to live' | log_prob=-12.57

--------------------------------------------------

Step 8
Beam 1: 'The meaning of life is to be able to live in harmony with the' | log_prob=-12.88
Beam 2: 'The meaning of life is to be able to be able to be able' | log_prob=-13.92
Beam 3: 'The meaning of life is to be able to live in harmony with nature' | log_prob=-13.94

--------------------------------------------------

Step 9
Beam 1: 'The meaning of life is to be able to be able to be able to' | log_prob=-13.93
Beam 2: 'The meaning of life is to be able to live in harmony with nature,' | log_prob=-15.20
Beam 3: 'The meaning of life is to be able to live in harmony with nature.' | log_prob=-15.39

--------------------------------------------------

Step 10
Beam 1: 'The meaning of life is to be able to be able to be able to be' | log_prob=-15.05
Beam 2: 'The meaning of life is to be able to be able to be able to live' | log_prob=-16.69
Beam 3: 'The meaning of life is to be able to live in harmony with nature, and' | log_prob=-16.86

--------------------------------------------------

Step 11
Beam 1: 'The meaning of life is to be able to be able to be able to be able' | log_prob=-16.15
Beam 2: 'The meaning of life is to be able to live in harmony with nature, and to' | log_prob=-17.98
Beam 3: 'The meaning of life is to be able to be able to be able to live in' | log_prob=-18.64

--------------------------------------------------

Step 12
Beam 1: 'The meaning of life is to be able to be able to be able to be able to' | log_prob=-16.16
Beam 2: 'The meaning of life is to be able to live in harmony with nature, and to be' | log_prob=-19.31
Beam 3: 'The meaning of life is to be able to be able to be able to live in a' | log_prob=-19.69

--------------------------------------------------

Step 13
Beam 1: 'The meaning of life is to be able to be able to be able to be able to be' | log_prob=-16.77
Beam 2: 'The meaning of life is to be able to be able to be able to be able to live' | log_prob=-19.40
Beam 3: 'The meaning of life is to be able to be able to be able to be able to have' | log_prob=-19.47

--------------------------------------------------

Step 14
Beam 1: 'The meaning of life is to be able to be able to be able to be able to be able' | log_prob=-17.25
Beam 2: 'The meaning of life is to be able to be able to be able to be able to have a' | log_prob=-20.80
Beam 3: 'The meaning of life is to be able to be able to be able to be able to be a' | log_prob=-21.25

--------------------------------------------------

Step 15
Beam 1: 'The meaning of life is to be able to be able to be able to be able to be able to' | log_prob=-17.26
Beam 2: 'The meaning of life is to be able to be able to be able to be able to be a person' | log_prob=-23.25
Beam 3: 'The meaning of life is to be able to be able to be able to be able to have a sense' | log_prob=-23.46

--------------------------------------------------

Step 16
Beam 1: 'The meaning of life is to be able to be able to be able to be able to be able to be' | log_prob=-17.61
Beam 2: 'The meaning of life is to be able to be able to be able to be able to be able to have' | log_prob=-20.70
Beam 3: 'The meaning of life is to be able to be able to be able to be able to be able to live' | log_prob=-21.11

--------------------------------------------------

Step 17
Beam 1: 'The meaning of life is to be able to be able to be able to be able to be able to be able' | log_prob=-17.85
Beam 2: 'The meaning of life is to be able to be able to be able to be able to be able to have a' | log_prob=-21.94
Beam 3: 'The meaning of life is to be able to be able to be able to be able to be able to be a' | log_prob=-22.61

--------------------------------------------------

Step 18
Beam 1: 'The meaning of life is to be able to be able to be able to be able to be able to be able to' | log_prob=-17.86
Beam 2: 'The meaning of life is to be able to be able to be able to be able to be able to be a person' | log_prob=-24.57
Beam 3: 'The meaning of life is to be able to be able to be able to be able to be able to have a life' | log_prob=-24.67

--------------------------------------------------

Step 19
Beam 1: 'The meaning of life is to be able to be able to be able to be able to be able to be able to be' | log_prob=-18.06
Beam 2: 'The meaning of life is to be able to be able to be able to be able to be able to be able to have' | log_prob=-21.49
Beam 3: 'The meaning of life is to be able to be able to be able to be able to be able to be able to live' | log_prob=-22.31

--------------------------------------------------

Step 20
Beam 1: 'The meaning of life is to be able to be able to be able to be able to be able to be able to be able' | log_prob=-18.19
Beam 2: 'The meaning of life is to be able to be able to be able to be able to be able to be able to have a' | log_prob=-22.67
Beam 3: 'The meaning of life is to be able to be able to be able to be able to be able to be able to be a' | log_prob=-23.58

--------------------------------------------------

Final Beam Search Results:
Beam 1: 'The meaning of life is to be able to be able to be able to be able to be able to be able to be able' | log_prob=-18.19
Beam 2: 'The meaning of life is to be able to be able to be able to be able to be able to be able to have a' | log_prob=-22.67
Beam 3: 'The meaning of life is to be able to be able to be able to be able to be able to be able to be a' | log_prob=-23.58

we can just use the transformers library and make our lives easier:¶

In [14]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

input_text = "The meaning of life is to"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate using built-in beam search
beam_output = model.generate(
    input_ids,
    max_length=10,
    num_beams=3,
    early_stopping=True,
    num_return_sequences=3   # Get top 3 beams
)


# Decode result
for i, output in enumerate(beam_output):
    text = tokenizer.decode(output, skip_special_tokens=True)
    print(f"Beam {i+1}: {repr(text)}")
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Beam 1: 'The meaning of life is to be able to be'
Beam 2: 'The meaning of life is to be able to live'
Beam 3: 'The meaning of life is to be understood as a'

Training¶

Training a Transformer model involves teaching it to predict outputs from inputs using supervised learning.


No description has been provided for this image

Core Idea:¶

  • The Transformer learns to minimize a loss function that measures the difference between its predicted outputs and the ground-truth targets.
  • Training is done end-to-end with gradient descent.

🚀 Training Steps:¶

  1. Input Preparation:

    • Encoder receives the input sequence (e.g., source sentence).
    • Decoder receives the target sequence shifted right (teacher forcing).
  2. Forward Pass:

    • Encoder outputs hidden states.
    • Decoder generates predictions token-by-token, attending to encoder outputs (via cross-attention).
  3. Loss Computation:

    • Compare decoder predictions to ground truth using a loss function, typically cross-entropy loss.
  4. Backward Pass:

    • Compute gradients of the loss with respect to all model parameters (weights).
    • Backpropagate through attention, feed-forward layers, etc.
  5. Parameter Update:

    • Use an optimizer (e.g., Adam) to update parameters based on gradients.

Special tokens¶

Let's look back at the tokenized input from before:

In [4]:
from transformers import AutoTokenizer

# Load model
tokenizer = AutoTokenizer.from_pretrained("nreimers/MiniLM-L6-H384-uncased")

# Input text
text = "The studious student graduated with honors."

# Proper tokenization that adds [CLS] and [SEP]
inputs = tokenizer(text, add_special_tokens=True)

# Convert ids back to readable tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"])
print(tokens)

## btw "inputs = tokenizer.tokenize(text)" doesn't reveal the special tokens. It's mostly just to split words
['[CLS]', 'the', 'studio', '##us', 'student', 'graduated', 'with', 'honors', '.', '[SEP]']

[CLS]? [SEP]? What?¶

What are [CLS] and [SEP] tokens?

Token Stands for Meaning
[CLS] Classification token Special token added at the beginning of every input sequence. Used to summarize the sentence.
[SEP] Separator token Special token used to separate different segments (like sentence A and B). Also marks the end of a single sentence.

[CLS] token¶

  • Short for "CLaSsification" token.
  • Inserted at the start of every input.
  • In models like BERT, the final hidden state corresponding to [CLS] is used as the sentence representation.
  • Example usage:
    • For sentence classification (e.g., spam detection), only the [CLS] output vector is used for prediction.
    • For QA tasks, [CLS] may be used to predict if an answer exists.

You can think of [CLS] as a "summary slot" that captures the overall meaning.


[SEP] token¶

  • Short for "Separator" token.
  • Inserted between segments (for example, between two sentences).
  • Marks the end of a sentence or segment.
  • In tasks like Question-Answering, the input might be:
    [CLS] Question tokens [SEP] Paragraph tokens [SEP]
    

[CLS] always at the start?¶

If the [CLS] token always starts a input, why do we inlcude it?

No description has been provided for this image image from [1].

Here's the reason:

Reason Why it matters
Training Flexibility During training, the model treats [CLS] as a normal token position — it has an embedding, it has hidden states, etc. It’s not “magic”; it’s learned.
Pooling Layer The final [CLS] hidden state is fed into a classification head (MLP, linear layer, etc.) for tasks like sentence classification.
Attention Mechanism [CLS] participates in self-attention — it can “attend” to other tokens and other tokens can “attend” to [CLS]. The model needs a “collector” node inside attention dynamics.
Multi-segment Inputs If you have two sentences (e.g., Question + Answer), you need [CLS] SentenceA [SEP] SentenceB [SEP]. Without [CLS], it would be ambiguous where to pool for classification.
Pretraining / Finetuning Separation [CLS] allows pretrained models to be flexibly finetuned for different tasks without changing architecture.
  • It has its own embedding.
  • It produces its own hidden state.
  • It interacts dynamically with attention.
  • It becomes a “summary vector” because the model learns to make it one.

Different architectures and tokenizers have different conventions!¶

Model Special Tokens Used
BERT [CLS], [SEP], [PAD], [MASK], [UNK]
RoBERTa <s>, </s>, <pad>, <mask>, <unk>
GPT-2 No explicit special tokens by default (but can use `< bos >,< eos >,< pad >`)
T5 Uses <pad>, <extra_id_N> masking tokens

Why Special Tokens Matter:

  • Special tokens signal different behavior to the model.
  • They enable multitask training: classification, translation, masked prediction, etc.
  • They make batch training efficient by padding shorter sequences.

Lots of special tokens¶

There are lots of special tokens commonly used in LLMs.

Token Meaning Usage
[CLS] Classification token Summarizes input for tasks like classification.
[SEP] Separator token Separates different segments or sentences.
[PAD] Padding token Fills empty space in a batch so all sequences are the same length.
[MASK] Masking token Used during training (especially in BERT) to hide tokens for prediction.
[BOS] Beginning Of Sentence Used in decoder models (e.g., GPT) to mark start of generation.
[EOS] End Of Sentence Used to mark the end of a sentence or sequence.
[UNK] Unknown token Used when the tokenizer encounters a word it doesn’t recognize.
[PAD] Padding token Used to pad sequences to a common length.
<s> Start-of-sequence (T5 / RoBERTa) Similar to [BOS], but with different tokenization schemes.
</s> End-of-sequence (T5 / RoBERTa) Similar to [EOS].

If they are used depends on the model.

Layers in Transformers¶

Let's think back to recurrent neural networks. There were variations that ahd layers. What did their layer structure look like?

No description has been provided for this image

What would a layered transformer network look like?

No description has been provided for this image

Bidirectionality¶

Remember we have bidirectional RNNs?


No description has been provided for this image
Credit to [6] for the photo. what does a trasnfer look like?

Well all transformers are bidirectional?¶

No description has been provided for this image

Bidirectional Encoder Representations from Transformers (BERT) Model [2]¶

BERT is a example of a encoder only architecture:

No description has been provided for this image image from [1].

Encoder-Only Characteristics¶

  • Input: Takes in the entire sequence of tokens at once (e.g., a sentence or sentence pair).
  • Self-Attention: Uses bidirectional self-attention, meaning every token attends to both its left and right context.
  • Output: Produces contextualized embeddings for all input tokens simultaneously.

What BERT Doesn’t Do¶

  • It does not generate text token-by-token like GPT.
  • It does not include a decoder component (unlike GPT or encoder-decoder models like T5).

Language training methods¶

Masked Language Modeling (MLM)¶

  • Objective: Predict masked tokens in a sentence using bidirectional context.
  • Procedure:
    • Randomly select 15% of the tokens:
      • 80% → replaced with [MASK]
      • 10% → replaced with a random token
      • 10% → left unchanged
  • Example:
    Input:  "The quick brown [MASK] jumps over the lazy dog."
    Target: "fox"
    
  • Loss: Cross-entropy loss computed only on the masked positions.
No description has been provided for this image image from [1].
In [ ]:
from transformers import BertTokenizer, BertForPreTraining, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# Load pretrained tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForPreTraining.from_pretrained("bert-base-uncased")

# Load small dataset and tokenize
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1%]")  # tiny sample for demo
def tokenize_function(examples):
    return tokenizer(examples["text"], return_special_tokens_mask=True, truncation=True, padding="max_length", max_length=128)
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Masking strategy for MLM
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

# Training setup
training_args = TrainingArguments(
    output_dir="./bert-pretraining-demo",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=16,
    save_steps=10_000,
    save_total_limit=2,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

trainer.train()

Next Sentence Prediction (NSP)¶

  • Objective: Predict if Sentence B follows Sentence A in the original document.
  • Training Setup:
    • 50% of the time: Sentence B is the actual next sentence → IsNext
    • 50% of the time: Sentence B is randomly sampled → NotNext
  • Example:
    Sentence A: "He went to the store."
    Sentence B: "He bought milk." → IsNext
    Sentence B: "The Eiffel Tower is in Paris." → NotNext
    
  • Loss: Binary classification loss on the [CLS] token representation.

Input Format¶

[CLS] Sentence A [SEP] Sentence B [SEP]
  • [CLS] → Classification token (used for NSP loss)
  • [SEP] → Separator token between segments
  • Segment embeddings distinguish Sentence A vs Sentence B
  • Positional embeddings help track token order
No description has been provided for this image image from [1].

Are transformers are bidirectional?¶

No description has been provided for this image

General Purpose Transformer (GPT)¶

A decoder-only model uses just the decoder stack from the original Transformer architecture. A prime example is GPT (e.g., GPT-2, GPT-3, GPT-4).

No description has been provided for this image image from [3]

Key Characteristics¶

  • Architecture: Only the Transformer decoder is used.
  • Causal Masking: The model cannot attend to future tokens — only past tokens.
  • Auto-regressive: Generates tokens one at a time, conditioning on previously generated tokens.

What It Lacks¶

  • No bidirectional context like BERT.
  • No encoder stack, so not ideal for classification or sentence-pair tasks.

Some important models¶

Model Encoder Decoder Use Case
BERT ✅ ❌ Understanding tasks (e.g., classification, QA)
GPT ❌ ✅ Generative tasks (e.g., text generation)
T5 ✅ ✅ Seq2Seq tasks (e.g., translation, summarization)

That's it for today¶

  • HW10 is due on Monday.
  • Midterm next Thursday.
    • Skillset/cheat-sheet will be released by today.
    • Give me till the weekend to release the sample. Questions are still in flux so we need a extra day or two.
  • And most importantly, have a good weekend.

References¶

[1] Alammar, J. "The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)" - https://jalammar.github.io/illustrated-bert/

[2] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019.

[3] Wolfe, C. "Decoder-Only Transformers: The Workhorse of Generative LLMs" - https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse

Other transformer resources:¶

  • Elhage, N. et al. "A Mathematical Framework for Transformer Circuits"
  • Alammar J. et al. "The Illustrated Transformer"
In [ ]: