self.model = nn.Sequential(
    nn.Linear(10000, 16),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(16, 16),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(16, 1)
)

`Trainer.fit` stopped: `max_epochs=15` reached.

Review 0:
 [START] please give this one a miss br br [UNK] [UNK] and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite [UNK] so all you madison fans give this a miss
Predicted positiveness: 0.15110373

Review 16:
 [START] from 1996 first i watched this movie i feel never reach the end of my satisfaction i feel that i want to watch more and more until now my god i don't believe it was ten years ago and i can believe that i almost remember every word of the dialogues i love this movie and i love this novel absolutely perfection i love willem [UNK] he has a strange voice to spell the words black night and i always say it for many times never being bored i love the music of it's so much made me come into another world deep in my heart anyone can feel what i feel and anyone could make the movie like this i don't believe so thanks thanks
Predicted positiveness: 0.99687344

Review X:
 [START] the restaurant is not too terrible
Predicted positiveness: 0.8728

max_length = 100 # pad documents to a maximum number of words
vocab_size = 10000 # vocabulary size
embedding_length = 20 # embedding length (more would be better)

self.model = nn.Sequential(
    nn.Embedding(vocab_size, embedding_length),
    nn.AdaptiveAvgPool1d(1),  # global average pooling over sequence
    nn.Linear(embedding_length, 1),
)

`Trainer.fit` stopped: `max_epochs=15` reached.

array([-0.222,  0.065, -0.086,  0.513,  0.325, -0.129,  0.083,  0.092,
       -0.309, -0.941, -0.089, -0.108,  0.211,  0.701,  0.268, -0.04 ,
        0.174, -0.308, -0.052, -0.175, -0.841,  0.192, -0.138,  0.385,
        0.272, -0.174, -0.466, -0.025,  0.097,  0.301,  0.18 , -0.069,
       -0.205,  0.357, -0.283,  0.281, -0.012,  0.107, -0.244, -0.179,
       -0.132, -0.17 , -0.594,  0.957,  0.204, -0.043,  0.607, -0.069,
        0.523, -0.548], dtype=float32)

embedding_tensor = torch.tensor(embedding_matrix, dtype=torch.float32)
self.model = nn.Sequential(
    nn.Embedding.from_pretrained(embedding_tensor, freeze=True),
    nn.AdaptiveAvgPool1d(1),
    nn.Linear(embedding_tensor.shape[1], 1))

`Trainer.fit` stopped: `max_epochs=30` reached.

model = nn.Sequential(
    nn.Embedding(num_embeddings=10000, embedding_dim=embedding_dim),
    nn.Conv1d(in_channels=embedding_dim, out_channels=32, kernel_size=7),
    nn.ReLU(),
    nn.MaxPool1d(kernel_size=5),
    nn.Conv1d(in_channels=32, out_channels=32, kernel_size=7),
    nn.ReLU(),
    nn.AdaptiveAvgPool1d(1),  # GAP
    nn.Flatten(),             # (batch, 32, 1) → (batch, 32)
    nn.Linear(32, 1)
)

    B, C, H, W = x.shape  # Batch size, Channels, Height, Width
    x = x.reshape(B, C, H//patch_size, patch_size, W//patch_size, patch_size)

def scaled_dot_product(q, k, v):
    attn_logits = torch.matmul(q, k.transpose(-2, -1)) # dot prod
    attn_logits = attn_logits / math.sqrt(q.size()[-1])# scaling
    attention = F.softmax(attn_logits, dim=-1)         # softmax
    values = torch.matmul(attention, v)                # dot prod
    return values, attention

    qkv = nn.Linear(input_dim, 3*embed_dim)(x) # project to embed_dim
    qkv = qkv.reshape(batch_size, seq_length, num_heads, 3*head_dim)
    q, k, v = qkv.chunk(3, dim=-1) 

    values, attention = scaled_dot_product(q, k, v, mask=mask) # self-att
    values = values.reshape(batch_size, seq_length, embed_dim)
    out = nn.Linear(embed_dim, input_dim) # project back

def __init__(self, embed_dim, hidden_dim, num_heads, dropout=0.0):
    self.layer_norm_1 = nn.LayerNorm(embed_dim)
    self.attn = nn.MultiheadAttention(embed_dim, num_heads)
    self.layer_norm_2 = nn.LayerNorm(embed_dim)
    self.linear = nn.Sequential( # Feed-forward layer
        nn.Linear(embed_dim, hidden_dim),
        nn.GELU(), nn.Dropout(dropout),
        nn.Linear(hidden_dim, embed_dim),
        nn.Dropout(dropout)
    )

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Lightning automatically upgraded your loaded checkpoint from v1.6.4 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../data/checkpoints/ViT.ckpt`

Found pretrained model at ../data/checkpoints/ViT.ckpt, loading...

Testing: |                                                                                                    …

Testing: |                                                                                                    …

ViT results {'test': 0.7713000178337097, 'val': 0.7781999707221985}

Lecture 8. Transformers¶

Overview¶

Bag of word representation¶

Text preprocessing pipelines¶

Neural networks on bag of words¶

Evaluation¶

Predictions¶

Word Embeddings¶

Learning embeddings from scratch¶

Pre-trained embeddings¶

Word2Vec¶

Word2Vec properties¶

Doc2Vec¶

FastText¶

Global Vector model (GloVe)¶

Sequence-to-sequence (seq2seq) models¶

seq2seq models¶

1D convolutional networks¶

Recurrent neural networks (RNNs)¶

Simple self-attention¶

Simple self-attention¶

Scaled dot products¶

Simple self-attention layer¶

Simple self-attention layer¶

Standard self-attention¶

Sidenote on terminology¶

Intuition¶

Single-head self-attention¶

Multi-head self-attention¶

Transformer model¶

Positional encoding¶

Autoregressive models¶

Masked self-attention¶

Famous transformers¶

GPT 3¶

GPT 4¶

Vision transformers¶

Demonstration¶

Patchify¶

Self-attention¶

Self-attention¶

Multi-head attention (simplified)¶

Attention block¶

Attention block¶

Vision transformer¶

Positional encoding¶

Results¶

Summary¶