Exploring the mathematical foundations behind the architecture that revolutionized artificial intelligence.
Seminar led by Prof. Uri Yechiali
Section 01
The architecture that changed everything
Before 2017, sequence modeling used Recurrent Neural Networks (RNNs):
Problem 1
Computing \(h_t\) requires \(h_{t-1}\). No parallelization possible.
Problem 2
Information flows through many steps. Vanishing gradients.
Definition: Transformer
A neural network using self-attention as its core primitive. No inherent notion of order—position must be explicitly encoded.
Computes attention for all positions at once—fully parallelizable.
Intuition
Self-attention = soft, differentiable database lookup
Output = weighted sum via softmax over dot products. Entirely differentiable.
Section 02
From RNNs to GPT-4
2014
Attention for Translation
Bahdanau et al. add attention to RNNs
2017
"Attention Is All You Need"
Vaswani et al. eliminate recurrence. 100× faster.
2018
BERT & GPT
Pretrained Transformers revolutionize NLP
2020+
Universal Adoption
Vision, proteins, audio, code. GPT-4, Claude, Gemini.
RNN
\(O(n)\) sequential operations, each \(O(d^2)\)
Cannot parallelize
Transformer
\(O(n^2 d)\) matrix operations
Fully parallelizable
On GPUs, Transformers are dramatically faster.
Section 03
Building blocks
Each contains stacked layers of:
01
Embeddings
\(x_i = E[\text{token}_i]\)
02
Positional Encoding
\(\sin/\cos\) at varying frequencies
03
Layer Norm
\(\gamma \odot \frac{x-\mu}{\sigma} + \beta\)
04
Residual
\(x + \text{Sublayer}(x)\)
05
FFN
\(W_2 \cdot \text{ReLU}(W_1 x)\)
06
Self-Attention
→ Next slides
Section 04
The mathematical heart
Proposition
If components of \(q, k\) are i.i.d. with mean 0, variance 1:
Large values → softmax saturates → gradients vanish.
Dividing by \(\sqrt{d_k}\) normalizes variance to 1.
Finding
Different heads specialize: syntax, position, semantics.
Click a word to see its attention distribution:
Click a word above
Section 05
Complete calculation
\(n=3\) tokens, \(d=4\), \(d_k=d_v=2\)
A attends to B (0.78), B attends to A (0.78), C equally distributed.
Section 06
Transformers everywhere
LLMs
GPT-4, Claude, Gemini
Vision
ViT, DINO
Image Gen
DALL-E, SD
Proteins
AlphaFold 2
Audio
Whisper
Code
Copilot
Thank you for your attention.