1 / 22
Navigate
Mathematical Introduction

The Transformer
Architecture

Exploring the mathematical foundations behind the architecture that revolutionized artificial intelligence.

Nir Naim
Tel Aviv University
Queueing Theory Seminar

Section 01

What Are Transformers?

The architecture that changed everything

The Sequential Bottleneck

Before 2017, sequence modeling used Recurrent Neural Networks (RNNs):

\[h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b)\]

Problem 1

Computing \(h_t\) requires \(h_{t-1}\). No parallelization possible.

Problem 2

Information flows through many steps. Vanishing gradients.

The Solution: Attention

Definition: Transformer

A neural network using self-attention as its core primitive. No inherent notion of order—position must be explicitly encoded.

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]

Computes attention for all positions at once—fully parallelizable.

The Key Insight

Intuition

Self-attention = soft, differentiable database lookup

  • Query: "What do I need?"
  • Key: "What do I have?"
  • Value: "The information"

Output = weighted sum via softmax over dot products. Entirely differentiable.

Section 02

Historical Impact

From RNNs to GPT-4

The Transformer Revolution

2014

Attention for Translation

Bahdanau et al. add attention to RNNs

2017

"Attention Is All You Need"

Vaswani et al. eliminate recurrence. 100× faster.

2018

BERT & GPT

Pretrained Transformers revolutionize NLP

2020+

Universal Adoption

Vision, proteins, audio, code. GPT-4, Claude, Gemini.

Computational Complexity

RNN

\(O(n)\) sequential operations, each \(O(d^2)\)

Cannot parallelize

Transformer

\(O(n^2 d)\) matrix operations

Fully parallelizable

On GPUs, Transformers are dramatically faster.

Section 03

Architecture

Building blocks

Encoder-Decoder Structure

  • Encoder: Processes input (BERT)
  • Decoder: Generates output (GPT)

Each contains stacked layers of:

  • Self-Attention
  • Feed-Forward Networks
  • Residual Connections
  • Layer Normalization
ENCODER DECODER Self-Attention Feed Forward × N + Pos Encoding Embedding Masked Self-Attn Cross-Attention Feed Forward × N + Pos Encoding Embedding

Key Components

01

Embeddings

\(x_i = E[\text{token}_i]\)

02

Positional Encoding

\(\sin/\cos\) at varying frequencies

03

Layer Norm

\(\gamma \odot \frac{x-\mu}{\sigma} + \beta\)

04

Residual

\(x + \text{Sublayer}(x)\)

05

FFN

\(W_2 \cdot \text{ReLU}(W_1 x)\)

06

Self-Attention

→ Next slides

Section 04

Self-Attention

The mathematical heart

Scaled Dot-Product Attention

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]

Step 1: Project

\[Q = XW^Q, \; K = XW^K, \; V = XW^V\]

Step 2: Scores

\[e_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}}\]

Why \(\sqrt{d_k}\)?

Proposition

If components of \(q, k\) are i.i.d. with mean 0, variance 1:

\[\text{Var}(q \cdot k) = d_k\]

Large values → softmax saturates → gradients vanish.
Dividing by \(\sqrt{d_k}\) normalizes variance to 1.

Multi-Head Attention

\[\text{MultiHead} = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O\]
\[\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\]

Finding

Different heads specialize: syntax, position, semantics.

Interactive Demo

Click a word to see its attention distribution:

The cat sat on the mat because it was soft

Click a word above

Section 05

Worked Example

Complete calculation

Setup

\(n=3\) tokens, \(d=4\), \(d_k=d_v=2\)

\[X = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 0 \end{bmatrix}\]
\[Q = \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 1 & 1 \end{bmatrix}, K = \begin{bmatrix} 0 & 2 \\ 2 & 0 \\ 1 & 1 \end{bmatrix}\]

Result

\[QK^\top = \begin{bmatrix} 0 & 4 & 2 \\ 4 & 0 & 2 \\ 2 & 2 & 2 \end{bmatrix}\]
\[A = \text{softmax}\left(\frac{QK^\top}{\sqrt{2}}\right) \approx \begin{bmatrix} 0.05 & \mathbf{0.78} & 0.17 \\ \mathbf{0.78} & 0.05 & 0.17 \\ 0.33 & 0.33 & 0.33 \end{bmatrix}\]

A attends to B (0.78), B attends to A (0.78), C equally distributed.

Section 06

Applications

Transformers everywhere

Universal Architecture

💬

LLMs

GPT-4, Claude, Gemini

🖼️

Vision

ViT, DINO

🎨

Image Gen

DALL-E, SD

🧬

Proteins

AlphaFold 2

🎵

Audio

Whisper

💻

Code

Copilot

\[L(N) \propto N^{-0.076}, \quad L(D) \propto D^{-0.095}\]

Questions?

Thank you for your attention.

"Attention Is All You Need" (2017)