Mathematical Introduction

The Transformer
Architecture

Exploring the mathematical foundations behind the architecture that revolutionized artificial intelligence.

Nir Naim

Tel Aviv University

Queueing Theory Seminar

Section 01

What Are Transformers?

The architecture that changed everything

The Sequential Bottleneck

Before 2017, sequence modeling used Recurrent Neural Networks (RNNs):

h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b)

Problem 1

Computing \(h_t\) requires \(h_{t-1}\). No parallelization possible.

Problem 2

Information flows through many steps. Vanishing gradients.

The Solution: Attention

Definition: Transformer

A neural network using self-attention as its core primitive. No inherent notion of order—position must be explicitly encoded.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Computes attention for all positions at once—fully parallelizable.

The Key Insight

Intuition

Self-attention = soft, differentiable database lookup

Query: "What do I need?"
Key: "What do I have?"
Value: "The information"

Output = weighted sum via softmax over dot products. Entirely differentiable.

Section 02

Historical Impact

From RNNs to GPT-4

The Transformer Revolution

2014

Attention for Translation

Bahdanau et al. add attention to RNNs

2017

"Attention Is All You Need"

Vaswani et al. eliminate recurrence. 100× faster.

2018

BERT & GPT

Pretrained Transformers revolutionize NLP

2020+

Universal Adoption

Vision, proteins, audio, code. GPT-4, Claude, Gemini.

Computational Complexity

RNN

\(O(n)\) sequential operations, each \(O(d^2)\)

Cannot parallelize

Transformer

\(O(n^2 d)\) matrix operations

Fully parallelizable

On GPUs, Transformers are dramatically faster.

Section 03

Architecture

Building blocks

Encoder-Decoder Structure

Encoder: Processes input (BERT)
Decoder: Generates output (GPT)

Each contains stacked layers of:

Self-Attention
Feed-Forward Networks
Residual Connections
Layer Normalization

Key Components

01

Embeddings

\(x_i = E[\text{token}_i]\)

02

Positional Encoding

\(\sin/\cos\) at varying frequencies

03

Layer Norm

\(\gamma \odot \frac{x-\mu}{\sigma} + \beta\)

04

Residual

\(x + \text{Sublayer}(x)\)

05

FFN

\(W_2 \cdot \text{ReLU}(W_1 x)\)

06

Self-Attention

→ Next slides

Section 04

Self-Attention

The mathematical heart

Scaled Dot-Product Attention

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Step 1: Project

Q = XW^Q, \; K = XW^K, \; V = XW^V

Step 2: Scores

e_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}}

Why \(\sqrt{d_k}\)?

Proposition

If components of \(q, k\) are i.i.d. with mean 0, variance 1:

\text{Var}(q \cdot k) = d_k

Large values → softmax saturates → gradients vanish.
Dividing by \(\sqrt{d_k}\) normalizes variance to 1.

Multi-Head Attention

\text{MultiHead} = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Finding

Different heads specialize: syntax, position, semantics.

Interactive Demo

Click a word to see its attention distribution:

The cat sat on the mat because it was soft

Click a word above

Section 05

Worked Example

Complete calculation

Setup

\(n=3\) tokens, \(d=4\), \(d_k=d_v=2\)

X = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 0 \end{bmatrix}

Q = \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 1 & 1 \end{bmatrix}, K = \begin{bmatrix} 0 & 2 \\ 2 & 0 \\ 1 & 1 \end{bmatrix}

Result

QK^\top = \begin{bmatrix} 0 & 4 & 2 \\ 4 & 0 & 2 \\ 2 & 2 & 2 \end{bmatrix}

A = \text{softmax}\left(\frac{QK^\top}{\sqrt{2}}\right) \approx \begin{bmatrix} 0.05 & \mathbf{0.78} & 0.17 \\ \mathbf{0.78} & 0.05 & 0.17 \\ 0.33 & 0.33 & 0.33 \end{bmatrix}

A attends to B (0.78), B attends to A (0.78), C equally distributed.

Section 06

Applications

Transformers everywhere

Universal Architecture

💬

LLMs

GPT-4, Claude, Gemini

🖼️

Vision

ViT, DINO

🎨

Image Gen

DALL-E, SD

🧬

Proteins

AlphaFold 2

🎵

Audio

Whisper

💻

Code

Copilot

L(N) \propto N^{-0.076}, \quad L(D) \propto D^{-0.095}

Questions?

Thank you for your attention.

"Attention Is All You Need" (2017)

The TransformerArchitecture

What Are Transformers?

The Sequential Bottleneck

The Solution: Attention

The Key Insight

Historical Impact

The Transformer Revolution

Computational Complexity

Architecture

Encoder-Decoder Structure

Key Components

Self-Attention

Scaled Dot-Product Attention

Step 1: Project

Step 2: Scores

Why \(\sqrt{d_k}\)?

Multi-Head Attention

Interactive Demo

Worked Example

Setup

Result

Applications

Universal Architecture

Questions?

The Transformer
Architecture