Generative Pre-trained Transformer (GPT)#

The Generative Pre-trained Transformer (GPT) [1] represents a significant evolution in transformer-based language models, focusing on powerful text generation capabilities through a decoder-only architecture. While BERT uses bidirectional attention to understand context, GPT employs unidirectional (causal) attention to generate coherent text by predicting one token at a time.

GPT in interactive notebook:

Here is a demo notebook for GPT

To run the notebook, download the notebook as a .py file and run it with:

marimo edit –sandbox gpt-interactive.py

You will need to install marimo and uv to run the notebook. But other packages will be installed automatically in uv’s virtual environment.

Architecture#

Like in BERT, GPT also uses a transformer architecture. The main difference is that BERT uses an encoder transformer, while GPT uses a decoder transformer with some modifications.

GPT architecture

Fig. 35 GPT architecture.#

Tip

The GPT model family has evolved through several iterations, starting with GPT-1 in 2018 which introduced the basic architecture with 117M parameters and transfer learning capabilities. GPT-2 followed in 2019 with 1.5B parameters and zero-shot abilities, while GPT-3 in 2020 dramatically scaled up to 175B parameters, enabling few-shot learning. The latest GPT-4 (2023) features multimodal capabilities, improved reasoning, and a 32K token context window. Throughout these iterations, the core decoder-only transformer architecture remained unchanged, with improvements coming primarily from increased scale that enabled emergent capabilities.

GPT evolution

Core Components#

GPT causal attention

Fig. 36 Causal attention in GPT.#

Like BERT, GPT uses learned token embeddings to convert input tokens into continuous vector representations. The model also employs learned positional embeddings that are added to the token embeddings to encode position information. A key difference from BERT is that GPT uses a causal attention mechanism, which means each position can only attend to previous positions in the sequence, enabling the model to generate text in a left-to-right fashion by predicting one token at a time.

Causal Language Modeling#

Causal (autoregressive) language modeling is the pre-training objective of GPT, where the model learns to predict the next token given all previous tokens in the sequence. More formally, given a sequence of tokens \((x_1, x_2, ..., x_n)\), the model is trained to maximize the likelihood:

\[ P(x_1, ..., x_n) = \prod_{i=1}^n P(x_i|x_1, ..., x_{i-1}) \]

For example, given the partial sentence “The cat sat on”, the model learns to predict the next word by calculating probability distributions over its entire vocabulary. During training, it might learn that “mat” has a high probability in this context, while “laptop” has a lower probability.

Note

While BERT uses bidirectional attention and sees the entire sequence at once (making it powerful for understanding), GPT’s unidirectional approach more naturally models how humans write text, i.e., one word at a time, with each word influenced by all previous words. The bidirectional nature of BERT is more powerful for understanding, but it is less suitable for text generation.

Tip

The autoregressive nature of GPT means it’s particularly sensitive to the initial tokens (prompt) it receives. Well-crafted prompts that establish clear patterns or constraints can significantly improve generation quality.

The next-token prediction objective has remained unchanged across all GPT versions due to its remarkable effectiveness. Rather than modifying this core approach, improvements have come from increasing model size and refining the architecture. This simple yet powerful training method has become fundamental to modern language models.

Scaling Laws

Language model performance improves predictably as models get larger, following simple mathematical relationships (power laws). The larger the model, the better it performs - and this improvement is reliable and measurable. This predictability was crucial for the development of models like GPT-3 and Claude, as it gave researchers confidence that investing in larger models would yield better results. Importantly, larger models are more efficient learners - they need proportionally less training data and fewer training steps to achieve good performance.

These findings revolutionized AI development by showing that better AI systems could be reliably built simply by scaling up model size, compute, and data in the right proportions. This insight led directly to the development of increasingly powerful models, as researchers could confidently invest in building larger and larger systems knowing they would see improved performance.

See the paper Scaling Laws for Neural Language Models for more details.

Scaling laws figure

Inference Strategies#

GPT does not generate text in one go. Instead, it predicts the next token repeatedly to generate text. GPT does not pick a specific token but provides a probability distribution over the next token. It is our job to sample a token from the distribution. There are several strategies to sample a token from the distribution as we will see below.

GPT inference

Fig. 37 GPT predicts the next token repeatedly to generate text.#

From Deterministic to Stochastic Sampling#

Both greedy and beam search are deterministic. They pick the most likely token at each step. However, this creates a loop where the model always predicts the same tokens repeatedly. A simple way to alleviate this problem is to sample a token from the distribution.

Top-k Sampling relaxes the deterministic nature of greedy sampling by selecting randomly from the k most likely next tokens at each generation step. While this introduces some diversity compared to greedy sampling, choosing a fixed k can be problematic. Value of \(k\) might be too large for some distribution tails (including many poor options) or too small for others (excluding reasonable options).

Nucleus Sampling~[2] addresses this limitation by dynamically selecting tokens based on cumulative probability. It samples from the smallest set of tokens whose cumulative probability exceeds a threshold p (e.g. 0.9). This adapts naturally to different probability distributions, i.e., selecting few tokens when the distribution is concentrated and more when it’s spread out. This approach often provides a good balance between quality and diversity.

GPT top-k top-p

Fig. 40 Nucleus sampling. The image is taken from this blog.#

Temperature Control Temperature (\(\tau\)) modifies how “concentrated” the probability distribution is for sampling by scaling the logits before applying softmax:

\[ p_i = \frac{\exp(z_i/\tau)}{\sum_j \exp(z_j/\tau)} \]

where \(z_i\) are the logits and \(\tau\) is the temperature parameter. Lower temperatures (\(\tau < 1.0\)) make the distribution more peaked, making high probability tokens even more likely to be chosen, leading to more focused and conservative outputs. Higher temperatures (\(\tau > 1.0\)) flatten the distribution by making the logits more similar, increasing the chances of selecting lower probability tokens and producing more diverse but potentially less coherent text. As \(\tau \to 0\), the distribution approaches a one-hot vector (equivalent to greedy search), while as \(\tau \to \infty\), it approaches a uniform distribution.

GPT temperature

Fig. 41 Temperature controls the concentration of the probability distribution. Lower temperature makes the distribution more peaked, while higher temperature makes the distribution more flat.#