Word Embeddings

What you’ll learn in this module

This module introduces word embeddings and the revolutionary Word2Vec algorithm.

You’ll learn:

  • How meaning emerges from relationships rather than being stored in containers.
  • The technique of contrastive learning and how it shapes semantic space through push and pull.
  • How to perform semantic arithmetic with vector operations like “king - man + woman = queen”.
  • The connection between structural linguistics and the geometric structure of word embeddings.

Words as Relationships, Not Containers

Have you ever wondered where meaning comes from? We intuitively assume words are containers for meaning, that “Dog” holds the concept of a canine. This is incorrect. Structural linguistics reveals that a sign is defined solely by its relationships: “Dog” means “dog” only because it is not “cat”, “wolf”, or “log”. Meaning is differential, not intrinsic.

Figure 1: Green is the color that is not non-green (not red, not blue, not yellow, etc.).

Word2Vec, the foundational model grounding modern NLP, learns to map the statistical topology of language. Think of it like mapping a city based purely on traffic data.

You don’t know what a “school” is, but you see that “buses” and “children” congregate there at 8 AM. By placing these entities close together on a map, you reconstruct the city’s functional structure. Word2Vec does this for language, turning semantic proximity into geometric distance.

Exploring Word2Vec

Let’s first experience the power of Word2Vec, then understand how it works.

We’ll use a pre-trained model trained on 100 billion words of Google News. We aren’t teaching it anything. We’re simply inspecting the map it created.

import gensim.downloader as api
import numpy as np

# Load pre-trained Word2vec embeddings
print("Loading Word2vec model...")
model = api.load("word2vec-google-news-300")
print(f"Loaded embeddings for {len(model):,} words.")
Loading Word2vec model...
Loaded embeddings for 3,000,000 words.

If the map is accurate, “dog” should be surrounded by its semantic kin. We query the nearest neighbors in the vector space.

similar_to_dog = model.most_similar("dog", topn=10)

print("Words most similar to 'dog':")
for word, similarity in similar_to_dog:
    print(f"  {word:20s} {similarity:.3f}")
Words most similar to 'dog':
  dogs                 0.868
  puppy                0.811
  pit_bull             0.780
  pooch                0.763
  cat                  0.761
  golden_retriever     0.750
  German_shepherd      0.747
  Rottweiler           0.744
  beagle               0.742
  pup                  0.741

The model groups “dog” with “dogs”, “puppy”, and “pooch” not because it knows biology, but because they are statistically interchangeable in sentences.

What about semantic arithmetic? Since words are vectors, we can perform arithmetic on meaning. The relationship between “King” and “Man” is a vector, and if we add that vector to “Woman”, we should arrive at “Queen”.

\vec{\text{King}} - \vec{\text{Man}} + \vec{\text{Woman}} \approx \vec{\text{Queen}}

result = model.most_similar(
  positive=['king', 'woman'],
   negative=['man'], topn=5
)

print("king - man + woman =")
for word, similarity in result:
    print(f"  {word:15s} {similarity:.3f}")
king - man + woman =
  queen           0.712
  monarch         0.619
  princess        0.590
  crown_prince    0.550
  prince          0.538

How do we visualize these relationships? We cannot see in 300 dimensions, but we can project the space down to 2D using PCA. This reveals consistent structures like the “capital city” relationship that the model has learned.

Code
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import pandas as pd

countries = ["Germany", "France", "Italy", "Spain", "Portugal", "Greece"]
capitals = ["Berlin", "Paris", "Rome", "Madrid", "Lisbon", "Athens"]

# Get embeddings
country_embeddings = np.array([model[country] for country in countries])
capital_embeddings = np.array([model[capital] for capital in capitals])

# PCA to 2D
pca = PCA(n_components=2)
embeddings = np.vstack([country_embeddings, capital_embeddings])
embeddings_pca = pca.fit_transform(embeddings)

# Create DataFrame
df = pd.DataFrame(embeddings_pca, columns=["PC1", "PC2"])
df["Label"] = countries + capitals
df["Type"] = ["Country"] * len(countries) + ["Capital"] * len(capitals)

# Plot
fig, ax = plt.subplots(figsize=(12, 10))

for idx, row in df.iterrows():
    color = "#e74c3c" if row["Type"] == "Country" else "#3498db"
    marker = "o" if row["Type"] == "Country" else "s"
    ax.scatter(
        row["PC1"],
        row["PC2"],
        c=color,
        marker=marker,
        s=200,
        edgecolors="black",
        linewidth=1.5,
        alpha=0.7,
        zorder=3,
    )
    ax.text(
        row["PC1"],
        row["PC2"] + 0.15,
        row["Label"],
        fontsize=12,
        ha="center",
        va="bottom",
        fontweight="bold",
        bbox=dict(facecolor="white", edgecolor="none", alpha=0.8),
    )

# Draw arrows
for i in range(len(countries)):
    country_pos = df.iloc[i][["PC1", "PC2"]].values
    capital_pos = df.iloc[i + len(countries)][["PC1", "PC2"]].values
    ax.arrow(
        country_pos[0],
        country_pos[1],
        capital_pos[0] - country_pos[0],
        capital_pos[1] - country_pos[1],
        color="gray",
        alpha=0.6,
        linewidth=2,
        head_width=0.15,
        head_length=0.1,
        zorder=2,
    )

ax.set_title(
    'The "Capital Of" Relationship as Parallel Transport',
    fontsize=16,
    fontweight="bold",
    pad=20,
)
ax.grid(alpha=0.3, linestyle="--")
plt.tight_layout()
plt.show()

The ‘Capital Of’ relationship appears as a consistent direction in vector space.

How Word2Vec Learns Meaning

Let’s talk about the mechanism behind the magic. We intuitively treat words as containers that hold meaning, that “Green” contains the visual concept of a specific color. This is incorrect. Nature presents us with a messy, continuous spectrum without hard borders, and language is simply the set of arbitrary cuts we make in that continuum to create order.

Word2Vec operationalizes this by treating meaning as a game of contrast, functioning as a pair of linguistic scissors. It does not learn what a word is by looking up a definition. It learns what a word is like by pulling it close to neighbors, and more importantly, it learns what a word is not by pushing it away from random noise.

The meaning of “Green” is simply the geometric region that remains after we have pushed away “Red”, “Purple”, and “Banana”.

Figure 2: Starting from initially random vectors, word2vec learns iteratively to push away the words that are not related and pull words that are related. The resulting vector space is a map of the relationships between words.

What’s the underlying technique? This process relies on contrastive learning. We cannot teach the model the exact meaning of each word, but we can let it learn the relationship between words through a binary classification problem: are these two words neighbors, or are they strangers? The training loop provides a positive pair from the text, instructing the model to maximize the similarity between their vectors, while simultaneously grabbing random negative samples (imposters from the vocabulary) and demanding the model minimize their similarity. This push-and-pull mechanic creates the vector space, where the “Green” cluster forms not because the model understands color, but because those words are statistically interchangeable when opposed to “Red”.

How do we generate training pairs without human labeling? We employ a sliding window technique that moves over the raw text corpus, converting a sequence of words into a system of geometric queries.

Figure 3: Without human labeling, word2vec assumes that words in the same context are related. Context is defined as the words within a window of predefined size. For example, in “The quick brown fox jumps over the lazy dog”, the context of “fox” includes “brown”, “jumps”, “over”, and “lazy”.

What’s the neural network architecture? Word2Vec is a simple neural network with one hidden layer. The input is a one-hot encoded vector of a word, which triggers neurons in the hidden layer to fire. The neural connection strength from the neuron representing the word to the neurons in the hidden layer (marked by red arrows) represents the query vector, u, while the hidden layer neurons trigger the firing of output layer neurons. The connection strength from an output word neuron to the hidden layer neurons represents the key vector, v.

The word in the center of the window acts as the Query vector (u), broadcasting its position to the surrounding Context words, which act as Keys (v). The neural network adjusts its weights to maximize the dot product u \cdot v for these specific context pairs while suppressing the dot product for the negative samples, making the probability of a word appearing in context a function of their vector alignment.

P(j \vert i) = \frac{P(j) \exp(u_i \cdot v_j)}{\sum_{k=1}^{V} P(k) \exp(u_i \cdot v_k)}

where P(j) is the probability of word j appearing in the vocabulary.

Why do we include P(j) in the formula? The original Word2Vec paper uses a different formulation that omits P(j), which is correct conceptually but not practically. In practice, word2vec is trained with an efficient but biased training algorithm (negative sampling), and the term P(j) enters the P(j \vert i) when we account for this bias.

This closes the loop between high-level linguistic philosophy and low-level matrix operations. The machine proves the structuralist hypothesis: that meaning is relational. By mechanically slicing the continuum of language and applying the pressure of negative sampling, the model reconstructs a functional map of human concepts, successfully turning a philosophy of meaning into a runnable algorithm.

Figure 4

Key Takeaway

You don’t need to know what a thing is to understand it. You only need to know where it stands relative to everything it isn’t.

There’s a nice blog post by Chris McCormick that walks through the inner workings of Word2Vec. See here.