Sprint Project: The Les Misérables Identity Crisis

Your mission

You have 60 minutes to create two rival embeddings of Les Misérables characters—one semantic (from text), one structural (from co-occurrence network)—and find the biggest “reputation gaps.” Then present your discoveries to the class for voting.

The Challenge

You receive the full text of Les Misérables and a character co-occurrence network. Create semantic embeddings using text-based methods (Word2Vec, BERT, etc.) and structural embeddings using graph methods (Node2Vec, DeepWalk). Compare nearest neighbors in both spaces. Find characters whose textual associations differ drastically from their network position. Who is Valjean close to in narrative versus structure?

Kickstarter Code

import pandas as pd
import networkx as nx
from gensim.models import Word2Vec
from node2vec import Node2Vec
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load text and network
with open('data/les_miserables.txt', 'r') as f:
    text = f.read()

G = nx.read_edgelist('data/character_network.txt')

# Semantic embeddings (example: Word2Vec)
# Preprocess text to treat character names as single tokens
characters = list(G.nodes())
# ... tokenize and train Word2Vec ...

# Structural embeddings (Node2Vec)
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200)
structural_model = node2vec.fit(window=10, min_count=1)

# Get embeddings for same characters
semantic_embeds = {}  # character -> semantic embedding
structural_embeds = {char: structural_model.wv[char] for char in characters}

# Compare nearest neighbors
for char in characters:
    sem_neighbors = # ... find k nearest in semantic space
    struct_neighbors = # ... find k nearest in structural space
    overlap = len(set(sem_neighbors) & set(struct_neighbors))
    print(f"{char}: {overlap}/{k} overlap")

# Visualize with PCA
# ...

The Rules

Time: 60 minutes of work, followed by presentations.

Two Embeddings Required: Generate both semantic (text-based) and structural (graph-based) embeddings. Document methods used.

Same Characters: Use consistent character sets in both embedding spaces to enable comparison.

Identity Crisis Discovery: Identify at least three characters with significant discrepancies. Quantify using Jaccard distance or correlation.

Evaluation

The class votes on the most compelling reputation gap:

Magnitude (30%): How large is the discrepancy between neighborhoods?

Interpretability (40%): Does the gap illuminate something true about the character’s dual roles?

Presentation (30%): Clear explanations with effective visualizations (e.g., 2D projections with lines connecting positions).

Deliverables

Your submission should include:

Embeddings: Both semantic and structural embeddings for characters
Analysis Code: Script or notebook computing and comparing embeddings
Visualizations: 2D projections showing both spaces with gap illustrations
Report: A brief markdown document explaining:
- Embedding methods used for each space
- Three characters with significant reputation gaps
- Interpretation of what these gaps reveal about the story

Submission

Use the provided template: https://github.com/sk-classroom/sprint-project-template
Follow the template instructions to create your project repository
Place data and embeddings in the data folder, code in the notebooks folder
Include visualizations comparing both embedding spaces
Write your report in README.md
Submit the link to your GitHub repository to Brightspace