Sprint Project: The Les Misérables Identity Crisis
You have 60 minutes to create two rival embeddings of Les Misérables characters—one semantic (from text), one structural (from co-occurrence network)—and find the biggest “reputation gaps.” Then present your discoveries to the class for voting.
The Challenge
You receive the full text of Les Misérables and a character co-occurrence network. Create semantic embeddings using text-based methods (Word2Vec, BERT, etc.) and structural embeddings using graph methods (Node2Vec, DeepWalk). Compare nearest neighbors in both spaces. Find characters whose textual associations differ drastically from their network position. Who is Valjean close to in narrative versus structure?
Kickstarter Code
import pandas as pd
import networkx as nx
from gensim.models import Word2Vec
from node2vec import Node2Vec
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load text and network
with open('data/les_miserables.txt', 'r') as f:
text = f.read()
G = nx.read_edgelist('data/character_network.txt')
# Semantic embeddings (example: Word2Vec)
# Preprocess text to treat character names as single tokens
characters = list(G.nodes())
# ... tokenize and train Word2Vec ...
# Structural embeddings (Node2Vec)
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200)
structural_model = node2vec.fit(window=10, min_count=1)
# Get embeddings for same characters
semantic_embeds = {} # character -> semantic embedding
structural_embeds = {char: structural_model.wv[char] for char in characters}
# Compare nearest neighbors
for char in characters:
sem_neighbors = # ... find k nearest in semantic space
struct_neighbors = # ... find k nearest in structural space
overlap = len(set(sem_neighbors) & set(struct_neighbors))
print(f"{char}: {overlap}/{k} overlap")
# Visualize with PCA
# ...The Rules
Time: 60 minutes of work, followed by presentations.
Two Embeddings Required: Generate both semantic (text-based) and structural (graph-based) embeddings. Document methods used.
Same Characters: Use consistent character sets in both embedding spaces to enable comparison.
Identity Crisis Discovery: Identify at least three characters with significant discrepancies. Quantify using Jaccard distance or correlation.
Evaluation
The class votes on the most compelling reputation gap:
Magnitude (30%): How large is the discrepancy between neighborhoods?
Interpretability (40%): Does the gap illuminate something true about the character’s dual roles?
Presentation (30%): Clear explanations with effective visualizations (e.g., 2D projections with lines connecting positions).
Deliverables
Your submission should include:
- Embeddings: Both semantic and structural embeddings for characters
- Analysis Code: Script or notebook computing and comparing embeddings
- Visualizations: 2D projections showing both spaces with gap illustrations
- Report: A brief markdown document explaining:
- Embedding methods used for each space
- Three characters with significant reputation gaps
- Interpretation of what these gaps reveal about the story
Submission
- Use the provided template: https://github.com/sk-classroom/sprint-project-template
- Follow the template instructions to create your project repository
- Place data and embeddings in the
datafolder, code in thenotebooksfolder - Include visualizations comparing both embedding spaces
- Write your report in
README.md - Submit the link to your GitHub repository to Brightspace