Sean's Blog

Structure of Neural Embeddings

A small collection of insights on the structure of embeddings (latent spaces) produced by deep neural networks. Embeddings are a collection of vectors (list of numbers) arranged in a meaningful way - representing relationships via distances between each other.

General Principles

Manifold Hypothesis: High-dimensional data sampled from natural (real-world) processes lies in a low-dimensional manifold. A manifold is a space that looks like flat (Euclidean) space when zooming in on any of its points, f.e. earths (a 3D object) surface looks like flat 2D space (a manifold) when zooming in.

Hierarchical Organization: Features organize hierarchically across layers - earlier layers capture local, low-level (small context) features while deeper layers represent increasingly abstract and compositional (large context) concepts by integrating information from earlier layers.

Linear Representation Hypothesis: Neural networks represent features as linear directions in their activation space, so that semantic operations correspond to vector arithmetic (e.g. queen ≈ king - man + woman).

Superposition Hypothesis: Neural nets represent more “independent” features than a layer has neurons (dimensions) by representing features as a linear combination of neurons (resulting in dense vectors).

Entangled Representation Hypothesis: Gradient descent optimised neural networks tend to develop redundant and fractured features instead of unified, modular representations that can be reused and controlled separately.

Universality Hypothesis: Neural circuits reappear across different models trained on similar data: models trained on the same modality (text, images, video, etc.) share the same relative semantic structure (angles, distances between concepts) in their embeddings (activations), even if absolute positions/orientations differ. This suggests embeddings converge to a "universal" semantic manifold for a given modality/data distribution.

Smoothness (Lipschitz continuity): Small changes in inputs cause proportionally bounded changes in output (latent) space - formally, ||f(x₁) - f(x₂)|| ≤ L||x₁ - x₂|| for some constant L. Well-trained (robust) models should ideally be insensitive to small input changes - which has its limits in praxis (see next point).

Adversarial Vulnerability: Carefully crafted small changes in input space can cause large shifts in embedding space and therefore also in predictions, suggesting even networks trained for smoothness show some chaotic properties.

Neural Collapse: After extensive training, class features in the final layer cluster tightly around their means, with the network's classification weights aligning with these mean directions. Within-class variation becomes minimal compared to between-class differences, effectively creating distinct, well-separated clusters for each class.

Limits of Dense Embeddings

Most neural network architectures process data as dense vectors, making them hard to interpret for humans.

A more human interpretable embedding representation would be sparse (few dimensions are active) and spatially meaningful (position of dimensions encodes information).

This would make them easier to interpret for humans and potentially offer some other benefits: encode single-concept objects vs multi-concept objects, encode novelty (outliers), increase robustness and reduce storage / increase efficiency. But this is another blog post.

Problems with Contrastive Embeddings

Modality Gap: Multi-modal training strategies like CLIP that unify text and image in a shared embedding space, suffer from the Modality Gap, where text and image vectors form distinct clusters instead of occupying a shared cluster in the embedding space (relative similarities are working though). This separation can limit the effectiveness of cross-modal retrieval and transfer tasks.

Dimensional Collapse: A phenomenon in contrastive learning where the learned representations tend to occupy a lower-dimensional subspace than intended, effectively "collapsing" along certain dimensions. This results in embeddings that don't fully utilize the available embedding dimensions, leading to highly correlated dimensions rather than capturing independent features.

TODO:

  • How do image (continuous input space) and text (discrete input space) embedding spaces differ (number of clusters, density, etc.)?
  • How sparse are dense embeddings (how much information do they lose if sparsified -> compare different embeddings based on layer depth)?

References:

#ML