# Graph and Set Networks
> Message passing on irregular structures -- from spectral convolutions on fixed graphs to attention over molecules to permutation-invariant processing of point clouds and sets.

## Overview

Graph neural networks (GNNs) generalize deep learning to data that lives on nodes and edges rather than in grids or sequences. Social networks, molecules, knowledge bases, protein structures, and 3D point clouds all have graph structure that convolutions and recurrences cannot naturally exploit. The GNN family in Edifice provides eight graph architectures and two set-processing architectures that handle these irregular domains.

The unifying abstraction is **message passing**: each node gathers information from its neighbors, transforms it, and updates its own representation. What varies across architectures is how messages are computed (linear projection, attention, continuous filters), how they are aggregated (sum, mean, max, or multiple aggregators), and how expressive the resulting model is (bounded by the Weisfeiler-Lehman graph isomorphism hierarchy).

Sets can be viewed as a special case: graphs with no edges, where the only structural constraint is permutation invariance. DeepSets and PointNet achieve this through symmetric aggregation functions (sum, max) applied after per-element transformations, with PointNet adding spatial alignment networks for 3D geometric data.

## Conceptual Foundation

The message passing framework decomposes a GNN layer into three stages:

    1. Message:    m_{ij} = MSG(h_i, h_j, e_{ij})    -- compute message from j to i
    2. Aggregate:  M_i    = AGG({m_{ij} : j in N(i)}) -- collect messages at node i
    3. Update:     h_i'   = UPD(h_i, M_i)             -- update node i's features

The key equation for the simplest case (GCN) reduces message passing to a single matrix operation:

    H' = sigma(D^{-1/2} A D^{-1/2} H W)

where A is the adjacency matrix (with self-loops), D is the degree matrix, H is the node feature matrix, W is a learnable weight matrix, and sigma is a nonlinearity. This formulation comes from approximating spectral graph convolutions with first-order Chebyshev polynomials.

For sets without edges, the message passing collapse to:

    output = rho( AGG_i( phi(x_i) ) )

where phi processes each element independently and AGG is a symmetric function (sum, mean, max). This is provably a universal approximator for permutation-invariant functions.

## Architecture Evolution

```
2005  Spectral Graph Theory foundations
  |     (Bruna et al., spectral convolutions on graphs)
  |
2017  GCN (Kipf & Welling, ICLR)
  |     - First-order Chebyshev approximation
  |     - Normalized adjacency propagation
  |     [GCN]
  |
2017  GraphSAGE (Hamilton et al., NeurIPS)       PointNet (Qi et al., CVPR)
  |     - Inductive learning via                    - Max pool over points
  |       neighborhood sampling                     - T-Net spatial alignment
  |     [GraphSAGE]                                 [PointNet]
  |
2017  MPNN (Gilmer et al.)                       DeepSets (Zaheer et al.)
  |     - Unifying message passing framework        - Permutation-invariant
  |     [MessagePassing]                              universal approximator
  |                                                 [DeepSets]
  |
2017  SchNet (Schutt et al., NeurIPS)
  |     - Continuous-filter convolutions
  |     - RBF distance expansion for molecules
  |     [SchNet]
  |
2018  GAT (Velickovic et al., ICLR)
  |     - Learned attention weights per edge
  |     - Multi-head attention on graphs
  |     [GAT]
  |
2019  GIN (Xu et al., ICLR)
  |     - Provably maximally powerful under WL test
  |     - Sum aggregation + learnable epsilon
  |     [GIN]
  |
2020  PNA (Corso et al., NeurIPS)
  |     - Multiple aggregators + degree scalers
  |     - Principled aggregation diversity
  |     [PNA]
  |
2021  Graph Transformer (Dwivedi & Bresson; Ying et al.)
        - Full attention over graph nodes
        - Adjacency as attention bias/mask
        [GraphTransformer]
```

## When to Use What

| Scenario | Module | Rationale |
|----------|--------|-----------|
| Simple, homogeneous graph (citation, social) | `GCN` | Fast, well-understood, good default |
| Heterogeneous neighbor importance | `GAT` | Learned attention weights adapt to each edge |
| Graph classification (isomorphism-sensitive) | `GIN` | Provably 1-WL expressive; sum aggregation distinguishes multisets |
| Large/evolving graphs, inductive setting | `GraphSAGE` | Neighborhood sampling; works on unseen nodes |
| Maximum aggregation expressiveness | `PNA` | Multiple aggregators x scalers capture diverse structural patterns |
| Molecular property prediction | `SchNet` | Continuous-filter convolutions on interatomic distances |
| Graph + global context needed | `GraphTransformer` | Full attention (not just neighbors); adjacency biases structure |
| Custom message/aggregate/update | `MessagePassing` | Generic MPNN framework; build your own GNN variant |
| Permutation-invariant set function | `DeepSets` | Simplest provably universal set architecture |
| 3D point cloud classification | `PointNet` | Per-point MLP + max pool + T-Net spatial alignment |

### Expressiveness Hierarchy

The Weisfeiler-Lehman (WL) graph isomorphism test provides a theoretical ceiling for message passing GNNs. The hierarchy from least to most expressive:

```
Expressiveness of message passing GNNs:

  GCN (mean aggregation)
    |  Cannot distinguish certain regular graphs
    |
  GAT (attention-weighted aggregation)
    |  Adaptive weighting, but still bounded by 1-WL
    |
  GIN (sum aggregation + learnable epsilon)  <==>  1-WL test
    |  Maximally powerful among standard MPNN architectures
    |
  PNA (multiple aggregators + scalers)
    |  Captures more structural info than any single aggregator
    |
  GraphTransformer (full attention + positional encoding)
       Can approximate higher-order WL with proper positional encoding
```

## Key Concepts

### Spectral vs Spatial Approaches

Graph convolutions originated from spectral graph theory: define convolution as multiplication in the graph Fourier domain (eigenvectors of the graph Laplacian). GCN approximates this with a first-order Chebyshev polynomial, yielding the one-hop neighbor aggregation rule.

Spatial approaches (GAT, GraphSAGE, GIN) operate directly in the node domain without referencing the spectrum. They define operations in terms of neighborhoods: for each node, look at its neighbors, compute something, and aggregate. Spatial methods are more flexible (they handle varying graph structures, inductive settings, and edge features more naturally) and have largely superseded purely spectral methods.

```
  Spectral                                Spatial
  --------                                -------
  Defined via graph Laplacian             Defined via neighborhoods
  Global operation (eigendecomposition)   Local operation (message passing)
  Fixed graph structure required          Handles dynamic/unseen graphs
  Smooth spectral filters                 Arbitrary message functions
  Example: GCN (1st-order approx)         Examples: GAT, GIN, GraphSAGE

  In practice, GCN is used as a spatial method -- the spectral
  derivation justifies the normalized adjacency propagation rule,
  but the implementation is purely spatial (matrix multiply with A).
```

### The Message Passing Framework

The `MessagePassing` module implements the generic MPNN framework that all other graph modules specialize. Understanding it clarifies the design space:

**Message function**: How does node j contribute to node i? Options range from simple (linear projection of j's features, as in GCN) to complex (attention-weighted projection using both i and j, as in GAT) to continuous (distance-dependent filter, as in SchNet).

**Aggregation function**: How are messages from multiple neighbors combined? Sum preserves multiset information (GIN). Mean normalizes by degree (GCN). Max captures the most salient neighbor (GraphSAGE with `:max`). PNA uses all of the above simultaneously plus standard deviation.

**Update function**: How does node i incorporate the aggregated messages? Simple approaches add or concatenate messages with self-features then project. GraphSAGE concatenates and L2-normalizes. GIN uses (1 + epsilon) * self + aggregated, passed through an MLP.

```
Message Passing decomposition for each GNN variant:

  GCN:       MSG = W * h_j          AGG = mean      UPD = sigma(agg)
  GAT:       MSG = alpha_ij * W*h_j AGG = sum        UPD = sigma(agg)
  GIN:       MSG = h_j              AGG = sum        UPD = MLP((1+eps)*h_i + agg)
  GraphSAGE: MSG = h_j              AGG = mean/max   UPD = W*[h_i || agg], L2-norm
  SchNet:    MSG = filter(d_ij)*h_j AGG = sum        UPD = MLP(agg) + h_i
  PNA:       MSG = h_j              AGG = [mean,max, UPD = project(concat(all))
                                          sum,std]
```

### Domain-Specific Architectures: SchNet and PointNet

**SchNet** is designed for molecular graphs where edges represent interatomic distances rather than binary connections. It uses radial basis functions (RBFs) to expand continuous distances into a feature vector, then generates per-edge filter weights from these RBF features. This continuous-filter approach means SchNet can distinguish atoms that are 2.1 angstroms apart from those 2.3 angstroms apart -- critical for chemistry where bond lengths determine molecular properties.

The cosine cutoff envelope smoothly zeros out interactions beyond a distance threshold, ensuring locality and preventing discontinuities in the energy surface.

**PointNet** processes 3D point clouds (unordered sets of xyz coordinates) for classification and segmentation. Its key contribution is the T-Net: a mini-PointNet that predicts a spatial transformation matrix, applied to the input to canonicalize the orientation. This makes the model robust to rotations and translations. After per-point feature extraction through shared MLPs, a global max pool creates a permutation-invariant representation.

### Sets as Edgeless Graphs

DeepSets and PointNet share the permutation-invariance constraint: the output must not change when input elements are reordered. DeepSets proves that any continuous permutation-invariant function on sets can be decomposed as:

    f({x_1, ..., x_n}) = rho( SUM_i phi(x_i) )

where phi processes elements independently and rho processes the aggregate. This is the theoretical foundation underlying both DeepSets (which implements it directly) and PointNet (which uses max instead of sum, and adds spatial alignment).

```
  Set processing as a special case of message passing:

  Graph (with edges):     Node i gathers from neighbors N(i)
  Set (no edges):         Node i gathers from ALL other nodes (or none)
  DeepSets:               No inter-element interaction; phi per element, then aggregate
  PointNet:               Same as DeepSets + learned spatial alignment (T-Net)

  The choice of symmetric aggregation function matters:

  SUM:   Preserves count information; |{a,a,b}| != |{a,b}|
  MEAN:  Normalized; loses count but more stable for variable-size sets
  MAX:   Captures extrema; ignores multiplicity entirely
```

## Complexity Comparison

| Module | Per-Layer Compute | Memory (adjacency) | Supports Inductive | Global Pooling |
|--------|------------------|--------------------|--------------------|----------------|
| `GCN` | O(E * d + N * d^2) | O(N^2) dense | No (spectral) | Via `MessagePassing` |
| `GAT` | O(E * d + N * d^2) | O(N^2) dense | Yes | Via `MessagePassing` |
| `GIN` | O(E * d + N * d^2) | O(N^2) dense | Yes | Built-in option |
| `GraphSAGE` | O(E * d + N * d^2) | O(N^2) dense | Yes (designed for it) | Built-in option |
| `PNA` | O(E * d * A * S + N * d^2) | O(N^2) dense | Yes | Built-in option |
| `SchNet` | O(E * R * d + N * d^2) | O(N^2) dense (distances) | Yes | Built-in option |
| `GraphTransformer` | O(N^2 * d) | O(N^2) dense | Yes | Built-in option |
| `MessagePassing` | O(E * d + N * d^2) | O(N^2) dense | Yes | `global_pool/2` |
| `DeepSets` | O(N * d^2) | None needed | N/A (no graph) | Built-in (sum/mean/max) |
| `PointNet` | O(N * d^2 + T-Net) | None needed | N/A (no graph) | Built-in (max) |

Where N = nodes, E = edges, d = hidden dimension, A = number of aggregators (PNA), S = number of scalers (PNA), R = number of RBF centers (SchNet).

Note: Edifice uses dense adjacency matrices (O(N^2) memory) because Nx does not support sparse tensors. For large graphs, batch subgraphs or use neighborhood sampling.

## Module Reference

- `Edifice.Graph.GCN` -- Spectral graph convolutions via normalized adjacency; node and graph classification with optional global pooling
- `Edifice.Graph.GAT` -- Multi-head attention over graph neighbors with LeakyReLU scoring; concatenated or averaged heads
- `Edifice.Graph.GIN` -- Maximally expressive MPNN (1-WL equivalent) using sum aggregation and learnable epsilon weighting
- `Edifice.Graph.GraphSAGE` -- Inductive graph learning with mean/max/pool aggregation, self-concatenation, and L2 normalization
- `Edifice.Graph.GraphTransformer` -- Full attention over graph nodes with adjacency-biased scores and random-walk positional encoding
- `Edifice.Graph.PNA` -- Principal Neighbourhood Aggregation combining mean/max/sum/std aggregators with identity/amplification/attenuation scalers
- `Edifice.Graph.SchNet` -- Continuous-filter convolutions for molecular graphs using RBF distance expansion and cosine cutoff envelopes
- `Edifice.Graph.MessagePassing` -- Generic MPNN framework with configurable message/aggregation/update functions and global pooling utilities
- `Edifice.Sets.DeepSets` -- Permutation-invariant set functions via independent phi network, symmetric aggregation, and rho post-processing
- `Edifice.Sets.PointNet` -- 3D point cloud processing with per-point shared MLPs, optional T-Net spatial alignment, and max pooling

## Cross-References

- **attention_mechanisms.md** -- GraphTransformer uses full multi-head attention with adjacency as a structural bias; the attention mechanism is the same as in standard transformers
- **building_blocks.md** -- Graph architectures can incorporate RMSNorm, SwiGLU, and other blocks as internal components of their message and update functions
- **meta_learning.md** -- GNNs compose naturally with Mixture-of-Experts for routing-based architectures on graph-structured inputs

## Further Reading

1. Kipf & Welling, "Semi-Supervised Classification with Graph Convolutional Networks" (ICLR 2017) -- arxiv.org/abs/1609.02907. The foundational GCN paper deriving graph convolution from spectral theory.
2. Velickovic et al., "Graph Attention Networks" (ICLR 2018) -- arxiv.org/abs/1710.10903. Introduces learned attention weights for neighbor aggregation.
3. Xu et al., "How Powerful are Graph Neural Networks?" (ICLR 2019) -- arxiv.org/abs/1810.00826. Proves GIN matches 1-WL expressiveness and analyzes limitations of other GNNs.
4. Gilmer et al., "Neural Message Passing for Quantum Chemistry" (ICML 2017) -- arxiv.org/abs/1704.01212. Unifying MPNN framework that subsumes GCN, GAT, and SchNet.
5. Corso et al., "Principal Neighbourhood Aggregation for Graph Nets" (NeurIPS 2020) -- arxiv.org/abs/2004.05718. Systematic analysis showing that combining diverse aggregators with degree scalers maximizes expressiveness.