# The Problem Landscape > Different problems have different shapes -- classification, regression, sequence modeling, generation, and structured prediction each demand different architectural choices. ## Why Problems Dictate Architecture It's tempting to think there's one best neural network architecture. There isn't. The reason Edifice has 19 families is that different problems have fundamentally different structures, and the right architecture encodes the right structural assumptions. Choosing an architecture is really about answering: "what does my data look like, and what do I need to predict?" This guide maps the landscape of ML problems and connects each one to the Edifice families designed for it. After reading this, you should be able to look at a new problem and have a reasonable sense of which part of Edifice to reach for. ## The Five Problem Types ### 1. Classification **Question:** "Which category does this belong to?" You have an input and a fixed set of possible labels. The network outputs a probability distribution over those labels. ``` Input: an image, a game frame, a sentence, a molecule graph Output: probabilities over classes Examples: - "Is this image a cat or a dog?" (binary) - "Which of 10 digits is this?" (multiclass) - "What action should the player take?" (multiclass) - "Which tags apply to this article?" (multi-label) ``` **How it works:** The final layer has one neuron per class and applies softmax to produce probabilities. Training uses cross-entropy loss, which heavily penalizes confident wrong predictions. **Edifice families for classification:** - **Feedforward (MLP)**: simplest classifier for tabular data - **Convolutional (ResNet, EfficientNet)**: image classification - **Vision (ViT, Swin)**: image classification with transformers - **Attention/SSM/Recurrent**: sequence classification (text sentiment, game state evaluation) - **Graph (GCN, GAT)**: classify molecules, social network nodes ### 2. Regression **Question:** "What number should this produce?" Like classification, but the output is a continuous value (or vector of values) instead of a category. Training uses mean squared error or similar distance-based losses. ``` Input: features describing a state or situation Output: one or more continuous numbers Examples: - "What will the temperature be tomorrow?" (scalar) - "What (x, y) position will this object move to?" (vector) - "What is this house worth?" (scalar) - "What energy does this molecule have?" (scalar, SchNet) ``` **How it works:** The final layer outputs raw numbers (no softmax). Loss measures the gap between predicted and actual values. **Edifice families for regression:** - **Feedforward (MLP, KAN, TabNet)**: tabular regression - **Graph (SchNet, PNA)**: molecular property prediction - **Recurrent/SSM**: time-series forecasting ### 3. Sequence Modeling **Question:** "Given this sequence so far, what comes next?" This is the problem that drives language models, music generators, and game AI. The network processes a sequence of tokens and predicts the next one (autoregressive) or fills in missing ones (masked). This is technically classification at each timestep (choosing the next token from a vocabulary), but the sequential structure demands specialized architectures. ``` Input: a sequence of tokens (words, game frames, notes) Output: prediction for the next token (or all tokens) Examples: - "Given these words, what's the next word?" (language modeling) - "Given these game frames, what controller input?" (game AI) - "Given this melody so far, what's the next note?" (music) - "Translate this sentence to another language" (seq-to-seq) ``` **The core challenge:** sequences have temporal dependencies. Word 50 might depend on word 3. Frame 120 might depend on what happened at frame 10. The architecture needs some mechanism to carry or access information across time. **Edifice families for sequence modeling:** ``` Sequence Modeling Approaches ============================ Recurrent (LSTM, GRU, xLSTM, MinGRU, Titans) Process one token at a time, maintaining a hidden state. Constant memory per step, but sequential (can't parallelize training). Best for: streaming/online inference, moderate sequence lengths. Attention (Multi-Head, GQA, Perceiver, RetNet) Every token attends to every other token simultaneously. Parallel training, but O(L²) cost in sequence length. Best for: tasks requiring precise long-range recall. State Space Models (Mamba, S4, H3, Hyena) Model sequences as discretized dynamical systems. Parallel training AND constant-memory inference. Best for: long sequences where O(L²) attention is too expensive. Linear Attention (LinearTransformer, Performer, GLA, RWKV, Griffin) Approximate attention with O(L) cost. Bridge between attention quality and SSM efficiency. Best for: when you need attention-like behavior at SSM-like cost. ``` The choice between these families is one of the most important architectural decisions, and it's driven by your sequence length, whether you need autoregressive inference, and how much compute you have. The architecture guides cover this tradeoff in depth. ### 4. Generation **Question:** "Create new data that looks like the training data." Generative models learn the underlying distribution of the data and can sample new examples from it. This is fundamentally different from classification or regression -- there's no single right answer. The network needs to capture the full range of variation in the data. ``` Input: random noise, a text prompt, a conditioning signal Output: a new image, audio clip, molecule, game trajectory Examples: - "Generate a new face image" (unconditional) - "Create an image matching this description" (conditional) - "Design a molecule with these properties" (conditional) - "Generate realistic game trajectories" (simulation) ``` **Edifice families for generation:** ``` Generative Paradigms ==================== Latent Variable (VAE, VQ-VAE) Encode data to a compressed latent space, decode back. Smooth latent space enables interpolation and manipulation. Trade-off: smooth generation but sometimes blurry outputs. Adversarial (GAN) Generator creates, discriminator judges. Minimax game. Sharp outputs but unstable training, mode collapse risk. Diffusion (DDPM, DDIM, DiT, LatentDiffusion, ConsistencyModel) Gradually add noise to data, then learn to reverse the process. Current state of the art for image generation. Trade-off: high quality but slow sampling (many denoising steps). Flow-Based (NormalizingFlow, FlowMatching, ScoreSDE) Learn invertible transformations between noise and data. Exact likelihood computation, principled training. ``` ### 5. Structured Prediction **Question:** "The input has non-trivial structure -- how do I respect it?" Some data doesn't fit neatly into sequences or grids. Graphs have nodes and edges. Sets have no ordering. Point clouds are unordered 3D coordinates. These require architectures that respect the data's inherent symmetries. ``` Input: a graph, a set of items, a point cloud Output: node labels, graph labels, set-level prediction Examples: - "Classify each atom in this molecule" (node classification) - "Is this molecule toxic?" (graph classification) - "What's the aggregate property of this item set?" (set regression) - "Segment this 3D scene" (point cloud) ``` **Edifice families for structured prediction:** - **Graph (GCN, GAT, GIN, GraphSAGE, PNA, SchNet)**: data with relational structure - **Sets (DeepSets, PointNet)**: unordered collections The key property these architectures encode is **equivariance** or **invariance**: - A graph network produces the same result regardless of how you number the nodes - DeepSets produces the same result regardless of how you order the set elements - PointNet handles arbitrary point orderings in 3D space ## Combining Problems Real applications often combine multiple problem types: ``` Game AI (ExPhil): Sequence modeling → process history of game frames Classification → choose which button to press Regression → choose stick position (continuous) Image Captioning: Structured pred. → encode image patches (ViT) Generation → decode caption text autoregressively Drug Discovery: Structured pred. → encode molecular graph (SchNet) Regression → predict binding energy Generation → generate candidate molecules (VAE/Flow) ``` When you see a compound problem like this, you'll typically compose modules from different Edifice families. Edifice's consistent API (everything returns an Axon model) makes this composition natural. ## The Decision Map When approaching a new problem, walk through this: ``` What is your data? │ ├─ Tabular (rows and columns, no structure) │ └─ Feedforward: MLP, KAN, TabNet │ ├─ Sequential (ordered in time or position) │ ├─ Short sequences (< 1K tokens) │ │ └─ Attention or Recurrent │ ├─ Long sequences (1K-100K+ tokens) │ │ └─ SSM (Mamba) or Linear Attention │ └─ Need both recall AND efficiency? │ └─ Hybrid (Jamba, Zamba) │ ├─ Images │ ├─ Classification / understanding │ │ └─ Convolutional (ResNet, EfficientNet) or Vision (ViT, Swin) │ ├─ Segmentation / dense prediction │ │ └─ Vision: UNet, Swin │ └─ Generation │ └─ Generative: Diffusion, VAE, GAN │ ├─ Graphs / molecules │ └─ Graph: GCN, GAT, SchNet, GraphTransformer │ ├─ Unordered sets / point clouds │ └─ Sets: DeepSets, PointNet │ └─ Want to generate new data? ├─ High quality images → Diffusion (DiT, LatentDiffusion) ├─ Fast sampling needed → ConsistencyModel, FlowMatching ├─ Smooth latent space → VAE, VQ-VAE └─ Exact likelihood → NormalizingFlow ``` ## Supervised, Unsupervised, and Self-Supervised One more axis to consider -- not what you're predicting, but what kind of training signal you have: **Supervised learning**: you have labeled data (input-output pairs). Classification and regression are typically supervised. Most Edifice architectures can be used in supervised settings. **Unsupervised learning**: no labels. The network finds structure in the data on its own. Generative models (VAE, GAN, Diffusion) are unsupervised -- they learn from the data distribution itself. Clustering and dimensionality reduction also fall here. **Self-supervised learning**: a clever middle ground. You create labels from the data itself by hiding part of it and asking the network to predict the hidden part. Examples: - Masked language modeling: hide words, predict them - Contrastive learning: different views of the same image should have similar representations - MAE: mask 75% of image patches, reconstruct them The Contrastive family in Edifice (SimCLR, BYOL, BarlowTwins, MAE, VICReg) is entirely self-supervised. These methods learn powerful representations without any human-provided labels. **Reinforcement learning**: the network learns by interacting with an environment and receiving rewards. This is how game AI (like ExPhil) trains after the initial behavioral cloning phase. Edifice provides the architecture (the policy network), and a separate RL framework handles the training loop. ## What's Next Now that you can identify your problem type and narrow down which Edifice families to consider: 1. **[Reading Edifice](reading_edifice.md)** -- understand the code patterns so you can actually use the architectures 2. **[Learning Path](learning_path.md)** -- a guided tour through the families in a logical learning order 3. Jump to any specific architecture guide for the family you need