I built a self-evolving MoE from scratch that triggers "neurogenesis" on distribution shifts (MPS native). Looking for architecture roasts. [P]

I don't have a formal ML or research background, but I wanted to understand dynamic architectures, so I tried building a Mixture of Experts (MoE) model that physically grows instead of just failing when the data changes.

It's called the Self-Evolving Functional Manifold.

Instead of a static parameter count, the model tracks its rolling loss. When it detects a distribution shift it can't handle (e.g., the training data suddenly switches from linear addition to quadratic equations), it calculates the exact latent coordinate where it's failing and spawns a new Neural Expert there. The new expert inherits a blended state_dict from its nearest neighbors.

A few engineering hurdles I had to solve:

Runaway Neurogenesis & Optimizer Amnesia: Spawning experts ruined the AdamW momentum, causing the model to panic and spawn endless experts. I had to build a refractory "cooldown" period.
The MPS Gap: PyTorch's cdist backward pass isn't supported on Apple Silicon (Metal) yet, so I had to write a manual Euclidean distance routing mechanism to prevent it from falling back to the CPU. It runs 100% natively on M-series chips now.
Hysteresis Loops: The model kept getting stuck in predictive loops. Injecting positional encodings into the global workspace finally gave it the "momentum" to step through the math auto-regressively.

I just pushed v1.0. It successfully trains on small numbers and accurately extrapolates the mathematical function to numbers it hasn't seen without modulo constraints.

Since I'm entirely self-taught in ML, I am absolutely certain my gradient flow or routing logic could be optimized. I would love for some of the actual researchers and ML engineers here to tear into the architecture and tell me what I did wrong (or right).

Repo: https://github.com/rushplayer-arch/self-evolving-manifold

submitted by /u/cocacola_can
[link] [comments]

Leave a Comment