I don't have a formal ML or research background, but I wanted to understand dynamic architectures, so I tried building a Mixture of Experts (MoE) model that physically grows instead of just failing when the data changes.
It's called the Self-Evolving Functional Manifold.
Instead of a static parameter count, the model tracks its rolling loss. When it detects a distribution shift it can't handle (e.g., the training data suddenly switches from linear addition to quadratic equations), it calculates the exact latent coordinate where it's failing and spawns a new Neural Expert there. The new expert inherits a blended state_dict from its nearest neighbors.
A few engineering hurdles I had to solve:
- Runaway Neurogenesis & Optimizer Amnesia: Spawning experts ruined the AdamW momentum, causing the model to panic and spawn endless experts. I had to build a refractory "cooldown" period.
- The MPS Gap: PyTorch's
cdistbackward pass isn't supported on Apple Silicon (Metal) yet, so I had to write a manual Euclidean distance routing mechanism to prevent it from falling back to the CPU. It runs 100% natively on M-series chips now. - Hysteresis Loops: The model kept getting stuck in predictive loops. Injecting positional encodings into the global workspace finally gave it the "momentum" to step through the math auto-regressively.
I just pushed v1.0. It successfully trains on small numbers and accurately extrapolates the mathematical function to numbers it hasn't seen without modulo constraints.
Since I'm entirely self-taught in ML, I am absolutely certain my gradient flow or routing logic could be optimized. I would love for some of the actual researchers and ML engineers here to tear into the architecture and tell me what I did wrong (or right).
Repo: https://github.com/rushplayer-arch/self-evolving-manifold
[link] [comments]