I need feedback on my preprint, please. [D]

https://zenodo.org/records/19661389

Any feedback would be appreciated, including critical ones.

Abstract

We introduce the parameter updater expert, an architectural concept in which one or
more dedicated expert slots within a Mixture-of-Experts (MoE) layer generate weight deltas
(∆w) that modify sibling inference experts during the forward pass. Under this concept, the
router’s sparse gating mechanism implicitly determines not only which experts compute but
also when and whether the model adapts its own parameters—collapsing the weight-generator
into the expert pool itself. We parameterize ∆w as LoRA factor matrices for efficiency and
to constrain the degrees of freedom of the adaptation, though the concept is agnostic to the
factorization.
As a first scalable mechanism-isolation prototype, we present DeltaNet-LoRA: a recur-
rent module that generates LoRA factors applied inside the MoE expert computation. The
recurrent state consists of the LoRA factor matrices themselves, updated via gated linear in-
terpolation. DeltaNet-LoRA is a retrofit experiment that isolates a necessary submechanism
of the full concept—whether learned, persistent weight-delta generation can work inside a
real pretrained MoE. It does not instantiate the router-selected, expert-indexed architecture,
which remains the target design.
On OLMoE-1B-7B (6.9B parameters, all base weights frozen), DeltaNet-LoRA achieves
(i) 100% in-context and 80.1% persistent fact retrieval under a sliding-window attention
constraint; (ii) 52.2% persistent accuracy under full causal attention with a dual-updater
variant; (iii) 54.0% persistent accuracy on natural-language templated facts with a single
persistent updater drawing from per-layer hidden states. Ablations show simple learned
gates beat surprise-gating and delta-rule variants at this rank. A parallel-scan reformulation
yields 2.77× training speedup.
These results demonstrate that a necessary submechanism of the Parameter Updater
Expert architecture—learned, persistent weight-delta generation inside a pretrained MoE—
can scale beyond toy models. They do not replace the full router-integrated design, which
remains the target.

submitted by /u/max6296
[link] [comments]

Leave a Comment