Here’s how my LLM’s decoder block changed while training on 5B tokens

I'm monitoring an experimental model's ongoing training. I replaced the MLP decoders of a traditional transformer with discrete lower-dimensional spline manifold geometry described in my K-Splanifolds paper. The image shows how layer 96 of 128 developed over 5B tokens trained. The 18M model works surprisingly well and loss is reducing, so I'll continue to train it until I see evidence it is stagnating. Just thought you all might find this look at its development interesting.

submitted by /u/1ncehost
[link] [comments]

Leave a Comment