Scaling Probabilistic Transformer via Efficient Cross-Scale Hyperparameter Transfer
arXiv:2604.25409v1 Announce Type: new
Abstract: Probabilistic Transformer (PT), a white-box probabilistic model for contextual word representation, has demonstrated substantial similarity to standard Transformers in both computational structure and do…