/u/Few-Personality6088

NEO-unify — A 2B multimodal model with no Vision Encoder, no VAE. Open source coming "hopefully not too long"

/u/Few-Personality6088 / April 14, 2026

SenseTime (the Chinese AI lab) just published details on NEO-unify, a multimodal model that throws out the vision encoder AND the VAE. Just raw pixels in, raw pixels out. The quick rundown: No CLIP, no SigLIP, no VAE — it processes pixel inputs …

Author name: /u/Few-Personality6088

NEO-unify — A 2B multimodal model with no Vision Encoder, no VAE. Open source coming "hopefully not too long"