NEO-unify — A 2B multimodal model with no Vision Encoder, no VAE. Open source coming "hopefully not too long"

SenseTime (the Chinese AI lab) just published details on NEO-unify, a multimodal model that throws out the vision encoder AND the VAE. Just raw pixels in, raw pixels out.

The quick rundown:

No CLIP, no SigLIP, no VAE — it processes pixel inputs natively
2B parameter model, single unified Transformer backbone (they call it MoT — Mixture of Transformer) handles both understanding and image generation
Trained with flow matching for image generation, autoregressive for text — all in one model

Numbers that caught my attention:

Image reconstruction quality (PSNR 31.56) is already close to Flux's VAE (32.65) at only 90K pretraining steps
Beats Bagel on data efficiency (same benchmark, fewer tokens)
Image editing works even with the understanding branch completely frozen

The bad news: Not released yet. The comment from a team member says they're "actively preparing for open source as well as a detailed tech report."

For a 2B model with no encoder dependencies, this could be interesting to run locally — lighter dependency stack than most multimodal setups.

Keeping an eye on their HF page: https://huggingface.co/blog/sensenova/neo-unify

Got the Discord server invation code: https://discord.gg/vh5SE45D8b

Anyone else tracking encoder-free multimodal models? Feels like this direction (Chameleon, Vila-U, now NEO-unify) is picking up steam.

submitted by /u/Few-Personality6088
[link] [comments]

Leave a Comment