| SenseTime (the Chinese AI lab) just published details on NEO-unify, a multimodal model that throws out the vision encoder AND the VAE. Just raw pixels in, raw pixels out. The quick rundown:
Numbers that caught my attention:
The bad news: Not released yet. The comment from a team member says they're "actively preparing for open source as well as a detailed tech report." For a 2B model with no encoder dependencies, this could be interesting to run locally — lighter dependency stack than most multimodal setups. Keeping an eye on their HF page: https://huggingface.co/blog/sensenova/neo-unify Got the Discord server invation code: https://discord.gg/vh5SE45D8b Anyone else tracking encoder-free multimodal models? Feels like this direction (Chameleon, Vila-U, now NEO-unify) is picking up steam. [link] [comments] |