Document-as-Image Representations Fall Short for Scientific Retrieval
arXiv:2604.18508v1 Announce Type: cross
Abstract: Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientif…