DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
arXiv:2604.12812v4 Announce Type: replace
Abstract: Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental…