cs.CL

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan’s National Assessment of Academic Ability

arXiv:2605.11663v1 Announce Type: new
Abstract: Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a mu…