STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs
arXiv:2604.18177v2 Announce Type: replace-cross
Abstract: Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs an…