SteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis

arXiv:2603.21824v2 Announce Type: replace Abstract: Steel surface defect analysis is critical for industrial quality control, yet existing benchmarks rely primarily on label-only annotations, limiting fine-grained semantic understanding and systematic evaluation of vision-language models. To address this gap, we introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories. At the class level, the dataset provides defect names, representative visual attributes, and industrial causes. At the sample level, each image is annotated with three forms of textual representations: (1) free-form natural language descriptions, (2) structured attribute annotations, and (3) template-based sentences. These annotations provide flexible textual supervision with varying levels of expressiveness and controllability. We further establish a comprehensive benchmark covering vision-language classification, segmentation, and cross-dataset transfer, along with additional evaluations such as retrieval and text-guided localization. Experimental results reveal a trade-off between structure and flexibility in textual representations. Structured attributes provide more stable semantic alignment, while natural language descriptions improve transferability and fine-grained spatial grounding. These findings highlight the critical role of textual design in industrial vision-language learning. SteelDefectX provides a new benchmark for studying semantic alignment and generalization in industrial vision-language learning. The code and dataset are available at https://github.com/Zhaosxian/SteelDefectX.

Leave a Comment