[D] Vision Transformer (ViT) – How do I deal with variable size images?

By /u/PositiveInformal9512 / January 21, 2026

Hi,

I'm currently building a ViT following the research paper (An Image is Worth 16x16 Words). I was wondering what the best solution is for dealing with variable size images for training the model for classification?

One solution I can think of is by rescaling and filling in small images with empty pixels with just black pixels. Not sure if this is acceptable?

submitted by /u/PositiveInformal9512
[link] [comments]

Leave a Comment