Hey everyone,
I did a small personal benchmark on using local models to detect UI icons from application screenshots. English is not my first language, so sorry for any grammar mistakes! I just wanted to share what I found in case it helps someone doing similar stuff.
Models includes(none quantization):
- Gemma4-31B-it
- Qwen3.5-27B
- Qwen3.6-35B-A3B
Approach:
I feed the app screenshot into the LLM and ask it to recognize the UI icons and return the bbox_2d coordinates. After it gives me the coordinates, I use supervision to draw red bounding boxes on the image. Finally, I just check the results manually by eye.
For the setup, I used the newest vLLM v0.19.1 doing offline inference. I set the starting temperature to 0 because I want the most confident output. If the model returns 0 icons, I gradually increase the temperature: 0 -> 0.3 -> 0.6 -> 0.9.
Overall Results:
Overall, the Dense model is much better than the MoE model for this task. My ranking: Qwen3.5 > Qwen3.6 ≈ Gemma4
Some specific findings:
- Gemma4 and Qwen3.6 are both tied for last place. They are noticeably worse than Qwen3.5.
- Gemma4 completely failed on the Cursor IDE screenshot. I tried 4 times, everytime pushing the temperature all the way to 0.9, and it still couldn't detect a single icon.
- Qwen3.6 did something really funny on the Photoshop screenshot. It basically recognized the whole entire image as one giant icon and drew a massive box around the screen. 😅
- For the other app scenarios, you can check the comparison pictures below.
Here are the detail vllm parameters:
- name: gemma-4-31B-it family: gemma4 params_b: 31 vllm_kwargs: model: google/gemma-4-31B-it tensor_parallel_size: 8 max_model_len: 8192 max_num_seqs: 1 gpu_memory_utilization: 0.85 limit_mm_per_prompt: image: 1 audio: 0 video: 0 mm_processor_cache_gb: 0 skip_mm_profiling: true mm_processor_kwargs: max_soft_tokens: 1120 - name: qwen3.5-27b family: qwen3.5 params_b: 27 vllm_kwargs: model: Qwen/Qwen3.5-27B tensor_parallel_size: 8 max_model_len: 32768 max_num_seqs: 1 gpu_memory_utilization: 0.9 limit_mm_per_prompt: image: 1 audio: 0 video: 0 mm_processor_cache_gb: 0 mm_encoder_tp_mode: data skip_mm_profiling: true - name: qwen3.6-35b-a3b family: qwen3.5 params_b: 35 vllm_kwargs: model: Qwen/Qwen3.6-35B-A3B tensor_parallel_size: 8 max_model_len: 32768 max_num_seqs: 1 gpu_memory_utilization: 0.9 limit_mm_per_prompt: image: 1 audio: 0 video: 0 mm_processor_cache_gb: 0 mm_encoder_tp_mode: data skip_mm_profiling: true
Has anyone else tried UI element detection with local models recently? Curious if you guys have any tricks for getting better bounding boxes.
submitted by