ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua
Don't deploy GUI automation in professional tools yet. If you're building AI assistants for design software, test at native 4K resolution with real toolbar density—not downsampled screenshots. Prioritize single-window workflows until multi-pane grounding improves.
MLLMs fail at professional software interfaces. Targets are 3-5x smaller than consumer apps, and 4K displays expose models trained on low-res screenshots.
Method: ScreenSpot-Pro tests GUI agents on Photoshop, Blender, and CAD tools at native resolutions up to 3840x2160. Current best models hit only 25-40% accuracy on small UI elements (buttons under 50px), compared to 70%+ on consumer interfaces. The benchmark isolates three failure modes: resolution degradation during preprocessing, inability to parse dense toolbars, and confusion in multi-window layouts with overlapping coordinate systems.
Caveats: Benchmark focuses on static screenshots, not dynamic interactions like drag-and-drop or modal dialogs that appear mid-workflow.
Reflections: Can fine-tuning on high-res professional UI datasets close the 30-point accuracy gap? · How do coordinate system transformations in multi-monitor setups affect grounding performance? · What's the minimum viable resolution for acceptable accuracy in dense interfaces?