An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
Abstract
Existing methods for developing Graphical User Interface agents rely on massive, noisy synthetic datasets obtained by extracting elements in the interface and then generating instructions with an oracle component. While effective, this pipeline often results in misaligned, low quality or repetitive instructions. This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes 1) supervised fine-tuning, 2) chain-of-thought augmented fine-tuning, and 3) reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baseline models trained with orders of magnitude more data. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.
Type
Publication
In ICLR 2026 Workshop Multimodal Intelligence Next Token Prediction And Beyond