An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

Mar 1, 2026·

George Pantazopoulos

Eda Bilici Ozyigit

· 0 min read

Abstract

Existing methods for developing Graphical User Interface agents rely on massive, noisy synthetic datasets obtained by extracting elements in the interface and then generating instructions with an oracle component. While effective, this pipeline often results in misaligned, low quality or repetitive instructions. This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes 1) supervised fine-tuning, 2) chain-of-thought augmented fine-tuning, and 3) reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baseline models trained with orders of magnitude more data. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.

Type

Conference paper

Publication

In ICLR 2026 Workshop Multimodal Intelligence Next Token Prediction And Beyond

Last updated on Mar 1, 2026

Authors

George Pantazopoulos

PhD Candidate

CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts Jul 1, 2025 →