「i) test-time scaling for planning, which introduces a scaling strategy during inference to effectively handle planning ambiguity in complex GUI environments; ii) grounding model training, filtering out training samples with annotation errors to improve supervision quality, and optimizing a grounding model using RL (e g , GRPO) to directly predict coordinates without relying on any intermediate “thinking” (i. e., CoT reasoning) on the derived data.」という工夫を行っている。UI-TARS-1.5-7B, Qwen2.5-VL-32B-Instruct, Qwen2.5-VL-72B-InstructをPost Trainingしているが、やはりこの手のチューニングを行わないと厳しいタスクなのだろうか・・・