Agent S2 Technical Review
A Compositional Generalist-Specialist Framework for Computer Use Agents
April 1, 2025
Building an agent that can use a computer like a human remains one of the most formidable milestones on the path toward artificial general intelligence. From executing open-ended digital tasks to navigating unfamiliar applications via GUIs, the problem space is vast, noisy, and highly dynamic. Today, we're thrilled to release the technical paper for Agent S2, a modular framework that has set new state-of-the-art performance on multiple computer-use benchmarks.
Two weeks ago, we open-sourced Agent S2. Now, with the release of the technical paper, we’re excited to provide a deeper look into the core ideas and architecture behind the system. For a more beginner-friendly explanation, check out our previous blog post.
Agent S2 Overview: Compositional Intelligence
Agent S2 is designed around a simple but powerful idea: instead of relying on a single monolithic model to plan, act, and ground its interactions with the screen, we divide these responsibilities between generalist and specialist modules. This compositional setup mimics how expert human operators work: high-level planners, low-level executors, and interface specialists working in tandem.

Key features of Agent S2:
Mixture of Grounding (MoG): Uses a suite of grounding experts (visual, textual, structural) to accurately localize GUI elements.
Proactive Hierarchical Planning (PHP): Dynamically refines its plans based on feedback from the environment, rather than following a fixed script.
Benchmark Results: State-of-the-Art Across Platforms
Agent S2 sets a new bar on the widely-used OSWorld benchmark:
.png)
It also shows strong generalization:
WindowsAgentArena: +52.8% improvement over prior SOTA
AndroidWorld: +16.5% improvement over prior SOTA


Design Innovations: MoG + PHP
Most agents fail due to poor grounding or rigid planning. Agent S2 addresses both:
Mixture of Grounding: Routes each interaction to the best-suited expert. E.g., for spreadsheets, use a structural grounding expert; for buttons, use visual grounding. Decoupling grounding from planning essentially factorizes the overall problem into two (relatively) simpler subproblems, which better align with the training distribution of current general reasoning models and specialized visual grounding models.
Proactive Planning: Continuously refines subgoals and adjusts based on new observations, mimicking how a human would re-evaluate a plan when something changes.

Scaling and Error Recovery
With longer horizons, Agent S2 scales better than monolithic models. It adapts on the fly and self-corrects when its initial actions don't produce the desired effect.

Generalizing Beyond Desktop: Android Results
Even though Agent S2 was primarily built for desktop agents, it generalizes well to mobile environments:

Conclusion: Modular Agents, Real Progress
Agent S2 shows that compositionality isn't just an elegant design philosophy—it's a winning strategy for building agents that can robustly use computers like humans. We believe this work brings us a step closer to AGI and opens up new directions for research in planning, grounding, and multimodal coordination.