Simular AI

Agent S2 Technical Review

A Compositional Generalist-Specialist Framework for Computer Use Agents

April 1, 2025

Building an agent that can use a computer like a human remains one of the most formidable milestones on the path toward artificial general intelligence. From executing open-ended digital tasks to navigating unfamiliar applications via GUIs, the problem space is vast, noisy, and highly dynamic. Today, we're thrilled to release the technical paper for Agent S2, a modular framework that has set new state-of-the-art performance on multiple computer-use benchmarks.

Two weeks ago, we open-sourced Agent S2. Now, with the release of the technical paper, we’re excited to provide a deeper look into the core ideas and architecture behind the system. For a more beginner-friendly explanation, check out our previous blog post.
‍

Agent S2 Overview: Compositional Intelligence

Agent S2 is designed around a simple but powerful idea: instead of relying on a single monolithic model to plan, act, and ground its interactions with the screen, we divide these responsibilities between generalist and specialist modules. This compositional setup mimics how expert human operators work: high-level planners, low-level executors, and interface specialists working in tandem.

Agent S2 architecture combining generalist planning and specialist grounding.

Key features of Agent S2:

Mixture of Grounding (MoG): Uses a suite of grounding experts (visual, textual, structural) to accurately localize GUI elements. 
Proactive Hierarchical Planning (PHP): Dynamically refines its plans based on feedback from the environment, rather than following a fixed script.

Benchmark Results: State-of-the-Art Across Platforms

Agent S2 sets a new bar on the widely-used OSWorld benchmark:

It also shows strong generalization:

WindowsAgentArena: +52.8% improvement over prior SOTA
AndroidWorld: +16.5% improvement over prior SOTA

Success rate on OSWorld. Agent S2 significantly outperforms previous agents.

Success rate on WindowsAgentArena. Agent S2 significantly outperforms previous agents.

Design Innovations: MoG + PHP

Most agents fail due to poor grounding or rigid planning. Agent S2 addresses both:

Mixture of Grounding: Routes each interaction to the best-suited expert. E.g., for spreadsheets, use a structural grounding expert; for buttons, use visual grounding. Decoupling grounding from planning essentially factorizes the overall problem into two (relatively) simpler subproblems, which better align with the training distribution of current general reasoning models and specialized visual grounding models.
Proactive Planning: Continuously refines subgoals and adjusts based on new observations, mimicking how a human would re-evaluate a plan when something changes.

Agent S2 self-corrects by switching from visual to textual grounding.

Scaling and Error Recovery

With longer horizons, Agent S2 scales better than monolithic models. It adapts on the fly and self-corrects when its initial actions don't produce the desired effect.

Why Agent S2 succeeds with longer horizons: adaptive navigation, interaction, and correction.

Generalizing Beyond Desktop: Android Results

Even though Agent S2 was primarily built for desktop agents, it generalizes well to mobile environments:

Agent S2 achieves state of the art in AndroidWorld smartphone use benchmark.

Conclusion: Modular Agents, Real Progress

Agent S2 shows that compositionality isn't just an elegant design philosophy—it's a winning strategy for building agents that can robustly use computers like humans. We believe this work brings us a step closer to AGI and opens up new directions for research in planning, grounding, and multimodal coordination.

Check out the code and the paper.

‍

Agent S2 Paper

Code Repository

Ready to use your
computer in a Simular way?

Shares and organize your memory, and personalize your tasks.

Try Simular AI