Featured

Agent S3: Approaching Human-level Computer Use with Wide Scaling

October 2, 2025

Since launching our first framework, Agent S, at 20.6% on OSWorld just a year ago, we’ve steadily advanced the frontier of computer-use agents. Agent S2 raised the state of the art to 48.8%, and now Agent S3 pushes performance to 69.9%, approaching human-level performance at 72%.

Agent S3 builds directly on the foundation of Agent S2. By simplifying the framework and introducing a native coding agent, we improved performance to 62.6% on OSWorld, setting a new state of the art. Beyond that, Agent S3 introduces the first wide-scaling framework for computer-use agents through Behavior Best-of-N (bBoN). Instead of relying on a single agent run, bBoN selects from multiple rollouts and selects the best outcome. This approach unlocks scalable performance gains, raising accuracy from 62.6% to 69.9% and showing how agentic frameworks can improve simply by scaling with more diverse agent runs.

New State-of-the-Art, Near Human-level Performance

*Agent S3 using Behavior Best-of-N

On OSWorld, Agent S3 alone reaches 62.6% in the 100-step setting, already exceeding the previous state of the art of 61.4% (Claude Sonnet 4.5). With the addition of Behavior Best-of-N, performance climbs even higher to 69.9%, bringing computer-use agents to within just a few points of human-level accuracy (72%).

For generalization across environments, Agent S3 also shows strong improvements when applying Behavior Best-of-N. On WindowsAgentArena, accuracy rises from 50.2% using only Agent S3 to 56.6% by selecting from multiple rollouts. Similarly on AndroidWorld, performance improves from 68.1% to 71.6%.

CUA Bottleneck: High Variance in Long-Horizon Tasks

Different agent runs with high variance success. bBoN can look over the runs and choose the best one.

Computer-use agents (CUAs) promise a future where software runs itself, booking tickets, filling forms, and navigating apps so you don’t have to. But right now, even the best CUAs stumble when tasks get long and messy. A stray click, a late response, or an unexpected pop-up can send the whole run off course. Small mistakes compound, and what should have been smooth automation turns into frustration.

That’s the core bottleneck: high variance. The same agent might nail a task once and then completely blow it the next time. This inconsistency makes CUAs unpredictable and shows why reliability on complex, everyday workflows remains such a challenge.

Scaling Agents for Computer Use

Behavior Best-of-N: Scaling Through Multiple Rollouts

A core challenge in scaling agents is that single-run rollouts remain inconsistent, even with stronger models. Agent S3 introduces Behavior Best-of-N (bBoN), which tackles this by running multiple rollouts in parallel and selecting the best one.

Our approach starts by generating facts. Raw agent runs contain a large amount of step-by-step detail, much of which is irrelevant or redundant. By generating facts, we convert these noisy runs into concise statements about what happened at each step, focusing only on the information that directly matters for task success. Concatenating these facts produces a behavior narrative, which is a clear summary of what an agent did at each step, making agent runs more interpretable and easier to compare.

With behavior narratives in place, we apply judge selection to determine which rollout best completes the task. Instead of comparing raw outputs, the judge grounds its decision in the facts within each behavior narrative. By citing these facts across rollouts, the judge can reason comparatively about which attempt is most effective and ultimately selects the best run.

Improving the Framework: Simpler Design, Greater Flexibility

Agent S2 used a hierarchical manager–worker setup, but this added unnecessary overhead. Agent S3 streamlines the framework by removing that hierarchy and introducing a native coding agent that can generate and execute code. This makes solutions more diverse, spanning both code and GUI tasks, and also more reliable. Together, these refinements boosted performance by about 13%, bringing Agent S3 to 62.6% for single-agent performance.

Scaling with Agent Runs

As the number of agent runs increase on OSWorld, we find performance gradually improves. With 10 runs, we achieved highest performance with GPT-5 at 69.9% and with GPT-5 Mini at 60.2%.

Ready to use your
computer in a Simular way?

Shares and organize your memory, and personalize your tasks.