If you are evaluating AI browser or computer-use agents, the metric that matters first is not a single headline leaderboard. It is task completion on the surface your workflow actually lives on. For multi-app desktop work, weight OSWorld-Verified most heavily. For repeatable browser flows you want to regression-test, weight WebArena-Verified. For open-web navigation and general website exploration, use WebVoyager or Online-Mind2Web as secondary signals. That ordering matters because these benchmarks measure different kinds of difficulty, use different evaluation setups, and can produce very different winner lists.
The short verdict
The easiest way to misread computer-use benchmarks is to treat them like one shared leaderboard. They are not. OSWorld-Verified is closest to a real employee moving across apps, files, documents, and browser tabs. WebArena-Verified is closer to a controlled browser lab, which makes it better for reproducible release-to-release regression testing. WebVoyager is still useful for open-web navigation, but recent work in March 2026 showed why it is risky to compare old self-reported results at face value without standardizing task setup and scoring.
Which benchmark matters for which workflow
| Benchmark | Best fit | What it tells you | What it can miss |
|---|---|---|---|
| OSWorld-Verified | Desktop and multi-app automation | Whether an agent can handle real computer work across files, browsers, office tools, and long action chains | How well it generalizes to the open web or your exact internal browser app |
| WebArena-Verified | Controlled browser workflows and regression testing | Whether an agent stays reliable in reproducible, self-hosted browser tasks with deterministic scoring | How it behaves on changing live sites, CAPTCHAs, or messy real-world web drift |
| WebVoyager / Online-Mind2Web | Open-web browsing and navigation reach | Whether an agent can move through realistic public websites and open-ended browsing tasks | Stable apples-to-apples comparison across dates, site changes, and vendor-specific evaluation tweaks |
What each benchmark is actually testing
OSWorld-Verified
OSWorld was designed for real computer use, not just browser navigation. Its official benchmark covers 369 real-world tasks across web and desktop apps, including workflows that span multiple applications. On July 28, 2025, the benchmark was upgraded to OSWorld-Verified after the maintainers said they had fixed more than 300 issues, improved infrastructure, and created a more stable evaluation foundation. If your target workflow looks like finance ops, spreadsheet handling, document updates, or back-office software execution, this is usually the most relevant public benchmark to start with.
WebArena-Verified
WebArena already mattered because it created a standalone, self-hostable environment for autonomous browser agents. WebArena Verified tightened that further. The September 19, 2025 paper describes an audit of all 812 tasks, more deterministic scoring, repaired evaluator logic, and a 258-task hard subset that cuts runtime while preserving coverage. That makes WebArena-Verified especially useful when you want to compare releases of the same agent or vendor under stable conditions instead of chasing noisy live-site variance.
WebVoyager and Online-Mind2Web
WebVoyager and Mind2Web matter because they test something the controlled benchmarks do not fully capture: the chaos of realistic website interaction. The original WebVoyager paper focused on 15 real-world websites. Mind2Web pushed toward broader web generalization across many domains. That makes these benchmarks good signals for agents that need to browse unfamiliar sites, gather information, or complete public-web tasks. But Google’s October 7, 2025 evaluation notes also explain why self-reported WebVoyager results are hard to compare over time: tasks can be date-edited, infeasible tasks may be removed, and models are often run at different points against changing sites.
The current model numbers are useful, but only in the right frame
Google’s October 7, 2025 Gemini 2.5 Computer Use model card is one of the most useful public snapshots because it published a normalized Browserbase comparison across Google, Anthropic, and OpenAI computer-use APIs. On that shared harness, Gemini 2.5 Computer Use posted 65.7% on Online-Mind2Web and 79.9% on WebVoyager, versus Claude Sonnet 4.5 at 55.0% and 71.4%, Claude Sonnet 4 at 61.0% and 69.4%, and OpenAI’s Computer-Using Agent at 44.3% and 61.0%.
That same table also shows why buyers should slow down before declaring a universal winner. Google listed its own model as not yet supporting OS-level control in that snapshot, while Anthropic and OpenAI reported separate OSWorld numbers. OpenAI’s January 23, 2025 CUA launch reported 38.1% on OSWorld, 58.1% on WebArena, and 87% on WebVoyager. Anthropic’s February 17, 2026 Sonnet 4.6 launch, meanwhile, emphasized a major jump in computer use and explicitly noted that scores from Sonnet 4.5 onward use OSWorld-Verified, not the original OSWorld. In other words, model generations are improving while the benchmarks themselves are also being repaired. Older cross-vendor tables are still useful, but they are snapshots, not final procurement answers.
The OpenAI side of the market has also moved again since those older CUA numbers. On April 23, 2026, OpenAI released GPT-5.5 and positioned it for software, spreadsheet, and tool-driven work on a computer. That matters less as a single benchmark datapoint and more as a reminder that model selection is moving faster than many benchmark headlines. If you are comparing vendors in 2026, you need current model context and benchmark-context awareness at the same time.
Where benchmark winners still fail in production
- Website drift: public sites change layouts, URLs, and content faster than benchmark papers update.
- Prompt injection and risky actions: computer-use agents can encounter malicious instructions or attempt actions you do not want fully automated.
- Authentication and human gates: CAPTCHAs, MFA, and approval steps still break otherwise strong agent runs.
- Latency and retries: a benchmark success may still be too slow or too brittle for a customer-facing or time-sensitive workflow.
- Workflow mismatch: an agent that scores well on open-web browsing can still struggle in your internal admin console, ERP flow, or spreadsheet-heavy back office process.
This is why benchmark reading should narrow the field, not end the buying decision. Public benchmarks tell you where to start testing. They do not replace private evals on the workflows that actually create value for your business.
How to choose without overfitting to one leaderboard
If your target job looks like a real employee moving among browser tabs, spreadsheets, files, and documents, start with OSWorld-Verified. If you need a stable browser benchmark for release-to-release testing, start with WebArena-Verified. If your product depends on navigating unfamiliar public websites, use WebVoyager-style or Online-Mind2Web results as a secondary signal for generalization, then validate on your own live-site tasks.
For most operators, the best evaluation stack is a two-layer plan: one public benchmark aligned to your task surface, and one internal test set built from the 10 to 30 workflows that would actually matter in production. That is the point where benchmark consumption turns into deployment judgment.