← Back to Blog

Which Benchmark Actually Predicts Computer-Use Agent Performance? OSWorld-Verified, WebArena-Verified, and WebVoyager

Editorial image for Which Benchmark Actually Predicts Computer-Use Agent Performance? OSWorld-Verified, WebArena-Verified, and WebVoyager about AI Infrastructure.

Key Takeaways

  • OSWorld-Verified is the strongest public signal for full desktop and multi-app computer-use workflows.
  • WebArena-Verified is better than live-web benchmarks for reproducible browser-agent regression testing.
  • WebVoyager-style scores can be useful for open-web reach, but they are easier to misread across different dates and setups.
  • Older cross-vendor benchmark tables are snapshots, not a reliable substitute for current-model context and private evals.
  • The safest buying process is one public benchmark plus a small internal workflow test set.
BLOOMIE
POWERED BY NEROVA

If you are evaluating AI browser or computer-use agents, the metric that matters first is not a single headline leaderboard. It is task completion on the surface your workflow actually lives on. For multi-app desktop work, weight OSWorld-Verified most heavily. For repeatable browser flows you want to regression-test, weight WebArena-Verified. For open-web navigation and general website exploration, use WebVoyager or Online-Mind2Web as secondary signals. That ordering matters because these benchmarks measure different kinds of difficulty, use different evaluation setups, and can produce very different winner lists.

The short verdict

The easiest way to misread computer-use benchmarks is to treat them like one shared leaderboard. They are not. OSWorld-Verified is closest to a real employee moving across apps, files, documents, and browser tabs. WebArena-Verified is closer to a controlled browser lab, which makes it better for reproducible release-to-release regression testing. WebVoyager is still useful for open-web navigation, but recent work in March 2026 showed why it is risky to compare old self-reported results at face value without standardizing task setup and scoring.

Which benchmark matters for which workflow

BenchmarkBest fitWhat it tells youWhat it can miss
OSWorld-VerifiedDesktop and multi-app automationWhether an agent can handle real computer work across files, browsers, office tools, and long action chainsHow well it generalizes to the open web or your exact internal browser app
WebArena-VerifiedControlled browser workflows and regression testingWhether an agent stays reliable in reproducible, self-hosted browser tasks with deterministic scoringHow it behaves on changing live sites, CAPTCHAs, or messy real-world web drift
WebVoyager / Online-Mind2WebOpen-web browsing and navigation reachWhether an agent can move through realistic public websites and open-ended browsing tasksStable apples-to-apples comparison across dates, site changes, and vendor-specific evaluation tweaks

What each benchmark is actually testing

OSWorld-Verified

OSWorld was designed for real computer use, not just browser navigation. Its official benchmark covers 369 real-world tasks across web and desktop apps, including workflows that span multiple applications. On July 28, 2025, the benchmark was upgraded to OSWorld-Verified after the maintainers said they had fixed more than 300 issues, improved infrastructure, and created a more stable evaluation foundation. If your target workflow looks like finance ops, spreadsheet handling, document updates, or back-office software execution, this is usually the most relevant public benchmark to start with.

WebArena-Verified

WebArena already mattered because it created a standalone, self-hostable environment for autonomous browser agents. WebArena Verified tightened that further. The September 19, 2025 paper describes an audit of all 812 tasks, more deterministic scoring, repaired evaluator logic, and a 258-task hard subset that cuts runtime while preserving coverage. That makes WebArena-Verified especially useful when you want to compare releases of the same agent or vendor under stable conditions instead of chasing noisy live-site variance.

WebVoyager and Online-Mind2Web

WebVoyager and Mind2Web matter because they test something the controlled benchmarks do not fully capture: the chaos of realistic website interaction. The original WebVoyager paper focused on 15 real-world websites. Mind2Web pushed toward broader web generalization across many domains. That makes these benchmarks good signals for agents that need to browse unfamiliar sites, gather information, or complete public-web tasks. But Google’s October 7, 2025 evaluation notes also explain why self-reported WebVoyager results are hard to compare over time: tasks can be date-edited, infeasible tasks may be removed, and models are often run at different points against changing sites.

The current model numbers are useful, but only in the right frame

Google’s October 7, 2025 Gemini 2.5 Computer Use model card is one of the most useful public snapshots because it published a normalized Browserbase comparison across Google, Anthropic, and OpenAI computer-use APIs. On that shared harness, Gemini 2.5 Computer Use posted 65.7% on Online-Mind2Web and 79.9% on WebVoyager, versus Claude Sonnet 4.5 at 55.0% and 71.4%, Claude Sonnet 4 at 61.0% and 69.4%, and OpenAI’s Computer-Using Agent at 44.3% and 61.0%.

That same table also shows why buyers should slow down before declaring a universal winner. Google listed its own model as not yet supporting OS-level control in that snapshot, while Anthropic and OpenAI reported separate OSWorld numbers. OpenAI’s January 23, 2025 CUA launch reported 38.1% on OSWorld, 58.1% on WebArena, and 87% on WebVoyager. Anthropic’s February 17, 2026 Sonnet 4.6 launch, meanwhile, emphasized a major jump in computer use and explicitly noted that scores from Sonnet 4.5 onward use OSWorld-Verified, not the original OSWorld. In other words, model generations are improving while the benchmarks themselves are also being repaired. Older cross-vendor tables are still useful, but they are snapshots, not final procurement answers.

The OpenAI side of the market has also moved again since those older CUA numbers. On April 23, 2026, OpenAI released GPT-5.5 and positioned it for software, spreadsheet, and tool-driven work on a computer. That matters less as a single benchmark datapoint and more as a reminder that model selection is moving faster than many benchmark headlines. If you are comparing vendors in 2026, you need current model context and benchmark-context awareness at the same time.

Where benchmark winners still fail in production

  • Website drift: public sites change layouts, URLs, and content faster than benchmark papers update.
  • Prompt injection and risky actions: computer-use agents can encounter malicious instructions or attempt actions you do not want fully automated.
  • Authentication and human gates: CAPTCHAs, MFA, and approval steps still break otherwise strong agent runs.
  • Latency and retries: a benchmark success may still be too slow or too brittle for a customer-facing or time-sensitive workflow.
  • Workflow mismatch: an agent that scores well on open-web browsing can still struggle in your internal admin console, ERP flow, or spreadsheet-heavy back office process.

This is why benchmark reading should narrow the field, not end the buying decision. Public benchmarks tell you where to start testing. They do not replace private evals on the workflows that actually create value for your business.

How to choose without overfitting to one leaderboard

If your target job looks like a real employee moving among browser tabs, spreadsheets, files, and documents, start with OSWorld-Verified. If you need a stable browser benchmark for release-to-release testing, start with WebArena-Verified. If your product depends on navigating unfamiliar public websites, use WebVoyager-style or Online-Mind2Web results as a secondary signal for generalization, then validate on your own live-site tasks.

For most operators, the best evaluation stack is a two-layer plan: one public benchmark aligned to your task surface, and one internal test set built from the 10 to 30 workflows that would actually matter in production. That is the point where benchmark consumption turns into deployment judgment.

Pick the benchmark before you pick the model

Match the benchmark to the surface your agent will control, then validate with a small private eval set before rollout.

WorkflowBenchmark to weight mostReason
Desktop, spreadsheet, document, or multi-app executionOSWorld-VerifiedClosest public signal for full computer-use capability across real applications
Controlled browser automation and release-to-release testingWebArena-VerifiedContainerized tasks and deterministic scoring make regressions easier to trust
Open-web research, navigation, and changing public sitesWebVoyager or Online-Mind2WebBetter signal for live-site browsing, but less stable over time
Vendor shortlisting before a pilotNormalized shared harness plus private evalsReduces cherry-picked benchmark claims and exposes workflow-specific failure modes
Pick one benchmark that matches your task surface, not the noisiest leaderboard.
Build a 10 to 30 task internal eval set from real workflows before rollout.
Track pass rate, step count, latency, retries, and human intervention together.

Frequently Asked Questions

Is OSWorld-Verified better than WebArena-Verified?

Not universally. OSWorld-Verified is better for multi-app computer-use tasks, while WebArena-Verified is better for reproducible browser benchmarking and regression testing.

Why do WebVoyager scores from different vendors often disagree?

Because models may be run on different dates, with edited task sets, different harnesses, or different scoring practices. Live sites also change over time, which makes comparisons noisier.

Should I choose a model from public benchmark results alone?

No. Public benchmarks are useful for shortlisting, but the final decision should come from private evals on your own highest-value workflows.

What is the minimum internal eval set a team should run?

A practical starting point is 10 to 30 real tasks that represent the workflows you care about most, scored for pass rate, latency, retries, and required human intervention.

When does WebArena-Verified matter most?

It matters most when you need stable, controlled browser tasks to compare model or agent changes over time without as much live-web noise.

Turn one manual computer task into a custom AI agent

Once you know which benchmark surface matches your workflow, the next step is testing a real task instead of living on public leaderboards. Generate a custom Nerova agent for a browser, spreadsheet, or back-office workflow and validate it on the work your team actually does.

Generate a browser workflow agent
Ask Bloomie about this article