← Back to Blog

OpenAI’s May 27 Tax AI Reveal Turns Codex Into a Self-Improving Workflow Engine

Editorial image for OpenAI’s May 27 Tax AI Reveal Turns Codex Into a Self-Improving Workflow Engine about Enterprise AI.

Key Takeaways

  • OpenAI says its Codex-powered Tax AI pilot processed 7,000 returns across Crete’s network of 30-plus accounting firms.
  • The company claims about one-third time savings, up to 97% accuracy, and roughly 50% higher throughput.
  • The core product signal is a self-improving loop that turns practitioner corrections into evals and bounded Codex tasks.
  • OpenAI is positioning Codex as regulated-workflow infrastructure, not just a coding assistant.
  • The same design is already being extended into bookkeeping, audit, and IT help desk workflows.
BLOOMIE
POWERED BY NEROVA

On May 27, 2026, OpenAI published a new engineering case study with Thrive Holdings showing how the companies built "Tax AI" for Crete Professionals Alliance, an accounting network of more than 30 firms. The system uses Codex inside a feedback loop that turns practitioner corrections into evals and engineering tasks, and OpenAI says the pilot processed 7,000 tax returns this season, saved practitioners about a third of preparation time, reached up to 97% accuracy, and increased throughput by about 50%.

The announcement matters because it pushes Codex beyond the crowded AI coding-assistant narrative. OpenAI is using a regulated, document-heavy professional-services workflow to argue that the next enterprise value from agents may come from self-improving execution loops, not just one-shot copilots.

What OpenAI published on May 27

OpenAI said forward-deployed engineers and researchers worked with Thrive Holdings engineers for roughly six months to build Tax AI alongside Crete’s accountants. The goal was to attack a tax-season bottleneck: medium- and large-complexity filings can require hours of manual data entry across messy source documents, prior-year returns, schedules, and client notes.

  • Crete’s network spans 30-plus accounting firms.
  • Tax AI processed 7,000 tax returns during the pilot tax season.
  • OpenAI says the system automates much of the preparation work for 1040 and 1041 returns.
  • The company says practitioners saved about one-third of tax-preparation time, while throughput rose by about 50%.
  • OpenAI also says drafted returns reached up to 97% accuracy.

The more unusual claim is not the automation itself but the improvement curve. OpenAI says that at launch only about a quarter of returns reached 75% correct field completion, but within six weeks 86% hit that mark. The company presents that as evidence that the system improved materially after deployment instead of relying on a static pre-launch quality bar.

Why the bigger signal is self-improvement, not tax prep

OpenAI frames Tax AI around a three-part loop: practitioner feedback, production traces, and Codex-driven iteration. In practice, that means human corrections are captured as structured evidence, recurring failures are grouped into targeted evals, and Codex is used to investigate root causes, propose fixes, rerun evals, and surface candidate changes for review.

That is a more ambitious enterprise story than an AI tool helping accountants fill forms faster. It suggests OpenAI wants Codex seen as an engine for operational improvement inside live workflows, especially where errors are costly and edge cases keep changing. That also fits OpenAI’s broader 2026 Codex strategy: in April, Reuters reported the company was expanding Codex through major consulting and systems-integration partners and launching Codex Labs to embed specialists directly in customer environments.

Seen that way, the May 27 post is a deployment proof point. OpenAI is showing how Codex can live inside a bounded, eval-backed environment where the model is not asked to simply solve tax, but to improve a narrow product surface using production evidence and human validation gates.

Business impact lands first in regulated document workflows

Tax work is a strong test case because it combines high document volume, repetitive extraction, strict review requirements, and expensive human expertise. If OpenAI’s numbers hold up beyond the pilot, the commercial implication is bigger than accounting software. Similar patterns exist in audit prep, bookkeeping, revenue operations, claims processing, underwriting, legal intake, procurement review, and back-office compliance workflows.

OpenAI also hints at that expansion path directly. The company says the same design is already being reused for other accounting workflows such as bookkeeping and audit, and for operational workflows like IT help desk automation.

That matters for enterprise AI buyers because it reinforces a different lesson than the typical agent demo. The limiting factor is not only model intelligence. It is whether a company can capture production signals, turn them into usable evals, and keep humans in the loop where judgment, exceptions, and liability still matter. The firms that get that layer right may build agents that compound in value over time instead of flattening after launch.

One anecdote in the OpenAI post captures the upside clearly: a senior accountant who spent 180 hours on tax prep last year reportedly spent only 15 hours on it this year, freeing time for client calls, new clients, and additional service lines. Even if that outcome is not representative across every user, it shows the kind of labor reallocation OpenAI wants enterprise buyers to picture.

What to watch next

The next question is whether this remains a strong vertical case study or becomes a repeatable product pattern. Three signals will matter.

  • Repeatability: Can OpenAI show similar gains in other regulated workflows, not just one tax-season pilot?
  • Tooling: Will Codex productize more of the trace-to-eval-to-fix loop instead of leaving it to custom engineering teams?
  • Go-to-market: Will consultancies, integrators, and industry platforms turn this model into packaged enterprise rollouts?

For AI agents and automation teams, the May 27 announcement is a useful marker. The frontier race is no longer just about which model writes cleaner code or answers harder questions. It is increasingly about which stack can turn live production work into a governed learning loop that keeps improving after deployment.

Find the first regulated workflow worth automating

This story shows that agent gains come from workflow fit, evals, and governed feedback loops rather than model access alone. Use Scope to identify where a finance, operations, or back-office AI agent could create measurable value without adding unnecessary risk.

Run an AI rollout audit
Ask Bloomie about this article