On May 27, 2026, OpenAI published a new engineering case study with Thrive Holdings showing how the companies built "Tax AI" for Crete Professionals Alliance, an accounting network of more than 30 firms. The system uses Codex inside a feedback loop that turns practitioner corrections into evals and engineering tasks, and OpenAI says the pilot processed 7,000 tax returns this season, saved practitioners about a third of preparation time, reached up to 97% accuracy, and increased throughput by about 50%.
The announcement matters because it pushes Codex beyond the crowded AI coding-assistant narrative. OpenAI is using a regulated, document-heavy professional-services workflow to argue that the next enterprise value from agents may come from self-improving execution loops, not just one-shot copilots.
What OpenAI published on May 27
OpenAI said forward-deployed engineers and researchers worked with Thrive Holdings engineers for roughly six months to build Tax AI alongside Crete’s accountants. The goal was to attack a tax-season bottleneck: medium- and large-complexity filings can require hours of manual data entry across messy source documents, prior-year returns, schedules, and client notes.
- Crete’s network spans 30-plus accounting firms.
- Tax AI processed 7,000 tax returns during the pilot tax season.
- OpenAI says the system automates much of the preparation work for 1040 and 1041 returns.
- The company says practitioners saved about one-third of tax-preparation time, while throughput rose by about 50%.
- OpenAI also says drafted returns reached up to 97% accuracy.
The more unusual claim is not the automation itself but the improvement curve. OpenAI says that at launch only about a quarter of returns reached 75% correct field completion, but within six weeks 86% hit that mark. The company presents that as evidence that the system improved materially after deployment instead of relying on a static pre-launch quality bar.
Why the bigger signal is self-improvement, not tax prep
OpenAI frames Tax AI around a three-part loop: practitioner feedback, production traces, and Codex-driven iteration. In practice, that means human corrections are captured as structured evidence, recurring failures are grouped into targeted evals, and Codex is used to investigate root causes, propose fixes, rerun evals, and surface candidate changes for review.
That is a more ambitious enterprise story than an AI tool helping accountants fill forms faster. It suggests OpenAI wants Codex seen as an engine for operational improvement inside live workflows, especially where errors are costly and edge cases keep changing. That also fits OpenAI’s broader 2026 Codex strategy: in April, Reuters reported the company was expanding Codex through major consulting and systems-integration partners and launching Codex Labs to embed specialists directly in customer environments.
Seen that way, the May 27 post is a deployment proof point. OpenAI is showing how Codex can live inside a bounded, eval-backed environment where the model is not asked to simply solve tax, but to improve a narrow product surface using production evidence and human validation gates.
Business impact lands first in regulated document workflows
Tax work is a strong test case because it combines high document volume, repetitive extraction, strict review requirements, and expensive human expertise. If OpenAI’s numbers hold up beyond the pilot, the commercial implication is bigger than accounting software. Similar patterns exist in audit prep, bookkeeping, revenue operations, claims processing, underwriting, legal intake, procurement review, and back-office compliance workflows.
OpenAI also hints at that expansion path directly. The company says the same design is already being reused for other accounting workflows such as bookkeeping and audit, and for operational workflows like IT help desk automation.
That matters for enterprise AI buyers because it reinforces a different lesson than the typical agent demo. The limiting factor is not only model intelligence. It is whether a company can capture production signals, turn them into usable evals, and keep humans in the loop where judgment, exceptions, and liability still matter. The firms that get that layer right may build agents that compound in value over time instead of flattening after launch.
One anecdote in the OpenAI post captures the upside clearly: a senior accountant who spent 180 hours on tax prep last year reportedly spent only 15 hours on it this year, freeing time for client calls, new clients, and additional service lines. Even if that outcome is not representative across every user, it shows the kind of labor reallocation OpenAI wants enterprise buyers to picture.
What to watch next
The next question is whether this remains a strong vertical case study or becomes a repeatable product pattern. Three signals will matter.
- Repeatability: Can OpenAI show similar gains in other regulated workflows, not just one tax-season pilot?
- Tooling: Will Codex productize more of the trace-to-eval-to-fix loop instead of leaving it to custom engineering teams?
- Go-to-market: Will consultancies, integrators, and industry platforms turn this model into packaged enterprise rollouts?
For AI agents and automation teams, the May 27 announcement is a useful marker. The frontier race is no longer just about which model writes cleaner code or answers harder questions. It is increasingly about which stack can turn live production work into a governed learning loop that keeps improving after deployment.