Models

GPT-5.6 Sol Ultra Is Here. It Doesn’t Just Answer — It Delegates.

OpenAI released GPT-5.6 Sol on June 26, the first model that can autonomously spin up sub-agents to parallelize complex tasks — not a prototype, but a shipping product. Ultra Mode pushed its Terminal-Bench score past Claude Mythos 5, 91.9% to 88.0% — proof that the agentic AI race just shifted from “smarter models” to “models that can build their own workforce.”

Jeff Editorial | · 2 min read
GPT-5.6 Sol Ultra Is Here. It Doesn’t Just Answer — It Delegates.

Three Tiers, One Winner

OpenAI released GPT-5.6 Sol on June 26, alongside Terra and Luna — a three-tier lineup that replaces the old Pro/Mini naming with Sol, Terra, and Luna.

The benchmark numbers are impressive. Sol scored 88.8% on Terminal-Bench 2.1, a test that measures how well a model can navigate a command-line environment, plan steps, run commands, interpret errors, and recover — more like real software engineering than a multiple-choice exam . With Ultra Mode enabled, that score jumped to 91.9%, surpassing Claude Mythos 5’s 88.0% .

On ExploitBench, Sol matched the performance of Anthropic’s Mythos Preview — the model that identified over 10,000 critical vulnerabilities — while using about one-third of the output tokens . On GeneBench v1, Sol outperformed GPT-5.5 while consuming fewer tokens . In CTF-style capture-the-flag challenges, Sol hit a 96.7% success rate .

GPT-5.6 Sol Ultra Is Here. It Doesn’t Just Answer — It Delegates.
Tibo

The pricing is also notable. Sol costs $5 per million input tokens and $30 per million output — roughly half of Claude Fable 5. Terra costs half of Sol, and Luna is 80% cheaper than Terra .

Sol will also be available on Cerebras hardware in July, delivering inference speeds up to 750 tokens per second .

Ultra Mode: The Model That Builds Its Own Team

The benchmark jump isn‘t the story. The Ultra Mode is.

Ultra Mode doesn’t just make the model“think longer” — it lets the model spin up multiple sub-agents to work on different parts of a problem in parallel . This is different from Anthropic‘s Agent Teams, where humans design the collaboration workflow. Ultra Mode is the model doing the task decomposition and coordination itself.

The Terminal-Bench 2.1 score that beat Mythos was achieved with Ultra Mode enabled. That’s not a reasoning upgrade. That‘s an architecture upgrade — a model that can orchestrate its own workforce.

GPT-5.6 Sol Ultra Is Here. It Doesn’t Just Answer — It Delegates.
GPT-5.6 Sol Ultra

The “Find It Yourself” Challenge

On July 2, SiliconANGLE co-founder and editor-in-chief John Furrier posted on X: “Can’t wait to see what people will do with GPT-5.6 Sol Ultra. Stash your hardest prompts somewhere.” The post invited the community to test the model and share the hardest prompts that break it.

That’s not a marketing line — it‘s a direct challenge to the AI community. The goal is for people to try breaking the new model with their toughest questions.

There is also a regulatory dimension. The release was limited to about 20 trusted partners at the request of the US government, with access approved on a case-by-case basis . OpenAI made it clear this wasn’t its preferred model: “We don‘t believe this kind of government access process should become the long-term default.”


P.S. Sol Ultra is live — but only for a handful of partners. The benchmark gap over Mythos is real, but the bigger question is what happens when developers start pushing the sub-agent architecture beyond benchmarks. That’s what the “stash your hardest prompts” invitation is actually about — not validation, but discovery. The hardest prompts people find will become the next dataset.

Advertisement

CRAZE

Use CRAZE to turn this article into a faster answer: pull the summary, surface the key term, or jump straight to the next story in this thread.

Article