Three Tiers, One Winner
OpenAI released GPT-5.6 Sol on June 26, alongside Terra and Luna — a three-tier lineup that replaces the old Pro/Mini naming with Sol, Terra, and Luna.
The benchmark numbers are impressive. Sol scored 88.8% on Terminal-Bench 2.1, a test that measures how well a model can navigate a command-line environment, plan steps, run commands, interpret errors, and recover — more like real software engineering than a multiple-choice exam . With Ultra Mode enabled, that score jumped to 91.9%, surpassing Claude Mythos 5’s 88.0% .
On ExploitBench, Sol matched the performance of Anthropic’s Mythos Preview — the model that identified over 10,000 critical vulnerabilities — while using about one-third of the output tokens . On GeneBench v1, Sol outperformed GPT-5.5 while consuming fewer tokens . In CTF-style capture-the-flag challenges, Sol hit a 96.7% success rate .

The pricing is also notable. Sol costs $5 per million input tokens and $30 per million output — roughly half of Claude Fable 5. Terra costs half of Sol, and Luna is 80% cheaper than Terra .
Sol will also be available on Cerebras hardware in July, delivering inference speeds up to 750 tokens per second .
Ultra Mode: The Model That Builds Its Own Team
The benchmark jump isn‘t the story. The Ultra Mode is.
Ultra Mode doesn’t just make the model“think longer” — it lets the model spin up multiple sub-agents to work on different parts of a problem in parallel . This is different from Anthropic‘s Agent Teams, where humans design the collaboration workflow. Ultra Mode is the model doing the task decomposition and coordination itself.
The Terminal-Bench 2.1 score that beat Mythos was achieved with Ultra Mode enabled. That’s not a reasoning upgrade. That‘s an architecture upgrade — a model that can orchestrate its own workforce.

The “Find It Yourself” Challenge
On July 2, SiliconANGLE co-founder and editor-in-chief John Furrier posted on X: “Can’t wait to see what people will do with GPT-5.6 Sol Ultra. Stash your hardest prompts somewhere.” The post invited the community to test the model and share the hardest prompts that break it.
That’s not a marketing line — it‘s a direct challenge to the AI community. The goal is for people to try breaking the new model with their toughest questions.
There is also a regulatory dimension. The release was limited to about 20 trusted partners at the request of the US government, with access approved on a case-by-case basis . OpenAI made it clear this wasn’t its preferred model: “We don‘t believe this kind of government access process should become the long-term default.”
P.S. Sol Ultra is live — but only for a handful of partners. The benchmark gap over Mythos is real, but the bigger question is what happens when developers start pushing the sub-agent architecture beyond benchmarks. That’s what the “stash your hardest prompts” invitation is actually about — not validation, but discovery. The hardest prompts people find will become the next dataset.