AI
🤖GPT-5.5 Tops Every Major AI Benchmark — Then Hallucinates 85% of the Time
The Rundown: OpenAI's GPT-5.5 leads the Artificial Analysis Intelligence Index and ARC-AGI-2 but ranks third on knowledge calibration due to an alarming 85.53% hallucination rate.
The details:
- ●GPT-5.5 scores 60 pts on the Artificial Analysis Intelligence Index and 85% on ARC-AGI-2, topping both leaderboards ahead of Gemini 3.1 Pro Preview and Claude Opus 4.7.
- ●Despite raw capability wins, GPT-5.5 ranks third on knowledge calibration — trailing Gemini and Claude — with an 85.53% hallucination rate that makes it unreliable for factual workloads.
- ●Pricing is steep: GPT-5.5 API runs $5/$30 per million input/output tokens (roughly double GPT-5.4), while GPT-5.5 Pro hits $30/$180 per million tokens with parallel reasoning inference.
- ●OpenAI also expanded its AWS partnership to bring GPT-5.5, Codex, and Managed Agents to Amazon Bedrock, letting enterprises deploy within existing AWS security infrastructure.
Why it matters: For founders and builders, GPT-5.5 is a double-edged sword. The raw benchmark wins are real and the ARC-AGI-2 score signals genuine reasoning leaps — but an 85% hallucination rate means you cannot use it for anything customer-facing where factual accuracy matters without heavy guardrails. At $5–$30 per million tokens, the cost-benefit calculus gets complicated fast. The smarter play right now: use GPT-5.5 for creative and reasoning-heavy tasks, route factual retrieval to Gemini or Claude, and build evals before you scale spend. The model is impressive; the trust layer isn't there yet.
📰 Source: The Batch @ DeepLearning.AI / Bay Area Times / TLDR