OpenAI, Anthropic launch dueling benchmarks

Welcome back. Do you remember everything you’ve said to ChatGPT this year? Because OpenAI certainly does. On Monday, the AI firm launched its own end-of-year wrap-up feature called “Your Year with ChatGPT” to customers in certain markets that have the “reference saved memories” and “reference chat history” settings turned on. Check it out if you want to reminisce on the good times you’ve had with the chatbot, whether it be the grocery lists you were too lazy to make, the documents you didn’t have time to pore through or the late-night chats when you were in your feelings and listening to Adele. Nat Rubio-Licht

IN TODAY’S NEWSLETTER

1. OpenAI, Anthropic launch dueling benchmarks

2. AI tool helps diagnose cancer 30% faster

3. Google buys clean energy company to power AI

BIG TECH

OpenAI, Anthropic launch dueling benchmarks

Every AI company wants its model to be the best. But who comes out on top often depends on who is holding the yardstick. 

Practically every AI model released over the last year has come with the label “state of the art,” inching out the competition on standard benchmarks and evaluations for metrics such as performance, alignment, and context window length. But now some firms are developing their own assessments.

Two major model firms have released new benchmarking and evaluation tools in the last week: 

  • On Friday, Anthropic introduced Bloom, an open-source framework for generating behavioral evaluations of frontier AI models. Bloom allows researchers to quickly develop tests for specific model traits they’re interested in tracking. 

  • And on Wednesday, OpenAI released FrontierScience, a benchmark that evaluates AI capabilities for “expert-level scientific reasoning” in domains like physics, chemistry and biology. 

Of course, in testing these measurements, Anthropic found that its Claude Opus 4.5 model outperformed competitors like OpenAI, xAI, and Google at reining in troublesome behaviors, including delusional sycophancy, self-preferential bias, and self-preservation. And OpenAI’s benchmark revealed that GPT-5.2 beats other frontier models in research and Olympiad-style scientific reasoning. 

While these benchmarks might not be lying about these models’ capabilities, they likely tell you about these systems’ specific features, but “don’t necessarily really create a fair way of comparing different tooling,” said Bob Rogers, chief product and technology officer of Oii.ai and co-founder of BeeKeeper AI, told The Deep View. These tests emphasize the things that the model developer is proudest of, rather than serving as an objective barometer. 

“This is a big part of the old school big tech playbook,” said Rogers. “What you do is you build a benchmark that really emphasizes the great aspects of your product. Then you publish that benchmark, and you keep moving your roadmap forward and keep being ahead of everybody else on that benchmark. It’s a natural thing.”

Companies that test their products by their own evaluations are simply doing good PR. In order to see if a model is really up to snuff, it should be tested by common tests and standards that aren’t developed by companies with a huge stakes in proving their model is the best on the market. Otherwise, rather than serving as an actual benchmark for capabilities, these tools are just “bench-marketing,” as Rogers called it.

TOGETHER WITH UNFRAME

Should You Build or Buy Enterprise AI?

This is the million (more like billion) dollar question for big businesses today – and that might even be understating the financial impact one right (or wrong) choice can have.

Lucky for you, Unframe just released a free guide that can help you determine the best approach for bringing AI to your enterprise. Inside you’ll find strategic insights for leaders, lessons from where AI adoption has gone wrong, and even a full-blown framework to help you decide whether to build or buy. Check it out for yourself right here.

RESEARCH

AI tool helps diagnose cancer 30% faster

In radiology, a new AI tool is helping fill the gap left by a shortage of radiologists to read CT scans. It's also helping to improve early detection and get diagnosis data to patients faster. It's not by replacing skilled medical professionals, but assisting them. 

The breakthrough came at the University of Tartu in Estonia, where computer scientists, radiologists, and medical professionals collaborated on a study published in the journal Nature.

The tool, called BMVision, uses deep learning to detect and assess kidney cancer. AI startup Better Medicine is commercializing the software. 

"Kidney cancer is one of the most common cancers of the urinary system. It is typically identified using … [CT] scans, which are carefully reviewed by radiologists. However, there are not enough radiologists, and the demand for scans is growing. This makes it more challenging to provide patients with fast and accurate results," said Dmytro Fishman, co-founder of Better Medicine, and one of the authors of the study.

Here's how the study worked:

  • The AI software was tested by a team of six radiologists on a total of 2,400 scans

  • Each radiologist used BMVision to help interpret 200 CT scans

  • Each scan was measured twice: once with AI and once without

  • Accuracy, reporting times and inter-radiologist agreement were compared

  • Using the AI software reduced the time to identify, measure, and report malignant lesions by 30%

  • The time for radiologists to read scans was reduced by 33% on average, and as much as 52% in some cases

  • Auto-generated reports significantly reduced the time for typing and dictation

  • Use of the tool improved sensitivity by about 6%, leading to greater accuracy and agreement between radiologists

  • The study said AI wouldn't replace radiologists but would become a valuable assistant

In the journal article, the authors of the study concluded, "We found that BMVision enables radiologists to work more efficiently and consistently. Tools like BMVision can help patients by making cancer diagnosis faster, more reliable, and more widely available."

Completely separate from one another, I had two friends in 2025 who had a tumor that needed to be scanned. In both cases, they had the scan done but had to wait weeks before their medical team could read and interpret the results. They each downloaded the raw test data from their chart and fed it into ChatGPT. In both cases, the information the AI provided was nearly identical to what they eventually received from their medical provider. AI certainly can't replace medical professionals, but if it can make them faster, more responsive, and more accurate, it would be a huge win.

Jason Hiner, Editor-in-Chief

TOGETHER WITH CEREBRAS

20× Faster Inference, Built to Scale

Advanced reasoning, agentic, long‑context, and multimodal workloads are driving a surge in inference demand—with more tokens per task and tighter latency budgets—yet GPU‑based inference is memory‑bandwidth bound, streaming weights from off‑chip HBM for each token and producing multi‑second to minutes-long delays that erode user engagement.

Cerebras Inference shatters this bottleneck through its revolutionary wafer-sized chip architecture, which uses exponentially faster memory that is closer to compute, delivering frontier‑model outputs at interactive speed.

BIG TECH

Google buys clean energy company to power AI

Tech companies are reading the tea leaves on AI’s energy problem. 

Google parent company Alphabet agreed to acquire Intersect Power, a developer of clean energy, for $4.75 billion in cash, the companies announced on Monday. The deal will help Google with its ambitious data center goals as the entire tech industry is in a mad dash for more compute capacity. 

Along with acquiring the Intersect team, the deal gives Google “multiple gigawatts of energy and data center projects in development, or under construction.” 

“Intersect will help us expand capacity, operate more nimbly in building new power generation in lockstep with new data center load, and reimagine energy solutions to drive US innovation and leadership,” Google CEO Sundar Pichai said in a statement. 

Google’s acquisition marks the latest in a string of energy deals and developments as AI companies reckon with the problem that their innovations are creating.

Multiple estimates have shown that we’re in for a massive power shortfall as a result of AI data centers. While these investments might push the energy transition in the right direction, these firms are racing against the clock.

Though crisis often breeds innovation, AI firms are at war for market share and show no sign of slowing their ambitions. If the choice is burning fossil fuels today to make their AI models a little more powerful and waiting for clean energy to be viable, these companies are likely going to pick the former. However, as sentiment towards AI factories starts to sour amid rising power demand, fear of a PR crisis might push these firms to work harder to clean up the mess they’re making.

Nat Rubio-Licht

LINKS

  • GLM-4.7: The latest model from Chinese startup Z.ai, featuring advanced coding capability

  • Manus Design View: a granular AI design tool that gives users more control. 

  • Clickup Super Agents: Agents with “human-level skills,” automatically learning from human interactions. 

  • Claude Chrome Extension: Integrate Anthropic’s flagship chatbot into your browser.

GAMES

Which image is real?

Login or Subscribe to participate in polls.

POLL RESULTS

Do you buy the popular narrative about AI reducing jobs in the US?

  • Yes, AI is definitely impacting layoffs and entry-level jobs (46%)

  • No, pandemic overhiring and tariffs are the bigger reasons (34%)

  • Other (20%)

The Deep View is written by Nat Rubio-Licht, Jason Hiner, Faris Kojok and The Deep View crew. Please reply with any feedback.

The Deep View team

Thanks for reading today’s edition of The Deep View! We’ll see you in the next one.

“Buck and pillar behind [it] are very realistic. Buck's eyes look alive.”

“Depth focus is consistently ‘fading’.”

“The placement of the deer in the foreground of [the other image] seemed off.”

“The white outline on the antlers [in the other image] gave it away for me.”

“[I was] fooled by the pillar size in [the other image]. It made me overlook that the closest deer in [this image] is not turned towards the food.”

“By comparing the animal’s fur in each pic against one another, it’s obvious [this image] is fake. ”

“They are being fed Pop-Tarts, and there is no fence. I doubt they run wild there. And people usually know not to feed wild animals.”

Take The Deep View with you on the go! We’ve got exclusive, in-depth interviews for you on The Deep View: Conversations podcast every Tuesday morning.

If you want to get in front of an audience of 450,000+ developers, business leaders and tech enthusiasts, get in touch with us here.