Jason Hiner
Editor-in-Chief

Jason Hiner

Jason Hiner is the Editor-in-Chief and Chief Content Officer of The Deep View. He was previously Editor-in-Chief of ZDNET, Senior Editorial Director at CNET, and Global Editor-in-Chief of TechRepublic.

Exclusive: OpenAI grabs top spot in new Thinking Algorithm Leaderboard - but small models are the surprise winners

There’s a new AI leaderboard from startup Neurometric that ranks the effectiveness of a specific set of language models powering the current AI boom. The leaderboard focuses on "thinking algorithms," and OpenAI grabbed the top spot with its open weights model GPT-OSS 120B, while the Chinese model DeepSeek R1 was just behind.

[@portabletext/react] Unknown block type "image", specify a component for it in the `components.types` prop

But even more consequential, the work from Neurometric reveals several surprises about model performance that could upend the conventional wisdom of businesses looking to invest in AI. And ultimately, that could result in better performance and/or lower prices for AI workloads.

The Deep View spoke with Neurometric CEO Rob May and got an exclusive look at the leaderboard and its data ahead of the launch on Dec. 2.

May told The Deep View, "These small language models are more task-specific, so they typically run faster, perform better, and they're cheaper altogether — which is unheard of compared to just using a giant model… This leaderboard provides a counterintuitive insight — the idea that model performance varies dramatically on a per-task basis. I don't think people expected it to vary this much, particularly when you couple it with the test-time compute strategies."

While ChatGPT and other chatbots have been the lead singer for the generative AI revolution, Large Language Models (LLMs) like OpenAI’s GPT-5 have been the backstage crew orchestrating most of the performance. Now, Neurometric is making the case that, since the arrival of reasoning models with OpenAI's o1, thinking algorithms have become even more valuable for applied AI in specific business use cases.

May stated, "Ever since OpenAI launched their first reasoning model, o1, I’ve been fascinated by the idea that the way you probe these models — the “thinking” algorithms you apply, can get you different outcomes. Over the past 9 months, we’ve explored this here at Neurometric. We published some research showing that test-time scaling algorithm choice matters on a per-task basis, and now we’ve decided to launch a tool to help you explore the difference."

With Neurometric's focus on applied AI in real-world use cases, it chose CRMArena as the tool to measure the performance of the thinking models. The test selected eight of the task categories within the CRMArena-Pro benchmark suite, covering three business scenarios (sales, customer service, and configure, price, and quote) and four business skills (workflow routing, policy compliance, information retrieval and textual reasoning, and database querying and numerical computation). The test then measured the accuracy of how well the agents carried out the tasks.

In a blog post announcing the leaderboard, May wrote, "We’ve seen a trend in companies as they move along the AI maturity curve. While nearly everyone starts out building a prototype on one single model, usually a frontier lab model, as AI products start to scale, it becomes obvious that some workloads are better handled with other models. Multi-model systems become the norm as you become more AI mature. But figuring out which models to choose and why is not intuitive. Our leaderboard is a small step towards a more data-driven approach to AI systems design."

Others are also coming around to the idea that small models could be the key for applying generative AI to more real-world use cases. On the a16z podcast on Nov. 28, 2025, Sherwin Wu, Head of Engineering for the OpenAI Platform, said, "Even within OpenAI, the thinking was that there would be one model that rules them all. It’s definitely completely changed. It’s becoming increasingly clear that there will be room for a bunch of specialized models. There will likely be a proliferation of other types of models."

Neurometric ran all its tests on thinking models available on Amazon Bedrock, eliminating additional variables such as network latency and server performance. However, it plans to test and measure additional thinking algorithms over time, including ones available outside of Amazon Bedrock. To try out the leaderboard for yourself, simply click on the "Method" and Models" drop-downs to see for yourself how the models performed differently based on the various tasks.

[@portabletext/react] Unknown block type "image", specify a component for it in the `components.types` prop

or get it straight to your inbox, for free!

Get our free, daily newsletter that makes you smarter about AI. Read by 450,000+ from Google, Meta, Microsoft, a16z and more.