Exclusive: OpenAI grabs top spot in new Thinking Algorithm Leaderboard - but small models are the surprise winners

By
Jason Hiner

Dec 2, 2025

10:30pm UTC

Copy link
Share on X
Share on LinkedIn
Share on Instagram
Share via Facebook
T

here’s a new AI leaderboard from startup Neurometric that ranks the effectiveness of a specific set of language models powering the current AI boom. The leaderboard focuses on "thinking algorithms," and OpenAI grabbed the top spot with its open weights model GPT-OSS 120B, while the Chinese model DeepSeek R1 was just behind. Qwen3 235B, Meta's Llama4 Maverick 17B, and OpenAI's GPT-OSS 20B rounded out the top five.

Thinking Algorithm Leaderboard

But even more consequential, the work from Neurometric reveals several surprises about model performance that could upend the conventional wisdom of businesses launching AI projects. And ultimately, that could result in better performance and/or lower prices for AI workloads.

The Deep View spoke with Neurometric CEO Rob May and got an exclusive look at the leaderboard and its data ahead of the launch on Dec. 2.

May told The Deep View, "This leaderboard provides a counterintuitive insight — the idea that model performance varies dramatically on a per-task basis. I don't think people expected it to vary this much, particularly when you couple it with the test-time compute strategies... These small language models are more task-specific, so they typically run faster, perform better, and they're cheaper altogether — which is unheard of compared to just using a giant model."

While ChatGPT and other chatbots have been the lead singer for the generative AI revolution, Large Language Models (LLMs) like OpenAI’s GPT-5 have been the backstage crew orchestrating most of the performance. Now, Neurometric is making the case that, since the arrival of reasoning models with OpenAI's o1, thinking algorithms have become even more valuable for applied AI in specific business use cases.

May stated, "Ever since OpenAI launched their first reasoning model, o1, I’ve been fascinated by the idea that the way you probe these models — the “thinking” algorithms you apply, can get you different outcomes. Over the past 9 months, we’ve explored this here at Neurometric. We published some research showing that test-time scaling algorithm choice matters on a per-task basis, and now we’ve decided to launch a tool to help you explore the difference."

With Neurometric's focus on applied AI in real-world use cases, it chose CRMArena as the tool to measure the performance of the thinking algorithms. They selected eight of the task categories within the CRMArena-Pro benchmark suite, covering three business scenarios (sales, customer service, and configure, price, and quote) and four business skills (workflow routing, policy compliance, information retrieval and textual reasoning, and database querying and numerical computation). The test then measured the accuracy of how well the agents carried out the tasks.

In a blog post announcing the leaderboard, May wrote, "We’ve seen a trend in companies as they move along the AI maturity curve. While nearly everyone starts out building a prototype on one single model, usually a frontier lab model, as AI products start to scale, it becomes obvious that some workloads are better handled with other models. Multi-model systems become the norm as you become more AI mature. But figuring out which models to choose and why is not intuitive. Our leaderboard is a small step towards a more data-driven approach to AI systems design."

Others are also coming around to the idea that small models could be the key for applying generative AI to more real-world use cases. On the a16z podcast on Nov. 28, 2025, Sherwin Wu, Head of Engineering for the OpenAI Platform, said, "Even within OpenAI, the thinking was that there would be one model that rules them all. It’s definitely completely changed. It’s becoming increasingly clear that there will be room for a bunch of specialized models. There will likely be a proliferation of other types of models."

Neurometric ran all its tests on thinking algorithms available on Amazon Bedrock, eliminating additional variables such as network latency and server performance. That's why the models tested included ones from OpenAI, Meta, Amazon, DeepSeek, and Qwen, but didn't include ones from Google and Anthropic, for example.

However, May said Neurometric plans to test and measure additional thinking algorithms over time, including ones available outside of Amazon Bedrock. To try out the leaderboard for yourself, simply click on the "Method" and Models" drop-downs to see for yourself how the models performed differently based on the specific tasks shown at the bottom of the chart.

May said, "Obviously, CRM arena is a limited set of tasks related to CRM-related workflows. But we'll be adding more tasks and benchmarks over time. Part of my goal is to build out a whole suite of tools where you can basically say, 'I have tasks [in] finance, accounting, marketing, etc. What models should I use for these tasks?' And we can make solid recommendations."

Neurometric will launch its first product in early 2026, aimed at helping companies select the right models for their workloads to improve performance, save money, or both.

Our Deeper View

Every day, more parts of the generative AI ecosystem are calling into question one of the basic premises of the current AI boom: that large foundation models paired with massive amounts of data and compute will deliver the biggest gains in AI. Ilya Sutskever recently declared the age of scaling over and that momentum would shift back to research. Mistral AI released new low-cost, open-source models, arguing that the future of AI innovation will depend on smaller models fine-tuned for specific use cases. And so Neurometric's counterintuitive insight that thinking algorithms can vary widely on a per-task basis offers hope that today's AI could become more performant and less expensive, opening up new possibilities for AI projects and improving ROI.