new AI leaderboard from startup Neurometric benchmarks the effectiveness of a specific aspect of language models powering the current AI boom. The leaderboard focuses on "thinking algorithms," and while OpenAI's open-source model GPT-OSS 120B, the Chinese model DeepSeek R1, and Anthropic's Claude Sonnet 4.5 are overall leaders, the bigger news is the insight it offers into how best to deploy models — specifically, smaller models.
In an exclusive interview with The Deep View, Neurometric CEO Rob May said, "These small language models are more task specific, so they typically run faster, perform better, and they're cheaper altogether — which is unheard of compared to just using a giant model… This leaderboard provides a counterintuitive insight — the idea that model performance varies dramatically on a per-task basis. I don't think people expected it to vary this much, particularly when you couple it with the test-time compute strategies."
When you go to the leaderboard, you can scroll down to "Algorithmic Lift: Which Thinking Strategy Wins per Task" and then pick a task from the eight business tasks listed in the drop-down on the right. You'll quickly notice how the results vary wildly in terms of which algorithms are best for which tasks. The example below shows the results for "Lead Qualification," where OpenAI's GPT-OSS 120B model performed best, followed closely by DeepSeek R1 and Alibaba's Qwen3 235B.

Neurometric argues that, since the arrival of reasoning models with OpenAI's o1 in 2024, thinking algorithms have become valuable for applied AI in specific business use cases.
May stated, "Ever since OpenAI launched their first reasoning model, o1, I’ve been fascinated by the idea that the way you probe these models — the 'thinking' algorithms you apply, can get you different outcomes. Over the past 9 months, we’ve explored this here at Neurometric. We published some research showing that test-time scaling algorithm choice matters on a per-task basis, and now we’ve decided to launch a tool to help you explore the difference."
With Neurometric's focus on applied AI in real-world use cases, it chose CRMArena as the benchmark tool to measure the performance of the thinking models. The test selected eight task categories from the CRMArena-Pro benchmark suite, covering three business scenarios (sales, customer service, and configure, price, and quote) and four business skills (workflow routing, policy compliance, information retrieval and textual reasoning, and database querying and numerical computation). The test then measured the accuracy with which the agents carried out the tasks.
Neurometric ran all its tests on thinking models available on Amazon Bedrock, eliminating additional variables such as network latency and server performance. However, it plans to test and measure additional thinking algorithms over time, including ones available outside of Amazon Bedrock. To try out the leaderboard for yourself, simply click on the "Method" and Models" drop-downs to see for yourself how the models performed differently based on the various tasks.
At the time of this story, the leaderboard currently tests 12 models:
- Anthropic Sonnet 4.5
- Anthropic Haiku 4.5
- DeepSeek R1
- OpenAI GPT-OSS 120B
- OpenAI GPT-OSS 20B
- Meta Llama 3.3 70B
- Meta Llama 4 Maverick 17B
- Meta Llama 4 Scout 17B
- Amazon Nova Premier
- Amazon Nova Pro
- Alibaba Qwen3 235B
- Alibaba Qwen3 32B
May said, "Obviously, CRM arena is a limited set of tasks related to CRM-related workflows. But we'll be adding more tasks and benchmarks over time. Part of my goal is to build out a whole suite of tools where you can basically say, 'I have tasks [in] finance, accounting, marketing, etc. What models should I use for these tasks?' And we can make solid recommendations."
In a blog post announcing the leaderboard, May wrote, "We’ve seen a trend in companies as they move along the AI maturity curve. While nearly everyone starts out building a prototype on one single model, usually a frontier lab model, as AI products start to scale, it becomes obvious that some workloads are better handled with other models. Multi-model systems become the norm as you become more AI mature. But figuring out which models to choose and why is not intuitive. Our leaderboard is a small step towards a more data-driven approach to AI systems design."
Others are also coming around to the idea that small models could be the key for applying generative AI to more real-world use cases. On the a16z podcast on Nov. 28, 2025, Sherwin Wu, Head of Engineering for the OpenAI Platform, said, "Even within OpenAI, the thinking was that there would be one model that rules them all. It’s definitely completely changed. It’s becoming increasingly clear that there will be room for a bunch of specialized models. There will likely be a proliferation of other types of models."
May said, "The industry still underestimates the power of test-time scaling algorithms to get more performance out of models for less cost. Our tool can do the analysis to help AI practitioners find solutions that are simultaneously better, faster, and cheaper, using test-time scaling techniques."
Neurometric will launch its first product in early 2026, aimed at helping companies select the right models for their workloads to improve performance, save money, or both. For now, May said Neurometric will also let companies test the leaderboard using their own data. So, if you want to know what models work best on your specific tasks, you can upload a file containing prompts and responses, and the team can run it against thousands of model + algorithm combos to provide you with a report. To give it a try, you can hit the feedback form at the bottom of the leaderboard page and mention that you want to try your own workload.




