Test-Time Scaling: The New Frontier for AI
- By Paul Mah
- January 08, 2025
At this week’s CES 2025, the world’s largest consumer electronics trade show, Nvidia founder and CEO Jensen Huang delivered a compelling keynote that blended past, present, and future.
Between announcing major products like Nvidia’s new RTX 50 series GPU, Huang wove together the history of artificial intelligence with his vision for its future, painting a picture of where AI technology is headed.
The future of AI development
According to Huang, the development direction of AI can be broken into three key areas: Pre-training Scaling, Post-training Scaling, and Test-time Scaling.
Pre-training Scaling was the original “scaling law” which posits that the larger the scale of training data, the larger the scale of the model, and the more computational power invested, then the stronger the resulting AI model’s capabilities.
On its part, Post-training Scaling involves technologies such as reinforcement learning and human feedback. This reinforcement system helps the AI continually refine its capabilities in specific areas to eventually become better at solving problems and performing complex tasks.
As Pre-training Scaling and Post-training Scaling see diminishing returns, another technique that is gradually emerging is known as Test-time Scaling, which sees the AI dynamically allocate resources during inference in a way that is no longer limited to parameter optimization.
How does Test-time Scaling work
According to software engineer and computer scientist François Chollet, OpenAI’s latest o3 AI model appears to use test-time techniques. He wrote: “[The core mechanism of o3] appears to be natural language program search and execution within token space – at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task.”
“Effectively, o3 represents a form of deep learning-guided program search. The model does a test-time search over a space of ‘programs’… guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.”
But why not stick to Pre-training and Post-training techniques? In a detailed Substack post on Test-Time scaling, Akash Bajwa explained: “As pre-training gains plateau or become too expensive, we’ve found a new vector of scaling (test time search) that is demonstrating a path to truly general intelligence.”
“Instead of prohibitively expensive pre-training runs, enterprises developing their own models may opt to train smaller models with reasoning cores and decide when to scale up test time search for certain economically valuable tasks,” Bajwa said.
Image credit: X/MatthewBerman
Paul Mah
Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.