Openai's Latest Model O3 Shows Strong Reasoning Capabilities
On December 20, the Open Artificial Intelligence Research Center (OpenAI) introduced its latest artificial intelligence (AI) reasoning model-O3 and its lightweight version of O3-Mini.The company claims that O3 has more advanced and similar human reasoning capabilities. It has surpassed its predecessor O1 in terms of code writing, mathematical competitions, and scientific knowledge of human doctoral levels.
However, in a report on December 22, the British "New Scientist" website pointed out that although O3 has achieved remarkable performance leaps, it has not reached the level of general AI (AGI) in the industry.
Outstanding performance in many aspects
OpenAI revealed that when solving more complex multi -step problems, the O3 model will spend more time calculating the answer, and then give a response.The improvement of this reasoning ability has made O3 perform well in many tests.
Large language models are keen to score in various mathematical benchmark tests, and O3 is no exception.In the 2024 American Mathematical Invitational Tournament, the accuracy of the O3 model was as high as 96.7%, and only the wrong question was answered.In the OpenAI researchers believe that one of the strictest benchmark tests -Frontier Math, O3 also solved a problem of 25.2%.Although this score seems not high, other large language models have been collectively turned over here before, and the accuracy rate has not exceeded 2%.
FRONTIER MATH testing is extremely difficult. It was rated by Tao Zhexuan, a Chinese mathematician and the Fields winner, as a possibility that it may be difficult to live in AI for several years.However, O3 only needs to think for a few minutes to answer one of the questions, while human mathematicians spend a few hours to several days.
In terms of mastery of scientific knowledge, the performance of O3 is also exceeded the level of general doctoral.In the performance of GPQA Diamond (measurement model on doctoral scientific issues, covering professional knowledge in chemistry, physics, and biology), the accuracy of O3 reached 87.7%, exceeding 70%of human doctors, and than before, it was also previously more than before.O1 performed nearly 10%high.
In addition, O3's encoding ability is better than the previous O1 series.On the benchmark of Swe-Bench Verify (the ability to measure the AI model to solve the real world software problem), the accuracy of O3 is about 71.7%, which is more than 20%higher than O1.In the CodeForcess coding competition platform, the O3 score is 2727, which is equivalent to the level of 175 human programmers on the list, while the O1 score is only 1891.
After showing these proud results achieved by O3, Oldman, CEO of OpenAI, emphasized that the emergence of O3 marks that AI has entered the next stage of development. These models can handle complex tasks that require a lot of reasoning.
There is still differences with human intelligence
The "New Scientist" website also reported that in the abstract and reasoning corpus-AGI (ARC-AGI) contest considered to be regarded as an important measurement standard of AGI, the O3 model also set a new record: under low computing power configuration, it is 75.7%under 75.7%The score is on the forefront of the public rankings.Because of determining that the test of this prize winner has stricter computing power restrictions, under this computing power restriction, O3's challenge ended in failure.
However, under the high computing power that exceeded the official computing power limit of 172 times, O3 used brute force to achieve 87.5%, reaching the 85%threshold representing the human level.
Regarding the performance of O3, former Google Engineer and ARC-AGI main founder Francois Scholes wrote in a blog that this is an amazing and important jumper of AI capabilities.However, O3 has not yet implemented AGI because it still cannot solve some very simple problems in the ARC-AGI competition, which shows that it has fundamental differences with human intelligence.
AGI is an imaginary future system that can imitate human thinking, decision -making, self -consciousness, and actively act.However, AGI is currently active in science fiction works and has not yet entered reality.
Upgrade iteration is not easy
O3 is not only the latest masterpiece of OpenAI, but also a vivid portrayal of AI giants competing for large -scale language models.
Two years ago, Openai released ChatGPT, which kicked off the prelude to the AI military reserve competition.From GPT-3.5 to more accurate and more creative GPT-4, to O1, until O3, OpenAI is constantly improving its own products.
Other top AI developers are also using the increasingly advanced technology to promote the iterative upgrade of their own products.Not long ago, Google launched a new version of its flagship model Gemini, which is said to be twice as fast as the previous generation, and can think, remember, plan, and even replace users to take action.The Yuan Cosmic Platform Company plans to launch LLAMA 4 next year.
However, the road of iteration is not a way.Several leading companies, including OpenAI and Google, are facing the dilemma of huge cost -spenting but decreasing returns.Openai's GPT-5 model development is progressing slowly.It is reported that in just 6 months of training, the cost of calculating a single calculation is as high as about 500 million US dollars, and the performance is slightly better than the company's existing products.