Deepseek r1 vs open ai o3 on arc-agi tests

STEM版,合并数学,物理,化学,科学,工程,机械。不包括生物、医学相关,和计算机相关内容。

版主: verdeliteTheMatrix

回复
itspid楼主
见习作家
见习作家
帖子互动: 48
帖子: 342
注册时间: 2023年 5月 9日 22:37

#1 Deepseek r1 vs open ai o3 on arc-agi tests

帖子 itspid楼主 »

According to recent analysis, DeepSeek's "R1" model achieved a score of around 15% on the ARC-AGI benchmark, with its "R1-Zero" variant scoring slightly lower at approximately 14%. This means that while DeepSeek models show promising reasoning abilities, they still fall significantly below the scores achieved by top-performing AI systems like OpenAI's "o3" which scored around 87.5% on the same test.

Deepseek r1 比新出的o3差得远去了,是不是要想办法再蒸一遍模型。
头像
TheMatrix
论坛支柱
论坛支柱
2024年度优秀版主
TheMatrix 的博客
帖子互动: 279
帖子: 13694
注册时间: 2022年 7月 26日 00:35

#2 Re: Deepseek r1 vs open ai o3 on arc-agi tests

帖子 TheMatrix »

itspid 写了: 2025年 2月 5日 19:06 According to recent analysis, DeepSeek's "R1" model achieved a score of around 15% on the ARC-AGI benchmark, with its "R1-Zero" variant scoring slightly lower at approximately 14%. This means that while DeepSeek models show promising reasoning abilities, they still fall significantly below the scores achieved by top-performing AI systems like OpenAI's "o3" which scored around 87.5% on the same test.

Deepseek r1 比新出的o3差得远去了,是不是要想办法再蒸一遍模型。
o3到目前为止谁也没见过,都是自己说。
Caravel
论坛元老
论坛元老
Caravel 的博客
帖子互动: 693
帖子: 27366
注册时间: 2022年 7月 24日 17:21

#3 Re: Deepseek r1 vs open ai o3 on arc-agi tests

帖子 Caravel »

TheMatrix 写了: 2025年 2月 5日 19:42 o3到目前为止谁也没见过,都是自己说。
这些新的benchmark先澄清一下有没有oai的投资
赖美豪中(my pronouns: ha/ha)
论坛元老
论坛元老
2023年度优秀版主
帖子互动: 4485
帖子: 46395
注册时间: 2022年 9月 6日 12:50

#4 Re: Deepseek r1 vs open ai o3 on arc-agi tests

帖子 赖美豪中(my pronouns: ha/ha) »

不用担心,只要发布了2小时就能对标。我实测过ds错了gpt对的,基本2小时到一天内都会和gpt的答案一致,我估计后来有人工实测校正
itspid 写了: 2025年 2月 5日 19:06 According to recent analysis, DeepSeek's "R1" model achieved a score of around 15% on the ARC-AGI benchmark, with its "R1-Zero" variant scoring slightly lower at approximately 14%. This means that while DeepSeek models show promising reasoning abilities, they still fall significantly below the scores achieved by top-performing AI systems like OpenAI's "o3" which scored around 87.5% on the same test.

Deepseek r1 比新出的o3差得远去了,是不是要想办法再蒸一遍模型。
If printing money would end poverty, printing diplomas would end stupidity.
回复

回到 “STEM”