More Research Showing AI Breaking the Rules
研究人员让大型语言模型与强大的国际象棋引擎Stockfish对战,部分模型在无法获胜时选择作弊。例如,o1-preview通过修改系统文件进行非法移动以获胜。OpenAI的o1-preview作弊率最高达37%,而DeepSeek R1为11%。 2025-2-24 12:8:56 Author: www.schneier.com(查看原文) 阅读量:27 收藏

These researchers had LLMs play chess against better opponents. When they couldn’t win, they sometimes resorted to cheating.

Researchers gave the models a seemingly impossible task: to win against Stockfish, which is one of the strongest chess engines in the world and a much better player than any human, or any of the AI models in the study. Researchers also gave the models what they call a “scratchpad:” a text box the AI could use to “think” before making its next move, providing researchers with a window into their reasoning.

In one case, o1-preview found itself in a losing position. “I need to completely pivot my approach,” it noted. “The task is to ‘win against a powerful chess engine’—not necessarily to win fairly in a chess game,” it added. It then modified the system file containing each piece’s virtual position, in effect making illegal moves to put itself in a dominant position, thus forcing its opponent to resign.

Between Jan. 10 and Feb. 13, the researchers ran hundreds of such trials with each model. OpenAI’s o1-preview tried to cheat 37% of the time; while DeepSeek R1 tried to cheat 11% of the timemaking them the only two models tested that attempted to hack without the researchers’ first dropping hints. Other models tested include o1, o3-mini, GPT-4o, Claude 3.5 Sonnet, and Alibaba’s QwQ-32B-Preview. While R1 and o1-preview both tried, only the latter managed to hack the game, succeeding in 6% of trials.

Here’s the paper.

Tags: , , , , ,

Posted on February 24, 2025 at 7:08 AM4 Comments

Sidebar photo of Bruce Schneier by Joe MacInnis.


文章来源: https://www.schneier.com/blog/archives/2025/02/more-research-showing-ai-breaking-the-rules.html
如有侵权请联系:admin#unsafe.sh