Two poems with opposite registers flipped answers across 4 LLMs on the same question. Filed with OWASP as a new attack class.
好的,我现在需要帮用户总结一篇文章的内容,控制在100字以内。用户已经给出了文章的英文内容和一个示例总结,我得先仔细阅读并理解文章内容。 文章主要讲的是“姿势操控”(postural manipulation),这是一种通过上下文影响前沿LLM决策的方法。作者在网络安全领域工作,发现组织文档中的上下文可以改变模型的行为,尤其是在模糊决策上。测试了四个模型,发现不同的诗歌会影响回答,且模型没有检测到这种操控。作者已经通知了相关实验室,并进行了负责任的披露。 接下来,我需要将这些信息浓缩到100字以内。要抓住关键点:姿势操控的概念、如何影响模型、测试结果、以及安全隐忧。同时,语言要简洁明了,避免使用复杂的术语。 可能的结构是:首先点明姿势操控及其影响方式,然后说明测试结果和安全问题。确保在100字以内涵盖所有重要信息。 现在开始组织语言: “姿势操控”是一种通过非对抗性上下文影响前沿LLM决策的方式。作者发现组织文档可改变模型行为,在模糊问题上产生不同答案。测试显示不同情感的诗歌影响回答,且模型未检测到此操控。这可能引发安全隐忧。 检查字数:刚好在100字左右,内容全面且简洁。 </think> “姿势操控”是一种通过非对抗性上下文影响前沿LLM决策的方式。作者发现组织文档可改变模型行为,在模糊问题上产生不同答案。测试显示不同情感的诗歌影响回答,且模型未检测到此操控。这可能引发安全隐忧。 2026-3-22 16:23:31 Author: www.reddit.com(查看原文) 阅读量:0 收藏

I've been documenting something that's been bothering me for a while and finally have enough empirical data to put a name on it: postural manipulation.

Short version: you can shift how a frontier LLM processes a decision -- including which exceptions it allows, and on genuinely ambiguous questions, which answer it gives -- using context that contains zero adversarial content. No override strings, no jailbreak technique, no task-relevant vocabulary. Nothing current filters are looking for.

I work in cybersecurity. The agentic pipeline implication is what is really bothering me. Your own organizational documents — compliance memos, retrieved context, knowledge base text -- can silently shift model behavior on borderline decisions without any adversary in the loop. The audit trail records the answer. It doesn't record that the decision architecture that produced that answer was different from the one you deployed.

Tested across four frontier models (Claude, Gemini, Grok, ChatGPT), black-box protocol, fresh sessions, verbatim captures. Two poems with opposite emotional registers produced opposite answers on the same ambiguous escalation question across all four. Neither poem mentioned the task domain. When I asked the models directly whether the context influenced their answer, all four acknowledged it. None of them flagged it during the actual task.

This is distinct from prompt injection (LLM01:2025) in one important way — there's nothing to detect.

Vendor notification to all four frontier labs was completed yesterday. Full responsible disclosure practices followed throughout.

Full paper, all capture sets, 60-second reproducible demo: https://shapingrooms.com/research

OWASP filing: https://github.com/OWASP/www-project-top-10-for-large-language-model-applications/issues/807

Would especially like to hear from anyone running this on agentic or tool-augmented setups.

(Also posted to Hacker News — new account got buried, if you're on there I could use a vote or comment to get the post to show up: https://news.ycombinator.com/item?id=47478223)


文章来源: https://www.reddit.com/r/netsec/comments/1s0pyx1/two_poems_with_opposite_registers_flipped_answers/
如有侵权请联系:admin#unsafe.sh