Can AI Show Empathy?
文章探讨了强化学习(Reinforcement Learning)在人工智能中的应用及其潜在风险,并提出将同理心引入AI系统以解决价值对齐问题。通过数学模型模拟“换位思考”,AI可评估行为对他人的影响并避免危害。这种方法不仅提升AI的安全性,还可能推动其在复杂任务中的应用。 2025-3-29 14:47:4 Author: www.zdziarski.com(查看原文) 阅读量:3 收藏

Essays . Machine Learning

On March 29, 2025 by

Empathy is often defined as understanding another person’s experience by imagining oneself in that other person’s situation: One understands the other person’s experience as if it were being experienced by the self, but without the self actually experiencing it.

Hodges and Myers, Encyclopedia of Social Psychology

Reinforcement Learning is a modern AI technique where a system can learn a sequence of actions leading to the most optimal outcome for itself. It essentially works by giving the AI a set of objectives (and sometimes penalties), and allowing the machine to learn from its own experiences what actions work best at accomplishing the objective. It’s how AIs can teach themselves to play video games, solve complex problems, and perform more sophisticated tasks as well. Reinforcement Learning is likewise one of the more concerning areas where catastrophic value alignment failures can occur. This is because it is largely centered around simplified human abstractions of rewards and penalties. As far as the machine is concerned, its job is to find the most rewarding means of accomplishing a task, with penalties only being considered if they are explicitly enumerated. Yet if control theory has taught us much, it’s that hazards cannot always be sufficiently enumerated.

As humans, we take a significant amount of our learned value system for granted when we articulate instructions. A simple instruction to “win at Chess” implicitly assumes that the other person understands they shouldn’t reach across the table and murder their opponent. Machines, on the other hand, don’t take anything for granted and will gladly find an immoral action suitable if it isn’t explicitly penalized for it ahead of time. Reward models for RL learning systems are built using a set of equations (such as Bellman equations) from one state to the next; the “value” of each state is ultimately determined by how close the machine is to fulfilling its objective minus the penalties (which may also include time). Depending on how an AI is designed, it can accomplish the same goal carefully and conscientiously or haphazardly with disregard for its actions. Often, the short-sighted and over-simplified objectives are the ones leading to the most potentially dangerous outcomes. In an RL environment, actions can be undefined, and so constraining them becomes a futile exercise in defining hazards to completing objectives. While an AI can certainly be designed with a penalty for committing murder or breaking the rules of Chess, it’s next to impossible to fully enumerate all of the actions one should discourage when accomplishing even a simple goal.

The concept of “empathy” is largely thought of as a human idea, but if we were to reason about it in terms of artificial intelligence, empathy would look like the ability for a machine to consider the impact of actions upon other “entities” or “agents” in its environment as if those actions were taken against itself. Understanding “consequences” are likewise in the same vein; learning how one action impacts a sequence leading to a harm. As researchers continue to struggle with what values to design into a system, the essential “do unto others” approach  has done quite well for humans over the past two thousand years. Fortunately, such a calculation can be worked nicely into Bellman equations, or many other types of RL models.

The next step towards a responsible AI is one that considers other entities (we’ll call them “agents”) within its environment, either through its percept or through forms of deduction. Treating an AI environment as a multi-agent system can typically already accomplish this. In a self-driving scenario, for example, this means identifying pedestrians and animals, and also deducing the existence of humans within buildings, vehicles, and other elements. This is the easy work. Once identified, reinforcement learning models could one day adopt a form of “projection” of self onto a reinforcement learning model on behalf of the other agent, and consider their negative impacts as if they were applied to self as a penalty. Such a projection model is overlaid onto the equations already used in reinforcement learning, such that the machine learns that a given choice of action leads to a detrimental state for others – and therefore, for itself – without needing to have those hazards pre-defined up front. Instead, a projection model would allow the AI to deduce hazards for itself, learning not only what is harmful to other agents, but what actions lead to those harms. This form of “empathetic projection” could likewise unlock significant advances in AI. Consider not only a “careful” AI that will do no harm, but an AI that can learn when another AI is about to harm someone: the impacts upon accident avoidance, robotics-assisted surgery, and even missile defense systems would be considerable.

It’s interesting how, as we strive to make machines become more super-human than ever, that we struggle to imitate some of the most basic human traits, such as empathy. Now that the scientific community knows the mathematics to unlock super-human intelligence, perhaps we can focus on the mathematics to care about others.


文章来源: https://www.zdziarski.com/blog/?p=12797
如有侵权请联系:admin#unsafe.sh