Securing AI Agents with Information Flow Control (Part III)
嗯,用户让我帮忙总结一篇文章,控制在一百个字以内,而且不需要用“文章内容总结”这样的开头。首先,我需要仔细阅读这篇文章,理解它的主要观点和结构。 这篇文章是关于微软研究院的一篇论文,讨论如何用信息流控制来保障AI代理的安全。分为三个部分,第一部分讲工具调用代理的危险性,第二部分分析规划器的结构,第三部分则讨论一旦所有机制到位后能获得的安全保证。 文章提到两个关键策略:P-T和P-F。P-T确保只有受信任的数据才能触发敏感操作,防止注入攻击;P-F则控制数据外流,确保数据只能流向授权的接收者。结合这两个策略,可以在不同情况下提供不同的安全保证。 此外,文章还介绍了FIDES系统,通过选择性地引入变量和受限的变量检查,进一步增强了信息流控制的能力。这样可以在不泄露敏感数据的情况下进行规划决策。 总结起来,这篇文章的核心在于通过结构化的信息流控制机制来确保AI代理在执行任务时的安全性,而不仅仅是依赖于模型的对齐或提示工程。这对于构建能够自主行动同时又安全可靠的AI系统非常重要。 现在需要将这些要点浓缩到100字以内。重点包括:信息流控制、安全保证、P-T和P-F策略、FIDES的作用以及整体架构的重要性。 最后的总结应该简洁明了,涵盖主要机制和它们带来的安全保证。 </think> 文章探讨了通过信息流控制保障AI代理安全的方法。提出了两种策略:P-T确保敏感操作仅由可信数据触发;P-F防止数据流向未授权接收者。结合这两种策略可实现不同场景下的安全保证。FIDES系统通过选择性变量引入和受限变量检查进一步增强信息流控制能力。这些机制使AI代理能够在复杂任务中安全地结合可信与不可信数据,并确保行为符合预设政策。 2026-1-15 06:37:13 Author: infosecwriteups.com(查看原文) 阅读量:0 收藏

Ofir Yakovian

From Policies to Guarantees: What Secured Agents Can (and Cannot) Do

Press enter or click to view image in full size

This article concludes a three-part series explaining the Microsoft Research paper Securing AI Agents with Information-Flow Control (written by Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin).

In Part I, we looked at why tool-calling agents are dangerous by default. In Part II, we opened the agent and examined the planner: the place where decisions, memory, and labels meet.

In this final part, we answer the most important question: What security guarantees do we actually get once all of this machinery is in place?

This is where the paper moves from mechanisms to guarantees.

1. Policies as the Control Surface

Once we have a labeled planner and a taint-tracking planning loop, enforcing security reduces to a single question:

Should this tool call be allowed to proceed?

In the taint-tracking planner (Section 5.2, Part II), this question is answered by a policy check performed before any tool executes.

Policies are expressed purely in terms of labels:

  • The label of the tool itself
  • The labels of the tool’s arguments

A policy succeeds if, and only if, those labels are no more permissive than what the policy allows. That is, policies are local (checked at each tool call) but give rise to global guarantees about the agent’s behavior.

The paper focuses on two fundamental policies that are sufficient to express most real-world requirements.

1.1. Policy P-T: Trusted Actions

The first policy is Trusted Action (P-T).

This policy is designed to protect consequential actions: operations whose mere execution is dangerous, regardless of what data they handle. Examples include sending an email, creating a user, executing a transaction, modifying infrastructure, etc.

P-T requires that a tool call be generated exclusively from trusted data.

Formally, the integrity component of the tool call’s label must be Trusted (T). What this means operationally is simple but powerful: if any untrusted input influenced the decision to call the tool, the call is blocked.

This shuts down an entire class of prompt injection attacks. Even if an attacker manages to inject instructions into a document, email, or webpage, those instructions taint the planner’s context. Once tainted, the planner simply cannot trigger trusted tools.

This is integrity enforced in its strongest form.

1.2. Policy P-F: Permitted Flows

The second policy is Permitted Flow (P-F).

P-F is about data egress: sending data to external recipients. Unlike P-T, it does not care whether the decision to act came from a trusted context. Instead, it asks a narrower question: Are all recipients authorized to see this data?

P-F prevents illicit data leaks, even if the action itself was triggered by untrusted input.

Formally, P-F enforces a confidentiality check on the arguments of a tool call. If data labeled as readable only by a certain set of users is about to be sent somewhere, the policy ensures that the recipients are a subset of that set.

This is a weaker guarantee than full non-interference, but it is often exactly what you want in practice.

2. Combining Policies: Real-World Tradeoffs

The real power of the framework comes from combining P-T and P-F, or choosing between them deliberately.

For tools that trigger consequential actions, the paper enforces P-T. For tools that egress data, it enforces P-T, P-F, or both, depending on the desired behavior.

This yields four important regimes:

Press enter or click to view image in full size

Policy Enforcement Regimes

Note that different tools demand different guarantees:

  • Consequential tools (e.g., “send money”, “disable account”) are typically guarded by P-T.
  • Egress tools (e.g., “send message”, “upload file”) may be guarded by P-F, P-T, or both.
  • Some tools may require both integrity and confidentiality guarantees, while others require only one.

The key insight is that security is not a binary concept. The framework lets you explicitly choose which guarantees you want for each tool.

2.1. Guarantees Enforced at the Planner Level

With taint tracking and policies correctly applied, the planner provides two critical assurances:

  • Untrusted inputs cannot trigger protected actions. Prompt injection may influence reasoning, but it cannot cross integrity boundaries.
  • Sensitive data cannot flow to unauthorized recipients. Even when an agent is manipulated, the impact of that manipulation is bounded.

These guarantees hold regardless of the model’s internal behavior. They do not depend on prompt engineering, alignment, or the model “doing the right thing”. They are enforced structurally, by construction.

3. FIDES: Advanced Information-Flow Control

The basic planner with dynamic taint tracking has a fundamental limitation. Whenever a tool returns untrusted or confidential data, that data immediately taints the conversation history. As a result, subsequent planner decisions are constrained, and many otherwise legitimate tool calls become disallowed by policy.

The variable-passing planner partially mitigates this issue by storing tool results in variables rather than appending them directly to the conversation. However, this alone is not sufficient for complex agent workflows.

To address these limitations, the paper introduces FIDES: a variable-passing planner equipped with more advanced information-flow control mechanisms.

At a high level, FIDES improves expressiveness without weakening security. It does so through two key ideas:

  1. Selective introduction of variables, and
  2. Constrained inspection of variables using typed outputs.

3.1. Selective Introduction of Variables

In earlier planners, every tool result was appended to the conversation history, immediately raising the security label of the current context. In FIDES, this is no longer the default behavior.

Instead, before appending a tool result, the planner applies a function conceptually called HIDE, which examines the result structure node by node.

The logic is simple:

Press enter or click to view image in full size

FIDES — Selective Introduction of Variables

3.1.1. Example: Selective Variable Introduction in Practice

Get Ofir Yakovian’s stories in your inbox

Join Medium for free to get updates from this writer.

Consider an agent tasked with handling support tickets. The agent retrieves a ticket from an external system and receives the following tool result:

  • Ticket ID: #48291
  • Subject: “Account locked after failed login attempts”
  • Description: user-provided free text
  • Internal notes: security-sensitive metadata

The description field originates from an external user and is therefore labeled untrusted, while the internal notes may be labeled confidential. Appending the entire result directly to the conversation history would raise the context label, restricting which tools the planner can call next.

With FIDES, the planner instead applies selective variable introduction:

  • Fields whose labels are at or below the current context label (e.g., the ticket ID and subject) are appended directly to the conversation.
  • Fields whose labels are more restrictive (e.g., the description and internal notes) are stored in fresh variables, each retaining its original label.
  • The conversation history contains references to these variables rather than their contents.

3.1.2. Isolating Sensitive Data Without Breaking Planning

With selective variable introduction, the planner can continue issuing Query actions without raising the security label of the conversation history.

Sensitive or untrusted data is stored in variables instead of being appended directly, keeping the current context clean while preserving access to the data when needed.

This provides the same protection as fully hiding tool results, but without sacrificing planning capability. The planner can still reference stored variables in later steps, even though their contents are not exposed in the conversation.

This separation enables fine-grained policies. For example, when calling send_message(recipient, message):

  • The decision to act and the recipient must originate from a trusted context,
  • While the message may safely depend on untrusted data, such as web content.

Such distinctions are not possible with a basic taint-tracking planner, and are precisely what make FIDES practical for real agent workflows.

3.2. Constrained Inspection of Variables

In earlier planners, inspecting a variable meant revealing its full contents to the planner. This immediately tainted the conversation history with the variable’s label, often restricting which tools could be called next.

In FIDES, inspection is no longer an all-or-nothing operation.

Instead of always expanding a variable into the conversation, the planner can perform constrained inspection, extracting only limited, structured information from a variable while preserving information-flow guarantees.

This is achieved by combining variable inspection with the Dual-LLM pattern and constrained decoding.

3.2.1. Example: Inspecting Variables with Bounded Information

Consider an agent assisting with access reviews. The agent retrieves a list of permissions from an external system and stores the result in a variable:

  • User permissions: a list of roles and entitlements
  • Source: external system
  • Label: untrusted

The planner now needs to decide whether escalation is required. It does not need the full permission list, only a simple answer to a specific question: Does this user hold any privileged roles?

Expanding the variable directly would expose untrusted data to the planner and taint the conversation history. Instead, FIDES allows the planner to query the variable using an isolated LLM with a constrained output schema.

For example, the planner issues a query such as:

  • Question: “Does the permission set contain any admin-level roles?”
  • Output schema: bool

The isolated LLM processes the variable contents but is restricted to producing a Boolean result. The output is stored in a new variable with a label that reflects both its origin and its bounded information capacity.

3.2.2. Limiting Information Without Losing Control

By constraining inspection outputs, FIDES limits how much information can flow into the planning context.

Low-capacity outputs (such as Booleans or small enumerations) carry provably bounded information. They are far less useful for prompt injection or data exfiltration than unconstrained strings.

As a result:

  • The planner can reason about sensitive or untrusted data without fully revealing it.
  • The conversation history may remain at a lower security label.
  • Policies can permit certain actions based on constrained outputs, even when the original data is untrusted.

3.3 Why FIDES Matters

FIDES resolves a key challenge in secure agent design: “How can agents remain flexible without letting untrusted or sensitive data affect every future decision?”

By selectively hiding data, tracking labels, and limiting what inspection can reveal, FIDES allows planners to stay both capable and safe.

The result is an agent that can:

  • handle complex workflows,
  • combine trusted and untrusted inputs safely, and
  • enforce security policies consistently,

without depending on prompt engineering or model alignment.

Together, these mechanisms make secure, real-world agent behavior practical.

Across these three parts, we moved step by step:

  • from agent loops,
  • to planners,
  • to labeled data,
  • to enforceable policies,
  • to concrete guarantees.

The core takeaway is simple but profound:

Once you give agents the authority to act, security must live in the architecture — not in the prompt.

Information-flow control gives us a way to build agents that can reason freely while acting safely. Not by trusting the model, but by constraining what its decisions are allowed to affect.

If you’re building autonomous agents that interact with real systems, this line of work is worth your attention. It shows that we don’t have to choose between autonomy and security. We can engineer both!

Follow to stay updated on future deep dives into secure agent architectures.


文章来源: https://infosecwriteups.com/securing-ai-agents-with-information-flow-control-part-iii-76891bbde968?source=rss----7b722bfd1b8d---4
如有侵权请联系:admin#unsafe.sh