← Blog
2026-04-06Compliance

NIST AI RMF for AI Agent Operations: A Practical Mapping

The NIST AI Risk Management Framework was written for traditional ML systems. Autonomous agents introduce new challenges. This is a practical mapping for teams building a NIST AI RMF program around agent operations.

TL;DR

  • GOVERN maps to structured roles and delegation chains.
  • MAP maps to per-mandate risk classification.
  • MEASURE maps to reputation scoring and drift detection.
  • MANAGE maps to dispute resolution and Settlement Signals.

The NIST AI Risk Management Framework (AI RMF 1.0) organizes AI risk management into four functions: GOVERN, MAP, MEASURE, and MANAGE. It is the most widely referenced voluntary framework for AI risk in the United States and increasingly cited in procurement requirements, board-level AI governance policies, and regulatory guidance.

The problem: most of the framework’s guidance assumes traditional ML systems — a model trained on data, deployed behind an API, monitored for drift. Autonomous AI agents break these assumptions. Agents make decisions. Agents delegate to other agents. Agents operate across context windows, providers, and organizations. The risk surface is fundamentally different.

If you are building a NIST AI RMF program for agent operations, the framework still applies — but you need to map its functions to what agents actually do. This post walks through each function with concrete examples of how structured accountability maps to each area.

At each level, we separate what accountability infrastructure provides from what the enterprise still owns. The infrastructure records, structures, and enforces. The enterprise decides, interprets, and acts.

1. GOVERN

Establish and maintain policies, processes, and accountability structures for AI risk management.

GOVERN is where the framework starts: who is responsible, what authority do they have, and how are accountability structures defined and maintained. For traditional ML, this typically means model ownership, data governance committees, and approval workflows.

Agents introduce a harder version of this problem. When Agent A delegates to Agent B, who is responsible for Agent B’s output? When Agent B delegates further to Agent C (possibly running on a different LLM provider), the accountability chain extends across organizational and technical boundaries. Without explicit structure, nobody can answer “who was responsible for this decision” after the fact.

How agent accountability maps

  • Structured roles— Every mandate defines a principal (who assigns the work), a performer (who does the work), and optionally an accessor (auditor, mediator, compliance officer). These roles are recorded at mandate creation, not reconstructed later.
  • Authority scopes— Each role has defined permissions. A performer can submit receipts and request extensions but cannot render verdicts. An accessor can read the full chain but cannot modify it. These boundaries are enforced by the system, not by policy documents alone.
  • Accountability chains— When a mandate is delegated, the delegation is recorded as a linked mandate. The original principal’s authority propagates through the chain. You can trace any outcome back through every delegation to the original assignment. This is the GOVERN function’s core requirement applied to agent operations.
  • Append-only audit vault— Every policy decision, oversight action, and role assignment is recorded in an Ed25519-signed, hash-chained audit trail. The record is tamper-evident by construction.

Infrastructure provides

  • Structured accountability chain for every automated operation
  • Role-based access with enforced authority scopes
  • Delegation tracking through multi-agent chains
  • Append-only audit vault recording every governance action

Enterprise owns

  • Defining governance policies and risk tolerances
  • Designating responsible individuals and their authority
  • Organizational AI risk management strategy
  • Deciding which agent operations require human oversight

2. MAP

Identify, categorize, and document AI risks in context.

MAP is about understanding what risks exist and where they live. For traditional ML, this means cataloging models, their training data, their deployment contexts, and their potential failure modes.

For agents, the risk surface is dynamic. A single agent might operate across multiple domains in a single session. A delegation chain might cross from a low-risk data summarization task to a high-risk financial decision. Risk classification needs to happen per operation, not per model.

How agent accountability maps

  • Risk classification per mandate— Each mandate carries a risk level field (high, limited, minimal) and an optional domain tag (mapped to categories like the EU AI Act’s Annex III). The risk context travels with the work, not with the model.
  • Domain tagging— Mandates can be tagged with domain-specific metadata: financial, healthcare, legal, infrastructure, HR. This allows risk mapping to organizational context. When a compliance team asks “show me all high-risk agent operations in the financial domain this quarter,” the answer is a query, not a research project.
  • Custom schemas— Organizations define their own mandate schemas for their domain. A healthcare organization’s mandate schema includes fields that a financial organization’s does not. This maps directly to MAP’s requirement to document risks in the organization’s specific context.
  • Federation and cross-organizational risk mapping— When mandates cross organizational boundaries, each side maps risk independently using shared schemas. Sovereign data, shared risk vocabulary.

Infrastructure provides

  • Risk level and domain classification per mandate
  • Structured records linking each operation to its risk context
  • Custom schemas for domain-specific risk categorization
  • Federation for cross-organizational risk mapping

Enterprise owns

  • Performing the risk assessment
  • Determining risk categories and thresholds
  • Mapping AI systems to organizational context
  • Deciding which operations require elevated risk classification

3. MEASURE

Analyze, assess, and track AI risks and impacts.

MEASURE is where you quantify. For traditional ML, this means accuracy metrics, fairness audits, and performance benchmarks against test datasets. These approaches do not translate directly to agents. An agent’s “accuracy” depends on what it was asked to do, the context it operated in, and whether the principal accepted the result.

Agent measurement requires a different foundation: structured records of what was expected versus what was delivered, tracked over time and across providers. Every mandate has explicit acceptance criteria, which makes evidence-based measurement possible at the individual operation level.

How agent accountability maps

  • Reputation scoring— Every verdict (PASS or FAIL) contributes to an agent’s reputation score. Over time, this produces a quantitative reliability metric grounded in actual outcomes — not synthetic benchmarks. You can answer “how reliable is this agent at financial operations?” with data.
  • Drift detection— When a model provider ships an update, agent behavior changes. By comparing verdict rates and tolerance-band compliance before and after provider changes, you detect drift empirically. This is the MEASURE function applied to the unique challenge of multi-provider agent operations.
  • Tolerance bands— Mandates can include numeric tolerance criteria. A financial reconciliation mandate might require results within 0.01% of the expected value. These bounds are checked automatically at receipt submission. The measurement is built into the workflow, not bolted on.
  • Evidence-based measurement chain— Because every mandate records what was expected and every receipt records what was delivered, you have a structured dataset for measuring agent performance. Aggregate across agents, providers, domains, time periods, or risk levels.

Infrastructure provides

  • Built-in reputation scoring tracking agent reliability over time
  • Tolerance bands and verification rules enforcing numeric bounds
  • Drift detection across model updates and provider changes
  • Structured mandate/receipt/verdict data for measurement

Enterprise owns

  • Defining measurement criteria and acceptable thresholds
  • Interpreting measurement results
  • Deciding what corrective action to take based on metrics
  • Setting tolerance thresholds per domain and risk level

4. MANAGE

Allocate resources and implement plans to respond to AI risks.

MANAGE is where you act on what MEASURE found. For traditional ML, this means model retraining, deployment rollbacks, and incident response procedures. For agents, the response mechanisms are different because the failure modes are different. An agent does not just produce a bad prediction — it may have taken real-world actions based on that prediction, delegated work to other agents, and triggered downstream processes.

Managing risk in agent operations requires structured dispute resolution, remediation workflows, and clear signals to downstream systems about whether an outcome should be trusted.

How agent accountability maps

  • 3-tier dispute resolution— When an agent’s output is not accepted, the protocol supports escalation. Tier 1: self-resolution between principal and performer (revision requests, remediation, re-delivery). Tier 2: third-party mediation by an accessor granted access to the mandate chain. Tier 3: human-in-the-loop escalation with the complete cryptographic record. This maps directly to MANAGE’s requirement for proportional risk response mechanisms.
  • Remediation workflow— When a mandate fails, the performer can submit revisions. The mandate moves through defined remediation states. Each revision is recorded against the original mandate, preserving the full chain from initial failure through corrective action to resolution. This is MANAGE’s corrective action requirement built into the operational workflow.
  • Settlement Signals— When a mandate reaches a terminal state, it emits a Settlement Signal (SETTLE or HOLD) that routes to downstream systems via webhook. SETTLE means the work was accepted and downstream processes can proceed. HOLD means the outcome is unresolved and downstream processes should wait. This is the MANAGE function’s interface to the rest of the organization’s systems.
  • Audit export— The full mandate chain is exportable in JSON, CSV, and NDJSON formats for regulatory submission, third-party audit, and integration with existing GRC tools. OCSF v1.4.0 export is available for security tooling integration. The export is not a report — it is the raw cryptographic record.

Infrastructure provides

  • 3-tier dispute resolution with full audit trail
  • Remediation states and revision workflow
  • Settlement Signal (SETTLE/HOLD) to downstream systems
  • Full chain export for regulatory submission and audit

Enterprise owns

  • Resource allocation decisions
  • Risk response strategy and implementation
  • Ongoing monitoring program design
  • Deciding when to escalate from automated to human response

Putting it together: your NIST AI RMF starting point

If you are building a NIST AI RMF program for agent operations, here is the practical sequence:

  1. GOVERN first. Define your principal/performer/accessor roles. Decide which agent operations require human oversight at assignment, at delivery, or at both. Record these decisions in your accountability infrastructure so they are enforced, not just documented.
  2. MAP per operation. Classify risk at the mandate level, not the model level. An agent running GPT-4o might handle both low-risk summarization and high-risk financial decisions in the same session. The risk classification belongs to the operation, not the tool.
  3. MEASURE continuously. Use verdict data to build reputation scores. Set tolerance bands for numeric operations. Track drift when providers update models. Measurement is a byproduct of structured operations, not a separate monitoring program.
  4. MANAGE proportionally. Route disputes through tiered resolution. Use Settlement Signals to control downstream impact. Export the full chain when regulators or auditors need it. The response mechanism matches the risk level recorded in MAP.

The key insight: for agent operations, these four functions are not separate programs. They are different views of the same underlying data — the mandate/receipt/verdict chain. GOVERN defines the roles and authority. MAP classifies the risk. MEASURE quantifies the outcomes. MANAGE acts on the results. The accountability infrastructure provides the shared foundation for all four.

What the NIST AI RMF does not yet address

The current framework (AI RMF 1.0, January 2023) predates the widespread deployment of autonomous agents. Several areas are either underspecified or absent:

  • Multi-agent delegation chains— The framework does not address accountability when one AI system delegates to another, especially across providers or organizations.
  • Context window impermanence— Agent operations that span multiple context windows create gaps in continuity that the framework does not contemplate.
  • Cross-provider risk— When an agent chain uses Claude, GPT, and Gemini in sequence, each hop introduces different risk characteristics. The framework’s model-level risk assessment does not capture this.
  • Real-time action versus prediction— Agents take actions, not just make predictions. The consequence surface is larger and less reversible than traditional ML inference.

NIST’s Generative AI Profile (AI 600-1, July 2024) addresses some generative AI risks but does not yet cover multi-agent delegation chains or cross-provider accountability.

These gaps are not criticisms of the framework — it was published before agents became widespread. But they are areas where your NIST AI RMF program will need to extend beyond what the document covers. Structured accountability infrastructure fills many of these gaps by providing the operational record that the framework assumes exists.

Sources & further reading