Threat Model

Fundamental Assumption

AARM operates on a core premise: the AI orchestration layer cannot be trusted as a security boundary. The model processes untrusted inputs through opaque reasoning, producing actions that may serve attacker goals rather than user intent.

Threat Summary

Threat	Attack Vector	AARM Control
Prompt Injection	User input, documents, tool outputs	Policy enforcement, context-dependent deny
Malicious Tool Outputs	Adversarial tool responses	Post-tool action restrictions, context tracking
Confused Deputy	Ambiguous/malicious instructions	Step-up approval, intent alignment check
Over-Privileged Credentials	Excessive token scopes	Least-privilege, scoped credentials
Data Exfiltration	Action composition	Context accumulation, compositional policies
Goal Hijacking	Injected objectives	Action-level policy, semantic distance
Intent Drift	Agent reasoning divergence	Context accumulation, semantic distance, deferral
Memory Poisoning	Persistent context manipulation	Provenance tracking, anomaly detection
Cross-Agent Propagation	Multi-agent delegation	Cross-agent context tracking, transitive trust limits
Side-Channel Leakage	Logs, debug traces, API metadata	Output filtering, contextual sensitivity scoring
Environmental Manipulation	Modified system/environment state	Input provenance tracking, anomaly detection

Attack Lifecycle

Attacks against AI agents typically follow four stages:

Injection — Attacker embeds malicious instructions in content the agent will process
Hijacking — Agent interprets malicious content as legitimate instructions (unobservable)
Execution — Agent invokes tools with attacker-controlled parameters using legitimate credentials
Impact — Actions produce effects in external systems, often irreversible

AARM intervenes between stages 2 and 3 — after the agent decides to act but before the action executes. This is the only reliable intervention point. Intent drift follows a different lifecycle without explicit injection — the agent’s reasoning gradually diverges. AARM’s context accumulation and semantic distance tracking detect divergence regardless of cause.

Trust Assumptions

Trust Level	Components
Trusted	AARM system, cryptographic primitives, policy store, underlying infrastructure
Untrusted	AI model, orchestration layer, user inputs, tool outputs, documents, agent memory
Partially Trusted	Tool implementations (AARM constrains invocation, not internal behavior), human approvers (subject to social engineering)

Out of Scope

AARM addresses runtime action security. These threats require complementary controls:

Model training data poisoning — requires ML security and supply chain controls
Denial of service against AARM — requires infrastructure redundancy
Physical/infrastructure attacks — requires physical security
Social engineering of approvers — requires security awareness training
Vulnerabilities within tools — requires secure development practices
Memory storage security — requires separate storage-level controls

Overview

System Components

Implementation Architectures

Conformance

Research Directions

Threat Model

Fundamental Assumption

Threat Summary

Attack Lifecycle

Trust Assumptions

Out of Scope

Overview

System Components

Implementation Architectures

Threat Model

Conformance

Research Directions

​Fundamental Assumption

​Threat Summary

​Attack Lifecycle

​Trust Assumptions

​Out of Scope

Fundamental Assumption

Threat Summary

Attack Lifecycle

Trust Assumptions

Out of Scope