Skip to main content

Fundamental Assumption

AARM operates on a core premise: the AI orchestration layer cannot be trusted as a security boundary. The model processes untrusted inputs through opaque reasoning, producing actions that may serve attacker goals rather than user intent.

Threat Summary

ThreatAttack VectorAARM Control
Prompt InjectionUser input, documents, tool outputsPolicy enforcement, context-dependent deny
Malicious Tool OutputsAdversarial tool responsesPost-tool action restrictions, context tracking
Confused DeputyAmbiguous/malicious instructionsStep-up approval, intent alignment check
Over-Privileged CredentialsExcessive token scopesLeast-privilege, scoped credentials
Data ExfiltrationAction compositionContext accumulation, compositional policies
Goal HijackingInjected objectivesAction-level policy, semantic distance
Intent DriftAgent reasoning divergenceContext accumulation, semantic distance, deferral
Memory PoisoningPersistent context manipulationProvenance tracking, anomaly detection
Cross-Agent PropagationMulti-agent delegationCross-agent context tracking, transitive trust limits
Side-Channel LeakageLogs, debug traces, API metadataOutput filtering, contextual sensitivity scoring
Environmental ManipulationModified system/environment stateInput provenance tracking, anomaly detection

Attack Lifecycle

Attacks against AI agents typically follow four stages:
  1. Injection — Attacker embeds malicious instructions in content the agent will process
  2. Hijacking — Agent interprets malicious content as legitimate instructions (unobservable)
  3. Execution — Agent invokes tools with attacker-controlled parameters using legitimate credentials
  4. Impact — Actions produce effects in external systems, often irreversible
AARM intervenes between stages 2 and 3 — after the agent decides to act but before the action executes. This is the only reliable intervention point. Intent drift follows a different lifecycle without explicit injection — the agent’s reasoning gradually diverges. AARM’s context accumulation and semantic distance tracking detect divergence regardless of cause.

Trust Assumptions

Trust LevelComponents
TrustedAARM system, cryptographic primitives, policy store, underlying infrastructure
UntrustedAI model, orchestration layer, user inputs, tool outputs, documents, agent memory
Partially TrustedTool implementations (AARM constrains invocation, not internal behavior), human approvers (subject to social engineering)

Out of Scope

AARM addresses runtime action security. These threats require complementary controls:
  • Model training data poisoning — requires ML security and supply chain controls
  • Denial of service against AARM — requires infrastructure redundancy
  • Physical/infrastructure attacks — requires physical security
  • Social engineering of approvers — requires security awareness training
  • Vulnerabilities within tools — requires secure development practices
  • Memory storage security — requires separate storage-level controls