title: “Prompt Injection” description: “Adversaries craft inputs that override agent instructions, causing execution of unauthorized actions with legitimate credentials.”
Overview
Prompt injection is the foundational attack against AI agents. Adversaries embed instructions in content the agent processes, hijacking its behavior to execute attacker-controlled actions. Unlike traditional injection attacks (SQL, XSS) that exploit parsing failures, prompt injection exploits the model’s inability to distinguish instructions from data.Attack Variants
Direct Injection
Malicious instructions in user input:Indirect Injection
Malicious instructions embedded in data the agent processes—documents, emails, web pages, images, or tool outputs:- Document
- Email
- Web Page
- Tool Output
Real-World Examples
| Incident | Vector | Impact |
|---|---|---|
| Bing Chat data exfiltration (2023) | Hidden text in web pages | Agent leaked conversation history |
| Google Bard prompt leak | Crafted document | System prompt extracted |
| GitHub Copilot instruction override | Code comments | Generated malicious code |
| ChatGPT plugin attacks | Malicious plugin responses | Cross-plugin data theft |
Why Model-Level Defenses Are Insufficient
| Defense | Limitation |
|---|---|
| Instruction tuning | Can be bypassed with novel phrasings |
| Input filtering | Cannot catch semantic attacks without blocking legitimate content |
| Output filtering | Too late—action may already be queued |
| Prompt hardening | Reduces but doesn’t eliminate risk; increases brittleness |
AARM Mitigations
Policy Enforcement
Block actions regardless of how the agent was instructed:Parameter Constraints
Restrict what values are permissible:Step-Up Authorization
Require human approval for high-risk actions, even if the agent believes they’re legitimate:Action Receipts
Maintain forensic trail showing what was requested, what was decided, and why:Detection Signals
AARM can incorporate injection detection as a risk signal in policy evaluation:| Signal | Description |
|---|---|
injection_score | ML classifier confidence that input contains injection attempt |
instruction_anomaly | Deviation from typical action patterns for this user/session |
source_trust | Trust level of content source (user input vs. retrieved document) |
Key Takeaway
Prompt injection is not a bug to be fixed—it’s a fundamental property of how language models process text. AARM accepts that injection will occur and enforces security at the action boundary where it can be reliably controlled.
References
- Greshake et al. (2023). “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”
- Perez et al. (2023). “Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs”
- Liu et al. (2024). “Formalizing and Benchmarking Prompt Injection Attacks and Defenses”
- OWASP. “LLM01: Prompt Injection”
Next
Confused Deputy
How agents are manipulated into unauthorized actions
Policy Engine
How AARM evaluates and enforces policies