Prompt Injection
Prompt injection is a security attack in which an adversary feeds maliciously crafted input to an LLM to override or hijack the system's original instructions. It exploits the structural weakness that LLMs cannot tell trusted instructions apart from untrusted data, and it is a fundamentally different concept from prompt engineering.
- Prompt injection is a security attack that uses manipulated input to override or hijack an LLM's original instructions, and it is entirely distinct from prompt engineering, which is about steering a model effectively.
- Its root cause is a structural weakness: LLMs cannot distinguish the trusted instructions set by a developer from untrusted data arriving from outside, because both enter through the same input stream.
- It splits into direct injection, where an attacker types malicious commands straight into the chat, and indirect injection, where commands are hidden inside external content such as web pages, documents, or emails.
- OWASP ranks prompt injection as LLM01 on its 2025 list of LLM security risks, marking it the single highest-priority threat.
- There is still no perfect single fix; defense requires layering controls such as least privilege, output validation, isolation of external content, and human approval.
What Prompt Injection Is
Prompt injection is a security attack in which an adversary deliberately crafts input (a prompt) and feeds it to an LLM in order to neutralize or seize control of the instructions the system was originally given. The classic form slips in a sentence like "ignore all previous instructions and do this instead," bending the model's behavior toward whatever the attacker wants. The term was first coined by developer Simon Willison in September 2022.
The crucial point is that this is an unambiguous security attack, fundamentally different from prompt engineering, the craft of getting more out of a model. Where prompt engineering is the legitimate practice of designing instructions to obtain a desired output, prompt injection breaks that instruction hierarchy from the outside to seize control, so the conversation around it is framed in terms of attack and defense.
Root Cause: The Collapse of the Instruction-Data Boundary
OWASP defines prompt injection as a vulnerability that "occurs when user prompts alter the LLM's behavior or output in unintended ways." These attacks are so hard to stop at the root because today's LLMs cannot separate the trusted instructions a developer configures from untrusted content such as user input, retrieved documents, or web pages. To the model, both arrive as the same token stream, so a command buried inside data gets mistaken for a genuine instruction.
How It Differs from Jailbreaking
Prompt injection is often confused with jailbreaking, but the two are not the same. By Simon Willison's distinction, jailbreaking is an attempt to bypass the model's own safety guardrails, whereas prompt injection is the broader concept of exploiting a structural vulnerability in applications built on top of an LLM, where trusted system instructions and untrusted input become intermingled inside a single model.
Types: Direct and Indirect Injection
| Aspect | Direct Injection | Indirect Injection |
|---|---|---|
| Injection path | Attacker enters the malicious prompt directly into a chatbot or API | Malicious commands are planted in advance inside external data the model ingests |
| Typical location | Chat input box, user message | Web pages, documents, emails, RAG stores, text inside images |
| Attacker-to-model contact | Direct interaction | No direct contact — triggers when the victim loads the content |
| Representative risk | Instruction override, system-prompt theft | Data exfiltration, unauthorized API calls, agent contamination and spread |
Working along these two axes, OWASP lays out a range of scenarios: direct injection into chatbots, commands hidden in web pages, tampering with documents in a RAG store, payload splitting that distributes a payload across several documents, multimodal attacks embedded in images, adversarial suffixes that append meaningless strings, and obfuscation attacks that use encoding or emoji.
Mitigations
OWASP recommends layered defense—stacking several controls rather than relying on any single fix. The principal controls are as follows.
- Constrain model behavior: Spell out the model's role and permitted scope in the system prompt, and instruct it not to follow external directives.
- Define and validate output format: Specify the expected output format and verify it with deterministic rules to filter out anomalous output.
- Input and output filtering: Detect and block malicious patterns using semantic filters and content rules.
- Least privilege: Apply the principle of least privilege and use separate, function-scoped API tokens to limit the blast radius.
- Human approval: Require a human confirmation step for high-risk actions such as sending mail, making payments, or deleting data.
- Isolate external content: Structurally separate the influence that untrusted external data can have on the user prompt.
- Adversarial testing: Continuously probe your defenses with penetration tests and breach simulations.
The Lethal Trifecta
In June 2025, Simon Willison distilled the conditions under which indirect prompt injection turns into real-world harm into what he calls the "lethal trifecta." When access to private data, exposure to untrusted content, and the ability to communicate externally all coexist within a single agent, one piece of poisoned content is enough to leak sensitive information to the outside. So when applying the controls above, it is especially important to design things so that these three never overlap in one place.
Evidence and Real-World Cases
Prompt injection is not merely theoretical; it has been demonstrated against live commercial services. The paper by Greshake et al., "Not what you've signed up for" (arXiv:2302.12173, AISec '23), which systematized indirect prompt injection academically, points out that LLM-integrated applications blur the line between data and instructions, and it demonstrated working attacks against the then-GPT-4-based Bing Chat and against code autocompletion, among others. The paper laid the foundation for indirect-injection research by presenting data theft, worm-like spread, ecosystem contamination, and unauthorized API calls as a threat taxonomy.
These risks have grown more pronounced as agentic AI spreads. On its 2025 list of security risks for LLM applications, OWASP designated prompt injection as LLM01, the highest-priority threat. And in January 2026, a string of indirect prompt injection vulnerabilities exploiting the same "lethal trifecta" were disclosed across several AI productivity tools, showing that this is not a flaw in any one product but a structural challenge across LLM-integrated systems as a whole.