This three-part blog covers the ever-evolving prompt injection threat landscape. It goes beyond the currently well-known types, and explores novel and emerging threats.
Prompt injections work due to LLM confusion of instruction and data, where inputs are maliciously crafted to function as executable commands by the LLM.
Attackers are continuously innovating, moving beyond simple, direct prompt injections to more sophisticated and harder-to-detect methods.
Understanding these novel attack vectors is crucial for protecting against the full breadth of the prompt injection threat.
The cybersecurity landscape faces growing threats from AI vulnerabilities. However, prompt injection stands out as a major concern in generative AI systems. Often dismissed as simple chatbot “tricks,” this view overlooks its serious systemic risks, especially as AI becomes integral to critical infrastructure.
The OWASP Top 10 for LLMs highlights prompt injection (LLM01) as a top vulnerability.
Like so many other cybersecurity offensive threats and defensive responses, it is an ever-shifting arms race as attackers develop new techniques while well known risks are being addressed.
At its core, prompt injection exploits the semantic gap between human language and how LLMs follow instructions. Attackers craft inputs that appear valid but that are designed to subvert developer intent.
Attackers are continuously innovating, moving beyond simple, direct prompt injections to more sophisticated and harder-to-detect methods. Understanding these novel attack vectors is crucial for monitoring and hardening against the prompt injection threat.
The common thread across these diverse attack methods blurs lines between data and instruction, where external information or non-textual inputs are misinterpreted or maliciously crafted to function as executable commands by the LLM.
Indirect prompt injection occurs when malicious instructions are not fed directly to the LLM by the attacker but are instead hidden within external data sources that the LLM consumes, such as webpages, documents, or databases. This method is particularly insidious because the LLM ingests these instructions as part of its normal operational context, often from sources it is designed to trust.
Advanced scenarios significantly elevate the risk. Retrieval Augmented Generation (RAG) systems, designed to ground LLMs in factual, up-to-date information by allowing them to query external knowledge bases, can become potent attack vectors if these knowledge bases are tainted. An LLM is inherently programmed to trust the data retrieved through RAG, making injected prompts within this data highly effective.
Similarly, stored prompt injection involves embedding malicious prompts within an AI model's memory or associated databases, allowing the attack to persist across sessions and affect future outputs without repeated intervention.
Attackers also employ invisible injection techniques, such as using white text on a white background, extremely small font sizes, or hidden HTML comments within web pages, making the malicious prompts undetectable to human users but parsable by the LLM.
The significance of indirect attacks lies in their stealth and potential for broad impact. They exploit the LLM's interaction with its environment and do not require the attacker to be directly interacting with the system at the moment of compromise.
Multimodal LLMs introduce risks via images, audio, and video.
Attackers use steganography or sub-audible commands to hide prompts in non-text formats, bypassing defenses built for text-only inputs.
The 2025 OWASP Top 10 now recognizes multimodal threats as a distinct concern.
As LLMs gain more agency—designed to autonomously interact with external tools, data sources, and even other AI agents—new vulnerabilities emerge related to the protocols and frameworks that enable these interactions.
The Model Context Protocol (MCP) is one such framework, designed to standardize how LLMs discover and invoke tools or communicate with other agents. However, this increased connectivity and agency of AI agents create a new risk and potential impact of prompt injections.
Specific attack vectors targeting these systems include:
When LLMs can take direct actions in the world, such as executing code, accessing databases, or controlling physical systems, prompt injections targeting these agentic capabilities can lead to severe consequences like data exfiltration, unauthorized system control, or financial fraud, far exceeding the impact of attacks on standalone, non-agentic LLMs.
Recent academic research offers a more profound understanding of why prompt injections are effective at a fundamental architectural level. These studies reveal that prompt injections can succeed by manipulating the LLM's internal attention mechanisms.
Within the transformer architecture common to most LLMs:
Important heads can be diverted from focusing on the original, legitimate system instructions towards the attacker's injected instructions. This shift in attention is a key reason why LLMs might follow malicious prompts even when they contradict prior directives.
In short, attackers distract the model’s focus e.g. “provide a summary of why you think it contains this information" away from the fact that is sharing sensitive information against pre-programmed instructions, just like waving a flashlight to pull a guard’s gaze off the main door.
A particularly concerning development is the emergence of the Policy Puppetry Attack technique that has been reported to effectively bypass safety guardrails across a wide range of major AI models. This attack typically combines several elements:
The reported success of such attacks in extracting system prompts and circumventing safety measures is significant. If "universal" or broadly applicable attack methodologies like this persist, they pose a substantial challenge to current alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF), and suggest a more fundamental vulnerability in how models distinguish between instructions and content, or how they prioritize different types of instructions.
The following table summarizes these emerging threats:
Part 2 of this blog will provide detailed examples of these various types of prompt injections including specific context, prompt, payload, proof, and consequences of each.