May 20, 2025
15
Min Read

Part 1 - A Review of Prompt Injection Threats

This three-part blog covers the ever-evolving prompt injection threat landscape. It goes beyond the currently well-known types, and explores novel and emerging threats.

  • Part 1 - covers evolving and new types of prompt injection threats
  • Part 2 - provides three specific examples as well as cross-industry threat scenarios
  • Part 3 - explores mitigation strategies and highlights how Singulr AI can be used to protect the ever-expanding footprint of AI used inside the threat boundary.

Prompt injections work due to LLM confusion of instruction and data, where inputs are maliciously crafted to function as executable commands by the LLM.

Attackers are continuously innovating, moving beyond simple, direct prompt injections to more sophisticated and harder-to-detect methods.

Understanding these novel attack vectors is crucial for protecting against the full breadth of the prompt injection threat.

The Ever-Evolving Threat of Prompt Injection

The cybersecurity landscape faces growing threats from AI vulnerabilities. However, prompt injection stands out as a major concern in generative AI systems. Often dismissed as simple chatbot “tricks,” this view overlooks its serious systemic risks, especially as AI becomes integral to critical infrastructure.

The OWASP Top 10 for LLMs highlights prompt injection (LLM01) as a top vulnerability.

Like so many other cybersecurity offensive threats and defensive responses, it is an ever-shifting arms race as attackers develop new techniques while well known risks are being addressed.

At its core, prompt injection exploits the semantic gap between human language and how LLMs follow instructions. Attackers craft inputs that appear valid but that are designed to subvert developer intent.

Novel and Underestimated Prompt Injection Vectors

Attackers are continuously innovating, moving beyond simple, direct prompt injections to more sophisticated and harder-to-detect methods. Understanding these novel attack vectors is crucial for monitoring and hardening against the prompt injection threat.

The common thread across these diverse attack methods blurs lines between data and instruction, where external information or non-textual inputs are misinterpreted or maliciously crafted to function as executable commands by the LLM.

Indirect Prompt Injections

Indirect prompt injection occurs when malicious instructions are not fed directly to the LLM by the attacker but are instead hidden within external data sources that the LLM consumes, such as webpages, documents, or databases. This method is particularly insidious because the LLM ingests these instructions as part of its normal operational context, often from sources it is designed to trust.

Advanced scenarios significantly elevate the risk. Retrieval Augmented Generation (RAG) systems, designed to ground LLMs in factual, up-to-date information by allowing them to query external knowledge bases, can become potent attack vectors if these knowledge bases are tainted. An LLM is inherently programmed to trust the data retrieved through RAG, making injected prompts within this data highly effective.

Similarly, stored prompt injection involves embedding malicious prompts within an AI model's memory or associated databases, allowing the attack to persist across sessions and affect future outputs without repeated intervention.

Attackers also employ invisible injection techniques, such as using white text on a white background, extremely small font sizes, or hidden HTML comments within web pages, making the malicious prompts undetectable to human users but parsable by the LLM.

The significance of indirect attacks lies in their stealth and potential for broad impact. They exploit the LLM's interaction with its environment and do not require the attacker to be directly interacting with the system at the moment of compromise.

Multimodal Prompt Injections

Multimodal LLMs introduce risks via images, audio, and video.

Attackers use steganography or sub-audible commands to hide prompts in non-text formats, bypassing defenses built for text-only inputs.

The 2025 OWASP Top 10 now recognizes multimodal threats as a distinct concern.

Agent and Protocol Exploits

As LLMs gain more agency—designed to autonomously interact with external tools, data sources, and even other AI agents—new vulnerabilities emerge related to the protocols and frameworks that enable these interactions.

The Model Context Protocol (MCP) is one such framework, designed to standardize how LLMs discover and invoke tools or communicate with other agents. However, this increased connectivity and agency of AI agents create a new risk and potential impact of prompt injections.

Specific attack vectors targeting these systems include:

  • MCP Tool Poisoning: This involves embedding malicious instructions within the descriptions or anticipated responses of MCP tools. The LLM, relying on these descriptions to understand a tool's function and how to use it, can be tricked into executing unintended actions. This is a sophisticated form of indirect injection targeting the metadata and "senses" of the AI agent.
  • Retrieval-Agent Deception (RADE) Attacks: In a RADE attack, an adversary corrupts publicly available data or documents that an AI agent might later ingest into its knowledge base (e.g., a vector database). When the agent queries this compromised database on a related topic, the malicious commands are retrieved along with legitimate data and subsequently executed by the LLM.
  • MCP Rug Pulls: This technique involves deploying MCP tools that initially function benignly, thereby building user trust and gaining necessary permissions.Later, the attacker updates the tool to include malicious behavior, exploiting the established trust and permissions to execute harmful actions without further user consent.
  • Exploiting Agentic Planning Logic: More advanced attacks aim to manipulate not just an agent's output, but its internal "thought process"—how it decomposes goals, plans sequences of actions, or chooses between available tools.

When LLMs can take direct actions in the world, such as executing code, accessing databases, or controlling physical systems, prompt injections targeting these agentic capabilities can lead to severe consequences like data exfiltration, unauthorized system control, or financial fraud, far exceeding the impact of attacks on standalone, non-agentic LLMs.

Attention Manipulation

Recent academic research offers a more profound understanding of why prompt injections are effective at a fundamental architectural level. These studies reveal that prompt injections can succeed by manipulating the LLM's internal attention mechanisms.

Within the transformer architecture common to most LLMs:

  • Attention heads - play a crucial role in determining which parts of the input are most relevant for generating an output.
  • Important heads - are a subset of attention heads that have been shown to contribute more strongly to the model’s performance.

Important heads can be diverted from focusing on the original, legitimate system instructions towards the attacker's injected instructions. This shift in attention is a key reason why LLMs might follow malicious prompts even when they contradict prior directives.

In short, attackers distract the model’s focus e.g. “provide a summary of why you think it contains this information" away from the fact that is sharing sensitive information against pre-programmed instructions, just like waving a flashlight to pull a guard’s gaze off the main door.

Policy Puppetry Attack

A particularly concerning development is the emergence of the Policy Puppetry Attack technique that has been reported to effectively bypass safety guardrails across a wide range of major AI models. This attack typically combines several elements:

  • Policy Format Confusion: The malicious request is formatted to resemble configuration instructions (e.g., using XML, JSON, or INI-like structures), which models are often trained to interpret with higher authority.
  • Roleplaying Misdirection: The harmful request is framed within a fictional scenario (e.g., a TV script), leveraging the model's training to differentiate between direct harmful instructions and content generation for a fictional context.
  • Leetspeak Encoding: Parts of the prompt, especially those containing forbidden keywords, are encoded using "leetspeak" (e.g., replacing 'e' with '3') to bypass simple keyword filters.

The reported success of such attacks in extracting system prompts and circumventing safety measures is significant. If "universal" or broadly applicable attack methodologies like this persist, they pose a substantial challenge to current alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF), and suggest a more fundamental vulnerability in how models distinguish between instructions and content, or how they prioritize different types of instructions.

The following table summarizes these emerging threats:

Part 2 of this blog will provide detailed examples of these various types of prompt injections including specific context, prompt, payload, proof, and consequences of each.

What are your numbers?

Get an AI Usage And Risk Assessment to understand what is happening across your organization.

Request a Live Product Demo Now

By submitting this form, you are agreeing to our Terms & Conditions and Privacy Policy.

Your Request has been Successfully Submitted

Thank you. Our team will contact you shortly.
Oops! Something went wrong while submitting the form.