Adaptive Multi‑Layer Framework for Detecting and Mitigating Prompt Injection Attacks in Large Language Models

Authors

October 28, 2025

Downloads

Background: Prompt injection attacks are methods that exploit the instruction‐following nature of fine‐tuned large language models (LLMs), leading to the execution of unintended or malicious commands. This vulnerability shows the limitation of traditional defenses, including static filters, keyword blocks, and multi‐LLMs cross‐checks, which lack semantic understanding or incur high latency and operational overhead.

Objective: This study aimed to develop and evaluate a lightweight adaptive framework capable of detecting and neutralizing prompt injection attacks in real-time.

Methods: Prompt-Shield Framework (PSF) was developed around a locally hosted Llama 3.2 API. This PSF integrated three modules, namely Context-Aware Parsing (CAP), Output Validation (OV), and Self-Feedback Loop (SFL), to pre-filter inputs, validate outputs, and iteratively refine detection rules. Subsequently, five scenarios were tested, comprising baseline (without any defenses), CAP only, OV only, CAP+OV, and CAP+OV+SFL. The evaluation was performed over a near-balanced dataset of 1,405 adversarial and 1,500 benign prompt, measuring classification performance through confusion matrices, precision, recall, and accuracy.

Results: The results showed that baseline achieved 63.06% accuracy (precision = 0.678; recall = 0.450), while OV only improved performance to 79.28% (precision = 0.796; recall = 0.768). CAP reached 84.68% accuracy (precision = 0.891; recall = 0.779), while CAP+OV yielded 95.25% accuracy (precision = 0.938; recall = 0.966). Finally, integrating SFL over 10 epochs further improved performance to 97.83% accuracy (precision = 0.980; recall = 0.975) and reduced the false-negative count from 48 (CAP+OV) to 35 (CAP+OV+SFL).

Conclusion: The results show the significance of using multiple defenses, such as contextual understanding, OV, and adaptive learning fusion, which are needed for efficient prompt injection mitigation. This shows that PSF framework is an effective solution to protect LLMs against advancing threats. Moreover, further studies should aim to refine the adaptive thresholds in CAP and OV, particularly in multilingual or highly specialized environments, and examine other forms of SFL solutions for better efficiency.

 Keywords: Prompt Injection, LLMs Security, Jailbreak, Natural Language Processing