Emojis used to hide attacks & bypass major AI guardrails

Yesterday

Research from Mindgard has revealed that malicious payloads can be hidden inside emojis to circumvent AI guardrails, including those from Microsoft, Nvidia, Meta, and Protect AI.

The method, referred to as "emoji smuggling," was found to bypass leading large language model (LLM) security systems with an attack success rate of up to 100%, according to Mindgard's findings. This approach involves embedding hidden payloads within emojis, which current AI guardrails fail to detect.

Mindgard's investigation examined six of the most widely deployed guardrail systems, all of which enterprises increasingly rely on to enforce AI security protocols in customer-facing applications, decision support platforms, and internal operations. These systems typically function as AI-powered classifiers, trained to spot toxic content, prompt injection attempts, and injection payloads. However, the research identified a significant vulnerability: many of these tools depend heavily on static pattern recognition and demonstrated insufficient resilience against adversarial attacks.

The research employed low-complexity adversarial techniques, notably "emoji smuggling" and character-level perturbations. Tactics such as zero-width space insertions, homoglyph substitutions, and embedding emojis were used as part of the Character Injection methods. Additionally, the Adversarial ML Evasion approach modified prompt wording iteratively while preserving the original intent. Both methods succeeded in consistently evading the detection mechanisms, with attack success rates reaching 100% in some cases.

One of the most significant findings involved the manipulation of Unicode tags within emojis, enabling the embedding of malicious instructions that remain invisible to human reviewers but are processed by the target LLM. Mindgard's study is the first to document this technique at scale against LLM guardrails.

The research results brought to light notable inconsistencies in the effectiveness of current AI security strategies. Prompt injection attack success rates exceeded 70% across almost all tested guardrail systems. Specifically, the Vijil Prompt Injection Detector had an attack success rate of 87.95% for prompt injection and 91.67% for jailbreaks. Protect AI v1 followed with 77.32% and 51.39% respectively. Microsoft's Azure Prompt Shield, a core component for AI services on the Azure platform, recorded a 71.98% success rate for prompt injection and 60.15% for jailbreaks. Nvidia and Meta products showed similar levels of exposure, while Protect AI v2 showed some improvement but still had a 20.26% attack success rate for prompt injections.

Despite Mindgard's responsible disclosure of these findings to the respective vendors, none of the attacks received Common Vulnerabilities and Exposures (CVE) identifiers. Mindgard noted that this reflects the nascent state of established standards for identifying and reporting AI-specific vulnerabilities. While the vulnerabilities do not immediately facilitate remote code execution or data exfiltration, Mindgard asserts they undermine the effectiveness of AI guardrail systems where robust operation is essential for organisations adopting generative AI at scale.

Dr Peter Garraghan, Chief Executive Officer and co-founder of Mindgard, stated: "While AI guardrails are marketed as robust defences against adversarial prompts, our research demonstrates that these systems often operate on fragile assumptions about input sanitization and classifier reliability. The industry must move beyond static detection models toward continuous, runtime testing and adaptive defenses that reflect the dynamic threat landscape of AI."

The results call attention to the importance for security leaders and AI engineers to reassess the resilience and reliability of their defensive measures. As LLMs continue to be incorporated into a variety of enterprise and consumer-facing environments, failing to rigorously test and verify guardrail protections can introduce significant risk to data security, brand reputation, and regulatory compliance.

Share on: