Semantic Stealth: Crafting Covert Adversarial Patches for Sentiment Classifiers Using Large Language Models

Roa Carvajal, Maria; Mahbub, Maria; Srinivasan, Sudarshan; Begoli, Edmon; Sadovnik, Amir

Semantic Stealth: Crafting Covert Adversarial Patches for Sentiment Classifiers Using Large Language Models

Conference · Fri Nov 01 04:00:00 EDT 2024

OSTI ID:2480040

Roa Carvajal, Maria ^[1]; ^[1]; ^[1]; ^[1]; ^[1]

ORNL

Deep learning models have been shown to be vulnerable to adversarial attacks, in which perturbations to their inputs cause the model to produce incorrect predictions. As opposed to adversarial attacks in computer vision, where small changes introduced to pixel values can drastically alter a model's output while remaining imperceptible to humans, text-based attacks are difficult to conceal due to the discrete nature of tokens. Consequently, unconstrained gradient-based attacks often produce adversarial examples that lack semantic meaning, rendering them detectable through visual inspection or perplexity filters. In contrast to methods that rely on gradient-based optimization in the embedding space, we propose an approach that leverages a Large Language Model's ability to generate grammatically correct and semantically meaningful text to craft adversarial patches that seamlessly blend in with the original input text. These patches can be used to alter the behavior of a target model, such as a text classifier. Since our approach does not rely on gradient backpropagation, it only requires access to the target model's confidence scores, making it a grey-box attack. We demonstrate the feasibility of our approach using open-source LLMs, including Intel's Neural Chat, Llama2, and Mistral-Instruct, to generate adversarial patches capable of altering the predictions of a distilBERT model fine-tuned on the IMDB reviews dataset for sentiment classification.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 2480040

Country of Publication:: United States

Language:: English

Similar Records

User-directed Sentiment Analysis: Visualizing the Affective Content of Documents

Conference · Sat Jul 01 00:00:00 EDT 2006 · OSTI ID:982974

ROPE: Recoverable Order-Preserving Embedding of Natural Language

Technical Report · Wed Feb 10 23:00:00 EST 2016 · OSTI ID:1239214

On the vulnerability of data-driven structural health monitoring models to adversarial attack

Journal Article · Mon May 25 20:00:00 EDT 2020 · Structural Health Monitoring · OSTI ID:1630947

Semantic Stealth: Crafting Covert Adversarial Patches for Sentiment Classifiers Using Large Language Models

Citation Formats

Similar Records

Related Subjects