
- Journal of Electronic Science and Technology
- Vol. 23, Issue 1, 100301 (2025)
Abstract
1 Introduction
In today’s digital age, artificial intelligence (AI) technology has emerged as a pivotal driving force across a wide array of industries, revolutionizing business operations and services. Among the various AI advancements, large language models (LLMs) [1–4] stand out as groundbreaking innovations, leading a transformation with their exceptional language comprehension and generation capabilities. These models have demonstrated remarkable potential in diverse applications, such as machine translation [1], text generation [2], and recommendation system [4]. Their ability to understand and produce human-like text has paved the way for significant advancements in these areas, making tasks like automated customer service, real-time translation, and content creation more efficient.
Despite their impressive capabilities, LLMs are not without challenges. They face significant safety, security, and privacy issues that may undermine the trust and reliability of their applications. One of the most notable safety concerns is hallucination [5], where the model generates plausible but incorrect or nonsensical information. This issue can lead to the dissemination of false information, which is especially problematic in critical applications like healthcare or legal advice. Additionally, LLMs are vulnerable to backdoor attacks [6], a security threat in which malicious actors manipulate the model’s behavior to produce harmful outputs or expose sensitive information. Privacy leakage is a typical privacy problem [7], where these models may inadvertently reveal confidential data from their training corpus, posing serious risks to data privacy.
Given these pressing challenges, it is crucial to conduct a thorough and comprehensive study to identify, categorize, and address the vulnerabilities inherent in LLMs. This study should involve an in-depth analysis of the various safety, security, and privacy threats, offering a clear classification and understanding of each issue. Furthermore, it should explore potential defense mechanisms and strategies to mitigate these risks, ensuring that LLMs are used safely and responsibly. By tackling these challenges, we can enhance the robustness and reliability of LLMs, unlocking their full potential while safeguarding against emerging threats.
Several surveys have addressed security and privacy issues in LLMs [8–10], but these works exhibit some shortcomings. First, some surveys [9] often confuse security with privacy issues, such as incorrectly classifying backdoor attacks as privacy problems. Second, some surveys repetitively classify certain attacks [10], such as jailbreak and adversarial attacks without distinguishing between them. Third, the concept of safety is often overlooked [8], with some safety issues being incorrectly classified as security, resulting in an imprecise analysis. These issues may lead to several negative consequences. Misidentifying threats, for example, could lead to adversarial attacks being mistaken for jailbreak attempts, prompting the use of ineffective defensive measures that fail to address the actual issue. Furthermore, the development of security solutions and strategies relies on a clear understanding of prevalent attacks. Without an accurate classification framework, researchers and developers may struggle to prioritize their efforts and address the most critical areas. Additionally, varying approaches to reporting and analyzing attack data across organizations and industries make it difficult to compare data, limiting the ability to identify trends and develop effective defense strategies.
To address these problems, we propose a new classification framework to provide systematic guidance for building a more robust, secure, and reliable LLM system. Specifically, we begin by providing clearer and more reasonable definitions of safety, security, and privacy within the context of LLMs, as shown in Fig. 1. Safety refers to the model’s inherent reliability and robustness in the absence of external threats. Security pertains to the model’s ability to withstand attacks from adversaries, which may lead to errors or harmful outputs. Privacy involves safeguarding training data and model parameters, ensuring that sensitive information is neither disclosed nor leaked. In contrast to previous works [8–10], our framework offers a more accurate and comprehensive overview of safety, security, and privacy issues in LLMs, along with their respective defense mechanisms. Additionally, we explore the unique research challenges associated with LLMs and outline potential avenues for future research.
Figure 1.Definition of safety, security, and privacy in LLMs.
Our contributions are summarized as follows:
• We provide clearer definitions of safety, security, and privacy within the context of LLMs. This taxonomy helps researchers better distinguish between specific types of attacks and adopt appropriate defense strategies.
• For safety, we conduct a comprehensive survey of inherent safety issues in LLMs, such as toxicity, bias, hallucination, and jailbreak, along with mitigation methods for each.
• For security, we examine the security of LLMs under active attacks, including backdoor, poisoning, and adversarial attacks, and investigate corresponding defense mechanisms for these threats.
• For privacy, we explore the privacy concerns in LLMs, focusing on privacy leakage and active attacks like membership inference and data extraction attacks. We also summarize privacy-preserving techniques and watermarking methods.
The rest of this paper is organized as follows. Section 2 provides an introduction to the background of LLMs. In Section 3, we define safety, security, and privacy in the context of LLMs. In Section 4, we detail the safety problems and their corresponding defenses. Section 5 covers the security issues and the associated defenses. In Section 6, we examine the privacy problems and the defense methods. Finally, Section 7 offers suggestions for future research directions and Section 8 concludes this survey.
2 Background
LLMs are AI systems trained on vast amounts of text data to understand and generate language in a manner that closely resembles human communications. These models are capable of performing a wide range of natural language processing (NLP) tasks. The development of LLMs began with simpler models like recurrent neural networks (RNNs) and long short-term memory networks (LSTMs), which were effective at handling sequential data but struggled with long-range dependencies. A major breakthrough came with the introduction of the Transformer model by Vaswani et al. in 2017 [11], which revolutionized the field. Transformers use self-attention mechanisms to process input sequences, enabling them to capture context and relationships between words more effectively, regardless of distance. As LLMs evolved, their size and complexity grew, resulting in models with billions of parameters capable of capturing intricate linguistic nuances. LLMs are now applied across a broad spectrum of tasks, including real-time translation, content creation, complex question-answering systems, and sentiment analysis. They are transforming how we interact with technology, facilitating more natural and effective communication between humans and machines.
As shown in Fig. 2, the life cycle of LLMs primarily follows a three-stage process:
Figure 2.Training and inference phases of LLMs.
1) Pre-training stage. During the pre-training phase, LLMs are trained on a massive, diverse dataset including books, online forums, news articles, and more. This stage is not task-specific; instead, its goal is to teach the model the underlying structures and patterns of human language. Common techniques, such as masked language modeling (predicting missing words in a sentence) and next-sentence prediction (learning relationships between sentences), are employed. This phase is crucial for building a broad understanding of language, which serves as the foundation for further task-specific training.
2) Fine-tuning stage. Once pre-trained, LLM undergoes fine-tuning to adapt it for specific tasks. This involves training the model on a smaller, more focused dataset relevant to the desired application. For example, if the task is sentiment analysis, the fine-tuning dataset would consist of text labeled with sentiments. Fine-tuning adjusts the model’s weights to better align with task-specific objectives, improving accuracy and relevance for the intended application.
3) Inference stage. After fine-tuning, LLM is ready for inference, where it is applied to make predictions or generate responses in real-world scenarios. At this stage, the model processes new, unseen data and uses the knowledge acquired during pre-training and fine-tuning to generate meaningful outputs. Whether translating text, summarizing documents, or engaging in conversation, the inference stage is where the model’s capabilities are put into practical use, delivering value to users and systems.
Each stage plays a vital role in the development of effective LLMs. Pre-training provides a deep understanding of language, fine-tuning tailors that understanding for specific tasks, and inference demonstrates the model’s ability to apply this knowledge in real-world applications.
Here is an overview of the most popular LLMs currently:
• Generative Pre-trained Transformer (GPT). Developed by OpenAI, the GPT series encompasses GPT-1, GPT-2, GPT-3, and the latest GPT-4. These models are trained on extensive datasets to predict the next word in a sentence, with applications spanning writing, translation, summarization, and coding. OpenAI provides application programming interfaces (APIs) for integrating GPT models into various applications.
• Llama. Meta’s Llama series emphasizes open-access research and efficiency. These models are designed to deliver competitive performance while democratizing access to high-quality LLMs. They are gaining popularity in academic and open-source communities for research and development, fostering a collaborative approach to AI advancements.
• Claude. Anthropic’s Claude is crafted with a focus on safety and alignment with human values. Claude models prioritize user-centric responses, ethical considerations, and harm prevention, making them a choice for applications where ethical AI is paramount.
• Mistral. Mistral is a 700 million parameter language model (LM) that surpasses Llama’s similar-sized model on all evaluated benchmarks. It also features a fine-tuning model dedicated to following instructions, enhancing its utility in specific applications.
There are many other popular LLMs, such as Gemini and Pathways Language Model (PaLM). Each of these models brings unique features and capabilities to the table, contributing to the diverse landscape of AI language technologies. However, due to space constraints, we will omit a detailed description of these models.
3 Definitions for safety, security, and privacy
We define safety, security, and privacy in the context of LLMs as follows:
• Safety refers to the model’s inherent ability to function correctly and consistently without unintended behaviors or errors in the absence of external threats. It ensures that the model adheres to ethical guidelines, avoids harmful outputs, and operates within its intended design parameters.
• Security in LLMs addresses the model’s vulnerability to intentional attacks by adversaries. It focuses on the model’s resilience and robustness to manipulations that could lead to incorrect, misleading, or malicious outputs.
• Privacy concerns the protection of sensitive information, including the model’s training data and parameters and users’ personal information. It aims to prevent unauthorized access, use, or disclosure of private information.
Building on these definitions, we provide a comprehensive overview of the vulnerabilities and defense mechanisms associated with LLMs. As illustrated in Fig. 3, we categorize current research into three main areas: Safety, security, and privacy. Under each category, we identify specific security issues and outline corresponding defensive measures to mitigate these vulnerabilities.
Figure 3.Overview of safety, security, and privacy issues and their defense methods.
Table 1 presents the structure of this study, outlining the safety, security, and privacy threats faced by LLMs [6,7,12–25]. For safety, we examine challenges such as model toxicity, bias, hallucination, and jailbreak. Defense mechanisms in this category include robust training methods, bias detection, and data sanitization techniques. For security, we delve into issues like backdoor attacks, model poisoning, and adversarial attacks. Defense strategies in this area involve adversarial training, anomaly detection, and secure model update protocols. Regarding privacy, we explore threats such as privacy leakage, inference attacks, and extraction attacks. To mitigate these threats, we discuss the implementation of privacy-preserving techniques like differential privacy (DP), secure multi-party computation, and federated learning.
Category | Feature | Example | |
Safety | Toxicity and bias | Toxicity and bias refer to the inappropriate or prejudiced content that may be generated or amplified by LLMs due to biased training data. This can result in harmful or discriminatory outcomes. | [ |
Hallucination | Hallucination refers to the generation of text that is nonsensical, irrelevant, or factually incorrect, typically arising from the model’s inability to accurately understand or process the context of the input data. | [ | |
Jailbreak | Jailbreak attack refers to the exploitation of vulnerabilities to bypass the model’s intended constraints and generate content that violates its operational guidelines or ethical safeguards. | [ | |
Security | Backdoor & poisoning attacks | Backdoor and poisoning attacks refer to the malicious insertion of hidden triggers or corrupted data during the training process, which can cause the model to produce harmful or targeted outputs when prompted with specific inputs. | [ |
Adversarial attack | Adversarial attack is a method where carefully crafted inputs are used to deceive the model into making errors or generating unintended outputs, often by exploiting the model’s weaknesses or vulnerabilities. | [ | |
Privacy | Privacy leakage | Privacy leakage occurs when sensitive or personal information is inadvertently disclosed through the model’s responses, due to the model’s exposure to or training on data containing such information. | [ |
Inference attack | Inference attacks involve the exploitation of model responses to deduce sensitive information about the training data or the underlying algorithms, potentially compromising privacy or security. | [ | |
Extraction attack | Extraction attacks are attempts to reverse-engineer or illicitly obtain proprietary information, such as training data or model parameters, by interacting with the model’s outputs. | [ |
Table 1. Classification of safety, security, and privacy issues in LLMs.
4 Safety problems and defenses
4.1 Toxicity and bias
LLMs, while incredibly powerful, can sometimes exhibit toxicity and bias that reflect the data they were trained on. These models learn patterns from the text data, including discriminatory language, stereotypes, or offensive content. Consequently, when generating text, LLMs might inadvertently produce harmful, biased, or inappropriate outputs.
The issue of toxicity [12] arises when the model generates content that is offensive or abusive, or promotes harmful behavior. This includes hate speech, explicit language, or content that incites violence or discrimination. Biases in LLMs often reflect those present in the training data. If the data contains gender, racial, or cultural biases, the model may perpetuate these biases in its outputs, such as associating certain professions predominantly with a specific gender or making stereotypical assumptions about ethnic groups [13]. Additionally, LLMs may struggle with fairness and inclusivity if they are not trained on diverse or representative data, leading to the underrepresentation or misrepresentation of certain groups in the generated content.
Deshpande et al. [12] reported that the toxicity of ChatGPT’s output can escalate significantly—up to sixfold—when personas are assigned. This increase leads to the propagation of harmful dialogues, incorrect stereotypes, and offensive opinions. Such escalations are particularly alarming as they risk defaming the assigned personas and endangering unsuspecting users, especially in sensitive domains like healthcare, education, and customer service. The study also highlights inherent discriminatory biases within the model, with specific groups being disproportionately targeted. Shaikh et al. [26] investigated the impact of chain of thought (CoT) prompts on LLMs in socially sensitive domains. While CoT improves performance in logical reasoning tasks, it also significantly increases the likelihood of models generating harmful or biased outputs in zero-shot reasoning for sensitive domains. The study finds that this issue becomes more pronounced with larger models but can be mitigated with improved instruction. Dong et al. [27] present a novel framework for evaluating the gender bias in LLMs using three distinct input strategies: Template-based, LLM-generated, and naturally-sourced. Their findings reveal that larger LLMs do not guarantee fairness; all models tested exhibit both explicit and implicit gender biases, even without explicit gender stereotypes in the input.
4.2 Hallucination
Hallucinations [14] in LLMs pose a critical challenge where the model generates text that, while syntactically coherent and semantically plausible, is factually incorrect or entirely fictitious. Reference [5] categorizes hallucinations into two main groups: Factuality hallucination and faithfulness hallucination. Factuality hallucination emphasizes the discrepancy between generated content and verifiable real-world facts, typically manifesting as factual inconsistency or fabrication. Faithfulness hallucination refers to the divergence of generated content from user instructions or the context provided by the input and self-consistency within generated content.
The causes of hallucinations [15] stem from three main areas: Data-related issues such as misinformation, biases, and knowledge boundaries within the training data; training-related factors including architectural flaws, suboptimal training objectives, and misalignments between model capabilities and human preferences; inference-related challenges like the inherent randomness of decoding strategies and imperfect representations during the decoding phase. These multifaceted origins contribute to the generation of content that may be inconsistent with real-world facts or user inputs.
The hallucination issue can have detrimental effects across various domains. In the healthcare domain [32], medical professionals could be misguided by inaccurate health-related advice or data generated by LLMs, potentially impacting patient care and safety. In the financial domain [33], financial advisors and analysts could be misled by hallucinated financial data or market insights, leading to incorrect investment decisions with significant financial repercussions.
4.3 Jailbreak
In the realm of LLMs, jailbreak [16] denotes a collection of techniques employed to evade the safety mechanisms and constraints built into these models. This bypass allows for the generation of content that would typically be prohibited, potentially including harmful, biased, or illegal information. The phenomenon of jailbreak in LLMs has been the subject of extensive research, with scholars uncovering various tactics to coerce models into producing inappropriate content. For example, Wei et al. [17] theorized two primary modes of failure in safety training: “competing objectives” and “mismatched generalization”. The former occurs when a model’s inherent capabilities clash with its intended safety goals, while the latter arises when the training for safety does not adequately prepare the model for the diverse domains in which it operates. By leveraging these insights, the researchers designed and tested attacks on advanced models, revealing persistent vulnerabilities despite rigorous red-teaming and safety measures. Notably, most jailbreaks are the result of human ingenuity, requiring significant manual effort rather than automated processes. Zou et al. [41] introduced an automated method for attacking aligned LMs by prompting them to generate objectionable content. This method involves appending adversarial suffixes to prompts, which compels the model to generate affirmative responses to queries that could be harmful. Employing a blend of greedy search and gradient-based optimization, this method automates the generation of adversarial suffixes, enhancing previous approaches to automatic prompt generation.
Current research on jailbreak vulnerabilities of chatbots [42,43] is heavily concentrated on ChatGPT, leaving a notable gap in understanding the risks associated with other commercial LLM chatbots such as Bing Chat and Bard. Additionally, there is little publicly available information about the jailbreak prevention strategies employed by these commercial solutions. To address this gap, Deng et al. [44] conducted an empirical study to evaluate the efficacy of current jailbreak attacks, highlighting the deficiencies in existing defenses. They introduced MASTERKEY, an innovative framework that employs a time-based analysis technique to deconstruct and comprehend the defense mechanisms used by LLM chatbots. Furthermore, they present an automated approach for generating universal jailbreak prompts. By fine-tuning an LLM with a curated set of jailbreak prompts, MASTERKEY is capable of crafting prompts that can effectively navigate around the defenses of various LLM chatbots.
However, these defense mechanisms have limitations. First, many of them rely on prior knowledge or assumptions about the attack type. Additionally, these methods often introduce computational overhead, such as gradient-based optimizations or unlearning processes. Some defenses, like SmoothLLM and RPO, may degrade model performance on benign queries or result in subtle user experience issues. Moreover, unlearning may unintentionally erase useful knowledge.
5 Security problems and defenses
5.1 Backdoor and poisoning attacks
Backdoor attacks involve surreptitiously embedding a hidden trigger within the model’s training data or architecture. Once activated by specific input patterns (the “trigger”), the model exhibits unintended behaviors, such as generating misleading or malicious outputs. Poisoning attacks contaminate the training dataset with maliciously crafted samples, causing the model to learn incorrect associations or biases. Backdoor and poisoning attacks are two distinct but interconnected types of attack strategies. Concretely, backdoor attacks implant stealthy triggers for specific, manipulated outputs, while poisoning attacks corrupt the training data to degrade overall model performance or induce biases. While backdoor and poisoning attacks differ in their specific goals and mechanisms, they are interconnected in their exploitation of vulnerabilities in the training process. We make no special distinction between them in this paper.
A variety of backdoor attack algorithms targeting LLMs have been documented [6]. For example, instruction poisoning and ICL poisoning leverage the model’s adaptability to new tasks without explicit parameter updates. Xu et al. [49] demonstrated that attackers can control model behavior by inserting malicious instructions into the training data, achieving high success rates without altering the actual data or labels. Reference [50] shows that by searching extensive text collections for inputs exhibiting significant gradient magnitudes under the approximation of a bag-of-n-grams LM, attackers can succeed with surprisingly small datasets—sometimes as few as a hundred correctly labeled data points. Furthermore, the efficacy of these attacks is expected to increase as the size and complexity of LMs grow. Zhao et al. [51] demonstrated that ICL, despite its efficacy in various NLP tasks, is susceptible to backdoor attacks where adversaries can manipulate model behavior by poisoning the demonstration context. The paper introduces ICLAttack, which includes two strategies: Poisoning demonstration examples and poisoning prompts, both capable of inducing the model to generate targeted outputs without the need for model fine-tuning. The method preserves the correct labeling of examples, enhancing the stealth of the attack. Additionally, some works leverage model-editing techniques to inject backdoors. Reference [18] proposes BadEdit, which builds shortcuts connecting triggers to their corresponding attack targets by directly manipulating the model’s weights. It requires minimal samples and adjustments while maintaining model performance and stability.
In summary, while each of these defense methods offers valuable approaches to securing LLMs against backdoor and poisoning attacks, their limitations in terms of generalization, computational overhead, reliance on prior knowledge, and vulnerability to evolving attacks make them less than fully robust for all scenarios.
5.2 Adversarial attack
Adversarial attacks [19,20,56] utilize meticulously crafted inputs to manipulate the model behavior and output. These attacks manifest in various forms, from prompt injections that distort model responses to token manipulations leading to incorrect predictions. It potentially enables the generation of harmful content, the inadvertent disclosure of sensitive information, or the evasion of critical safety mechanisms. For example, Xu et al. [56] proposed a groundbreaking method called PromptAttack, designed to audit the adversarial robustness of LLMs. This technique ingeniously converts textual adversarial attacks into an attack prompt, which deceives the victim LLM into generating an adversarial sample capable of self-deception. The attack prompt comprises three key elements: The original input, the attack objective, and strategic attack guidance. To preserve the semantic essence of the adversarial examples, a fidelity filter is meticulously applied. The attack’s potency is further amplified by integrating adversarial examples across a spectrum of perturbation levels. Building upon this foundation, Sadasivan et al. [19] introduced BEAST, which operates on interpretable parameters. It allows for a nuanced balance between the swiftness of the attack, its success rate, and the intelligibility of the adversarial prompts. BEAST demonstrates versatility by executing untargeted adversarial attacks that provoke hallucinations in aligned LLMs, as well as enhancing existing membership inference attacks. Raina et al. [20] examined the vulnerability of LLMs used for zero-shot text assessment, discovering that both scoring and comparative assessment are susceptible to universal adversarial phrases that can manipulate the models into assigning high scores regardless of actual text quality. Their research reveals the transferability of these attacks across various LLMs, indicating a significant robustness issue.
However, these approaches often struggle with scalability and computational efficiency, as techniques like token erasure and self-evaluation can be resource-intensive. There is also a risk of false positives or negatives, where benign inputs are unnecessarily flagged, or malicious ones go undetected. Furthermore, methods like instruction-augmented fine-tuning may suffer from overfitting to specific attack patterns, limiting generalization to new threats.
6 Privacy problems and defenses
6.1 Privacy leakage
Privacy leakage is a pressing concern due to the inherent potential of models to inadvertently divulge sensitive information. In a notable case, Samsung Electronics encountered sensitive corporate information unintentionally disclosed through ChatGPT interactions. Reference [61] finds that Gemini Flash often collects excessive user data beyond what is necessary and has a higher rate of anonymization failures when sharing data with the third parties. Privacy leakage can arise from three primary causes: Private information in training data, data memorization, and inference leakage.
1) Private information in training data. Training LLMs demands massive datasets, often compiled from diverse sources including the Internet and public databases, to learn the intricacies of human language effectively. These datasets may contain sensitive or personally identifiable information (PII), such as personal identifiers, health records, or financial details. Despite efforts at anonymization, there remains a risk of reidentification. Kim et al. [7] present ProPILE, a novel probing tool designed to empower data subjects, or the owners of PII, with awareness of potential PII leakage in LLM-based services. ProPILE enables data subjects to formulate prompts based on their PII to evaluate the level of privacy intrusion in LLMs. However, ProPILE primarily focuses on measuring leakage from training data and does not account for the flow of information from input to output. Mireshghallah et al. [21] proposed CONFAIDE, a benchmark grounded in the theory of contextual integrity and designed to identify critical weaknesses in the privacy reasoning capabilities of instruction-tuned LLMs. CONFAIDE consists of four tiers. Grounded in contextual integrity theory [62], each tier has a set of seed components defining the context, which gradually increases in complexity as the tiers progress. This work shed light on the often-overlooked interactive settings where an LLM may inadvertently expose sensitive input data in inappropriate contexts.
2) Data memorization. LLMs, with their immense capacity to learn from massive datasets, can unintentionally retain and later reproduce specific pieces of training data, including sensitive personal information. This phenomenon, known as memorization, is a byproduct of their ability to learn complex patterns. Reference [63] addresses the issue of LLMs inadvertently memorizing parts of their training data, which can lead to privacy breaches, reduced utility, and unfairness. The authors demonstrated that memorization significantly increases with model capacity, data duplication, and context length. Their log-linear approach quantifies the extent of memorization, revealing that larger models memorize 2−5× more than smaller ones.
3) Inference leakage. Even without explicit exposure to private data during interactions, LLMs can infer sensitive details through contextual clues. Their ability to generalize and connect information means that seemingly innocuous inputs can trigger responses that reveal more than intended, especially when the model has learned correlations between different types of data during training. We will explore this issue in greater detail in subsection 6.2.
6.2 Inference attack
Inference attacks leverage the models’ ability to infer sensitive information from their training datasets. Here, we outline four distinct types of inference attacks: Membership inference attack (MIA), attribute inference attack, data reconstruction attack, and model inversion attack.
1)
2)
3)
4)
6.3 Extraction attack
In the realm of LLMs, extraction attacks emerge as a pivotal concern for data privacy. These assaults are designed to retrieve information from the model’s training data or even its underlying parameters. Although extraction attacks share commonalities with inference attacks, they are distinguished by their unique objectives and methodologies. Extraction attacks are primarily aimed at directly acquiring tangible assets, such as model gradients or sensitive training data. In contrast, inference attacks are more concerned with deducing or uncovering the intrinsic characteristics or attributes of the model or its data. This is typically achieved through an analysis of the model’s outputs or observed behaviors. There are two principal categories of extraction attacks: Model extraction attack, which targets the model’s parameters, and data extraction attack, which focuses on recovering the training data.
1)
2)
It is observed that there is a relative scarcity of empirical research on extracting model parameters or data from LLMs, with most discussions being theoretical in nature [9]. The complexity and proprietary nature of LLMs pose significant challenges to such extraction attacks. Additionally, the controlled output of these models further limits the feasibility of black-box attacks, making it difficult for adversaries to successfully execute data extraction without access to the model’s internal workings.
6.4 Other privacy-preserving methods
In addition to the previously discussed defensive strategies against various attacks, there are several privacy-preserving methods that can protect against a wide range of threats. These methods are more general and can be applied across different scenarios and attack vectors. By integrating advanced privacy-preserving techniques, LLMs can operate more securely and reliably, reducing the potential for enhancing user trust.
6.4.1 Differential privacy
DP incorporates a degree of controlled randomness into data processing, complicating the task for adversaries to pinpoint individual data points within aggregated outcomes. This strategy is especially effective at safeguarding user privacy without compromising the ability to perform insightful statistical analyses. Charles et al. [85] introduced differentially private in-context learning (DP-ICL), and designed to prevent privacy breaches in LLMs during the adaptation phase. DP-ICL leverages noisy consensus from diverse data subsets to uphold privacy, having been rigorously tested in both text classification and language generation tasks. By balancing utility and privacy, DP-ICL supports the ethical deployment of AI. Chua et al. [86] advocated for user-level DP to ensure consistent protection. Their research compares two privacy mechanisms—group privacy and user-wise differential privacy-stochastic gradient descent (DP-SGD)—across NLP tasks, aiming to fine-tune the balance between privacy and utility. The findings indicated that user-wise DP-SGD outperforms its counterpart, particularly under constrained privacy budgets, owing to its nuanced approach to data management during training. Du et al. [77] present DP-Forward, a method that applies noise to the embeddings of LMs for enhanced privacy protection during both training and inference phases. Tang et al. [87] introduced an algorithm capable of generating few-shot demonstrations while upholding DP, ensuring sensitive information remains undisclosed. Their contribution advances the field of privacy-preserving ICL, balancing contextual learning with data protection across various applications.
6.4.2 Federated learning
FL enables models to be trained on decentralized data, thereby reducing the need to centralize and expose sensitive user data. This method allows models to learn from data distributed across various devices without compromising privacy. In a recent contribution, Zheng et al. [88] proposed a federated learning-generative language model (FL-GLM), an FL framework specifically designed for LLMs. It utilizes split learning, where the majority of the model’s parameters reside on the server, while the embedding and output layers are trained locally on client devices. This strategy is well-suited for LLMs given their substantial computational requirements. To bolster security, FL-GLM incorporates key encryption for client-server communications and introduces optimization techniques such as client batching or server hierarchical structuring to enhance training efficiency. Sun et al. [89] introduced an innovative enhancement of the low-rank adaptation (LoRA), named federated freeze A LoRA (FFA-LoRA), of LLMs method designed for privacy-preserving FL. This approach addresses the inherent challenges of LoRA within the FL paradigm by stabilizing non-zero matrices and focusing fine-tuning efforts on those initialized to zero. By doing so, FFA-LoRA not only improves the stability and efficiency of FL tasks but also ensures privacy preservation through the implementation of DP measures.
6.4.3 Methods based on cryptography
1)
2)
3)
6.4.4 Watermarking
Watermarking embeds unique identifiers into data, serving as a traceable deterrent against unauthorized use or access. It fortifies security by facilitating the identification and prosecution of data misuse. Kirchenbauer et al. [94] present a framework for watermarking LLM outputs to curb misuse and ensure the detectability of machine-generated text. The method discreetly selects “green” tokens, subtly encouraging their use during text generation, embedding an imperceptible watermark that minimally impacts quality, while allowing for algorithmic detection without requiring access to the model. Yao et al. [95] focused on prompt copyright protection with PromptCARE (prompt copyright protection by watermark injection and verification), a framework that integrates watermarks into prompts via a dual-level optimization process. It employs “label tokens” and “signal tokens” to forge a unique signature, with a verification phase that uses a secret key to trigger watermark behavior for statistical copyright verification. Zhao et al. [96] introduced Unigram-Watermark, enhancing an existing method with a simplified grouping strategy, underpinned by a theoretical framework that quantifies watermark effectiveness and robustness in LLMs. Min et al. [97] unveiled SILO, an LM designed to balance risk and performance during inference. Trained on the Open License Corpus and augmented with a nonparametric datastore for high-risk data, SILO allows for high-risk data utilization without direct training exposure.
7 Future direction
Safeguarding LLMs is fraught with challenges, primarily due to the following two fundamental factors:
• The sheer scale of LLMs, often containing billions of parameters, makes traditional security methods less effective. The vastness of these models demands innovative, scalable solutions. Moreover, the proprietary nature of many powerful LLMs limits the broader community’s ability to scrutinize and test them for vulnerabilities. This confidentiality, while protecting intellectual property, also shields LLMs from the critical analysis that could improve their security.
• There is a shortage of research on how LLM architectures affect their safety. This is partly due to the prohibitive computational costs associated with experimenting with various designs. The complexity and resource intensity of such studies hinders progress in understanding the architectural factors that influence model security, even though this knowledge is vital for creating more secure and manipulation-resistant models.
To address these challenges, future research should focus on developing scalable security solutions that can handle the complexity of LLMs. Several promising research directions are as follows:
1) Scalable security protocols and algorithms: Develop security protocols and algorithms that are capable of scaling with the growing size and complexity of LLMs. This includes the design of adaptive learning mechanisms that can evolve as models expand, ensuring that the security architecture can handle not only the size but also the dynamic nature of these systems.
2) Open-source frameworks and collaborative platforms: Promote the creation of open-source frameworks and collaborative platforms that enable broader analysis, testing, and improvement of LLM architectures. By fostering transparency in research and development, these platforms can accelerate the identification of vulnerabilities and facilitate the creation of effective countermeasures.
3) Transparent decision-making processes: Develop methods to enhance the transparency and interpretability of LLM decision-making processes. Making model reasoning more understandable will help researchers identify and mitigate unintended biases, harmful outputs, and other ethical concerns, thus improving overall trust in the system.
4) Privacy-preserving inference methods: Investigate privacy-preserving techniques for LLM inference to ensure that user queries and model responses do not inadvertently leak sensitive or personal information. Techniques such as DP and MPC should be adapted for LLM contexts to maintain privacy while ensuring effective functionality.
5) Security of multimodal systems: Focus on securing multimodal systems as LLMs increasingly integrate multiple modalities, such as text, vision, and audio. Research should address potential security risks arising from the interaction between different modalities, including how to mitigate the exposure of vulnerabilities that may arise from their combination.
6) Optimizing security with computational efficiency: Strive to maintain robust security while optimizing computational resources and efficiency during the deployment of large models. Balancing security measures with resource constraints is critical, especially for ensuring smooth operation in resource-limited environments like edge computing devices.
7) Defending edge-deployed models: For models deployed at the edge, investigate defense mechanisms such as request-level authorization, weight permutation, and encryption. These techniques should be optimized to minimize latency and computational overhead while providing robust security. The development of lightweight encryption methods and more efficient trusted execution environments (TEEs) is crucial for protecting models during runtime without significantly impacting performance.
8 Conclusions
In this paper, we clarify the concepts of safety, security, and privacy in LLMs with clear and simple classification. Our goal is to provide researchers with a straightforward framework to help them identify specific vulnerabilities and apply effective defense strategies. Additionally, we present a comprehensive review of current issues and solutions in LLMs, categorized into these three key areas. The survey aims to help LLM researchers and practitioners better understand these important topics. We also analyze existing studies, summarize key insights, and suggest directions for future research.
Disclosures
The authors declare no conflicts of interest.
References
[1] OpenAI, J. Achiam, S. Adler, et al., GPT4 technical rept [Online]. Available, https:arxiv.gabs2303.08774, March 2023.
[2] H. Touvron, T. Lavril, G. Izacard, et al., LLaMA: open efficient foundation language models [Online]. Available, https:arxiv.gabs2302.13971, February 2023.
[3] H. Touvron, L. Martin, K. Stone, et al., Llama 2: open foundation finetuned chat models [Online]. Available, https:arxiv.gabs2307.09288, July 2023.
[4] W.X. Zhao, K. Zhou, J.Y. Li, et al., A survey of large language models [Online]. Available, https:arxiv.gabs2303.18223, March 2023.
[5] L. Huang, W.J. Yu, W.T. Ma, et al., A survey on hallucination in large language models: principles, taxonomy, challenges, open questions, ACM T. Infm. Syst. (2024), doi: 10.11453703155.
[6] S. Zhao, M.H.Z. Jia, Z.L. Guo, et al., A survey of backdo attacks defenses on large language models: implications f security measures [Online]. Available, https:arxiv.gabs2406.06852, June 2024.
[7] S. Kim, S. Yun, H. Lee, M. Gubri, S. Yoon, S.J. Oh, ProPILE: probing privacy leakage in large language models, in: Proc. of the 37th Intl. Conf. on Neural Infmation Processing Systems, New leans, USA, 2024, pp. 1–14.
[8] T.Y. Cui, Y.L. Wang, C.P. Fu, et al., Risk taxonomy, mitigation, assessment benchmarks of large language model systems [Online]. Available, https:arxiv.gabs2401.05778, January 2024.
[9] B.W. Yan, K. Li, M.H. Xu, et al., On protecting the data privacy of large language models (LLMs): a survey [Online]. Available, https:arxiv.gabs2403.05156, March 2024.
[10] Yao Y.-F., Duan J.-H., Xu K.-D., Cai Y.-F., Sun Z.-B., Zhang Y.. A survey on large language model (LLM) security and privacy: the good. the bad, and the ugly, High-Confid. Comput., 4, 100211(2024).
[11] A. Vaswani, N. Shazeer, N. Parmar, et al., Attention is all you need, in: Proc. of the 31st Intl. Conf. on Neural Infmation Processing Systems, Long Beach, USA, 2017, pp. 6000–6010.
[12] A. Deshpe, V. Murahari, T. Rajpurohit, A. Kalyan, K. Narasimhan, Toxicity in ChatGPT: analyzing personaassigned language models, in: Proc. of the Findings of the Association f Computational Linguistics, Singape, 2023, pp. 1236–1270.
[13] A. Haim, A. Salinas, J. Nyarko, What’s in a name Auditing large language models f race gender bias [Online]. Available, https:arxiv.gabs2402.14875, February 2024.
[14] Ji Z.-W., Lee N., Frieske R. et al. Survey of hallucination in natural language generation. ACM Comput. Surv., 55, 248(2023).
[15] Y. Zhang, Y.F. Li, L.Y. Cui, et al., Siren’s song in the AI ocean: a survey on hallucination in large language models [Online]. Available, https:arxiv.gabs2309.01219, September 2023.
[16] X.X. Li, R.C. Zhao, Y.K. Chia, et al., Chainofknowledge: grounding large language models via dynamic knowledge adapting over heterogeneous sources, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–23.
[17] A. Wei, N. Haghtalab, J. Steinhardt, Jailbroken: how does LLM safety training fail in: Proc. of the 37th Advances in Neural Infmation Processing Systems, New leans, USA, 2023, pp. 1–32.
[18] Y.Z. Li, T.L. Li, K.J. Chen, et al., BadEdit: backdoing large language models by model editing, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–18.
[19] V.S. Sadasivan, S. Saha, G. Sriramanan, P. Kattakinda, A.M. Chegini, S. Feizi, Fast adversarial attacks on language models in one GPU minute, in: Proc. of the 41st Intl. Conf. on Machine Learning, Vienna, Austria, 2024, pp. 1–20.
[20] V. Raina, A. Liusie, M.J.F. Gales, Is LLMasajudge robust Investigating universal adversarial attacks on zeroshot LLM assessment, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 7499–7517.
[21] N. Mireshghallah, H. Kim, X.H. Zhou, et al., Can LLMs keep a secret Testing privacy implications of language models via contextual integrity they, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–24.
[22] M. Duan, A. Suri, N. Mireshghallah, et al., Do membership inference attacks wk on large language models [Online]. Available, https:arxiv.gabs2402.07841, February 2024.
[23] R. Staab, M. Vero, M. Balunovic, M.T. Vechev, Beyond memization: violating privacy via inference with large language models, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–47.
[24] A. Naseh, K. Krishna, M. Iyyer, A. Houmansadr, Stealing the decoding algithms of language models, in: Proc. of the ACM SIGSAC Conf. on Computer Communications Security, Copenhagen, Denmark, 2023, pp. 1835–1849.
[25] A. Pa, C.A. ChoquetteChoo, Z.M. Zhang, Y.Q. Yang, P. Mittal, Teach LLMs to phish: stealing private infmation from language models, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–25.
[26] O. Shaikh, H.X. Zhang, W. Held, M. Bernstein, D.Y. Yang, On second thought, let’s not think step by step! Bias toxicity in zeroshot reasoning, in: Proc. of the 61st Annual Meeting of the Association f Computational Linguistics, Tonto, Canada, 2023, pp. 4454–4470.
[27] X.J. Dong, Y.B. Wang, P.S. Yu, J. Caverlee, Probing explicit implicit gender bias through LLM conditional text generation [Online]. Available, https:arxiv.gabs2311.00306, November 2023.
[28] J. Welbl, A. Glaese, J. Uesato, et al., Challenges in detoxifying language models, in: Proc. of the Findings of the Association f Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 2447–2469.
[30] M. Bhan, J.N. Vittaut, N. Achache, et al., Mitigating text toxicity with counterfactual generation [Online]. Available, https:arxiv.gabs2405.09948, May 2024.
[31] N. Vishwamitra, K.Y. Guo, F.T. Romit, et al., Moderating new waves of online hate with chainofthought reasoning in large language models, in: Proc. of the IEEE Symposium on Security Privacy, San Francisco, USA, 2024, pp. 1–19.
[32] A. Pal, L.K. Umapathi, M. Sankarasubbu, MedHALT: medical domain hallucination test f large language models, in: Proc. of the 27th Conf. on Computational Natural Language Learning, Singape, 2023, pp. 314–334.
[33] H.Q. Kang, X.Y. Liu, Deficiency of large language models in finance: an empirical examination of hallucination, in: Proc. of the 37th Conf. on Neural Infmation Processing Systems, Virtual Event, 2023, pp. 1–15.
[34] N. Mündler, J.X. He, S. Jenko, M.T. Vechev, Selfcontradicty hallucinations of large language models: evaluation, detection mitigation, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–30.
[35] Y.Z. Li, S. Bubeck, R. Eldan, A. Del Gino, S. Gunasekar, Y.T. Lee, Textbooks are all you need II: phi1.5 technical rept [Online]. Available, https:arxiv.gabs2309.05463, September 2023.
[36] H. Lightman, V. Kosaraju, Y. Burda, et al., Let’s verify step by step, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–24.
[37] Z.B. Gou, Z.H. Shao, Y.Y. Gong, et al., CRITIC: large language models can selfcrect with toolinteractive critiquing, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–77.
[38] S. Dhuliawala, M. Komeili, J. Xu, et al., Chainofverification reduces hallucination in large language models, in: Proc. of the Findings of the Association f Computational Linguistics, Bangkok, Thail, 2024, pp. 3563–3578.
[39] E. Jones, H. Palangi, C. Simões, et al., Teaching language models to hallucinate less with synthetic tasks, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–18.
[40] S.M.T.I. Tonmoy, S.M.M. Zaman, V. Jain, et al., A comprehensive survey of hallucination mitigation techniques in large language models [Online]. Available, https:arxiv.gabs2401.01313, January 2024.
[41] A. Zou, Z.F. Wang, N. Carlini, M. Nasr, J.Z. Kolter, M. Fredrikson, Universal transferable adversarial attacks on aligned language models [Online]. Available, https:arxiv.gabs2307.15043, July 2023.
[42] H.R. Li, D.D. Guo, W. Fan, et al., Multistep jailbreaking privacy attacks on ChatGPT, in: Proc. of the Findings of the Association f Computational Linguistics, Singape, 2023, pp. 4138–4153.
[43] M. Shanahan, K. McDonell, L. Reynolds, Role play with large language models, Nature 623 (7987) (2023) 493–498.
[44] G.L. Deng, Y. Liu, Y.K. Li, et al., MASTERKEY: automated jailbreaking of large language model chatbots, in: Proc. of the 31st Annual wk Distributed System Security Symposium, San Diego, USA, 2024, pp. 1–16.
[45] Z.X. Zhang, J.X. Yang, P. Ke, et al., Safe unlearning: a surprisingly effective generalizable solution to defend against jailbreak attacks [Online]. Available, https:arxiv.gabs2407.02855, July 2024.
[46] W.K. Lu, Z.Q. Zeng, J.W. Wang, et al., Eraser: jailbreaking defense in large language models via unlearning harmful knowledge [Online]. Available, https:arxiv.gabs2404.05880, April 2024.
[47] A. Robey, E. Wong, H. Hassani, G.J. Pappas, SmoothLLM: defending large language models against jailbreaking attacks [Online]. Available, https:arxiv.gabs2310.03684, October 2023.
[48] A. Zhou, B. Li, H.H. Wang, Robust prompt optimization f defending language models against jailbreaking attacks, in: Proc. of the 38th Conf. on Neural Infmation Processing Systems, Virtual Event, 2024, pp. 1–17.
[49] J.S. Xu, M.Y. Ma, F. Wang, C.W. Xiao, M.H. Chen, Instructions as backdos: backdo vulnerabilities of instruction tuning f large language models, in: Proc. of the Conf. of the Nth American Chapter of the Association f Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 2024, pp. 3111–3126.
[50] A. Wan, E. Wallace, S. Shen, D. Klein, Poisoning language models during instruction tuning, in: Proc. of the 40th Intl. Conf. on Machine Learning, Honolulu, USA, 2023, pp. 1–13.
[51] S. Zhao, M.H.Z. Jia, L.A. Tuan, F.J. Pan, J.M. Wen, Universal vulnerabilities in large language models: backdo attacks f incontext learning, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 11507–11522.
[52] Z.H. Xi, T.Y. Du, C.J. Li, et al., Defending pretrained language models as fewshot learners against backdo attacks, in: Proc. of the 37th Advances in Neural Infmation Processing Systems, New leans, USA, 2023, pp. 1–17.
[53] X. Li, Y.S. Zhang, R.Z. Lou, C. Wu, J.Q. Wang, Chainofscrutiny: detecting backdo attacks f large language models [Online]. Available, https:arxiv.gabs2406.05948, June 2024.
[54] H.R. Li, Y.L. Chen, Z.H. Zheng, et al., Simulate eliminate: revoke backdos f generative large language models [Online]. Available, https:arxiv.gabs2405.07667, May 2024.
[55] Y.T. Li, Z.C. Xu, F.Q. Jiang, et al., CleanGen: mitigating backdo attacks f generation tasks in large language models, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 9101–9118.
[56] X.L. Xu, K.Y. Kong, N. Liu, et al., An LLM can fool itself: a promptbased adversarial attack, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–23.
[57] A. Kumar, C. Agarwal, S. Srinivas, A.J. Li, S. Feizi, H. Lakkaraju, Certifying LLM safety against adversarial prompting [Online]. Available, https:arxiv.gabs2309.02705, September 2023.
[58] H. Brown, L. Lin, K. Kawaguchi, M. Shieh, Selfevaluation as a defense against adversarial attacks on LLMs [Online]. Available, https:arxiv.gabs2407.03234, July 2024.
[59] Y.K. Zhao, L.Y. Yan, W.W. Sun, et al., Improving the robustness of large language models via consistency alignment, in: Proc. of the Joint Intl. Conf. on Computational Linguistics, Language Resources Evaluation, Tino, Italia, 2024, pp. 8931–8941.
[60] S. Kadavath, T. Conerly, A. Askell, et al., Language models (mostly) know what they know [Online]. Available, https:arxiv.gabs2207.05221, July 2022.
[61] O. Cartwright, H. Dunbar, T. Radcliffe, Evaluating privacy compliance in commercial large language modelsChatGPT, Claude, Gemini, Research Square (2024), doi: 10.21203rs.3.rs4792047v1.
[62] Nissenbaum H.. Privacy as contextual integrity. Wash. Law Rev., 79, 119(2004).
[63] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramèr, C.Y. Zhang, Quantifying memization across neural language models, in: Proc. of the 11th Intl. Conf. on Learning Representations, Kigali, Rwa, 2023, pp. 1–19.
[64] R. Eldan, M. Russinovich, Who’s Harry Potter Approximate unlearning in LLMs [Online]. Available, https:arxiv.gabs2310.02238, October 2023.
[65] V. Patil, P. Hase, M. Bansal, Can sensitive infmation be d from LLMs Objectives f defending against extraction attacks [Online]. Available, https:arxiv.gabs2309.17410, September 2023.
[66] N. Kpal, E. Wallace, C. Raffel, Deduplicating training data mitigates privacy risks in language models, in: Proc. of the 39th Intl. Conf. on Machine Learning, Baltime, USA, 2022, pp. 10697–10707.
[67] R. Shokri, M. Stronati, C.Z. Song, V. Shmatikov, Membership inference attacks against machine learning models, in: Proc. of the IEEE Symposium on Security Privacy, San Jose, USA, 2017, pp. 3–18.
[68] P. Maini, H.R. Jia, N. Papernot, A. Dziedzic, LLM dataset inference: did you train on my dataset [Online]. Available, https:arxiv.gabs2406.06443, June 2024.
[69] J. Mattern, F. Mireshghallah, Z.J. Jin, B. Schoelkopf, M. Sachan, T. BergKirkpatrick, Membership inference attacks against language models via neighbourhood comparison, in: Proc. of the Findings of the Association f Computational Linguistics, Tonto, Canada, 2023, pp. 11330–11343.
[70] W.J. Shi, A. Ajith, M.Z. Xia, et al., Detecting pretraining data from large language models, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–17.
[71] M. Kaneko, Y.M. Ma, Y. Wata, N. Okazaki, Samplingbased pseudolikelihood f membership inference attacks [Online]. Available, https:arxiv.gabs2404.11262, April 2024.
[72] N. Kpal, K. Pillutla, A. Oprea, P. Kairouz, C.A. ChoquetteChoo, Z. Xu, User inference attacks on large language models, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 18238–18265.
[73] Z.P. Wang, A.D. Cheng, Y.G. Wang, L. Wang, Infmation leakage from embedding in large language models [Online]. Available, https:arxiv.gabs2405.11916, May 2024.
[74] R.S. Zhang, S. Hidano, F. Koushanfar, Text revealer: private text reconstruction via model inversion attacks against Transfmers [Online]. Available, https:arxiv.gabs2209.10505, September 2022.
[75] J.X. Mris, W.T. Zhao, J.T. Chiu, V. Shmatikov, A.M. Rush, Language model inversion, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–21.
[76] Hu L., Yan A.-L., Yan H.-Y. et al. Defenses to membership inference attacks: a survey. ACM Comput. Surv., 56, 92(2024).
[77] M.X. Du, X. Yue, S.S.M. Chow, T.H. Wang, C.Y. Huang, H. Sun, DPFward: finetuning inference on language models with differential privacy in fward pass, in: Proc. of the ACM SIGSAC Conf. on Computer Communications Security, Copenhagen, Denmark, 2023, pp. 2665–2679.
[78] D.F. Chen, N. Yu, M. Fritz, RelaxLoss: defending membership inference attacks without losing utility, in: Proc. of the 10th Intl. Conf. on Learning Representations, Virtual Event, 2022, pp. 1–28.
[79] Z.J. Li, C.Z. Wang, P.C. Ma, et al., On extracting specialized code abilities from large language models: a feasibility study, in: Proc. of the IEEEACM 46th Intl. Conf. on Software Engineering, Lisbon, Ptugal, 2024, pp. 1–13.
[80] N. Carlini, D. Paleka, K.D. Dvijotham, et al., Stealing part of a production language model, in: Proc. of the 41st Intl. Conf. on Machine Learning, Vienna, Austria, 2024, pp. 1–26.
[81] M. Finlayson, X. Ren, S. Swayamdipta, Logits of APIprotected LLMs leak proprietary infmation [Online]. Available, https:arxiv.gabs2403.09539, March 2024.
[82] Y. Bai, G. Pei, J.D. Gu, Y. Yang, X.J. Ma, Special acters attack: toward scalable training data extraction from large language models [Online]. Available, https:arxiv.gabs2405.05990, May 2024.
[83] C.L. Zhang, J.X. Mris, V. Shmatikov, Extracting prompts by inverting LLM outputs, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 14753–14777.
[84] Q.F. Li, Z.Q. Shen, Z.H. Qin, et al., TransLinkGuard: safeguarding Transfmer models against model stealing in edge deployment, in: Proc. of the 32nd ACM Intl. Conf. on Multimedia, Melbourne, Australia, 2024, pp. 3479–3488.
[85] Z. les, A. Ganesh, R. McKenna, et al., Finetuning large language models with userlevel differential privacy, in: Proc. of the ICML2024 Wkshop on Theetical Foundations of Foundation Models, Vienna, Austria, 2024, pp. 1–24.
[86] L. Chua, B. Ghazi, Y.S.B. Huang, et al., Mind the privacy unit! Userlevel differential privacy f language model finetuning [Online]. Available, https:arxiv.gabs2406.14322, June 2024.
[87] X.Y. Tang, R. Shin, H.A. Inan, et al., Privacypreserving incontext learning with differentially private fewshot generation [Online]. Available, https:arxiv.gabs2309.11765, September 2023.
[88] J.Y. Zheng, H.N. Zhang, L.X. Wang, W.J. Qiu, H.W. Zheng, Z.M. Zheng, Safely learning with private data: a federated learning framewk f large language model, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 5293–5306.
[89] Y.B. Sun, Z.T. Li, Y.L. Li, B.L. Ding, Improving LA in privacypreserving federated learning, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–17.
[90] M. Hao, H.W. Li, H.X. Chen, P.Z. Xing, G.W. Xu, T.W. Zhang, Iron: private inference on Transfmers, in: Proc. of the 36th Intl. Conf. on Neural Infmation Processing Systems, New leans, USA, 2022, pp. 1–14.
[91] X.Y. Hou, J. Liu, J.Y. Li, et al., CipherGPT: secure twoparty GPT inference, Cryptology ePrint Archive [Online]. Available, https:eprint.iacr.g20231147, May 2023.
[92] Y. Dong, W.J. Lu, Y.C. Zheng, et al., PUMA: secure inference of LLaMA7B in five minutes [Online]. Available, https:arxiv.gabs2307.12533, July 2023.
[93] H.C. Sun, J. Li, H.Y. Zhang, zkLLM: zero knowledge proofs f large language models, in: Proc. of the on ACM SIGSAC Conf. on Computer Communications Security, Salt Lake City, USA, 2024, pp. 4405–4419.
[94] J. Kirchenbauer, J. Geiping, Y.X. Wen, J. Katz, I. Miers, T. Goldstein, A watermark f large language models, in: Proc. of the 40th Intl. Conf. on Machine Learning, Honolulu, USA, 2023, pp. 17061–17084.
[95] H.W. Yao, J. Lou, Z. Qin, K. Ren, PromptCARE: prompt copyright protection by watermark injection verification, in: Proc. of the IEEE Symposium on Security Privacy, San Francisco, USA, 2024, pp. 845–861.
[96] X.D. Zhao, P.V. Ananth, L. Li, Y.X. Wang, Provable robust watermarking f AIgenerated text, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–35.
[97] S. Min, S. Gururangan, E. Wallace, et al., SILO language models: isolating legal risk in a nonparametric dataste, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–27.

Set citation alerts for the article
Please enter your email address