• Journal of Electronic Science and Technology
  • Vol. 23, Issue 1, 100301 (2025)
Ran Zhang, Hong-Wei Li*, Xin-Yuan Qian, Wen-Bo Jiang, and Han-Xiao Chen
Author Affiliations
  • School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
  • show less
    DOI: 10.1016/j.jnlest.2025.100301 Cite this Article
    Ran Zhang, Hong-Wei Li, Xin-Yuan Qian, Wen-Bo Jiang, Han-Xiao Chen. On large language models safety, security, and privacy: A survey[J]. Journal of Electronic Science and Technology, 2025, 23(1): 100301 Copy Citation Text show less

    Abstract

    The integration of artificial intelligence (AI) technology, particularly large language models (LLMs), has become essential across various sectors due to their advanced language comprehension and generation capabilities. Despite their transformative impact in fields such as machine translation and intelligent dialogue systems, LLMs face significant challenges. These challenges include safety, security, and privacy concerns that undermine their trustworthiness and effectiveness, such as hallucinations, backdoor attacks, and privacy leakage. Previous works often conflated safety issues with security concerns. In contrast, our study provides clearer and more reasonable definitions for safety, security, and privacy within the context of LLMs. Building on these definitions, we provide a comprehensive overview of the vulnerabilities and defense mechanisms related to safety, security, and privacy in LLMs. Additionally, we explore the unique research challenges posed by LLMs and suggest potential avenues for future research, aiming to enhance the robustness and reliability of LLMs in the face of emerging threats.

    1 Introduction

    In today’s digital age, artificial intelligence (AI) technology has emerged as a pivotal driving force across a wide array of industries, revolutionizing business operations and services. Among the various AI advancements, large language models (LLMs) [14] stand out as groundbreaking innovations, leading a transformation with their exceptional language comprehension and generation capabilities. These models have demonstrated remarkable potential in diverse applications, such as machine translation [1], text generation [2], and recommendation system [4]. Their ability to understand and produce human-like text has paved the way for significant advancements in these areas, making tasks like automated customer service, real-time translation, and content creation more efficient.

    Despite their impressive capabilities, LLMs are not without challenges. They face significant safety, security, and privacy issues that may undermine the trust and reliability of their applications. One of the most notable safety concerns is hallucination [5], where the model generates plausible but incorrect or nonsensical information. This issue can lead to the dissemination of false information, which is especially problematic in critical applications like healthcare or legal advice. Additionally, LLMs are vulnerable to backdoor attacks [6], a security threat in which malicious actors manipulate the model’s behavior to produce harmful outputs or expose sensitive information. Privacy leakage is a typical privacy problem [7], where these models may inadvertently reveal confidential data from their training corpus, posing serious risks to data privacy.

    Given these pressing challenges, it is crucial to conduct a thorough and comprehensive study to identify, categorize, and address the vulnerabilities inherent in LLMs. This study should involve an in-depth analysis of the various safety, security, and privacy threats, offering a clear classification and understanding of each issue. Furthermore, it should explore potential defense mechanisms and strategies to mitigate these risks, ensuring that LLMs are used safely and responsibly. By tackling these challenges, we can enhance the robustness and reliability of LLMs, unlocking their full potential while safeguarding against emerging threats.

    Several surveys have addressed security and privacy issues in LLMs [810], but these works exhibit some shortcomings. First, some surveys [9] often confuse security with privacy issues, such as incorrectly classifying backdoor attacks as privacy problems. Second, some surveys repetitively classify certain attacks [10], such as jailbreak and adversarial attacks without distinguishing between them. Third, the concept of safety is often overlooked [8], with some safety issues being incorrectly classified as security, resulting in an imprecise analysis. These issues may lead to several negative consequences. Misidentifying threats, for example, could lead to adversarial attacks being mistaken for jailbreak attempts, prompting the use of ineffective defensive measures that fail to address the actual issue. Furthermore, the development of security solutions and strategies relies on a clear understanding of prevalent attacks. Without an accurate classification framework, researchers and developers may struggle to prioritize their efforts and address the most critical areas. Additionally, varying approaches to reporting and analyzing attack data across organizations and industries make it difficult to compare data, limiting the ability to identify trends and develop effective defense strategies.

    To address these problems, we propose a new classification framework to provide systematic guidance for building a more robust, secure, and reliable LLM system. Specifically, we begin by providing clearer and more reasonable definitions of safety, security, and privacy within the context of LLMs, as shown in Fig. 1. Safety refers to the model’s inherent reliability and robustness in the absence of external threats. Security pertains to the model’s ability to withstand attacks from adversaries, which may lead to errors or harmful outputs. Privacy involves safeguarding training data and model parameters, ensuring that sensitive information is neither disclosed nor leaked. In contrast to previous works [810], our framework offers a more accurate and comprehensive overview of safety, security, and privacy issues in LLMs, along with their respective defense mechanisms. Additionally, we explore the unique research challenges associated with LLMs and outline potential avenues for future research.

    Definition of safety, security, and privacy in LLMs.

    Figure 1.Definition of safety, security, and privacy in LLMs.

    Our contributions are summarized as follows:

    • We provide clearer definitions of safety, security, and privacy within the context of LLMs. This taxonomy helps researchers better distinguish between specific types of attacks and adopt appropriate defense strategies.

    • For safety, we conduct a comprehensive survey of inherent safety issues in LLMs, such as toxicity, bias, hallucination, and jailbreak, along with mitigation methods for each.

    • For security, we examine the security of LLMs under active attacks, including backdoor, poisoning, and adversarial attacks, and investigate corresponding defense mechanisms for these threats.

    • For privacy, we explore the privacy concerns in LLMs, focusing on privacy leakage and active attacks like membership inference and data extraction attacks. We also summarize privacy-preserving techniques and watermarking methods.

    The rest of this paper is organized as follows. Section 2 provides an introduction to the background of LLMs. In Section 3, we define safety, security, and privacy in the context of LLMs. In Section 4, we detail the safety problems and their corresponding defenses. Section 5 covers the security issues and the associated defenses. In Section 6, we examine the privacy problems and the defense methods. Finally, Section 7 offers suggestions for future research directions and Section 8 concludes this survey.

    2 Background

    LLMs are AI systems trained on vast amounts of text data to understand and generate language in a manner that closely resembles human communications. These models are capable of performing a wide range of natural language processing (NLP) tasks. The development of LLMs began with simpler models like recurrent neural networks (RNNs) and long short-term memory networks (LSTMs), which were effective at handling sequential data but struggled with long-range dependencies. A major breakthrough came with the introduction of the Transformer model by Vaswani et al. in 2017 [11], which revolutionized the field. Transformers use self-attention mechanisms to process input sequences, enabling them to capture context and relationships between words more effectively, regardless of distance. As LLMs evolved, their size and complexity grew, resulting in models with billions of parameters capable of capturing intricate linguistic nuances. LLMs are now applied across a broad spectrum of tasks, including real-time translation, content creation, complex question-answering systems, and sentiment analysis. They are transforming how we interact with technology, facilitating more natural and effective communication between humans and machines.

    As shown in Fig. 2, the life cycle of LLMs primarily follows a three-stage process:

    Training and inference phases of LLMs.

    Figure 2.Training and inference phases of LLMs.

    1) Pre-training stage. During the pre-training phase, LLMs are trained on a massive, diverse dataset including books, online forums, news articles, and more. This stage is not task-specific; instead, its goal is to teach the model the underlying structures and patterns of human language. Common techniques, such as masked language modeling (predicting missing words in a sentence) and next-sentence prediction (learning relationships between sentences), are employed. This phase is crucial for building a broad understanding of language, which serves as the foundation for further task-specific training.

    2) Fine-tuning stage. Once pre-trained, LLM undergoes fine-tuning to adapt it for specific tasks. This involves training the model on a smaller, more focused dataset relevant to the desired application. For example, if the task is sentiment analysis, the fine-tuning dataset would consist of text labeled with sentiments. Fine-tuning adjusts the model’s weights to better align with task-specific objectives, improving accuracy and relevance for the intended application.

    3) Inference stage. After fine-tuning, LLM is ready for inference, where it is applied to make predictions or generate responses in real-world scenarios. At this stage, the model processes new, unseen data and uses the knowledge acquired during pre-training and fine-tuning to generate meaningful outputs. Whether translating text, summarizing documents, or engaging in conversation, the inference stage is where the model’s capabilities are put into practical use, delivering value to users and systems.

    Each stage plays a vital role in the development of effective LLMs. Pre-training provides a deep understanding of language, fine-tuning tailors that understanding for specific tasks, and inference demonstrates the model’s ability to apply this knowledge in real-world applications.

    Here is an overview of the most popular LLMs currently:

    • Generative Pre-trained Transformer (GPT). Developed by OpenAI, the GPT series encompasses GPT-1, GPT-2, GPT-3, and the latest GPT-4. These models are trained on extensive datasets to predict the next word in a sentence, with applications spanning writing, translation, summarization, and coding. OpenAI provides application programming interfaces (APIs) for integrating GPT models into various applications.

    • Llama. Meta’s Llama series emphasizes open-access research and efficiency. These models are designed to deliver competitive performance while democratizing access to high-quality LLMs. They are gaining popularity in academic and open-source communities for research and development, fostering a collaborative approach to AI advancements.

    • Claude. Anthropic’s Claude is crafted with a focus on safety and alignment with human values. Claude models prioritize user-centric responses, ethical considerations, and harm prevention, making them a choice for applications where ethical AI is paramount.

    • Mistral. Mistral is a 700 million parameter language model (LM) that surpasses Llama’s similar-sized model on all evaluated benchmarks. It also features a fine-tuning model dedicated to following instructions, enhancing its utility in specific applications.

    There are many other popular LLMs, such as Gemini and Pathways Language Model (PaLM). Each of these models brings unique features and capabilities to the table, contributing to the diverse landscape of AI language technologies. However, due to space constraints, we will omit a detailed description of these models.

    3 Definitions for safety, security, and privacy

    We define safety, security, and privacy in the context of LLMs as follows:

    • Safety refers to the model’s inherent ability to function correctly and consistently without unintended behaviors or errors in the absence of external threats. It ensures that the model adheres to ethical guidelines, avoids harmful outputs, and operates within its intended design parameters.

    • Security in LLMs addresses the model’s vulnerability to intentional attacks by adversaries. It focuses on the model’s resilience and robustness to manipulations that could lead to incorrect, misleading, or malicious outputs.

    • Privacy concerns the protection of sensitive information, including the model’s training data and parameters and users’ personal information. It aims to prevent unauthorized access, use, or disclosure of private information.

    Building on these definitions, we provide a comprehensive overview of the vulnerabilities and defense mechanisms associated with LLMs. As illustrated in Fig. 3, we categorize current research into three main areas: Safety, security, and privacy. Under each category, we identify specific security issues and outline corresponding defensive measures to mitigate these vulnerabilities.

    Overview of safety, security, and privacy issues and their defense methods.

    Figure 3.Overview of safety, security, and privacy issues and their defense methods.

    Table 1 presents the structure of this study, outlining the safety, security, and privacy threats faced by LLMs [6,7,1225]. For safety, we examine challenges such as model toxicity, bias, hallucination, and jailbreak. Defense mechanisms in this category include robust training methods, bias detection, and data sanitization techniques. For security, we delve into issues like backdoor attacks, model poisoning, and adversarial attacks. Defense strategies in this area involve adversarial training, anomaly detection, and secure model update protocols. Regarding privacy, we explore threats such as privacy leakage, inference attacks, and extraction attacks. To mitigate these threats, we discuss the implementation of privacy-preserving techniques like differential privacy (DP), secure multi-party computation, and federated learning.

    CategoryFeatureExample
    SafetyToxicity and biasToxicity and bias refer to the inappropriate or prejudiced content that may be generated or amplified by LLMs due to biased training data. This can result in harmful or discriminatory outcomes.[12,13]
    HallucinationHallucination refers to the generation of text that is nonsensical, irrelevant, or factually incorrect, typically arising from the model’s inability to accurately understand or process the context of the input data.[14,15]
    JailbreakJailbreak attack refers to the exploitation of vulnerabilities to bypass the model’s intended constraints and generate content that violates its operational guidelines or ethical safeguards.[16,17]
    SecurityBackdoor & poisoning attacksBackdoor and poisoning attacks refer to the malicious insertion of hidden triggers or corrupted data during the training process, which can cause the model to produce harmful or targeted outputs when prompted with specific inputs.[6,18]
    Adversarial attackAdversarial attack is a method where carefully crafted inputs are used to deceive the model into making errors or generating unintended outputs, often by exploiting the model’s weaknesses or vulnerabilities.[19,20]
    PrivacyPrivacy leakagePrivacy leakage occurs when sensitive or personal information is inadvertently disclosed through the model’s responses, due to the model’s exposure to or training on data containing such information.[7,21]
    Inference attackInference attacks involve the exploitation of model responses to deduce sensitive information about the training data or the underlying algorithms, potentially compromising privacy or security.[22,23]
    Extraction attackExtraction attacks are attempts to reverse-engineer or illicitly obtain proprietary information, such as training data or model parameters, by interacting with the model’s outputs.[24,25]

    Table 1. Classification of safety, security, and privacy issues in LLMs.

    4 Safety problems and defenses

    4.1 Toxicity and bias

    LLMs, while incredibly powerful, can sometimes exhibit toxicity and bias that reflect the data they were trained on. These models learn patterns from the text data, including discriminatory language, stereotypes, or offensive content. Consequently, when generating text, LLMs might inadvertently produce harmful, biased, or inappropriate outputs.

    The issue of toxicity [12] arises when the model generates content that is offensive or abusive, or promotes harmful behavior. This includes hate speech, explicit language, or content that incites violence or discrimination. Biases in LLMs often reflect those present in the training data. If the data contains gender, racial, or cultural biases, the model may perpetuate these biases in its outputs, such as associating certain professions predominantly with a specific gender or making stereotypical assumptions about ethnic groups [13]. Additionally, LLMs may struggle with fairness and inclusivity if they are not trained on diverse or representative data, leading to the underrepresentation or misrepresentation of certain groups in the generated content.

    Deshpande et al. [12] reported that the toxicity of ChatGPT’s output can escalate significantly—up to sixfold—when personas are assigned. This increase leads to the propagation of harmful dialogues, incorrect stereotypes, and offensive opinions. Such escalations are particularly alarming as they risk defaming the assigned personas and endangering unsuspecting users, especially in sensitive domains like healthcare, education, and customer service. The study also highlights inherent discriminatory biases within the model, with specific groups being disproportionately targeted. Shaikh et al. [26] investigated the impact of chain of thought (CoT) prompts on LLMs in socially sensitive domains. While CoT improves performance in logical reasoning tasks, it also significantly increases the likelihood of models generating harmful or biased outputs in zero-shot reasoning for sensitive domains. The study finds that this issue becomes more pronounced with larger models but can be mitigated with improved instruction. Dong et al. [27] present a novel framework for evaluating the gender bias in LLMs using three distinct input strategies: Template-based, LLM-generated, and naturally-sourced. Their findings reveal that larger LLMs do not guarantee fairness; all models tested exhibit both explicit and implicit gender biases, even without explicit gender stereotypes in the input.

    Defenses. Several studies [2,28,29] have emerged to address these issues. Gallegos et al. [29] provided an in-depth analysis of the challenges and methodologies for mitigating social biases in LLMs. They present a comprehensive survey that categorizes existing literature into three main areas: Metrics for bias evaluation, datasets for bias evaluation, and techniques for bias mitigation. Bhan et al. [30] explored the application of explainable AI (XAI) methods, specifically local feature importance (LFI) and counterfactual generation, to address textual toxicity. They proposed a method called CF-Detoxtigtec, which leverages a counterfactual (CF) example generator, token importance guided text counterfactuals (TIGTEC), to detoxify text by identifying minimal changes that convert toxic content into non-toxic while preserving its original meaning. Vishwamitra et al. [31] present HATEGUARD, which employs CoT prompting to swiftly and effectively moderate new waves of online hate speech triggered by events such as the COVID-19 pandemic, the 2021 US Capitol insurrection, and the 2022 Russian invasion of Ukraine. HATEGUARD’s zero-shot detection capability allows it to adapt to new derogatory terms and targets with limited training samples. Despite advancements, existing methodologies often struggle with generalizability across diverse contexts and rapidly evolving languages. Additionally, detection systems like HATEGUARD [31] may face challenges in keeping up with emerging hate speech trends and new derogatory terms.

    4.2 Hallucination

    Hallucinations [14] in LLMs pose a critical challenge where the model generates text that, while syntactically coherent and semantically plausible, is factually incorrect or entirely fictitious. Reference [5] categorizes hallucinations into two main groups: Factuality hallucination and faithfulness hallucination. Factuality hallucination emphasizes the discrepancy between generated content and verifiable real-world facts, typically manifesting as factual inconsistency or fabrication. Faithfulness hallucination refers to the divergence of generated content from user instructions or the context provided by the input and self-consistency within generated content.

    The causes of hallucinations [15] stem from three main areas: Data-related issues such as misinformation, biases, and knowledge boundaries within the training data; training-related factors including architectural flaws, suboptimal training objectives, and misalignments between model capabilities and human preferences; inference-related challenges like the inherent randomness of decoding strategies and imperfect representations during the decoding phase. These multifaceted origins contribute to the generation of content that may be inconsistent with real-world facts or user inputs.

    The hallucination issue can have detrimental effects across various domains. In the healthcare domain [32], medical professionals could be misguided by inaccurate health-related advice or data generated by LLMs, potentially impacting patient care and safety. In the financial domain [33], financial advisors and analysts could be misled by hallucinated financial data or market insights, leading to incorrect investment decisions with significant financial repercussions.

    Defense. Addressing hallucinations [34] involves improving training data quality [2,35], reinforcing learning based on human feedback [36], leveraging knowledge bases for verification [16], interacting with multi-agent interaction [37], and refining decoding strategy [38]. For instance, Jones et al. [39] introduced SYNTRA, a novel method designed to mitigate the hallucination tendencies of LLMs in abstractive summarization tasks. SYNTRA tackles the challenge by creating a synthetic task that effectively elicits and measures hallucinations. Through prefix-tuning, the method optimizes the LLM’s system message and applies this learned message to real-world scenarios, significantly reducing hallucinations across various models and tasks. This approach uses synthetic data to provide clear supervision signals, guiding the model to produce outputs more closely aligned with the given context. However, it depends on the careful selection of synthetic tasks that align well with the challenges of real-world tasks [40], and synthetic tasks are prone to the design bias. Additionally, Gou et al. [37] present CRITIC, which empowers LLMs to self-verify and self-correct their outputs by interacting with external tools. This method addresses issues such as inconsistencies and the generation of inaccurate or toxic content. CRITIC enables LLMs to iteratively refine their outputs through a verify-then-correct process, utilizing resources like search engines and code interpreters. The method has been tested on tasks ranging from question answering to mathematical program synthesis and toxicity reduction, showing significant improvements in performance across different LLMs. The innovation lies in its use of in-context learning (ICL) and tool interaction, making it a practical way to enhance model reliability without requiring additional training or data. However, they each have inherent limitations. For instance, improving training data may not eliminate biases, human feedback can be subjective, and techniques like CRITIC [37] may face challenges with real-time verification accuracy, potentially leading to incorrect outputs if external tools are unreliable.

    4.3 Jailbreak

    In the realm of LLMs, jailbreak [16] denotes a collection of techniques employed to evade the safety mechanisms and constraints built into these models. This bypass allows for the generation of content that would typically be prohibited, potentially including harmful, biased, or illegal information. The phenomenon of jailbreak in LLMs has been the subject of extensive research, with scholars uncovering various tactics to coerce models into producing inappropriate content. For example, Wei et al. [17] theorized two primary modes of failure in safety training: “competing objectives” and “mismatched generalization”. The former occurs when a model’s inherent capabilities clash with its intended safety goals, while the latter arises when the training for safety does not adequately prepare the model for the diverse domains in which it operates. By leveraging these insights, the researchers designed and tested attacks on advanced models, revealing persistent vulnerabilities despite rigorous red-teaming and safety measures. Notably, most jailbreaks are the result of human ingenuity, requiring significant manual effort rather than automated processes. Zou et al. [41] introduced an automated method for attacking aligned LMs by prompting them to generate objectionable content. This method involves appending adversarial suffixes to prompts, which compels the model to generate affirmative responses to queries that could be harmful. Employing a blend of greedy search and gradient-based optimization, this method automates the generation of adversarial suffixes, enhancing previous approaches to automatic prompt generation.

    Current research on jailbreak vulnerabilities of chatbots [42,43] is heavily concentrated on ChatGPT, leaving a notable gap in understanding the risks associated with other commercial LLM chatbots such as Bing Chat and Bard. Additionally, there is little publicly available information about the jailbreak prevention strategies employed by these commercial solutions. To address this gap, Deng et al. [44] conducted an empirical study to evaluate the efficacy of current jailbreak attacks, highlighting the deficiencies in existing defenses. They introduced MASTERKEY, an innovative framework that employs a time-based analysis technique to deconstruct and comprehend the defense mechanisms used by LLM chatbots. Furthermore, they present an automated approach for generating universal jailbreak prompts. By fine-tuning an LLM with a curated set of jailbreak prompts, MASTERKEY is capable of crafting prompts that can effectively navigate around the defenses of various LLM chatbots.

    Defense. Researchers have crafted frameworks to systematically assess and counter jailbreak attacks in LLMs. For instance, Zhang et al. [45] proposed “Safe Unlearning”, a framework targeting three key objectives: Reducing the likelihood of harmful responses, increasing the likelihood of rejecting harmful queries, and preserving the model’s performance on benign queries. This approach employs an adaptive unlearning loss to stabilize training, preventing the model from producing nonsensical outputs. Similarly, Lu et al. [46] concentrated on unlearning harmful knowledge without sacrificing general knowledge or safety alignment. They utilized gradient ascent on harmful answers and implemented techniques to ensure the model retains its capacity to comprehend entities and reject harmful queries. Conversely, other researchers focus on defending against jailbreak attacks through prompt engineering. Robey et al. [47] present SmoothLLM, a defense mechanism that employs character-level perturbation of prompts, markedly decreasing the success rate of attacks to below 1% without degrading the model’s performance on legitimate inquiries. Zhou et al. [48] introduced robust prompt optimization (RPO), a method designed to fortify model inputs against being manipulated to produce harmful outputs. RPO utilizes gradient-based optimization to develop a defensive suffix, significantly bolstering the model’s resilience against both recognized and emerging jailbreaks. This method effectively diminishes attack success rates with minimal impacts on the model’s benign functionality and exhibits robust transferability across various models.

    However, these defense mechanisms have limitations. First, many of them rely on prior knowledge or assumptions about the attack type. Additionally, these methods often introduce computational overhead, such as gradient-based optimizations or unlearning processes. Some defenses, like SmoothLLM and RPO, may degrade model performance on benign queries or result in subtle user experience issues. Moreover, unlearning may unintentionally erase useful knowledge.

    5 Security problems and defenses

    5.1 Backdoor and poisoning attacks

    Backdoor attacks involve surreptitiously embedding a hidden trigger within the model’s training data or architecture. Once activated by specific input patterns (the “trigger”), the model exhibits unintended behaviors, such as generating misleading or malicious outputs. Poisoning attacks contaminate the training dataset with maliciously crafted samples, causing the model to learn incorrect associations or biases. Backdoor and poisoning attacks are two distinct but interconnected types of attack strategies. Concretely, backdoor attacks implant stealthy triggers for specific, manipulated outputs, while poisoning attacks corrupt the training data to degrade overall model performance or induce biases. While backdoor and poisoning attacks differ in their specific goals and mechanisms, they are interconnected in their exploitation of vulnerabilities in the training process. We make no special distinction between them in this paper.

    A variety of backdoor attack algorithms targeting LLMs have been documented [6]. For example, instruction poisoning and ICL poisoning leverage the model’s adaptability to new tasks without explicit parameter updates. Xu et al. [49] demonstrated that attackers can control model behavior by inserting malicious instructions into the training data, achieving high success rates without altering the actual data or labels. Reference [50] shows that by searching extensive text collections for inputs exhibiting significant gradient magnitudes under the approximation of a bag-of-n-grams LM, attackers can succeed with surprisingly small datasets—sometimes as few as a hundred correctly labeled data points. Furthermore, the efficacy of these attacks is expected to increase as the size and complexity of LMs grow. Zhao et al. [51] demonstrated that ICL, despite its efficacy in various NLP tasks, is susceptible to backdoor attacks where adversaries can manipulate model behavior by poisoning the demonstration context. The paper introduces ICLAttack, which includes two strategies: Poisoning demonstration examples and poisoning prompts, both capable of inducing the model to generate targeted outputs without the need for model fine-tuning. The method preserves the correct labeling of examples, enhancing the stealth of the attack. Additionally, some works leverage model-editing techniques to inject backdoors. Reference [18] proposes BadEdit, which builds shortcuts connecting triggers to their corresponding attack targets by directly manipulating the model’s weights. It requires minimal samples and adjustments while maintaining model performance and stability.

    Defense. In the ongoing battle to secure LLMs against backdoor and poisoning attacks, recent research has unveiled a variety of innovative defense mechanisms. Xi et al. [52] delved into the susceptibility of pre-trained language models (PLMs) to backdoor attacks, particularly in few-shot learning contexts. They introduced the masking-difference projection (MDP) method, which pinpoints significant shifts in model representations that suggest poisoning. It leverages the disparity in masking sensitivity between tainted and pristine samples. However, continuously monitoring representation shifts could be computationally expensive. Li et al. [53] pointed out the necessity of presupposing the attack mechanism, such as the insertion of irregular tokens as backdoor triggers. Recognizing that a typical backdoor attack creates a shortcut from the trigger to the target output, bypassing logical reasoning, they proposed chain-of-scrutiny (CoS). It detects inconsistencies between the reasoning steps generated by LLMs and their final outputs to identify malicious manipulations. However, reference [53] is specifically tailored for tasks such as text classification, which significantly limits its broader applicability. To address this limitation, Li et al. [54] proposed simulate and eliminate (SANDE) to remove backdoors for task-agnostic LLMs. SANDE includes overwrite supervised fine-tuning (OSFT) for known triggers and a two-stage framework for unknown triggers. The framework simulates the trigger’s behavior using a learnable parrot prompt and then eliminates the backdoor mapping using OSFT. However, this approach requires retraining LLM and prior knowledge of the attacker’s desired contents. Li et al. [55] further contributed to the field with CLEANGEN, an innovative defense mechanism applied at inference time. The underlying principle is that compromised LLMs tend to assign disproportionately high probabilities to tokens that reflect the attacker’s intended content. CLEANGEN capitalizes on this insight by identifying and rejecting tokens that are likely influenced by embedded triggers. These tokens are subsequently replaced with outputs from a reference model, ensuring that the LLM’s outputs are free from alignment with the attacker’s agenda. However, this method depends on the availability of a reference model that has not been compromised by the same adversary.

    In summary, while each of these defense methods offers valuable approaches to securing LLMs against backdoor and poisoning attacks, their limitations in terms of generalization, computational overhead, reliance on prior knowledge, and vulnerability to evolving attacks make them less than fully robust for all scenarios.

    5.2 Adversarial attack

    Adversarial attacks [19,20,56] utilize meticulously crafted inputs to manipulate the model behavior and output. These attacks manifest in various forms, from prompt injections that distort model responses to token manipulations leading to incorrect predictions. It potentially enables the generation of harmful content, the inadvertent disclosure of sensitive information, or the evasion of critical safety mechanisms. For example, Xu et al. [56] proposed a groundbreaking method called PromptAttack, designed to audit the adversarial robustness of LLMs. This technique ingeniously converts textual adversarial attacks into an attack prompt, which deceives the victim LLM into generating an adversarial sample capable of self-deception. The attack prompt comprises three key elements: The original input, the attack objective, and strategic attack guidance. To preserve the semantic essence of the adversarial examples, a fidelity filter is meticulously applied. The attack’s potency is further amplified by integrating adversarial examples across a spectrum of perturbation levels. Building upon this foundation, Sadasivan et al. [19] introduced BEAST, which operates on interpretable parameters. It allows for a nuanced balance between the swiftness of the attack, its success rate, and the intelligibility of the adversarial prompts. BEAST demonstrates versatility by executing untargeted adversarial attacks that provoke hallucinations in aligned LLMs, as well as enhancing existing membership inference attacks. Raina et al. [20] examined the vulnerability of LLMs used for zero-shot text assessment, discovering that both scoring and comparative assessment are susceptible to universal adversarial phrases that can manipulate the models into assigning high scores regardless of actual text quality. Their research reveals the transferability of these attacks across various LLMs, indicating a significant robustness issue.

    Defense. Defending against adversarial attacks requires a multifaceted approach, including robust input validation to detect and mitigate malicious prompts [57], the use of auxiliary models or specialized algorithms to identify and counteract attempts to exploit the model’s vulnerabilities, and secure training practices to reduce the risk of information leakage [58,59]. For instance, Kumar et al. [57] introduced the “erase-and-check” framework, a method that systematically erases tokens from a given prompt and scrutinizes the resulting subsequences with a safety filter. An input prompt is flagged as harmful if any subsequence or the prompt itself is identified by the filter. Brown et al. [58] proposed a defense method that utilizes self-evaluation [60]. This method does not require model fine-tuning; instead, it uses pre-trained models to assess the inputs and outputs of a generator model. Similarly, Zhao et al. [59] proposed a two-stage training framework to enhance the robustness of LLMs: Instruction-augmented supervised fine-tuning (SFT) and consistency alignment training (CAT). The first stage involves paraphrasing the original instructions and pairing them with the original responses to form new training samples, which are then used to fine-tune the model. In the second stage, the model generates candidate responses to paraphrased instructions, and consistency scores are used to differentiate between good and bad responses. The model is then optimized using an offline training algorithm.

    However, these approaches often struggle with scalability and computational efficiency, as techniques like token erasure and self-evaluation can be resource-intensive. There is also a risk of false positives or negatives, where benign inputs are unnecessarily flagged, or malicious ones go undetected. Furthermore, methods like instruction-augmented fine-tuning may suffer from overfitting to specific attack patterns, limiting generalization to new threats.

    6 Privacy problems and defenses

    6.1 Privacy leakage

    Privacy leakage is a pressing concern due to the inherent potential of models to inadvertently divulge sensitive information. In a notable case, Samsung Electronics encountered sensitive corporate information unintentionally disclosed through ChatGPT interactions. Reference [61] finds that Gemini Flash often collects excessive user data beyond what is necessary and has a higher rate of anonymization failures when sharing data with the third parties. Privacy leakage can arise from three primary causes: Private information in training data, data memorization, and inference leakage.

    1) Private information in training data. Training LLMs demands massive datasets, often compiled from diverse sources including the Internet and public databases, to learn the intricacies of human language effectively. These datasets may contain sensitive or personally identifiable information (PII), such as personal identifiers, health records, or financial details. Despite efforts at anonymization, there remains a risk of reidentification. Kim et al. [7] present ProPILE, a novel probing tool designed to empower data subjects, or the owners of PII, with awareness of potential PII leakage in LLM-based services. ProPILE enables data subjects to formulate prompts based on their PII to evaluate the level of privacy intrusion in LLMs. However, ProPILE primarily focuses on measuring leakage from training data and does not account for the flow of information from input to output. Mireshghallah et al. [21] proposed CONFAIDE, a benchmark grounded in the theory of contextual integrity and designed to identify critical weaknesses in the privacy reasoning capabilities of instruction-tuned LLMs. CONFAIDE consists of four tiers. Grounded in contextual integrity theory [62], each tier has a set of seed components defining the context, which gradually increases in complexity as the tiers progress. This work shed light on the often-overlooked interactive settings where an LLM may inadvertently expose sensitive input data in inappropriate contexts.

    2) Data memorization. LLMs, with their immense capacity to learn from massive datasets, can unintentionally retain and later reproduce specific pieces of training data, including sensitive personal information. This phenomenon, known as memorization, is a byproduct of their ability to learn complex patterns. Reference [63] addresses the issue of LLMs inadvertently memorizing parts of their training data, which can lead to privacy breaches, reduced utility, and unfairness. The authors demonstrated that memorization significantly increases with model capacity, data duplication, and context length. Their log-linear approach quantifies the extent of memorization, revealing that larger models memorize 2−5× more than smaller ones.

    3) Inference leakage. Even without explicit exposure to private data during interactions, LLMs can infer sensitive details through contextual clues. Their ability to generalize and connect information means that seemingly innocuous inputs can trigger responses that reveal more than intended, especially when the model has learned correlations between different types of data during training. We will explore this issue in greater detail in subsection 6.2.

    Defense. To safeguard privacy in LLMs, essential methods include data unlearning to strip away PII, and data deduplication to ensure uniqueness and prevent data redundancy. Reference [64] proposes a technique to erase a model’s knowledge of the Harry Potter books from the Llama2-7b model without full retraining. Their method involves three steps: i) training a reinforced model to identify tokens related to the unlearning target, ii) replacing unique expressions with generic terms and generating alternative labels, and iii) fine-tuning the model on these labels to approximate a model untrained on the target data. Results show a significant reduction in the model’s ability to generate Harry Potter content while maintaining performance on common benchmarks. However, it may not generalize well to diverse types of sensitive content, limiting its overall applicability. Reference [65] discusses the challenges of removing sensitive information from PLMs. The authors proposed an attack-and-defense framework to study the deletion of sensitive data directly from model weights. Reference [66] demonstrates that the susceptibility of LLMs to privacy attacks is significantly influenced by the duplication of sequences in their training sets. The authors showed that the frequency with which LLMs regenerate training sequences is superlinearly related to the number of times a sequence appears in the training data. The study also reveals that methods for detecting memorized sequences have near-chance accuracy when applied to non-duplicated training sequences. However, it requires significant computational resources, making them challenging to implement at scale.

    6.2 Inference attack

    Inference attacks leverage the models’ ability to infer sensitive information from their training datasets. Here, we outline four distinct types of inference attacks: Membership inference attack (MIA), attribute inference attack, data reconstruction attack, and model inversion attack.

    1) Membership inference attack. These attacks [22,67,68] aim to ascertain whether a particular piece of data was included in the model’s training set. By posing queries that pertain to personal information, attackers can discern its presence through the model’s reactions. Duan et al. [22] conducted a comprehensive assessment of MIAs across various LMs trained on the Pile dataset, encompassing models with parameters ranging from 160 million to 12 billion. The study reveals that MIAs only marginally exceed the performance of random guessing across most conditions and model sizes. The authors attributed this underwhelming outcome to the expansiveness of the dataset, the limited number of training iterations, and the ambiguous boundary between member and non-member data points. Mattern et al. [69] proposed a “neighborhood attack” that contrasts the model’s scores for a given sample with those of artificially created neighboring texts. This innovative approach negates the need for access to the training data distribution. However, it presupposes the availability of reference models trained on analogous data, which poses practical challenges in pre-training data detection due to the inaccessibility of the original training data and the prohibitive cost of training such models. Shi et al. [70] introduced a dynamic benchmark termed WIKIMIA and a detection technique called MIN-K% PROB. This method is based on the assumption that an unseen example is more likely to contain outlier words with low probabilities under LLM, whereas a seen example would exhibit fewer such low-probability terms. Kaneko et al. [71] tackled the issue of inaccessibility of likelihoods to users. They proposed the sampling-based pseudo-likelihood (SPL) for MIAs, known as SaMIA. This technique calculates SPL utilizing solely the text produced by LLM, bypassing the need for direct likelihoods. SaMIA leverages the ROUGE-N score to quantify the n-gram match between the target text and multiple LLM outputs, thus assessing the text’s membership in the training data.

    2) Attribute inference attack. This type of attack is a class of privacy threats aimed at discerning specific attributes or characteristics of the data contained within the training set of LLMs. These attacks can reveal general information about the dataset, such as the demographic details of the users or the thematic essence of the content. Staab et al. [23] conducted an extensive study on the ability of pre-trained LLMs to infer personal attributes from textual data. Their research assessed nine leading LLMs’ capabilities to deduce eight distinct personal attributes, demonstrating that these models can achieve performance levels comparable to human experts—significantly reducing the cost associated with human labelers. This raises concerns about the potential for large-scale privacy violations. Moreover, Kandpal et al. [72] delved into the specific risk of user inference attacks, where an attacker’s goal is to ascertain whether a particular user’s data has been involved in the fine-tuning of LLM. They proposed an attack methodology that utilizes a likelihood ratio test statistic, normalized against a reference model. The study also explores the factors that render users susceptible to such attacks, pinpointing outlier users, users with distinctive shared characteristics, and those who have contributed substantially to the fine-tuning dataset as especially vulnerable.

    3) Data reconstruction attack. These attacks represent a more intrusive form of data compromise, aiming to reconstruct entire sequences or datasets from the model’s knowledge base. By strategically querying the model with interconnected prompts, attackers can gather and synthesize information to recreate portions of the training data. Wang et al. [73] concentrated on the scenarios of federated learning and embedded vector databases, proposing “Embed Parrot”, a Transformer-based technique that adeptly reconstructs original inputs from the deep-layer embeddings of models such as ChatGLM-6B and Llama2-7B. They also introduced a defense mechanism to counteract the risks of privacy breaches, underscoring the imperative for stringent privacy safeguards in distributed learning environments.

    4) Model inversion attack. This type of attack seeks to reverse-engineer the model’s training process to extract specific training examples or patterns. These attacks capitalize on the model’s tendency to memorize training data, attempting to recover detailed sequences or sensitive information. Zhang et al. [74] introduced “Text Revealer”, an attack technique designed to reconstruct private texts from the training data by interfacing with the target model. The attack unfolds in two phases: The collection and analysis of public datasets and the perturbation of word embeddings. The adversary’s ultimate objective is to invert the memorized training data from fine-tuned LMs, producing inverted texts that closely resemble the private dataset distribution and are intelligible. Morris et al. [75] demonstrated that the probabilities of subsequent tokens retain substantial information about the preceding text, which can be leveraged for prompt reconstruction. They proposed a method that employs a conditional LM to map these probabilities back to the original tokens, achieving remarkable reconstruction accuracy on the Llama-2 7B model.

    Defense. Defending against inference attacks on LLMs [76] involves a variety of strategies including data deduplication [66], DP [77], model distillation, adversarial training, text anonymization [78], input obfuscation, and regularization techniques [78]. These methods aim to reduce the model’s ability to memorize and leak sensitive training data, thereby enhancing privacy and security. Moreover, Wang et al. [73] proposed defense mechanisms to curb the misuse of the Embed Parrot technique. Their method incorporates an overlap matrix with the discrete cosine transform (DCT) and its inverse (IDCT) to obscure the sensitive features of the embeddings. However, this defense strategy also has the side effect of affecting the model’s native generative capabilities and increasing its perplexity. However, these approaches often come with performance trade-offs, as techniques like DP and distillation can degrade model accuracy and generative capabilities. Additionally, methods such as adversarial training and input obfuscation can be computationally expensive and may not scale well for large models.

    6.3 Extraction attack

    In the realm of LLMs, extraction attacks emerge as a pivotal concern for data privacy. These assaults are designed to retrieve information from the model’s training data or even its underlying parameters. Although extraction attacks share commonalities with inference attacks, they are distinguished by their unique objectives and methodologies. Extraction attacks are primarily aimed at directly acquiring tangible assets, such as model gradients or sensitive training data. In contrast, inference attacks are more concerned with deducing or uncovering the intrinsic characteristics or attributes of the model or its data. This is typically achieved through an analysis of the model’s outputs or observed behaviors. There are two principal categories of extraction attacks: Model extraction attack, which targets the model’s parameters, and data extraction attack, which focuses on recovering the training data.

    1) Model extraction attack. These attacks are dedicated to reconstructing the model’s parameters or its fundamental structure. Attackers often employ methods such as gradient analysis or optimization algorithms to fine-tune their models to emulate the behavior of the target model, thus deducing the model’s architecture. In Ref. [24], the security of decoding algorithms within LMs for text generation is scrutinized. The authors introduced an attack methodology that capitalizes on the observable outputs from text generation APIs to infer the underlying decoding algorithm and its hyperparameters. The study illustrates that an adversary can accurately deduce the type of decoding algorithm and its hyperparameters with minimal investment and a multi-stage attack strategy. Another exploration is presented in Ref. [79], which delves into techniques for extracting specialized coding capabilities from commercial LLMs. Utilizing various query strategies such as zero-shot, in-context, and CoT, the research demonstrates that a medium-sized model can replicate these specialized behaviors effectively through strategic query methods and response verification, even when the models are protected by intellectual property rights. Reference [80] confronts the issue of intellectual property theft in AI by extracting the embedding layer from black-box LMs like ChatGPT and PaLM-2. The study employs strategic API queries to identify the hidden dimension through singular value decomposition (SVD) and reconstructs the weight matrix. This effectively exposes the model’s architecture in an economical manner and without the need for direct parameter access.

    2) Data extraction attack. These attacks are specifically designed to recover particular training examples or sequences from the model. Attackers can use the model’s responses to certain inputs to deduce and extract memorized data. This may encompass personal information, copyrighted text, or any other sensitive content that was included in the model’s training dataset. A method for extracting hidden information from commercial LLMs accessed via APIs is introduced in Ref. [81]. This method leverages the softmax bottleneck to deduce the model’s hidden size and reconstruct full outputs using limited API data. By exploiting the low-rank nature of the output layer, this approach uncovers the model’s parameterization and behavior, thereby enhancing accountability and transparency in the operations of LLMs. Bai et al. [82] highlighted the inadvertent memorization and leakage of training data by LLMs when prompted by certain inputs. They introduced the special characters attack (SCA), a pioneering method that exploits the model’s sensitivity to special characters in conjunction with English letters. SCA is engineered to provoke LLMs into divulging memorized data, which may encompass the source code, web content, and personal identifiers. Panda et al. [25] revealed a sophisticated data extraction attack called “neural phishing”, capable of extracting sensitive personal information from LLMs with high success rates. By strategically inserting benign-looking sentences into the training data, attackers can exploit the model’s memorization to retrieve private details such as credit card numbers. Zhang et al. [83] introduced PLeak, a closed-box attack framework that refines adversarial queries to extract the system prompts of LLM applications, often regarded as intellectual property and closely guarded by developers. PLeak conceptualizes the adversarial query generation as an optimization challenge, addressing it with a gradient-based approach. The underlying strategy is to progressively refine the adversarial queries, initiating with a few tokens and then scaling up to encompass the entire system prompt.

    It is observed that there is a relative scarcity of empirical research on extracting model parameters or data from LLMs, with most discussions being theoretical in nature [9]. The complexity and proprietary nature of LLMs pose significant challenges to such extraction attacks. Additionally, the controlled output of these models further limits the feasibility of black-box attacks, making it difficult for adversaries to successfully execute data extraction without access to the model’s internal workings.

    Defenses. To counteract the threat of data extraction attacks, a multifaceted defense strategy can be implemented, encompassing model obfuscation, input sanitization, and stringent access control measures. These approaches are designed to obscure the model’s underlying mechanisms, preemptively neutralize malicious queries, and regulate access based on user identity and behavior patterns. Li et al. [84] proposed a robust defense mechanism against model stealing attacks targeting edge-deployed LLMs. This defense introduces a request-level authorization system that employs a permutation technique to dynamically alter the model’s weight matrix. This permutation is managed by a trusted execution environment (TEE)-based authorization module, ensuring that only authorized inputs are correctly permuted for processing. Additionally, the framework integrates one-time pad (OTP) encryption for secure authorization, which maintains model accuracy and minimizes the computational overhead during runtime. However, it may still introduce latency and performance degradation, especially during high-frequency requests. Zhang et al. [83] delved into potential defenses against PLeak attacks on LLM applications. They suggested a variety of countermeasures, including keyword filtering to block specific prompts, parameterization, and formatting to obfuscate system prompts, and adversarial transformations to navigate around filtering mechanisms. This study also highlights the utility of detection techniques such as perplexity-based detection, input preprocessing, and adversarial training to identify and mitigate attacks. However, it is computationally expensive and may not always generalize well to novel forms of attacks.

    6.4 Other privacy-preserving methods

    In addition to the previously discussed defensive strategies against various attacks, there are several privacy-preserving methods that can protect against a wide range of threats. These methods are more general and can be applied across different scenarios and attack vectors. By integrating advanced privacy-preserving techniques, LLMs can operate more securely and reliably, reducing the potential for enhancing user trust.

    6.4.1 Differential privacy

    DP incorporates a degree of controlled randomness into data processing, complicating the task for adversaries to pinpoint individual data points within aggregated outcomes. This strategy is especially effective at safeguarding user privacy without compromising the ability to perform insightful statistical analyses. Charles et al. [85] introduced differentially private in-context learning (DP-ICL), and designed to prevent privacy breaches in LLMs during the adaptation phase. DP-ICL leverages noisy consensus from diverse data subsets to uphold privacy, having been rigorously tested in both text classification and language generation tasks. By balancing utility and privacy, DP-ICL supports the ethical deployment of AI. Chua et al. [86] advocated for user-level DP to ensure consistent protection. Their research compares two privacy mechanisms—group privacy and user-wise differential privacy-stochastic gradient descent (DP-SGD)—across NLP tasks, aiming to fine-tune the balance between privacy and utility. The findings indicated that user-wise DP-SGD outperforms its counterpart, particularly under constrained privacy budgets, owing to its nuanced approach to data management during training. Du et al. [77] present DP-Forward, a method that applies noise to the embeddings of LMs for enhanced privacy protection during both training and inference phases. Tang et al. [87] introduced an algorithm capable of generating few-shot demonstrations while upholding DP, ensuring sensitive information remains undisclosed. Their contribution advances the field of privacy-preserving ICL, balancing contextual learning with data protection across various applications.

    6.4.2 Federated learning

    FL enables models to be trained on decentralized data, thereby reducing the need to centralize and expose sensitive user data. This method allows models to learn from data distributed across various devices without compromising privacy. In a recent contribution, Zheng et al. [88] proposed a federated learning-generative language model (FL-GLM), an FL framework specifically designed for LLMs. It utilizes split learning, where the majority of the model’s parameters reside on the server, while the embedding and output layers are trained locally on client devices. This strategy is well-suited for LLMs given their substantial computational requirements. To bolster security, FL-GLM incorporates key encryption for client-server communications and introduces optimization techniques such as client batching or server hierarchical structuring to enhance training efficiency. Sun et al. [89] introduced an innovative enhancement of the low-rank adaptation (LoRA), named federated freeze A LoRA (FFA-LoRA), of LLMs method designed for privacy-preserving FL. This approach addresses the inherent challenges of LoRA within the FL paradigm by stabilizing non-zero matrices and focusing fine-tuning efforts on those initialized to zero. By doing so, FFA-LoRA not only improves the stability and efficiency of FL tasks but also ensures privacy preservation through the implementation of DP measures.

    6.4.3 Methods based on cryptography

    1) Secure multi-party computation (MPC). MPC allows multiple parties to collaboratively compute a function over their inputs while keeping those inputs confidential. Hao et al. [90] proposed a security protocol for LLMs that enables privacy-protected predictions. The protocol ensures that neither the user’s input nor the service provider’s model parameters are exposed, with only the prediction result being revealed to the user. Tailored for the Transformer architecture, the protocol includes secure methods for matrix multiplication and non-linear operations like Softmax, Gaussian error linear unit (GELU), and LayerNorm. Building on Ref. [90], Hou et al. [91] introduced CipherGPT, which enhances the communication and computational efficiency of the original scheme, conducting pioneering experiments on ChatGPT. They standardized the evaluation of linear layer tokens using a unified parameter matrix and applied subfield vector oblivious linear evaluation (VOLE) for secure matrix multiplication. The GELU function in the non-linear layer is segmented and approximated using low-order formulas, making it efficiently computable with cryptographic primitives. This results in a 4.2× improvement in communication and a 3.4× boost in computation compared to the original method [90]. Dong et al. [92] extended the privacy-preserving approach to a three-server configuration, enhancing model prediction services by secretly sharing prediction samples and model parameters among the servers. This setup leverages repeated secret sharing (RSS) for efficient secure evaluation of linear operations. Although the GELU function is still approximated with a low-order polynomial, it is segmented less finely, leading to a rougher fit. For non-linear operations such as Softmax, existing protocols are utilized. While MPC effectively prevents the explicit exposure of private data during LLM predictions, it does not guard against inference attacks like model inversion. MPC protects the privacy of data transmission and processing but does not address vulnerabilities in model behavior during inference.

    2) Zero-knowledge (ZK) proofs. ZK proofs are cryptographic methods that allow an individual to assert the knowledge of a value to another party without revealing any information about the value itself. Sun et al. [93] proposed zkLLM, a novel ZK-proof system that authenticates outputs from LLMs without revealing model parameters. It excels in verifying non-arithmetic deep learning computations, such as activation functions and attention mechanisms, through lookup, a parallelized lookup argument that minimizes computational overhead. The paper also introduces zkAttn, a tailored proof for the attention mechanism, optimizing both efficiency and accuracy.

    3) Other methods. In addition to the techniques discussed above, there are numerous advanced cryptographic schemes, including homomorphic encryption (HE), functional encryption (FE), and functional secret sharing (FSS). Due to space constraints, a detailed explanation of these methods is beyond the scope of this paper.

    6.4.4 Watermarking

    Watermarking embeds unique identifiers into data, serving as a traceable deterrent against unauthorized use or access. It fortifies security by facilitating the identification and prosecution of data misuse. Kirchenbauer et al. [94] present a framework for watermarking LLM outputs to curb misuse and ensure the detectability of machine-generated text. The method discreetly selects “green” tokens, subtly encouraging their use during text generation, embedding an imperceptible watermark that minimally impacts quality, while allowing for algorithmic detection without requiring access to the model. Yao et al. [95] focused on prompt copyright protection with PromptCARE (prompt copyright protection by watermark injection and verification), a framework that integrates watermarks into prompts via a dual-level optimization process. It employs “label tokens” and “signal tokens” to forge a unique signature, with a verification phase that uses a secret key to trigger watermark behavior for statistical copyright verification. Zhao et al. [96] introduced Unigram-Watermark, enhancing an existing method with a simplified grouping strategy, underpinned by a theoretical framework that quantifies watermark effectiveness and robustness in LLMs. Min et al. [97] unveiled SILO, an LM designed to balance risk and performance during inference. Trained on the Open License Corpus and augmented with a nonparametric datastore for high-risk data, SILO allows for high-risk data utilization without direct training exposure.

    7 Future direction

    Safeguarding LLMs is fraught with challenges, primarily due to the following two fundamental factors:

    • The sheer scale of LLMs, often containing billions of parameters, makes traditional security methods less effective. The vastness of these models demands innovative, scalable solutions. Moreover, the proprietary nature of many powerful LLMs limits the broader community’s ability to scrutinize and test them for vulnerabilities. This confidentiality, while protecting intellectual property, also shields LLMs from the critical analysis that could improve their security.

    • There is a shortage of research on how LLM architectures affect their safety. This is partly due to the prohibitive computational costs associated with experimenting with various designs. The complexity and resource intensity of such studies hinders progress in understanding the architectural factors that influence model security, even though this knowledge is vital for creating more secure and manipulation-resistant models.

    To address these challenges, future research should focus on developing scalable security solutions that can handle the complexity of LLMs. Several promising research directions are as follows:

    1) Scalable security protocols and algorithms: Develop security protocols and algorithms that are capable of scaling with the growing size and complexity of LLMs. This includes the design of adaptive learning mechanisms that can evolve as models expand, ensuring that the security architecture can handle not only the size but also the dynamic nature of these systems.

    2) Open-source frameworks and collaborative platforms: Promote the creation of open-source frameworks and collaborative platforms that enable broader analysis, testing, and improvement of LLM architectures. By fostering transparency in research and development, these platforms can accelerate the identification of vulnerabilities and facilitate the creation of effective countermeasures.

    3) Transparent decision-making processes: Develop methods to enhance the transparency and interpretability of LLM decision-making processes. Making model reasoning more understandable will help researchers identify and mitigate unintended biases, harmful outputs, and other ethical concerns, thus improving overall trust in the system.

    4) Privacy-preserving inference methods: Investigate privacy-preserving techniques for LLM inference to ensure that user queries and model responses do not inadvertently leak sensitive or personal information. Techniques such as DP and MPC should be adapted for LLM contexts to maintain privacy while ensuring effective functionality.

    5) Security of multimodal systems: Focus on securing multimodal systems as LLMs increasingly integrate multiple modalities, such as text, vision, and audio. Research should address potential security risks arising from the interaction between different modalities, including how to mitigate the exposure of vulnerabilities that may arise from their combination.

    6) Optimizing security with computational efficiency: Strive to maintain robust security while optimizing computational resources and efficiency during the deployment of large models. Balancing security measures with resource constraints is critical, especially for ensuring smooth operation in resource-limited environments like edge computing devices.

    7) Defending edge-deployed models: For models deployed at the edge, investigate defense mechanisms such as request-level authorization, weight permutation, and encryption. These techniques should be optimized to minimize latency and computational overhead while providing robust security. The development of lightweight encryption methods and more efficient trusted execution environments (TEEs) is crucial for protecting models during runtime without significantly impacting performance.

    8 Conclusions

    In this paper, we clarify the concepts of safety, security, and privacy in LLMs with clear and simple classification. Our goal is to provide researchers with a straightforward framework to help them identify specific vulnerabilities and apply effective defense strategies. Additionally, we present a comprehensive review of current issues and solutions in LLMs, categorized into these three key areas. The survey aims to help LLM researchers and practitioners better understand these important topics. We also analyze existing studies, summarize key insights, and suggest directions for future research.

    Disclosures

    The authors declare no conflicts of interest.

    References

    [1] OpenAI, J. Achiam, S. Adler, et al., GPT4 technical rept [Online]. Available, https:arxiv.gabs2303.08774, March 2023.

    [2] H. Touvron, T. Lavril, G. Izacard, et al., LLaMA: open efficient foundation language models [Online]. Available, https:arxiv.gabs2302.13971, February 2023.

    [3] H. Touvron, L. Martin, K. Stone, et al., Llama 2: open foundation finetuned chat models [Online]. Available, https:arxiv.gabs2307.09288, July 2023.

    [4] W.X. Zhao, K. Zhou, J.Y. Li, et al., A survey of large language models [Online]. Available, https:arxiv.gabs2303.18223, March 2023.

    [5] L. Huang, W.J. Yu, W.T. Ma, et al., A survey on hallucination in large language models: principles, taxonomy, challenges, open questions, ACM T. Infm. Syst. (2024), doi: 10.11453703155.

    [6] S. Zhao, M.H.Z. Jia, Z.L. Guo, et al., A survey of backdo attacks defenses on large language models: implications f security measures [Online]. Available, https:arxiv.gabs2406.06852, June 2024.

    [7] S. Kim, S. Yun, H. Lee, M. Gubri, S. Yoon, S.J. Oh, ProPILE: probing privacy leakage in large language models, in: Proc. of the 37th Intl. Conf. on Neural Infmation Processing Systems, New leans, USA, 2024, pp. 1–14.

    [8] T.Y. Cui, Y.L. Wang, C.P. Fu, et al., Risk taxonomy, mitigation, assessment benchmarks of large language model systems [Online]. Available, https:arxiv.gabs2401.05778, January 2024.

    [9] B.W. Yan, K. Li, M.H. Xu, et al., On protecting the data privacy of large language models (LLMs): a survey [Online]. Available, https:arxiv.gabs2403.05156, March 2024.

    [10] Yao Y.-F., Duan J.-H., Xu K.-D., Cai Y.-F., Sun Z.-B., Zhang Y.. A survey on large language model (LLM) security and privacy: the good. the bad, and the ugly, High-Confid. Comput., 4, 100211(2024).

    [11] A. Vaswani, N. Shazeer, N. Parmar, et al., Attention is all you need, in: Proc. of the 31st Intl. Conf. on Neural Infmation Processing Systems, Long Beach, USA, 2017, pp. 6000–6010.

    [12] A. Deshpe, V. Murahari, T. Rajpurohit, A. Kalyan, K. Narasimhan, Toxicity in ChatGPT: analyzing personaassigned language models, in: Proc. of the Findings of the Association f Computational Linguistics, Singape, 2023, pp. 1236–1270.

    [13] A. Haim, A. Salinas, J. Nyarko, What’s in a name Auditing large language models f race gender bias [Online]. Available, https:arxiv.gabs2402.14875, February 2024.

    [14] Ji Z.-W., Lee N., Frieske R. et al. Survey of hallucination in natural language generation. ACM Comput. Surv., 55, 248(2023).

    [15] Y. Zhang, Y.F. Li, L.Y. Cui, et al., Siren’s song in the AI ocean: a survey on hallucination in large language models [Online]. Available, https:arxiv.gabs2309.01219, September 2023.

    [16] X.X. Li, R.C. Zhao, Y.K. Chia, et al., Chainofknowledge: grounding large language models via dynamic knowledge adapting over heterogeneous sources, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–23.

    [17] A. Wei, N. Haghtalab, J. Steinhardt, Jailbroken: how does LLM safety training fail in: Proc. of the 37th Advances in Neural Infmation Processing Systems, New leans, USA, 2023, pp. 1–32.

    [18] Y.Z. Li, T.L. Li, K.J. Chen, et al., BadEdit: backdoing large language models by model editing, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–18.

    [19] V.S. Sadasivan, S. Saha, G. Sriramanan, P. Kattakinda, A.M. Chegini, S. Feizi, Fast adversarial attacks on language models in one GPU minute, in: Proc. of the 41st Intl. Conf. on Machine Learning, Vienna, Austria, 2024, pp. 1–20.

    [20] V. Raina, A. Liusie, M.J.F. Gales, Is LLMasajudge robust Investigating universal adversarial attacks on zeroshot LLM assessment, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 7499–7517.

    [21] N. Mireshghallah, H. Kim, X.H. Zhou, et al., Can LLMs keep a secret Testing privacy implications of language models via contextual integrity they, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–24.

    [22] M. Duan, A. Suri, N. Mireshghallah, et al., Do membership inference attacks wk on large language models [Online]. Available, https:arxiv.gabs2402.07841, February 2024.

    [23] R. Staab, M. Vero, M. Balunovic, M.T. Vechev, Beyond memization: violating privacy via inference with large language models, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–47.

    [24] A. Naseh, K. Krishna, M. Iyyer, A. Houmansadr, Stealing the decoding algithms of language models, in: Proc. of the ACM SIGSAC Conf. on Computer Communications Security, Copenhagen, Denmark, 2023, pp. 1835–1849.

    [25] A. Pa, C.A. ChoquetteChoo, Z.M. Zhang, Y.Q. Yang, P. Mittal, Teach LLMs to phish: stealing private infmation from language models, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–25.

    [26] O. Shaikh, H.X. Zhang, W. Held, M. Bernstein, D.Y. Yang, On second thought, let’s not think step by step! Bias toxicity in zeroshot reasoning, in: Proc. of the 61st Annual Meeting of the Association f Computational Linguistics, Tonto, Canada, 2023, pp. 4454–4470.

    [27] X.J. Dong, Y.B. Wang, P.S. Yu, J. Caverlee, Probing explicit implicit gender bias through LLM conditional text generation [Online]. Available, https:arxiv.gabs2311.00306, November 2023.

    [28] J. Welbl, A. Glaese, J. Uesato, et al., Challenges in detoxifying language models, in: Proc. of the Findings of the Association f Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 2447–2469.

    [29] Gallegos I.O., Rossi R.A., Barrow J. et al. Bias and fairness in large language models: a survey. Comput. Linguist, 50, 1097-1179(2024).

    [30] M. Bhan, J.N. Vittaut, N. Achache, et al., Mitigating text toxicity with counterfactual generation [Online]. Available, https:arxiv.gabs2405.09948, May 2024.

    [31] N. Vishwamitra, K.Y. Guo, F.T. Romit, et al., Moderating new waves of online hate with chainofthought reasoning in large language models, in: Proc. of the IEEE Symposium on Security Privacy, San Francisco, USA, 2024, pp. 1–19.

    [32] A. Pal, L.K. Umapathi, M. Sankarasubbu, MedHALT: medical domain hallucination test f large language models, in: Proc. of the 27th Conf. on Computational Natural Language Learning, Singape, 2023, pp. 314–334.

    [33] H.Q. Kang, X.Y. Liu, Deficiency of large language models in finance: an empirical examination of hallucination, in: Proc. of the 37th Conf. on Neural Infmation Processing Systems, Virtual Event, 2023, pp. 1–15.

    [34] N. Mündler, J.X. He, S. Jenko, M.T. Vechev, Selfcontradicty hallucinations of large language models: evaluation, detection mitigation, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–30.

    [35] Y.Z. Li, S. Bubeck, R. Eldan, A. Del Gino, S. Gunasekar, Y.T. Lee, Textbooks are all you need II: phi1.5 technical rept [Online]. Available, https:arxiv.gabs2309.05463, September 2023.

    [36] H. Lightman, V. Kosaraju, Y. Burda, et al., Let’s verify step by step, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–24.

    [37] Z.B. Gou, Z.H. Shao, Y.Y. Gong, et al., CRITIC: large language models can selfcrect with toolinteractive critiquing, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–77.

    [38] S. Dhuliawala, M. Komeili, J. Xu, et al., Chainofverification reduces hallucination in large language models, in: Proc. of the Findings of the Association f Computational Linguistics, Bangkok, Thail, 2024, pp. 3563–3578.

    [39] E. Jones, H. Palangi, C. Simões, et al., Teaching language models to hallucinate less with synthetic tasks, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–18.

    [40] S.M.T.I. Tonmoy, S.M.M. Zaman, V. Jain, et al., A comprehensive survey of hallucination mitigation techniques in large language models [Online]. Available, https:arxiv.gabs2401.01313, January 2024.

    [41] A. Zou, Z.F. Wang, N. Carlini, M. Nasr, J.Z. Kolter, M. Fredrikson, Universal transferable adversarial attacks on aligned language models [Online]. Available, https:arxiv.gabs2307.15043, July 2023.

    [42] H.R. Li, D.D. Guo, W. Fan, et al., Multistep jailbreaking privacy attacks on ChatGPT, in: Proc. of the Findings of the Association f Computational Linguistics, Singape, 2023, pp. 4138–4153.

    [43] M. Shanahan, K. McDonell, L. Reynolds, Role play with large language models, Nature 623 (7987) (2023) 493–498.

    [44] G.L. Deng, Y. Liu, Y.K. Li, et al., MASTERKEY: automated jailbreaking of large language model chatbots, in: Proc. of the 31st Annual wk Distributed System Security Symposium, San Diego, USA, 2024, pp. 1–16.

    [45] Z.X. Zhang, J.X. Yang, P. Ke, et al., Safe unlearning: a surprisingly effective generalizable solution to defend against jailbreak attacks [Online]. Available, https:arxiv.gabs2407.02855, July 2024.

    [46] W.K. Lu, Z.Q. Zeng, J.W. Wang, et al., Eraser: jailbreaking defense in large language models via unlearning harmful knowledge [Online]. Available, https:arxiv.gabs2404.05880, April 2024.

    [47] A. Robey, E. Wong, H. Hassani, G.J. Pappas, SmoothLLM: defending large language models against jailbreaking attacks [Online]. Available, https:arxiv.gabs2310.03684, October 2023.

    [48] A. Zhou, B. Li, H.H. Wang, Robust prompt optimization f defending language models against jailbreaking attacks, in: Proc. of the 38th Conf. on Neural Infmation Processing Systems, Virtual Event, 2024, pp. 1–17.

    [49] J.S. Xu, M.Y. Ma, F. Wang, C.W. Xiao, M.H. Chen, Instructions as backdos: backdo vulnerabilities of instruction tuning f large language models, in: Proc. of the Conf. of the Nth American Chapter of the Association f Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 2024, pp. 3111–3126.

    [50] A. Wan, E. Wallace, S. Shen, D. Klein, Poisoning language models during instruction tuning, in: Proc. of the 40th Intl. Conf. on Machine Learning, Honolulu, USA, 2023, pp. 1–13.

    [51] S. Zhao, M.H.Z. Jia, L.A. Tuan, F.J. Pan, J.M. Wen, Universal vulnerabilities in large language models: backdo attacks f incontext learning, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 11507–11522.

    [52] Z.H. Xi, T.Y. Du, C.J. Li, et al., Defending pretrained language models as fewshot learners against backdo attacks, in: Proc. of the 37th Advances in Neural Infmation Processing Systems, New leans, USA, 2023, pp. 1–17.

    [53] X. Li, Y.S. Zhang, R.Z. Lou, C. Wu, J.Q. Wang, Chainofscrutiny: detecting backdo attacks f large language models [Online]. Available, https:arxiv.gabs2406.05948, June 2024.

    [54] H.R. Li, Y.L. Chen, Z.H. Zheng, et al., Simulate eliminate: revoke backdos f generative large language models [Online]. Available, https:arxiv.gabs2405.07667, May 2024.

    [55] Y.T. Li, Z.C. Xu, F.Q. Jiang, et al., CleanGen: mitigating backdo attacks f generation tasks in large language models, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 9101–9118.

    [56] X.L. Xu, K.Y. Kong, N. Liu, et al., An LLM can fool itself: a promptbased adversarial attack, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–23.

    [57] A. Kumar, C. Agarwal, S. Srinivas, A.J. Li, S. Feizi, H. Lakkaraju, Certifying LLM safety against adversarial prompting [Online]. Available, https:arxiv.gabs2309.02705, September 2023.

    [58] H. Brown, L. Lin, K. Kawaguchi, M. Shieh, Selfevaluation as a defense against adversarial attacks on LLMs [Online]. Available, https:arxiv.gabs2407.03234, July 2024.

    [59] Y.K. Zhao, L.Y. Yan, W.W. Sun, et al., Improving the robustness of large language models via consistency alignment, in: Proc. of the Joint Intl. Conf. on Computational Linguistics, Language Resources Evaluation, Tino, Italia, 2024, pp. 8931–8941.

    [60] S. Kadavath, T. Conerly, A. Askell, et al., Language models (mostly) know what they know [Online]. Available, https:arxiv.gabs2207.05221, July 2022.

    [61] O. Cartwright, H. Dunbar, T. Radcliffe, Evaluating privacy compliance in commercial large language modelsChatGPT, Claude, Gemini, Research Square (2024), doi: 10.21203rs.3.rs4792047v1.

    [62] Nissenbaum H.. Privacy as contextual integrity. Wash. Law Rev., 79, 119(2004).

    [63] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramèr, C.Y. Zhang, Quantifying memization across neural language models, in: Proc. of the 11th Intl. Conf. on Learning Representations, Kigali, Rwa, 2023, pp. 1–19.

    [64] R. Eldan, M. Russinovich, Who’s Harry Potter Approximate unlearning in LLMs [Online]. Available, https:arxiv.gabs2310.02238, October 2023.

    [65] V. Patil, P. Hase, M. Bansal, Can sensitive infmation be d from LLMs Objectives f defending against extraction attacks [Online]. Available, https:arxiv.gabs2309.17410, September 2023.

    [66] N. Kpal, E. Wallace, C. Raffel, Deduplicating training data mitigates privacy risks in language models, in: Proc. of the 39th Intl. Conf. on Machine Learning, Baltime, USA, 2022, pp. 10697–10707.

    [67] R. Shokri, M. Stronati, C.Z. Song, V. Shmatikov, Membership inference attacks against machine learning models, in: Proc. of the IEEE Symposium on Security Privacy, San Jose, USA, 2017, pp. 3–18.

    [68] P. Maini, H.R. Jia, N. Papernot, A. Dziedzic, LLM dataset inference: did you train on my dataset [Online]. Available, https:arxiv.gabs2406.06443, June 2024.

    [69] J. Mattern, F. Mireshghallah, Z.J. Jin, B. Schoelkopf, M. Sachan, T. BergKirkpatrick, Membership inference attacks against language models via neighbourhood comparison, in: Proc. of the Findings of the Association f Computational Linguistics, Tonto, Canada, 2023, pp. 11330–11343.

    [70] W.J. Shi, A. Ajith, M.Z. Xia, et al., Detecting pretraining data from large language models, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–17.

    [71] M. Kaneko, Y.M. Ma, Y. Wata, N. Okazaki, Samplingbased pseudolikelihood f membership inference attacks [Online]. Available, https:arxiv.gabs2404.11262, April 2024.

    [72] N. Kpal, K. Pillutla, A. Oprea, P. Kairouz, C.A. ChoquetteChoo, Z. Xu, User inference attacks on large language models, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 18238–18265.

    [73] Z.P. Wang, A.D. Cheng, Y.G. Wang, L. Wang, Infmation leakage from embedding in large language models [Online]. Available, https:arxiv.gabs2405.11916, May 2024.

    [74] R.S. Zhang, S. Hidano, F. Koushanfar, Text revealer: private text reconstruction via model inversion attacks against Transfmers [Online]. Available, https:arxiv.gabs2209.10505, September 2022.

    [75] J.X. Mris, W.T. Zhao, J.T. Chiu, V. Shmatikov, A.M. Rush, Language model inversion, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–21.

    [76] Hu L., Yan A.-L., Yan H.-Y. et al. Defenses to membership inference attacks: a survey. ACM Comput. Surv., 56, 92(2024).

    [77] M.X. Du, X. Yue, S.S.M. Chow, T.H. Wang, C.Y. Huang, H. Sun, DPFward: finetuning inference on language models with differential privacy in fward pass, in: Proc. of the ACM SIGSAC Conf. on Computer Communications Security, Copenhagen, Denmark, 2023, pp. 2665–2679.

    [78] D.F. Chen, N. Yu, M. Fritz, RelaxLoss: defending membership inference attacks without losing utility, in: Proc. of the 10th Intl. Conf. on Learning Representations, Virtual Event, 2022, pp. 1–28.

    [79] Z.J. Li, C.Z. Wang, P.C. Ma, et al., On extracting specialized code abilities from large language models: a feasibility study, in: Proc. of the IEEEACM 46th Intl. Conf. on Software Engineering, Lisbon, Ptugal, 2024, pp. 1–13.

    [80] N. Carlini, D. Paleka, K.D. Dvijotham, et al., Stealing part of a production language model, in: Proc. of the 41st Intl. Conf. on Machine Learning, Vienna, Austria, 2024, pp. 1–26.

    [81] M. Finlayson, X. Ren, S. Swayamdipta, Logits of APIprotected LLMs leak proprietary infmation [Online]. Available, https:arxiv.gabs2403.09539, March 2024.

    [82] Y. Bai, G. Pei, J.D. Gu, Y. Yang, X.J. Ma, Special acters attack: toward scalable training data extraction from large language models [Online]. Available, https:arxiv.gabs2405.05990, May 2024.

    [83] C.L. Zhang, J.X. Mris, V. Shmatikov, Extracting prompts by inverting LLM outputs, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 14753–14777.

    [84] Q.F. Li, Z.Q. Shen, Z.H. Qin, et al., TransLinkGuard: safeguarding Transfmer models against model stealing in edge deployment, in: Proc. of the 32nd ACM Intl. Conf. on Multimedia, Melbourne, Australia, 2024, pp. 3479–3488.

    [85] Z. les, A. Ganesh, R. McKenna, et al., Finetuning large language models with userlevel differential privacy, in: Proc. of the ICML2024 Wkshop on Theetical Foundations of Foundation Models, Vienna, Austria, 2024, pp. 1–24.

    [86] L. Chua, B. Ghazi, Y.S.B. Huang, et al., Mind the privacy unit! Userlevel differential privacy f language model finetuning [Online]. Available, https:arxiv.gabs2406.14322, June 2024.

    [87] X.Y. Tang, R. Shin, H.A. Inan, et al., Privacypreserving incontext learning with differentially private fewshot generation [Online]. Available, https:arxiv.gabs2309.11765, September 2023.

    [88] J.Y. Zheng, H.N. Zhang, L.X. Wang, W.J. Qiu, H.W. Zheng, Z.M. Zheng, Safely learning with private data: a federated learning framewk f large language model, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 5293–5306.

    [89] Y.B. Sun, Z.T. Li, Y.L. Li, B.L. Ding, Improving LA in privacypreserving federated learning, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–17.

    [90] M. Hao, H.W. Li, H.X. Chen, P.Z. Xing, G.W. Xu, T.W. Zhang, Iron: private inference on Transfmers, in: Proc. of the 36th Intl. Conf. on Neural Infmation Processing Systems, New leans, USA, 2022, pp. 1–14.

    [91] X.Y. Hou, J. Liu, J.Y. Li, et al., CipherGPT: secure twoparty GPT inference, Cryptology ePrint Archive [Online]. Available, https:eprint.iacr.g20231147, May 2023.

    [92] Y. Dong, W.J. Lu, Y.C. Zheng, et al., PUMA: secure inference of LLaMA7B in five minutes [Online]. Available, https:arxiv.gabs2307.12533, July 2023.

    [93] H.C. Sun, J. Li, H.Y. Zhang, zkLLM: zero knowledge proofs f large language models, in: Proc. of the on ACM SIGSAC Conf. on Computer Communications Security, Salt Lake City, USA, 2024, pp. 4405–4419.

    [94] J. Kirchenbauer, J. Geiping, Y.X. Wen, J. Katz, I. Miers, T. Goldstein, A watermark f large language models, in: Proc. of the 40th Intl. Conf. on Machine Learning, Honolulu, USA, 2023, pp. 17061–17084.

    [95] H.W. Yao, J. Lou, Z. Qin, K. Ren, PromptCARE: prompt copyright protection by watermark injection verification, in: Proc. of the IEEE Symposium on Security Privacy, San Francisco, USA, 2024, pp. 845–861.

    [96] X.D. Zhao, P.V. Ananth, L. Li, Y.X. Wang, Provable robust watermarking f AIgenerated text, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–35.

    [97] S. Min, S. Gururangan, E. Wallace, et al., SILO language models: isolating legal risk in a nonparametric dataste, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–27.

    Ran Zhang, Hong-Wei Li, Xin-Yuan Qian, Wen-Bo Jiang, Han-Xiao Chen. On large language models safety, security, and privacy: A survey[J]. Journal of Electronic Science and Technology, 2025, 23(1): 100301
    Download Citation