[1] Vassilev A, Oprea A, Fordyce A, et al. Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations[J]. 2024.
[2] 根据《术语表》定义,RAG指从基座模型外部检索数据并加入上下文的方式来增强提示。RAG 可以有效地调整和修改模型的内部知识,而无需重新训练整个模型。详见报告第97页。
[3] 该图系报告所使用的示意图的翻译版本,展示了对抗性机器学习中针对PredAI系统的攻击分类概览。三个互不相交的圆代表了攻击者目标(Attacker’s objective),每个圆的核心代表了攻击者目的(Attacker’s goal)。圆环外围的扇形则代表了发动一次攻击所需的攻击者能力。攻击类别表示为与发动每次攻击所需能力相连接的标识(Callouts);基于同种能力达成同种目标的多个攻击类别显示在同一个标识中。需要不同攻击者能力达成同种目标的互相关联的攻击类别以圆点虚线表示。参见报告第6页。
[4] 根据《术语表》定义,对抗性实例指“受到篡改的、会导致机器学习模型在部署阶段做出不当分类(Misclassification)的测试样本。”详见报告第92页。
[5] 根据《术语表》定义,“能源效率攻击”是指“利用(机器学习)性能对硬件和模型优化的依赖性以抵消硬件优化效果、增加计算延迟、提高硬件温度并且大幅增加能耗的攻击行为”。详见报告第93页。
[6] 根据《术语表》定义,“后门规律”指“一种在数据样本中插入的触发模式,以诱导中毒模型产生错误分类”。详见报告第92页。
[7] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
[8] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.
[9] Hongyan Chang等人的研究表明,为了抵消训练数据中的偏置(Bias)影响,对不同规模和分布的组给予同样的重视,可能与模型稳健性相冲突。参见Hong Chang, Ta Duy Nguyen, Sasi Kumar Murakonda, Ehsan Kazemi, and R. Shokri. On adversarial bias and the robustness of fair machine learning. https://arxiv.org/abs/2006.08669, 2020.
[10] R. Perdisci, D. Dagon, Wenke Lee, P. Fogla, and M. Sharif. Misleading worm signature generators using deliberate noise injection. In 2006 IEEE Symposium on Security and Privacy (S&P’06), Berkeley/Oakland, CA, 2006. IEEE.
[11] Ram Shankar Siva Kumar, Magnus Nystr¨om, John Lambert, Andrew Marshall, Mario Goertzel, Andi Comissoneru, Matt Swann, and Sharon Xia. Adversarial machine learning - industry perspectives. https://arxiv.org/abs/2002.05646, 2020.
[12] Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramer. Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
[13] Yuzhe Ma, Xiaojin Zhu, and Justin Hsu. Data poisoning against differentially-private learners: Attacks and defenses. In In Proceedings of the 28th International Joint Conference on Artifcial Intelligence (IJCAI), 2019.
[14] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. BadNets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019.
[15] Ahmed Salem, Rui Wen, Michael Backes, Shiqing Ma, and Yang Zhang. Dynamic backdoor attacks against machine learning models. https://arxiv.org/abs/2003.036 75, 2020.
[16] 根据《术语表》定义,“功能性攻击”是指针对某一领域的一组数据而不是每个数据点进行优化的对抗性攻击。详见报告第94页。
[17] Shaofeng Li, Minhui Xue, Benjamin Zi Hao Zhao, Haojin Zhu, and Xinpeng Zhang. Invisible backdoor attacks on deep neural networks via steganography and regularization. IEEE Transactions on Dependable and Secure Computing, 18:2088–2105, 2021.
[18] Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Refection backdoor: A natural backdoor attack on deep neural networks. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision–ECCV 2020, pages 182–199, Cham, 2020. Springer International Publishing.
[19] Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. ABS: Scanning neural networks for back-doors by artifcial brain stimulation. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, CCS19, page 1265–1282, New York, NY, USA, 2019. Association for Computing Machinery.
[20] Xijie Huang, Moustafa Alzantot, and Mani Srivastava. NeuronInspect: Detecting backdoors in neural networks via output explanations, 2019.
[21] Huili Chen, Cheng Fu, Jishen Zhao, and Farinaz Koushanfar. DeepInspect: A blackbox trojan detection and mitigation framework for deep neural networks. In Proceedings of the Twenty-Eighth International Joint Conference on Artifcial Intelligence, IJCAI-19, pages 4658–4664. International Joint Conferences on Artifcial Intelligence Organization, 7 2019.
[22] Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov, Carl A. Gunter, and Bo Li. Detecting AI trojans using meta neural analysis. In IEEE Symposium on Security and Privacy, S&P 2021, pages 103–120, United States, May 2021.
[23] 根据《术语表》定义,“影子模型”是指一类模仿目标模型行为的模型,有关这些模型的训练数据集以及关于其构成信息的事实(ground truth)都是已知的。一般而言,攻击模型(attack model)以影子模型经过标签的输入和输出进行训练。详见报告第97页。
[24] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In NDSS. The Internet Society, 2018.
[25] Andrew Yuan, Alina Oprea, and Cheng Tan. Dropout attacks. In IEEE Symposium on Security and Privacy (S&P), 2024.
[26] Nils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waibhav Tembe, Jill Muehling, John V Pearson, Dietrich A Stephan, Stanley F Nelson, and David W Craig. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS genetics, 4(8):e1000167, 2008.
[27] Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In Proceedings of the 22nd ACM Symposium on Principles of Database Systems, PODS’03, pages 202–210. ACM, 2003.
[28] Simson Garfnkel, John Abowd, and Christian Martindale. Understanding database reconstruction attacks on public data. Communications of the ACM, 62:46–53, 02 2019.
[29] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overftting. In IEEE Computer Security Foundations Symposium, CSF ’18, pages 268–282, 2018. https://arxiv.org/ abs/1709.01604.
[30] 根据《术语表》定义,旁路攻击允许攻击者通过观察程序的非功能特征(如执行时间或内存),或通过测量或利用系统或其硬件的间接巧合效应(如功耗变化、电磁辐射),在程序执行时推断出私密信息。详见报告第97页。
[31] Lejla Batina, Shivam Bhasin, Dirmanto Jap, and Stjepan Picek. CSI NN: Reverse engineering of neural network architectures through electromagnetic side channel. In Proceedings of the 28th USENIX Conference on Security Symposium, SEC’19, page 515–532, USA, 2019. USENIX Association.
[32] Adnan Siraj Rakin, Md Hafzul Islam Chowdhuryy, Fan Yao, and Deliang Fan. DeepSteal: Advanced model extractions leveraging effcient weight stealing in memories. In 2022 IEEE Symposium on Security and Privacy (S&P), pages 1157–1174, 2022.
[33] Giuseppe Ateniese, Luigi V. Mancini, Angelo Spognardi, Antonio Villani, Domenico Vitali, and Giovanni Felici. Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifers. Int. J. Secur. Netw., 10(3):137–150, September 2015.
[34] Matthew Jagielski, Jonathan Ullman, and Alina Oprea. Auditing differentially private machine learning: How private is private SGD? In Advances in Neural Information Processing Systems, volume 33, pages 22205–22216, 2020.
[35] Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021, pages 141–159. IEEE, 2021.
[36] 该图系报告所使用的示意图的翻译版本,展示了GenAI对抗性机器学习攻击者目标(Aattacker’sobiective)分为四类:可用性破坏、完整性侵袭、私密性减损。此外,对于GenAI,由滥用(abuse)带来的危害也不容忽视。攻击者发起某类攻击所必需的能力在圆环外围表示;攻击类别以针对每一类能力进行的标识(callouts)表示。需要同种能力对于同种攻击目标发起的多类攻击以单个标识表示。参见报告第36页。
[37] 根据《术语表》定义,指模型在训练的最初阶段中,从大量无标签数据中习得总体模式、特征及关系。预训练通常通过无监督或自监督方式进行,作为精调阶段的前置步骤。详见报告第96页。
[38] 根据《术语表》定义,指使得预训练模型适应于特定任务或者特定领域的过程,紧随预训练阶段实施,需要将模型在领域专门的数据上进行进一步训练,通常以有监督学习方式进行。详见报告第94页。
[39] Nicholas Carlini. Poisoning the unlabeled dataset of Semi-Supervised learning. In 30th USENIX Security Symposium (USENIX Security 21), pages 1577–1592. USENIX Association, August 2021.
[40] Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramer. ` Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
[41] Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, and Aleksander Madry. Raising the cost of malicious ai-powered image editing. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
[42] Ebrahimi, J. et al. “HotFlip: White-Box Adversarial Examples for Text Classification.” Annual Meeting of the Association for Computational Linguistics (2017).
[43] Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
[44] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
[45] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
[46] Nicholas Carlini, Chang Liu, Ulfar Erlingsson, Jernej Kos, and Dawn Song. The Secret Sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, USENIX ’19), pages 267–284, 2019. https://arxiv.org/abs/1802.08232.
[47] Nicholas Carlini, Florian Tram`er, Eric Wallace, Matthew Jagielski, Ariel Herbert- ´ Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650. USENIX Association, August 2021.
[48] Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. https://arxiv.org/abs/2202.07646, 2022.
[49] Xinyue Shen, Yiting Qu, Michael Backes, and Yang Zhang. Prompt stealing attacks against text-to-image generation models. arXiv preprint arXiv:2302.09923, 2023.
[50] Yiming Zhang and Daphne Ippolito. Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success. arXiv preprint arXiv:2307.06865, 2023.
[51] Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, et al. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023.
[52] Learn Prompting. Defensive measures, 2023.
[53] Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023.
[54] Learn Prompting. Defensive measures, Separate LLM evaluation, 2023. See https://learnprompting.org/docs/prompt_hacking/defensive_measures/llm_eval, last visited: 1.31.2024.
[55] Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023.
[56] Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you signed up for: Compromising realworld llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023.
[57] ibid.
[58] Vincent, James. Google and Microsoft’s chatbots are already citing one another in a misinformation shitshow, 2023. See https://www.theverge.com/2023/3/22/23651564/google-microsoft-bard-bing-chatbots-misinformation, last visited: 1.31.2024.
[59] Kai, et al. (n 56).
[60] Ben Derico. Chatgpt bug leaked users’ conversation histories. See https://www.bbc.com/news/technology-65047304, last visited: 1.31.2024.
[61] Kai, et al. (n 56).
[62] Kai, et al. (n 56).
[63] ETSI Group Report SAI 005. Securing artifcial intelligence (SAI); mitigation strategy report, retrieved February 2023 from https://www.etsi.org/deliver/etsi gr/SAI/ 001 099/005/01.01.01 60/gr SAI005v010101p.pdf.
[64] 根据《术语表》定义,木马指“在软件或硬件系统代码中插入的恶意代码/逻辑,通常是系统所有者或开发者不知情或未征得其同意的情况下插入的。这种恶意代码/逻辑难以察觉,看似无害,但一旦攻击者发出信号,就会改变系统的预期功能并诱发攻击者所希望的恶意行为。触发器必须在正常运行环境下罕见,这样才不会影响人工智能的正常功能,也不会引起人类用户的怀疑。”详见报告第98页。
[65] Shaf Goldwasser, Michael P. Kim, Vinod Vaikuntanathan, and Or Zamir. Planting undetectable backdoors in machine learning models. https://arxiv.org/abs/2204.06974, 2022. arXiv.
[66] Eduard Kovacs, NIST: No Silver Bullet Against Adversarial Machine Learning Attacks. See https://www.securityweek.com/nist-no-silver-bullet-against-adversarial-machine-learning-attacks/, last visited: 2.3.2024.