Tag: ai-safety

Tag: ai-safety (41 references)

Exploring the limits of strong membership inference attacks on large language models 2025 article

Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Milad Nasr, Sahra Ghalebikesabi, Meenatchi Sundaram Mutu Selva Annamalai, Niloofar Mireshghallah, Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye, Katherine Lee, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper

View details Source arXiv preprint arXiv:2505.18773

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development 2025 article

Jan Kulveit, Raymond Douglas, Nora Ammann, Deger Turan, David Krueger, David Duvenaud

ai-safety

View details Source arXiv preprint arXiv:2501.16946

Rethinking machine unlearning for large language models 2025 article

Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu

Comprehensive review of machine unlearning in LLMs, aiming to eliminate undesirable data influence (sensitive or illegal information) while maintaining essential knowledge generation. Envisions LLM unlearning as a pivotal element in life-cycle management for developing safe, secure, trustworthy, and resource-efficient generative AI.

unlearning language-models format:survey ai-safety status:needs-review

View details Source Nature Machine Intelligence

Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations 2025 techreport

NIST

Official NIST taxonomy and terminology for adversarial machine learning. Covers data poisoning attacks applicable to all learning paradigms, model poisoning attacks in federated learning, and supply-chain attacks. Provides guidance for defense strategies.

data-poisoning adversarial ai-safety format:government-report status:needs-review

View details Source NIST AI 100-2e2025

The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm 2024 article

Aakanksha, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer, Marzieh Fadaee, Sara Hooker

ai-safety ml-methods language-models

View details Source EMNLP 2024

Poisoning Web-Scale Training Datasets is Practical 2024 misc

Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr

ai-safety ml-methods adversarial training-dynamics data-poisoning

View details Source

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024 misc

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

ai-safety ml-methods adversarial language-models

View details Source

Machine Unlearning: A Survey 2024 article

Heng Xu, Tianqing Zhu, Lefeng Zhang, Wanlei Zhou, Philip S. Yu

Comprehensive survey of machine unlearning covering definitions, scenarios, verification methods, and applications. Cited in the International AI Safety Report 2025 as a pioneering paradigm for removing sensitive information.

ai-safety privacy format:survey unlearning

View details Source ACM Computing Surveys

LEACE: Perfect linear concept erasure in closed form 2023 article

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman

ai-safety ml-methods interpretability unlearning

View details Source NeurIPS 2023

Quantifying Memorization Across Neural Language Models 2023 inproceedings

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

ai-safety ml-methods language-models privacy

View details Source International Conference on Learning Representations

A Watermark for Large Language Models 2023 article

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein

ai-safety ml-methods adversarial language-models

View details Source arXiv preprint arXiv:2301.10226

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment 2023 article

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, Hang Li

ai-safety ml-methods language-models format:survey

View details Source arXiv preprint arXiv:2308.05374

OWASP Top 10 for Large Language Model Applications 2023 misc

{OWASP}

ai-safety ml-methods adversarial language-models

View details Source

Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses 2022 article

Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, Tom Goldstein

Comprehensive survey systematically categorizing dataset vulnerabilities including poisoning and backdoor attacks, their threat models, and defense mechanisms.

adversarial ai-safety data-poisoning format:survey

View details Source IEEE Transactions on Pattern Analysis and Machine Intelligence

Datamodels: Predicting Predictions from Training Data 2022 inproceedings

Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry

Proposes datamodels that predict model outputs as a function of training data subsets, providing a framework for understanding data attribution through retraining experiments.

privacy ai-safety ai-society data-attribution data-governance fairness unlearning

View details Source International Conference on Machine Learning (ICML)

Robust Speech Recognition via Large-Scale Weak Supervision 2022 article

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever

ai-safety ml-methods adversarial language-models

View details Source

Machine Unlearning 2021 inproceedings

Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, Nicolas Papernot

Introduces SISA (Sharded, Isolated, Sliced, Aggregated) training for efficient exact machine unlearning. Partitions data into shards with separate models, enabling targeted retraining when data must be forgotten.

ml-methods ai-safety foundational privacy unlearning

View details Source IEEE Symposium on Security and Privacy (S&P)

Unsolved Problems in ML Safety 2021 article

Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt

ai-safety

View details Source arXiv preprint arXiv:2109.13916

Artificial Intelligence, Values, and Alignment 2020 article

Iason Gabriel

ai-safety

View details Source Minds and Machines

Are anonymity-seekers just like everybody else? An analysis of contributions to Wikipedia from Tor 2020 inproceedings

Tran, Chau, Champion, Kaylea, Forte, Andrea, Hill, Benjamin Mako, Greenstadt, Rachel

ai-safety ai-society content-ecosystems privacy

View details Source 2020 IEEE Symposium on Security and Privacy (SP)

The Secret Sharer: Measuring Unintended Memorization in Neural Networks 2019 inproceedings

Carlini, Nicholas, Liu, Chang, Erlingsson, {\'U}lfar, Kos, Jernej, Song, Dawn

ai-safety ml-methods language-models privacy

View details Source Proceedings of USENIX Security Symposium

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain 2019 article

Tianyu Gu, Brendan Dolan-Gavitt, Siddharth Garg

First demonstration of backdoor attacks on deep neural networks. Shows that small trigger patterns in training data cause models to misclassify any input containing the trigger (e.g., stop signs with stickers classified as speed limits).

ml-methods adversarial ai-safety data-poisoning foundational

View details Source IEEE Access

Incomplete Contracting and AI Alignment 2019 inproceedings

Dylan Hadfield-Menell, Gillian K. Hadfield

ai-safety

View details Source Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society

Privacy, anonymity, and perceived risk in open collaboration: A study of service providers 2019 inproceedings

McDonald, Nora, Hill, Benjamin Mako, Greenstadt, Rachel, Forte, Andrea

ai-safety ai-society content-ecosystems privacy

View details Source Proceedings of the 2019 CHI conference on human factors in computing systems

Rosenbach v. Six Flags Entertainment Corp. 2019 legal

{Supreme Court of Illinois}

ai-safety legal-policy privacy

View details Source

Towards Making Systems Forget with Machine Unlearning 2015 inproceedings

Yinzhi Cao, Junfeng Yang

First formal definition of machine unlearning. Proposes converting learning algorithms into summation form to enable efficient data removal without full retraining. Foundational work establishing the unlearning problem.

ml-methods ai-safety foundational privacy unlearning

View details Source IEEE Symposium on Security and Privacy (S&P)

The Algorithmic Foundations of Differential Privacy 2014 article

Cynthia Dwork, Aaron Roth

ai-safety privacy

View details Source Foundations and Trends in Theoretical Computer Science

Children's Online Privacy Protection Rule (COPPA) — 16 CFR Part 312 2013 misc

{Federal Trade Commission}

ai-safety legal-policy privacy regulation

View details Source

Poisoning Attacks against Support Vector Machines 2012 inproceedings

Battista Biggio, Blaine Nelson, Pavel Laskov

Investigates poisoning attacks against SVMs where adversaries inject crafted training data to increase test error. Uses gradient ascent to construct malicious data points.

adversarial ai-safety data-poisoning ml-methods foundational

View details Source Proceedings of the 29th International Conference on Machine Learning (ICML)

Guide to Protecting the Confidentiality of Personally Identifiable Information (PII) 2010 techreport

McCallister, Erika, Grance, Tim, Scarfone, Karen

ai-safety privacy

View details Source

Biometric Information Privacy Act (BIPA), 740 ILCS 14 2008 misc

{Illinois General Assembly}

ai-safety legal-policy privacy regulation

View details Source

Robust De-anonymization of Large Sparse Datasets 2008 inproceedings

Narayanan, Arvind, Shmatikov, Vitaly