Data Leverage References

← Back to browse

Tag: ai-safety (33 references)

Exploring the limits of strong membership inference attacks on large language models 2025 article

Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Milad Nasr, Sahra Ghalebikesabi, Meenatchi Sundaram Mutu Selva Annamalai, Niloofar Mireshghallah, Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye, Katherine Lee, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development 2025 article

Jan Kulveit, Raymond Douglas, Nora Ammann, Deger Turan, David Krueger, David Duvenaud

The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm 2024 article

Aakanksha, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer, Marzieh Fadaee, Sara Hooker

Poisoning Web-Scale Training Datasets is Practical 2024 misc

Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024 misc

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

Machine Unlearning: A Survey 2024 article

Heng Xu, Tianqing Zhu, Lefeng Zhang, Wanlei Zhou, Philip S. Yu

Comprehensive survey of machine unlearning covering definitions, scenarios, verification methods, and applications. Cited in the International AI Safety Report 2025 as a pioneering paradigm for removing sensitive information.

LEACE: Perfect linear concept erasure in closed form 2023 article

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman

Quantifying Memorization Across Neural Language Models 2023 inproceedings

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

A Watermark for Large Language Models 2023 article

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment 2023 article

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, Hang Li

OWASP Top 10 for Large Language Model Applications 2023 misc
Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses 2022 article

Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, Tom Goldstein

Comprehensive survey systematically categorizing dataset vulnerabilities including poisoning and backdoor attacks, their threat models, and defense mechanisms.

Datamodels: Predicting Predictions from Training Data 2022 inproceedings

Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry

Proposes datamodels that predict model outputs as a function of training data subsets, providing a framework for understanding data attribution through retraining experiments.

Robust Speech Recognition via Large-Scale Weak Supervision 2022 article

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever

Unsolved Problems in ML Safety 2021 article

Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt

Machine Unlearning 2021 inproceedings

Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, Nicolas Papernot

Introduces SISA (Sharded, Isolated, Sliced, Aggregated) training for efficient exact machine unlearning. Partitions data into shards with separate models, enabling targeted retraining when data must be forgotten.

Artificial Intelligence, Values, and Alignment 2020 article

Iason Gabriel

Are anonymity-seekers just like everybody else? An analysis of contributions to Wikipedia from Tor 2020 inproceedings

Tran, Chau, Champion, Kaylea, Forte, Andrea, Hill, Benjamin Mako, Greenstadt, Rachel

The Secret Sharer: Measuring Unintended Memorization in Neural Networks 2019 inproceedings

Carlini, Nicholas, Liu, Chang, Erlingsson, {\'U}lfar, Kos, Jernej, Song, Dawn

Incomplete Contracting and AI Alignment 2019 inproceedings

Dylan Hadfield-Menell, Gillian K. Hadfield

Privacy, anonymity, and perceived risk in open collaboration: A study of service providers 2019 inproceedings

McDonald, Nora, Hill, Benjamin Mako, Greenstadt, Rachel, Forte, Andrea

Rosenbach v. Six Flags Entertainment Corp. 2019 legal
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain 2019 article

Tianyu Gu, Brendan Dolan-Gavitt, Siddharth Garg

First demonstration of backdoor attacks on deep neural networks. Shows that small trigger patterns in training data cause models to misclassify any input containing the trigger (e.g., stop signs with stickers classified as speed limits).

Towards Making Systems Forget with Machine Unlearning 2015 inproceedings

Yinzhi Cao, Junfeng Yang

First formal definition of machine unlearning. Proposes converting learning algorithms into summation form to enable efficient data removal without full retraining. Foundational work establishing the unlearning problem.

The Algorithmic Foundations of Differential Privacy 2014 article

Cynthia Dwork, Aaron Roth

Children's Online Privacy Protection Rule (COPPA) — 16 CFR Part 312 2013 misc
Poisoning Attacks against Support Vector Machines 2012 inproceedings

Battista Biggio, Blaine Nelson, Pavel Laskov

Investigates poisoning attacks against SVMs where adversaries inject crafted training data to increase test error. Uses gradient ascent to construct malicious data points.

Guide to Protecting the Confidentiality of Personally Identifiable Information (PII) 2010 techreport

McCallister, Erika, Grance, Tim, Scarfone, Karen

Biometric Information Privacy Act (BIPA), 740 ILCS 14 2008 misc
Robust De-anonymization of Large Sparse Datasets 2008 inproceedings

Narayanan, Arvind, Shmatikov, Vitaly

Privacy as Contextual Integrity 2004 article

Nissenbaum, Helen

HIPAA Privacy Rule — 45 CFR Parts 160 and 164 2000 misc

{U.S. Department of Health, Human Services}

Family Educational Rights and Privacy Act (FERPA) 1974 misc