Tag: ai-safety (33 references)
Exploring the limits of strong membership inference attacks on large language models
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Poisoning Web-Scale Training Datasets is Practical
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Machine Unlearning: A Survey
Comprehensive survey of machine unlearning covering definitions, scenarios, verification methods, and applications. Cited in the International AI Safety Report 2025 as a pioneering paradigm for removing sensitive information.
LEACE: Perfect linear concept erasure in closed form
Quantifying Memorization Across Neural Language Models
A Watermark for Large Language Models
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
OWASP Top 10 for Large Language Model Applications
Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses
Comprehensive survey systematically categorizing dataset vulnerabilities including poisoning and backdoor attacks, their threat models, and defense mechanisms.
Datamodels: Predicting Predictions from Training Data
Proposes datamodels that predict model outputs as a function of training data subsets, providing a framework for understanding data attribution through retraining experiments.
Robust Speech Recognition via Large-Scale Weak Supervision
Unsolved Problems in ML Safety
Machine Unlearning
Introduces SISA (Sharded, Isolated, Sliced, Aggregated) training for efficient exact machine unlearning. Partitions data into shards with separate models, enabling targeted retraining when data must be forgotten.
Artificial Intelligence, Values, and Alignment
Are anonymity-seekers just like everybody else? An analysis of contributions to Wikipedia from Tor
The Secret Sharer: Measuring Unintended Memorization in Neural Networks
Incomplete Contracting and AI Alignment
Privacy, anonymity, and perceived risk in open collaboration: A study of service providers
Rosenbach v. Six Flags Entertainment Corp.
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
First demonstration of backdoor attacks on deep neural networks. Shows that small trigger patterns in training data cause models to misclassify any input containing the trigger (e.g., stop signs with stickers classified as speed limits).
Towards Making Systems Forget with Machine Unlearning
First formal definition of machine unlearning. Proposes converting learning algorithms into summation form to enable efficient data removal without full retraining. Foundational work establishing the unlearning problem.
The Algorithmic Foundations of Differential Privacy
Children's Online Privacy Protection Rule (COPPA) — 16 CFR Part 312
Poisoning Attacks against Support Vector Machines
Investigates poisoning attacks against SVMs where adversaries inject crafted training data to increase test error. Uses gradient ascent to construct malicious data points.