Tag: ai-safety (41 references)
Exploring the limits of strong membership inference attacks on large language models
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
Rethinking machine unlearning for large language models
Comprehensive review of machine unlearning in LLMs, aiming to eliminate undesirable data influence (sensitive or illegal information) while maintaining essential knowledge generation. Envisions LLM unlearning as a pivotal element in life-cycle management for developing safe, secure, trustworthy, and resource-efficient generative AI.
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations
Official NIST taxonomy and terminology for adversarial machine learning. Covers data poisoning attacks applicable to all learning paradigms, model poisoning attacks in federated learning, and supply-chain attacks. Provides guidance for defense strategies.
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Poisoning Web-Scale Training Datasets is Practical
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Machine Unlearning: A Survey
Comprehensive survey of machine unlearning covering definitions, scenarios, verification methods, and applications. Cited in the International AI Safety Report 2025 as a pioneering paradigm for removing sensitive information.
LEACE: Perfect linear concept erasure in closed form
Quantifying Memorization Across Neural Language Models
A Watermark for Large Language Models
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
OWASP Top 10 for Large Language Model Applications
Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses
Comprehensive survey systematically categorizing dataset vulnerabilities including poisoning and backdoor attacks, their threat models, and defense mechanisms.
Datamodels: Predicting Predictions from Training Data
Proposes datamodels that predict model outputs as a function of training data subsets, providing a framework for understanding data attribution through retraining experiments.
Robust Speech Recognition via Large-Scale Weak Supervision
Machine Unlearning
Introduces SISA (Sharded, Isolated, Sliced, Aggregated) training for efficient exact machine unlearning. Partitions data into shards with separate models, enabling targeted retraining when data must be forgotten.
Unsolved Problems in ML Safety
Artificial Intelligence, Values, and Alignment
Are anonymity-seekers just like everybody else? An analysis of contributions to Wikipedia from Tor
The Secret Sharer: Measuring Unintended Memorization in Neural Networks
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
First demonstration of backdoor attacks on deep neural networks. Shows that small trigger patterns in training data cause models to misclassify any input containing the trigger (e.g., stop signs with stickers classified as speed limits).
Incomplete Contracting and AI Alignment
Privacy, anonymity, and perceived risk in open collaboration: A study of service providers
Rosenbach v. Six Flags Entertainment Corp.
Towards Making Systems Forget with Machine Unlearning
First formal definition of machine unlearning. Proposes converting learning algorithms into summation form to enable efficient data removal without full retraining. Foundational work establishing the unlearning problem.
The Algorithmic Foundations of Differential Privacy
Children's Online Privacy Protection Rule (COPPA) — 16 CFR Part 312
Poisoning Attacks against Support Vector Machines
Investigates poisoning attacks against SVMs where adversaries inject crafted training data to increase test error. Uses gradient ascent to construct malicious data points.