Tag: foundational (16 references)
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
Large-scale audit of over 1,800 text AI datasets analyzing trends, permissions of use and global representation. Found frequent miscategorization of licences on dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. Released the Data Provenance Explorer tool for practitioners.
AI models collapse when trained on recursively generated data
Landmark study showing that indiscriminate use of model-generated content in training causes irreversible defects in resulting models, where tails of original content distribution disappear. Model collapse is a degenerative learning process where models forget improbable events over time. Demonstrates this across LLMs, VAEs, and GMMs.
Machine Unlearning
Introduces SISA (Sharded, Isolated, Sliced, Aggregated) training for efficient exact machine unlearning. Partitions data into shards with separate models, enabling targeted retraining when data must be forgotten.
Scaling Laws for Neural Language Models
Establishes power-law scaling relationships between language model performance and model size, dataset size, and compute, spanning seven orders of magnitude.
Data Shapley: Equitable Valuation of Data for Machine Learning
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
First demonstration of backdoor attacks on deep neural networks. Shows that small trigger patterns in training data cause models to misclassify any input containing the trigger (e.g., stop signs with stickers classified as speed limits).
mixup: Beyond Empirical Risk Minimization
Introduces mixup, a data augmentation technique that trains on convex combinations of input pairs and their labels. Simple, data-independent, and model-agnostic approach that improves generalization and robustness.
Understanding Black-box Predictions via Influence Functions
Uses influence functions from robust statistics to trace model predictions back to training data, identifying training points most responsible for a given prediction.
Towards Making Systems Forget with Machine Unlearning
First formal definition of machine unlearning. Proposes converting learning algorithms into summation form to enable efficient data removal without full retraining. Foundational work establishing the unlearning problem.
Poisoning Attacks against Support Vector Machines
Investigates poisoning attacks against SVMs where adversaries inject crafted training data to increase test error. Uses gradient ascent to construct malicious data points.
Curriculum Learning
Introduces curriculum learning: training models on examples of increasing difficulty. Shows this acts as a continuation method for non-convex optimization, improving both convergence speed and final generalization.
Causality: Models, Reasoning, and Inference
Foundational book on causal inference introducing structural causal models, do-calculus, and counterfactual reasoning. Unifies graphical models with potential outcomes framework. Second edition with expanded coverage.
Active Learning Literature Survey
Canonical survey of active learning covering uncertainty sampling, query-by-committee, expected error reduction, variance reduction, and density-weighted methods. Establishes foundational taxonomy for the field.
Causal Inference Using Potential Outcomes: Design, Modeling, Decisions
Comprehensive overview of the potential outcomes framework for causal inference. Covers experimental design, observational studies, propensity scores, and the fundamental problem of causal inference.