Tag: foundational

Tag: foundational (16 references)

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI 2024 article

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker

Large-scale audit of over 1,800 text AI datasets analyzing trends, permissions of use and global representation. Found frequent miscategorization of licences on dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. Released the Data Provenance Explorer tool for practitioners.

View details Source Nature Machine Intelligence

AI models collapse when trained on recursively generated data 2024 article

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, Yarin Gal

Landmark study showing that indiscriminate use of model-generated content in training causes irreversible defects in resulting models, where tails of original content distribution disappear. Model collapse is a degenerative learning process where models forget improbable events over time. Demonstrates this across LLMs, VAEs, and GMMs.

ml-methods model-collapse synthetic-data foundational status:needs-review

View details Source Nature

Machine Unlearning 2021 inproceedings

Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, Nicolas Papernot

Introduces SISA (Sharded, Isolated, Sliced, Aggregated) training for efficient exact machine unlearning. Partitions data into shards with separate models, enabling targeted retraining when data must be forgotten.

ml-methods ai-safety foundational privacy unlearning

View details Source IEEE Symposium on Security and Privacy (S&P)

Scaling Laws for Neural Language Models 2020 article

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei

Establishes power-law scaling relationships between language model performance and model size, dataset size, and compute, spanning seven orders of magnitude.

language-models ml-methods training-dynamics foundational

View details Source arXiv preprint

Data Shapley: Equitable Valuation of Data for Machine Learning 2019 inproceedings

Amirata Ghorbani, James Zou

ml-methods data-valuation shapley-value foundational

View details Source International Conference on Machine Learning

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain 2019 article

Tianyu Gu, Brendan Dolan-Gavitt, Siddharth Garg

First demonstration of backdoor attacks on deep neural networks. Shows that small trigger patterns in training data cause models to misclassify any input containing the trigger (e.g., stop signs with stickers classified as speed limits).

ml-methods adversarial ai-safety data-poisoning foundational

View details Source IEEE Access

mixup: Beyond Empirical Risk Minimization 2018 inproceedings

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz

Introduces mixup, a data augmentation technique that trains on convex combinations of input pairs and their labels. Simple, data-independent, and model-agnostic approach that improves generalization and robustness.

data-augmentation foundational ml-methods

View details Source ICLR 2018

Understanding Black-box Predictions via Influence Functions 2017 inproceedings

Pang Wei Koh, Percy Liang

Uses influence functions from robust statistics to trace model predictions back to training data, identifying training points most responsible for a given prediction.

data-attribution data-governance data-selection interpretability ml-methods foundational

View details Source Proceedings of the 34th International Conference on Machine Learning (ICML)

Towards Making Systems Forget with Machine Unlearning 2015 inproceedings

Yinzhi Cao, Junfeng Yang

First formal definition of machine unlearning. Proposes converting learning algorithms into summation form to enable efficient data removal without full retraining. Foundational work establishing the unlearning problem.

ml-methods ai-safety foundational privacy unlearning

View details Source IEEE Symposium on Security and Privacy (S&P)

Poisoning Attacks against Support Vector Machines 2012 inproceedings

Battista Biggio, Blaine Nelson, Pavel Laskov

Investigates poisoning attacks against SVMs where adversaries inject crafted training data to increase test error. Uses gradient ascent to construct malicious data points.

adversarial ai-safety data-poisoning ml-methods foundational

View details Source Proceedings of the 29th International Conference on Machine Learning (ICML)

Curriculum Learning 2009 inproceedings

Yoshua Bengio, Jerome Louradour, Ronan Collobert, Jason Weston

Introduces curriculum learning: training models on examples of increasing difficulty. Shows this acts as a continuation method for non-convex optimization, improving both convergence speed and final generalization.

data-selection foundational ml-methods training-dynamics

View details Source ICML 2009

Causality: Models, Reasoning, and Inference 2009 book

Judea Pearl

Foundational book on causal inference introducing structural causal models, do-calculus, and counterfactual reasoning. Unifies graphical models with potential outcomes framework. Second edition with expanded coverage.

ml-methods ai-society fairness foundational

View details Source Cambridge University Press

Active Learning Literature Survey 2009 techreport

Burr Settles

Canonical survey of active learning covering uncertainty sampling, query-by-committee, expected error reduction, variance reduction, and density-weighted methods. Establishes foundational taxonomy for the field.

data-selection foundational ml-methods format:survey

View details Source University of Wisconsin-Madison, Computer Sciences Technical Report 1648

Causal Inference Using Potential Outcomes: Design, Modeling, Decisions 2005 article

Donald B. Rubin

Comprehensive overview of the potential outcomes framework for causal inference. Covers experimental design, observational studies, propensity scores, and the fundamental problem of causal inference.

ml-methods ai-society fairness foundational

View details Source Journal of the American Statistical Association

Experimental Design & Causal Inference paper_collection

ml-methods foundational

View details

Data Leverage & Collective Action paper_collection

ai-safety ai-society data-governance ml-methods adversarial ai-economics content-ecosystems data-attribution data-labor data-valuation training-dynamics foundational

View details