Data Leverage References

View details Source arXiv preprint arXiv:2501.12948

Exploring the limits of strong membership inference attacks on large language models 2025 article

Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Milad Nasr, Sahra Ghalebikesabi, Meenatchi Sundaram Mutu Selva Annamalai, Niloofar Mireshghallah, Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye, Katherine Lee, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper

ai-safety ml-methods adversarial language-models privacy

View details Source arXiv preprint arXiv:2505.18773

Extending "GPTs Are GPTs" to Firms 2025 article

Benjamin Labaschin, Tyna Eloundou, Sam Manning, Pamela Mishkin, Daniel Rock

View details Source AEA Papers and Proceedings

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens 2025 article

Liu, Jiacheng, Blanton, Taylor, Elazar, Yanai, Min, Sewon, Chen, YenSung, Chheda-Kothary, Arnavi, Tran, Huy, Bischoff, Byron, Marsh, Eric, Schmitz, Michael, others

ml-methods training-dynamics

View details Source arXiv preprint arXiv:2504.07096

The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm 2024 article

Aakanksha, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer, Marzieh Fadaee, Sara Hooker

ai-safety ml-methods language-models

View details Source arXiv preprint arXiv:2406.18682

To code, or not to code? exploring impact of code in pre-training 2024 article

Aryabumi, Viraat, Su, Yixuan, Ma, Raymond, Morisot, Adrien, Zhang, Ivan, Locatelli, Acyr, Fadaee, Marzieh, {\"U}st{\"u}n, Ahmet, Hooker, Sara

View details Source arXiv preprint arXiv:2408.10914

The Rise of AI-Generated Content in Wikipedia 2024 article

Creston Brooks, Samuel Eggert, Denis Peskoff

ai-society ml-methods content-ecosystems language-models wikipedia

View details Source arXiv preprint arXiv:2410.08044

Poisoning Web-Scale Training Datasets is Practical 2024 misc

Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr

ai-safety ml-methods adversarial training-dynamics data-poisoning

data-governance ml-methods data-attribution data-value interpretability language-models

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions 2024 article

Choe, Sang Keun and Ahn, Hwijeen and Bae, Juhan and Zhao, Kewen and Kang, Minsoo and Chung, Youngseog and Pratapa, Adithya and Neiswanger, Willie and Strubell, Emma and Mitamura, Teruko and Schneider, Jeff and Hovy, Eduard and Grosse, Roger and Xing, Eric

View details Source arXiv preprint arXiv:2405.13954

Large language models reduce public knowledge sharing on online Q&A platforms 2024 article

Del Rio-Chanona, R. Maria and Laurentsyeva, Nadzeya and Wachs, Johannes

ai-society ml-methods content-ecosystems language-models

View details Source PNAS Nexus

Direct Preference Optimization: Your Language Model is Secretly a Reward Model 2024 article

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

ai-society ml-methods ai-economics language-models

GPTs are GPTs: Labor Market Impact Potential of LLMs 2024 article

Tyna Eloundou, Sam Manning, Pamela Mishkin, Daniel Rock

View details Source Science

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024 misc

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

data-governance ml-methods data-infrastructure language-models

StarCoder 2 and The Stack v2: The Next Generation 2024 misc

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries

LLM Dataset Inference: Did you train on my dataset? 2024 article

Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic

View details Source arXiv preprint arXiv:2406.06443

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging 2024 inproceedings

Na, Clara and Magnusson, Ian and Jha, Ananya Harsh and Sherborne, Tom and Strubell, Emma and Dodge, Jesse and Dasigi, Pradeep

legal-policy ml-methods copyright language-models

View details Source EMNLP

A Canary in the AI Coal Mine: American Jews May Be Disproportionately Harmed by Intellectual Property Dispossession in Large Language Model Training 2024 article

Heila Precel, Allison McDonald, Brent Hecht, Nicholas Vincent

View details Source arXiv preprint arXiv:2403.13073

Data {Flywheels} for {LLM} {Applications} 2024 misc

Shankar, Shreya

data-governance ml-methods data-infrastructure language-models training-dynamics

View details Source Shreya Shankar's blog

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research 2024 article

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo

View details Source arXiv preprint arXiv:2402.00159

Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model 2024 article

Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D'souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, Sara Hooker

ml-methods language-models training-dynamics

View details Source arXiv preprint arXiv:2402.07827

Alpaca: A Strong, Replicable Instruction-Following Model 2023 misc

{Stanford CRFM}

ai-safety ml-methods interpretability unlearning

LEACE: Perfect linear concept erasure in closed form 2023 article

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman

View details Source arXiv preprint arXiv:2306.03819

Quantifying Memorization Across Neural Language Models 2023 inproceedings

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

ai-safety ml-methods language-models privacy

View details Source International Conference on Learning Representations

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4 2023 article

Kent K. Chang, Mackenzie Cramer, Sandeep Soni, David Bamman

data-governance data-labor data-selection ml-methods

View details Source arXiv preprint arXiv:2305.00118

Algorithmic Collective Action in Machine Learning 2023 inproceedings

Moritz Hardt, Eric Mazumdar, Celestine Mendler-Dünner, Tijana Zrnic

Provides theoretical framework for algorithmic collective action, showing that small collectives can exert significant control over platform learning algorithms through coordinated data strategies.

View details Source International Conference on Machine Learning (ICML)

A Watermark for Large Language Models 2023 article

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein

View details Source arXiv preprint arXiv:2301.10226

Textbooks Are All You Need II: phi-1.5 technical report 2023 article

Yuanzhi Li, Sebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee

View details Source arXiv preprint arXiv:2309.05463

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment 2023 article

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, Hang Li

ai-safety ml-methods language-models survey

View details Source arXiv preprint arXiv:2308.05374

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore 2023 article

Sewon Min, Suchin Gururangan, Eric Wallace, Weijia Shi, Hannaneh Hajishirzi, Noah A. Smith, Luke Zettlemoyer

View details Source arXiv preprint arXiv:2308.04430

OWASP Top 10 for Large Language Model Applications 2023 misc

{OWASP}

data-attribution data-governance interpretability ml-methods training-dynamics

TRAK: Attributing Model Behavior at Scale 2023 inproceedings

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, Aleksander Madry

Introduces TRAK (Tracing with the Randomly-projected After Kernel), a data attribution method that is both effective and computationally tractable for large-scale models by leveraging random projections.

View details Source International Conference on Machine Learning (ICML)

DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning 2022 article

Chengcheng Guo, Bo Zhao, Yanbing Bai

Comprehensive library and empirical study of coreset selection methods for deep learning, finding that random selection remains a strong baseline across many settings.

benchmark data-selection language-models ml-methods

View details Source arXiv preprint

Training Data Influence Analysis and Estimation: A Survey 2022 article

Zayd Hammoudeh, Daniel Lowd

ml-methods training-dynamics survey

View details Source arXiv preprint arXiv:2212.04612

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning 2022 inproceedings

Yongchan Kwon, James Zou

Generalizes Data Shapley using Beta weighting functions, providing noise-reduced data valuation that better handles outliers and mislabeled data detection.

data-attribution data-augmentation data-governance ml-methods

View details Source International Conference on Artificial Intelligence and Statistics (AISTATS)

Training language models to follow instructions with human feedback 2022 article

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe

View details Source arXiv preprint arXiv:2203.02155

Probabilistic Machine Learning: An introduction 2022 book

Kevin P. Murphy

ml-methods data-selection training-dynamics

View details Source MIT Press

Why Black Box Machine Learning Should Be Avoided for High-Stakes Decisions, in Brief 2022 article

Cynthia Rudin

ml-methods interpretability

View details Source Nature Reviews Methods Primers

Beyond neural scaling laws: beating power law scaling via data pruning 2022 article

Sorscher, Ben, Geirhos, Robert, Shekhar, Shashank, Ganguli, Surya, Morcos, Ari

View details Source Advances in Neural Information Processing Systems

Introducing Whisper 2022 misc

{OpenAI}

Robust Speech Recognition via Large-Scale Weak Supervision 2022 article

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever

language-models ml-methods training-dynamics

Training Compute-Optimal Large Language Models 2022 inproceedings

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre

Shows that current LLMs are significantly undertrained. For compute-optimal training, model size and training tokens should scale equally. Introduces Chinchilla (70B params, 1.4T tokens) which outperforms larger models like Gopher (280B) trained on less data.

View details Source NeurIPS 2022

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 2021 inproceedings

Bender, Emily M., Gebru, Timnit, McMillan-Major, Angelina, Shmitchell, Shmargaret

ai-society ml-methods language-models

View details Source Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT)

Extracting Training Data from Large Language Models 2021 inproceedings

Carlini, Nicholas, Tramer, Florian, Wallace, Eric, Jagielski, Matthew, Herbert-Voss, Ariel, Lee, Katherine, Roberts, Adam, Brown, Tom B., Song, Dawn, Erlingsson, {\'U}lfar, Oprea, Alina, Papernot, Nicolas

ml-methods language-models training-dynamics privacy memorization

View details Source Proceedings of USENIX Security Symposium

Measuring Mathematical Problem Solving With the MATH Dataset 2021 article

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt

ml-methods language-models benchmark

data-governance ml-methods data-infrastructure language-models

The Pile: An 800GB Dataset of Diverse Text for Language Modeling 2021 article

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy

View details Source CoRR

Ethical and Social Risks of Harm from Language Models 2021 article

Weidinger, Laura, Mellor, John, Rauh, Maribeth, Griffin, Conor, Uesato, Jonathan, Huang, Po-Sen, Cheng, Myra, Glaese, Mia, Balle, Borja, Kasirzadeh, Atoosa, Kenton, Zac, Brown, Sasha, Hawkins, Will, Stepleton, Tom, Biles, Courtney, Birhane, Abeba, Haas, Julia, Rimell, Laura, Hendricks, Lisa Anne, Isaac, William, Legassick, Sean, Irving, Geoffrey, Gabriel, Iason

View details Source arXiv preprint arXiv:2112.04359

Language Models are Few-Shot Learners 2020 article

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

ai-society data-governance ml-methods data-infrastructure fairness language-models

View details Source arXiv preprint arXiv:2005.14165

Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning 2020 inproceedings

Jo, Emily, Gebru, Timnit

View details Source Proceedings of FAccT

Scaling Laws for Neural Language Models 2020 article

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei

Establishes power-law scaling relationships between language model performance and model size, dataset size, and compute, spanning seven orders of magnitude.

language-models ml-methods training-dynamics foundational

View details Source arXiv preprint

Coresets for Data-efficient Training of Machine Learning Models 2020 inproceedings

Baharan Mirzasoleiman, Jeff Bilmes, Jure Leskovec

Introduces CRAIG (Coresets for Accelerating Incremental Gradient descent), selecting subsets that approximate full gradient for 2-3x training speedups while maintaining performance.

data-selection interpretability ml-methods training-dynamics

View details Source International Conference on Machine Learning (ICML)

Estimating Training Data Influence by Tracing Gradient Descent 2020 inproceedings

Garima Pruthi, Frederick Liu, Mukund Sundararajan, Satyen Kale

Introduces TracIn, which computes influence of training examples by tracing how test loss changes during training. Uses first-order gradient approximation and saved checkpoints for scalability.

data-attribution data-governance interpretability ml-methods training-dynamics

View details Source Advances in Neural Information Processing Systems (NeurIPS)

In Pursuit of Interpretable, Fair and Accurate Machine Learning for Criminal Recidivism Prediction 2020 article

Caroline Wang, Bin Han, Bhrij Patel, Cynthia Rudin

ai-society ml-methods fairness interpretability

View details Source arXiv preprint arXiv:2005.04176

Deep Double Descent: Where Bigger Models and More Data Hurt 2020 inproceedings

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever

Demonstrates that double descent occurs across model size, training epochs, and dataset size in modern deep networks. Introduces effective model complexity to unify these phenomena and shows regimes where more data hurts.

language-models ml-methods training-dynamics

View details Source ICLR 2020

Common voice: A massively-multilingual speech corpus 2019 article

Ardila, Rosana, Branson, Megan, Davis, Kelly, Henretty, Michael, Kohler, Michael, Meyer, Josh, Morais, Reuben, Saunders, Lindsay, Tyers, Francis M, Weber, Gregor

ai-society ml-methods fairness training-dynamics

View details Source arXiv preprint arXiv:1912.06670

Reconciling modern machine-learning practice and the classical bias–variance trade-off 2019 article

Belkin, Mikhail, Hsu, Daniel, Ma, Siyuan, Mandal, Soumik

View details Source Proceedings of the National Academy of Sciences

The Secret Sharer: Measuring Unintended Memorization in Neural Networks 2019 inproceedings

Carlini, Nicholas, Liu, Chang, Erlingsson, {\'U}lfar, Kos, Jernej, Song, Dawn

ai-safety ml-methods language-models privacy

View details Source Proceedings of USENIX Security Symposium

Excavating AI: The Politics of Images in Machine Learning Training Sets 2019 misc

Crawford, Kate, Paglen, Trevor

ai-society data-governance ml-methods data-infrastructure fairness language-models

data-governance ml-methods data-attribution interpretability

On the Accuracy of Influence Functions for Measuring Group Effects 2019 inproceedings

Pang Wei Koh, Kai-Siang Ang, Hubert H. K. Teo, Percy Liang

View details Source Advances in Neural Information Processing Systems

Model Cards for Model Reporting 2019 inproceedings

Mitchell, Margaret, Wu, Simone, Zaldivar, Andrew, Barnes, Parker, Vasserman, Lucy, Hutchinson, Ben, Spitzer, Elena, Raji, Inioluwa Deborah, Gebru, Timnit

ml-methods interpretability

View details Source Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT)

A Survey on Image Data Augmentation for Deep Learning 2019 article

Connor Shorten, Taghi M. Khoshgoftaar

Comprehensive survey of image data augmentation techniques for deep learning, covering geometric transformations, color space transforms, kernel filters, mixing images, random erasing, and neural style transfer approaches.

computer-vision data-augmentation language-models ml-methods survey

View details Source Journal of Big Data

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features 2019 inproceedings

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Youngjoon Yoo

Combines cutting and mixing: patches from one image replace regions in another, with labels mixed proportionally. Improves over Cutout by using cut pixels constructively rather than zeroing them out.

data-augmentation ml-methods

View details Source ICCV 2019

The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards 2018 misc

Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, Kasia Chmielinski

ml-methods data-selection

data-selection language-models ml-methods

Troubling Trends in Machine Learning Scholarship 2018 article

Zachary C. Lipton, Jacob Steinhardt

ai-society ml-methods language-models

View details Source arXiv preprint arXiv:1807.03341

Active Learning for Convolutional Neural Networks: A Core-Set Approach 2018 inproceedings

Ozan Sener, Silvio Savarese

Defines active learning as core-set selection, choosing points such that a model trained on the subset is competitive for remaining data. Provides theoretical bounds via k-Center problem.

View details Source International Conference on Learning Representations (ICLR)

mixup: Beyond Empirical Risk Minimization 2018 inproceedings

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz

Introduces mixup, a data augmentation technique that trains on convex combinations of input pairs and their labels. Simple, data-independent, and model-agnostic approach that improves generalization and robustness.

data-augmentation foundational ml-methods

View details Source ICLR 2018

Deep learning scaling is predictable, empirically 2017 article

Hestness, Joel, Narang, Sharan, Ardalani, Newsha, Diamos, Gregory, Jun, Heewoo, Kianinejad, Hassan, Patwary, Md, Ali, Mostofa, Yang, Yang, Zhou, Yanqi

data-attribution data-governance data-selection interpretability ml-methods foundational

View details Source arXiv preprint arXiv:1712.00409

Understanding Black-box Predictions via Influence Functions 2017 inproceedings

Pang Wei Koh, Percy Liang

Uses influence functions from robust statistics to trace model predictions back to training data, identifying training points most responsible for a given prediction.

View details Source Proceedings of the 34th International Conference on Machine Learning (ICML)

Improved Regularization of Convolutional Neural Networks with Cutout 2017 article

Terrance DeVries, Graham W. Taylor

Introduces Cutout, a regularization technique that randomly masks square regions of input images during training. Inspired by dropout but applied to inputs, encouraging models to learn from partially visible objects.

data-augmentation ml-methods

View details Source arXiv preprint

Poisoning Attacks against Support Vector Machines 2012 inproceedings

Battista Biggio, Blaine Nelson, Pavel Laskov

Investigates poisoning attacks against SVMs where adversaries inject crafted training data to increase test error. Uses gradient ascent to construct malicious data points.

adversarial ai-safety data-poisoning ml-methods foundational

View details Source Proceedings of the 29th International Conference on Machine Learning (ICML)

Curriculum Learning 2009 inproceedings

Yoshua Bengio, Jerome Louradour, Ronan Collobert, Jason Weston

Introduces curriculum learning: training models on examples of increasing difficulty. Shows this acts as a continuation method for non-convex optimization, improving both convergence speed and final generalization.

data-selection foundational ml-methods training-dynamics

View details Source ICML 2009

Active Learning Literature Survey 2009 techreport

Burr Settles

Canonical survey of active learning covering uncertainty sampling, query-by-committee, expected error reduction, variance reduction, and density-weighted methods. Establishes foundational taxonomy for the field.

data-selection foundational ml-methods survey

View details Source University of Wisconsin-Madison, Computer Sciences Technical Report 1648

No free lunch theorems for optimization 1997 article

Wolpert, David H, Macready, William G

ml-methods training-dynamics

View details Source IEEE transactions on evolutionary computation

Stanford Alpaca GitHub Repository misc

{Tatsu Lab}