Data Leverage References

← Back to browse

Tag: ml-methods (80 references)

LLM Social Simulations Are a Promising Research Method 2025 article

Jacy Reese Anthis, Ryan Liu, Sean M. Richardson, Austin C. Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, Michael Bernstein

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 2025 article

{DeepSeek AI}

Exploring the limits of strong membership inference attacks on large language models 2025 article

Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Milad Nasr, Sahra Ghalebikesabi, Meenatchi Sundaram Mutu Selva Annamalai, Niloofar Mireshghallah, Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye, Katherine Lee, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper

Extending "GPTs Are GPTs" to Firms 2025 article

Benjamin Labaschin, Tyna Eloundou, Sam Manning, Pamela Mishkin, Daniel Rock

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens 2025 article

Liu, Jiacheng, Blanton, Taylor, Elazar, Yanai, Min, Sewon, Chen, YenSung, Chheda-Kothary, Arnavi, Tran, Huy, Bischoff, Byron, Marsh, Eric, Schmitz, Michael, others

The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm 2024 article

Aakanksha, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer, Marzieh Fadaee, Sara Hooker

To code, or not to code? exploring impact of code in pre-training 2024 article

Aryabumi, Viraat, Su, Yixuan, Ma, Raymond, Morisot, Adrien, Zhang, Ivan, Locatelli, Acyr, Fadaee, Marzieh, {\"U}st{\"u}n, Ahmet, Hooker, Sara

The Rise of AI-Generated Content in Wikipedia 2024 article

Creston Brooks, Samuel Eggert, Denis Peskoff

Poisoning Web-Scale Training Datasets is Practical 2024 misc

Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions 2024 article

Choe, Sang Keun and Ahn, Hwijeen and Bae, Juhan and Zhao, Kewen and Kang, Minsoo and Chung, Youngseog and Pratapa, Adithya and Neiswanger, Willie and Strubell, Emma and Mitamura, Teruko and Schneider, Jeff and Hovy, Eduard and Grosse, Roger and Xing, Eric

Large language models reduce public knowledge sharing on online Q&A platforms 2024 article

Del Rio-Chanona, R. Maria and Laurentsyeva, Nadzeya and Wachs, Johannes

Direct Preference Optimization: Your Language Model is Secretly a Reward Model 2024 article

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

GPTs are GPTs: Labor Market Impact Potential of LLMs 2024 article

Tyna Eloundou, Sam Manning, Pamela Mishkin, Daniel Rock

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024 misc

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

StarCoder 2 and The Stack v2: The Next Generation 2024 misc

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries

LLM Dataset Inference: Did you train on my dataset? 2024 article

Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging 2024 inproceedings

Na, Clara and Magnusson, Ian and Jha, Ananya Harsh and Sherborne, Tom and Strubell, Emma and Dodge, Jesse and Dasigi, Pradeep

A Canary in the AI Coal Mine: American Jews May Be Disproportionately Harmed by Intellectual Property Dispossession in Large Language Model Training 2024 article

Heila Precel, Allison McDonald, Brent Hecht, Nicholas Vincent

Data {Flywheels} for {LLM} {Applications} 2024 misc

Shankar, Shreya

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research 2024 article

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo

Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model 2024 article

Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D'souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, Sara Hooker

Alpaca: A Strong, Replicable Instruction-Following Model 2023 misc
LEACE: Perfect linear concept erasure in closed form 2023 article

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman

Quantifying Memorization Across Neural Language Models 2023 inproceedings

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4 2023 article

Kent K. Chang, Mackenzie Cramer, Sandeep Soni, David Bamman

Algorithmic Collective Action in Machine Learning 2023 inproceedings

Moritz Hardt, Eric Mazumdar, Celestine Mendler-Dünner, Tijana Zrnic

Provides theoretical framework for algorithmic collective action, showing that small collectives can exert significant control over platform learning algorithms through coordinated data strategies.

A Watermark for Large Language Models 2023 article

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein

Textbooks Are All You Need II: phi-1.5 technical report 2023 article

Yuanzhi Li, Sebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment 2023 article

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, Hang Li

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore 2023 article

Sewon Min, Suchin Gururangan, Eric Wallace, Weijia Shi, Hannaneh Hajishirzi, Noah A. Smith, Luke Zettlemoyer

OWASP Top 10 for Large Language Model Applications 2023 misc
TRAK: Attributing Model Behavior at Scale 2023 inproceedings

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, Aleksander Madry

Introduces TRAK (Tracing with the Randomly-projected After Kernel), a data attribution method that is both effective and computationally tractable for large-scale models by leveraging random projections.

DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning 2022 article

Chengcheng Guo, Bo Zhao, Yanbing Bai

Comprehensive library and empirical study of coreset selection methods for deep learning, finding that random selection remains a strong baseline across many settings.

Training Data Influence Analysis and Estimation: A Survey 2022 article

Zayd Hammoudeh, Daniel Lowd

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning 2022 inproceedings

Yongchan Kwon, James Zou

Generalizes Data Shapley using Beta weighting functions, providing noise-reduced data valuation that better handles outliers and mislabeled data detection.

Training language models to follow instructions with human feedback 2022 article

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe

Probabilistic Machine Learning: An introduction 2022 book

Kevin P. Murphy

Why Black Box Machine Learning Should Be Avoided for High-Stakes Decisions, in Brief 2022 article

Cynthia Rudin

Beyond neural scaling laws: beating power law scaling via data pruning 2022 article

Sorscher, Ben, Geirhos, Robert, Shekhar, Shashank, Ganguli, Surya, Morcos, Ari

Introducing Whisper 2022 misc
Robust Speech Recognition via Large-Scale Weak Supervision 2022 article

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever

Training Compute-Optimal Large Language Models 2022 inproceedings

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre

Shows that current LLMs are significantly undertrained. For compute-optimal training, model size and training tokens should scale equally. Introduces Chinchilla (70B params, 1.4T tokens) which outperforms larger models like Gopher (280B) trained on less data.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 2021 inproceedings

Bender, Emily M., Gebru, Timnit, McMillan-Major, Angelina, Shmitchell, Shmargaret

Extracting Training Data from Large Language Models 2021 inproceedings

Carlini, Nicholas, Tramer, Florian, Wallace, Eric, Jagielski, Matthew, Herbert-Voss, Ariel, Lee, Katherine, Roberts, Adam, Brown, Tom B., Song, Dawn, Erlingsson, {\'U}lfar, Oprea, Alina, Papernot, Nicolas

Measuring Mathematical Problem Solving With the MATH Dataset 2021 article

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt

The Pile: An 800GB Dataset of Diverse Text for Language Modeling 2021 article

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy

Ethical and Social Risks of Harm from Language Models 2021 article

Weidinger, Laura, Mellor, John, Rauh, Maribeth, Griffin, Conor, Uesato, Jonathan, Huang, Po-Sen, Cheng, Myra, Glaese, Mia, Balle, Borja, Kasirzadeh, Atoosa, Kenton, Zac, Brown, Sasha, Hawkins, Will, Stepleton, Tom, Biles, Courtney, Birhane, Abeba, Haas, Julia, Rimell, Laura, Hendricks, Lisa Anne, Isaac, William, Legassick, Sean, Irving, Geoffrey, Gabriel, Iason

Language Models are Few-Shot Learners 2020 article

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning 2020 inproceedings
Scaling Laws for Neural Language Models 2020 article

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei

Establishes power-law scaling relationships between language model performance and model size, dataset size, and compute, spanning seven orders of magnitude.

Coresets for Data-efficient Training of Machine Learning Models 2020 inproceedings

Baharan Mirzasoleiman, Jeff Bilmes, Jure Leskovec

Introduces CRAIG (Coresets for Accelerating Incremental Gradient descent), selecting subsets that approximate full gradient for 2-3x training speedups while maintaining performance.

Estimating Training Data Influence by Tracing Gradient Descent 2020 inproceedings

Garima Pruthi, Frederick Liu, Mukund Sundararajan, Satyen Kale

Introduces TracIn, which computes influence of training examples by tracing how test loss changes during training. Uses first-order gradient approximation and saved checkpoints for scalability.

In Pursuit of Interpretable, Fair and Accurate Machine Learning for Criminal Recidivism Prediction 2020 article

Caroline Wang, Bin Han, Bhrij Patel, Cynthia Rudin

Deep Double Descent: Where Bigger Models and More Data Hurt 2020 inproceedings

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever

Demonstrates that double descent occurs across model size, training epochs, and dataset size in modern deep networks. Introduces effective model complexity to unify these phenomena and shows regimes where more data hurts.

Common voice: A massively-multilingual speech corpus 2019 article

Ardila, Rosana, Branson, Megan, Davis, Kelly, Henretty, Michael, Kohler, Michael, Meyer, Josh, Morais, Reuben, Saunders, Lindsay, Tyers, Francis M, Weber, Gregor

Reconciling modern machine-learning practice and the classical bias–variance trade-off 2019 article

Belkin, Mikhail, Hsu, Daniel, Ma, Siyuan, Mandal, Soumik

The Secret Sharer: Measuring Unintended Memorization in Neural Networks 2019 inproceedings

Carlini, Nicholas, Liu, Chang, Erlingsson, {\'U}lfar, Kos, Jernej, Song, Dawn

Excavating AI: The Politics of Images in Machine Learning Training Sets 2019 misc
On the Accuracy of Influence Functions for Measuring Group Effects 2019 inproceedings

Pang Wei Koh, Kai-Siang Ang, Hubert H. K. Teo, Percy Liang

Model Cards for Model Reporting 2019 inproceedings

Mitchell, Margaret, Wu, Simone, Zaldivar, Andrew, Barnes, Parker, Vasserman, Lucy, Hutchinson, Ben, Spitzer, Elena, Raji, Inioluwa Deborah, Gebru, Timnit

A Survey on Image Data Augmentation for Deep Learning 2019 article

Connor Shorten, Taghi M. Khoshgoftaar

Comprehensive survey of image data augmentation techniques for deep learning, covering geometric transformations, color space transforms, kernel filters, mixing images, random erasing, and neural style transfer approaches.

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features 2019 inproceedings

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Youngjoon Yoo

Combines cutting and mixing: patches from one image replace regions in another, with labels mixed proportionally. Improves over Cutout by using cut pixels constructively rather than zeroing them out.

The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards 2018 misc

Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, Kasia Chmielinski

Troubling Trends in Machine Learning Scholarship 2018 article

Zachary C. Lipton, Jacob Steinhardt

Active Learning for Convolutional Neural Networks: A Core-Set Approach 2018 inproceedings

Ozan Sener, Silvio Savarese

Defines active learning as core-set selection, choosing points such that a model trained on the subset is competitive for remaining data. Provides theoretical bounds via k-Center problem.

mixup: Beyond Empirical Risk Minimization 2018 inproceedings

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz

Introduces mixup, a data augmentation technique that trains on convex combinations of input pairs and their labels. Simple, data-independent, and model-agnostic approach that improves generalization and robustness.

Deep learning scaling is predictable, empirically 2017 article

Hestness, Joel, Narang, Sharan, Ardalani, Newsha, Diamos, Gregory, Jun, Heewoo, Kianinejad, Hassan, Patwary, Md, Ali, Mostofa, Yang, Yang, Zhou, Yanqi

Understanding Black-box Predictions via Influence Functions 2017 inproceedings

Pang Wei Koh, Percy Liang

Uses influence functions from robust statistics to trace model predictions back to training data, identifying training points most responsible for a given prediction.

Improved Regularization of Convolutional Neural Networks with Cutout 2017 article

Terrance DeVries, Graham W. Taylor

Introduces Cutout, a regularization technique that randomly masks square regions of input images during training. Inspired by dropout but applied to inputs, encouraging models to learn from partially visible objects.

Poisoning Attacks against Support Vector Machines 2012 inproceedings

Battista Biggio, Blaine Nelson, Pavel Laskov

Investigates poisoning attacks against SVMs where adversaries inject crafted training data to increase test error. Uses gradient ascent to construct malicious data points.

Curriculum Learning 2009 inproceedings

Yoshua Bengio, Jerome Louradour, Ronan Collobert, Jason Weston

Introduces curriculum learning: training models on examples of increasing difficulty. Shows this acts as a continuation method for non-convex optimization, improving both convergence speed and final generalization.

Active Learning Literature Survey 2009 techreport

Burr Settles

Canonical survey of active learning covering uncertainty sampling, query-by-committee, expected error reduction, variance reduction, and density-weighted methods. Establishes foundational taxonomy for the field.

No free lunch theorems for optimization 1997 article

Wolpert, David H, Macready, William G

Stanford Alpaca GitHub Repository misc
Alpaca Data Cleaned Repository misc
Databricks Dolly Repository misc
GSM8K Hugging Face Dataset Card misc
Grade-School Math (GSM8K) Repository misc
HH-RLHF Dataset misc
Competition Math Dataset on Hugging Face misc