Data Leverage References

439 / 439 references
LLM Social Simulations Are a Promising Research Method 2025 article

Jacy Reese Anthis, Ryan Liu, Sean M. Richardson, Austin C. Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, Michael Bernstein

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 2025 article

{DeepSeek AI}

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation 2025 article

Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, David Fernandez-Llorca

Exploring the limits of strong membership inference attacks on large language models 2025 article

Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Milad Nasr, Sahra Ghalebikesabi, Meenatchi Sundaram Mutu Selva Annamalai, Niloofar Mireshghallah, Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye, Katherine Lee, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper

Trust and Friction: Negotiating How Information Flows Through Decentralized Social Media 2025 article

Hwang, Sohyeon, Nanayakkara, Priyanka, Shvartzshnaider, Yan

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development 2025 article

Jan Kulveit, Raymond Douglas, Nora Ammann, Deger Turan, David Krueger, David Duvenaud

Extending "GPTs Are GPTs" to Firms 2025 article

Benjamin Labaschin, Tyna Eloundou, Sam Manning, Pamela Mishkin, Daniel Rock

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens 2025 article

Liu, Jiacheng, Blanton, Taylor, Elazar, Yanai, Min, Sewon, Chen, YenSung, Chheda-Kothary, Arnavi, Tran, Huy, Bischoff, Byron, Marsh, Eric, Schmitz, Michael, others

Cybernetics 2025 misc

Cybernetics is the transdisciplinary study of circular causal processes such as feedback and recursion, where the effects of a system's actions (its outputs) return as inputs to that system, influencing subsequent action. It is concerned with general principles that are relevant across multiple contexts, including in engineering, ecological, economic, biological, cognitive and social systems and also in practical activities such as designing, learning, and managing. Cybernetics' transdisciplinary character has meant that it intersects with a number of other fields, leading to it having both wide influence and diverse interpretations. The field is named after an example of circular causal feedback—that of steering a ship (the ancient Greek κυβερνήτης (kybernḗtēs) refers to the person who steers a ship). In steering a ship, the position of the rudder is adjusted in continual response to the effect it is observed as having, forming a feedback loop through which a steady course can be maintained in a changing environment, responding to disturbances from cross winds and tide. Cybernetics has its origins in exchanges between numerous disciplines during the 1940s. Initial developments were consolidated through meetings such as the Macy Conferences and the Ratio Club. Early focuses included purposeful behaviour, neural networks, heterarchy, information theory, and self-organising systems. As cybernetics developed, it became broader in scope to include work in design, family therapy, management and organisation, pedagogy, sociology, the creative arts and the counterculture.

The Leaderboard Illusion 2025 article

Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker

Welcome to the Era of Experience 2025 misc

Silver, David, Sutton, Richard S.

If open source is to win, it must go public 2025 article

Tan, Joshua, Vincent, Nicholas, Elkins, Katherine, Sahlgren, Magnus

Canada as a Champion for Public AI: Data, Compute and Open Source Infrastructure for Economic Growth and Inclusive Innovation 2025 article

Vincent, Nicholas, Surman, Mark, Hirsch-Allen, Jake

Shapley value-based data valuation for machine learning data markets 2025 article

Proposes G-Value to bridge the gap between leave-one-out (LOO) and Shapley value approaches for data valuation. Addresses practical applications in machine learning data markets.

Rethinking machine unlearning for large language models 2025 article

Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu

Comprehensive review of machine unlearning in LLMs, aiming to eliminate undesirable data influence (sensitive or illegal information) while maintaining essential knowledge generation. Envisions LLM unlearning as a pivotal element in life-cycle management for developing safe, secure, trustworthy, and resource-efficient generative AI.

Distributional Training Data Attribution: What do Influence Functions Sample? 2025 article

Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu, Richard E. Turner, Roger Grosse

Introduces distributional training data attribution (d-TDA), which predicts how the distribution of model outputs depends upon the dataset. Shows that influence functions are "secretly distributional"—they emerge from this framework as the limit to unrolled differentiation without requiring restrictive convexity assumptions.

Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations 2025 techreport

NIST

Official NIST taxonomy and terminology for adversarial machine learning. Covers data poisoning attacks applicable to all learning paradigms, model poisoning attacks in federated learning, and supply-chain attacks. Provides guidance for defense strategies.

The Economics of AI Training Data: A Research Agenda 2025 article

Hamidah Oderinwale, Anna Kazlauskas

Research agenda documenting AI training data deals from 2020 to 2025. Reveals persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Found only 7 of 24 major deals compensate original creators.

Data-centric Artificial Intelligence: A Survey 2025 article

Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, Xia Hu

Comprehensive survey on data-centric AI, providing a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and representative methods. Covers the paradigm shift from model refinement to prioritizing data quality.

Revisiting Data Attribution for Influence Functions 2025 article

Hongbo Zhu, Angelo Cangelosi

Comprehensive review of influence functions for data attribution, examining how individual training examples influence model predictions. Covers techniques for model debugging, data curation, bias detection, and identification of mislabeled or adversarial data points.

Membership inference attacks against large language models 2025 article

Jamie Hayes

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development 2025 article

Jan Kulveit, Ryan Douglas, Neil Ammann, David Turan, David Krueger, David Duvenaud

AI as Normal Technology 2025 article

Arvind Narayanan, Sayash Kapoor

The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm 2024 article

Aakanksha, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer, Marzieh Fadaee, Sara Hooker

Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool 2024 article

Arslan Akram

Machines of Loving Grace: How AI Could Transform the World for the Better 2024 misc

Amodei, Dario

The Illusion of Artificial Inclusion 2024 inproceedings

William Agnew, A. Stevie Bergman, Jennifer Chien, Mark Diaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, Kevin R. McKee

To code, or not to code? exploring impact of code in pre-training 2024 article

Aryabumi, Viraat, Su, Yixuan, Ma, Raymond, Morisot, Adrien, Zhang, Ivan, Locatelli, Acyr, Fadaee, Marzieh, {\"U}st{\"u}n, Ahmet, Hooker, Sara

The Rise of AI-Generated Content in Wikipedia 2024 article

Creston Brooks, Samuel Eggert, Denis Peskoff

Poisoning Web-Scale Training Datasets is Practical 2024 misc

Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions 2024 article

Choe, Sang Keun and Ahn, Hwijeen and Bae, Juhan and Zhao, Kewen and Kang, Minsoo and Chung, Youngseog and Pratapa, Adithya and Neiswanger, Willie and Strubell, Emma and Mitamura, Teruko and Schneider, Jeff and Hovy, Eduard and Grosse, Roger and Xing, Eric

Large language models reduce public knowledge sharing on online Q&A platforms 2024 article

Del Rio-Chanona, R. Maria and Laurentsyeva, Nadzeya and Wachs, Johannes

Direct Preference Optimization: Your Language Model is Secretly a Reward Model 2024 article

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

GPTs are GPTs: Labor Market Impact Potential of LLMs 2024 article

Tyna Eloundou, Sam Manning, Pamela Mishkin, Daniel Rock

Artificial Intelligence Act 2024 misc
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024 misc

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

Public {AI}: {Infrastructure} for the {Common} {Good} 2024 techreport

Jackson, Brandon, Cavello, B, Devine, Flynn, Garcia, Nick, Klein, Samuel J., Krasodomski, Alex, Tan, Joshua, Tursman, Eleanor

ANSI/NISO Z39.96-2024, JATS: Journal Article Tag Suite 2024 techreport
Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing 2024 article

Johnson, Isaac, Kaffee, Lucie-Aim{\'e}e, Redi, Miriam

Data {Flywheel} {Go} {Brrr}: {Using} {Your} {Users} to {Build} {Better} {Products} - {Jason} {Liu} 2024 misc

Liu, Jason

Explore how data flywheels leverage user feedback to enhance product development and achieve business success with AI.

Consent in Crisis: The Rapid Decline of the AI Data Commons 2024 article

Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh 0003, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Shamiso Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad A. Alghamdi, Enrico Shippole, Jianguo Zhang 0005, Joanna Materzynska, Kun Qian 0016, Kushagra Tiwary, Lester James V. Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma 0001, Minh Chien Vu, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Alex Pentland

StarCoder 2 and The Stack v2: The Next Generation 2024 misc

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries

LLM Dataset Inference: Did you train on my dataset? 2024 article

Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic

Public AI: Making AI Work for Everyone, by Everyone 2024 misc

Marda, Nik, Sun, Jasmine, Surman, Mark

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging 2024 inproceedings

Na, Clara and Magnusson, Ian and Jha, Ananya Harsh and Sherborne, Tom and Strubell, Emma and Dodge, Jesse and Dasigi, Pradeep

Generative AI Profile (Draft/2024) 2024 techreport
A Canary in the AI Coal Mine: American Jews May Be Disproportionately Harmed by Intellectual Property Dispossession in Large Language Model Training 2024 article

Heila Precel, Allison McDonald, Brent Hecht, Nicholas Vincent

What is a {Data} {Flywheel}? {A} {Guide} to {Sustainable} {Business} {Growth} 2024 misc

Roche, Adam, Sassoon, Yali

Data {Flywheels} for {LLM} {Applications} 2024 misc

Shankar, Shreya

The data addition dilemma 2024 article

Shen, Judy Hanwen, Raji, Inioluwa Deborah, Chen, Irene Y

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research 2024 article

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo

Copyright and Artificial Intelligence: Policy Studies and Guidance 2024 misc
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model 2024 article

Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D'souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, Sara Hooker

Push and Pull: A Framework for Measuring Attentional Agency 2024 article

Wojtowicz, Zachary and Jain, Shrey and Vincent, Nicholas

A Systematic Review of NeurIPS Dataset Management Practices 2024 article

Yiwei Wu, Leah Ajmani, Shayne Longpre, Hanlin Li

Machine Unlearning: A Survey 2024 article

Heng Xu, Tianqing Zhu, Lefeng Zhang, Wanlei Zhou, Philip S. Yu

Comprehensive survey of machine unlearning covering definitions, scenarios, verification methods, and applications. Cited in the International AI Safety Report 2025 as a pioneering paradigm for removing sensitive information.

CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning 2024 article

Huaiguang Cai

Proposes CHG (compound of Hardness and Gradient) utility function to approximate the utility of each data subset, reducing computational complexity to a single model retraining—achieving a quadratic improvement over existing Data Shapley methods.

A Versatile Influence Function for Data Attribution with Non-Decomposable Loss 2024 article

Junwei Deng, Weijing Tang, Jiaqi W. Ma

Proposes Versatile Influence Function (VIF) designed to fully leverage auto-differentiation, eliminating case-specific derivations. Demonstrated across Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank, with estimates closely resembling leave-one-out retraining while being up to 10^3 times faster.

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data 2024 article

Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

Studies whether model collapse is inevitable. Found that collapse occurs when replacing real data with synthetic data each generation. However, when accumulating synthetic data alongside original real data, models stay stable across sizes and modalities. Suggests data accumulation rather than replacement as a solution.

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI 2024 article

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker

Large-scale audit of over 1,800 text AI datasets analyzing trends, permissions of use and global representation. Found frequent miscategorization of licences on dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. Released the Data Provenance Explorer tool for practitioners.

Influence Functions for Scalable Data Attribution in Diffusion Models 2024 article

Bruno Mlodozeniec, Runa Eschenhagen, Juhan Bae, Alexander Immer, David Krueger, Richard Turner

Develops influence function frameworks for diffusion models to address data attribution and interpretability challenges. Predicts how model output would change if training data were removed, showing how previously proposed methods can be interpreted as particular design choices in this framework.

AI models collapse when trained on recursively generated data 2024 article

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, Yarin Gal

Landmark study showing that indiscriminate use of model-generated content in training causes irreversible defects in resulting models, where tails of original content distribution disappear. Model collapse is a degenerative learning process where models forget improbable events over time. Demonstrates this across LLMs, VAEs, and GMMs.

LLM Unlearning via Loss Adjustment with Only Forget Data 2024 inproceedings

Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Parag Shah, Yujia Bao, Yang Liu, Wei Wei

FLAT is a loss adjustment approach which maximizes f-divergence between the available template answer and the forget answer with respect to the forget data. Demonstrates superior unlearning performance compared to existing methods while minimizing impact on retained capabilities, tested on Harry Potter dataset and MUSE Benchmark.

Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration 2024 inproceedings

Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng

Enhances training data attribution methods for large language models including LLaMA2, QWEN2, and Mistral by considering fitting error in the attribution process.

Position Paper: Data-Centric AI in the Age of Large Language Models 2024 inproceedings

Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang Low

Position paper identifying four specific scenarios centered around data for LLMs, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.

The simple macroeconomics of AI 2024 article

Daron Acemoglu

Self-consuming generative models go MAD 2024 article

Sina Alemohammad

The labor market impacts of technological change: From unbridled enthusiasm to qualified optimism to vast uncertainty 2024 article

David H Autor

The Foundation Model Transparency Index v1.1 2024 article

Rishi Bommasani

The consequences of generative AI for online knowledge communities 2024 article

Gordon Burtch

Large language models reduce public knowledge sharing on online Q&A platforms 2024 article

R Maria del Rio-Chanona, Nadzeya Laurentsyeva, Johannes Wachs

The impact of generative AI on Wikipedia traffic 2024 article

Jun Gao

The short-term effects of generative artificial intelligence on employment: Evidence from an online labor market 2024 misc

Xiang Hui, Oren Reshef, Luofeng Zhou

Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions 2024 inproceedings

Samia Kabir

Looking Beyond the Top-1: Transformers Determine Top Tokens in Order 2024 inproceedings

Daria Lioubashevski

Model collapse from recursive training on generated data 2024 article

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson

Benchmarking Benchmark Leakage in Large Language Models 2024 misc

Ruijie Xu, Zengzhi Wang, Run-Ze Fan, Pengfei Liu

Alpaca: A Strong, Replicable Instruction-Following Model 2023 misc
LEACE: Perfect linear concept erasure in closed form 2023 article

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman

Quantifying Memorization Across Neural Language Models 2023 inproceedings

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia 2023 article

Raul Castro Fernandez

Understanding CC Licenses and Generative AI 2023 misc
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4 2023 article

Kent K. Chang, Mackenzie Cramer, Sandeep Soni, David Bamman

Wikipedia's value in the age of generative {AI} 2023 misc

Deckelmann, Selena

If there was a generative artificial intelligence system that could, on its own, write all the information contained in Wikipedia, would it be the same as Wikipedia today?

Algorithmic Collective Action in Machine Learning 2023 inproceedings

Moritz Hardt, Eric Mazumdar, Celestine Mendler-Dünner, Tijana Zrnic

Provides theoretical framework for algorithmic collective action, showing that small collectives can exert significant control over platform learning algorithms through coordinated data strategies.

ISO/IEC 23894:2023 Information Technology—Artificial Intelligence—Risk Management 2023 standard
Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity| Winners of the 2024 Nobel Prize for Economics 2023 book

Simon Johnson, Daron Acemoglu

A Watermark for Large Language Models 2023 article

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein

The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers 2023 article

Hanlin Li, Nicholas Vincent, Stevie Chancellor, Brent Hecht

Textbooks Are All You Need II: phi-1.5 technical report 2023 article

Yuanzhi Li, Sebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment 2023 article

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, Hang Li

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore 2023 article

Sewon Min, Suchin Gururangan, Eric Wallace, Weijia Shi, Hannaneh Hajishirzi, Noah A. Smith, Luke Zettlemoyer

Artificial Intelligence Risk Management Framework (AI RMF 1.0) 2023 techreport
OWASP Top 10 for Large Language Model Applications 2023 misc
TRAK: Attributing Model Behavior at Scale 2023 inproceedings

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, Aleksander Madry

Introduces TRAK (Tracing with the Randomly-projected After Kernel), a data attribution method that is both effective and computationally tractable for large-scale models by leveraging random projections.

Terms-we-serve-with: Five dimensions for anticipating and repairing algorithmic harm 2023 article

Rakova, Bogdana, Shelby, Renee, Ma, Megan

Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction 2023 inproceedings

Shelby, Renee, Rismani, Shalaleh, Henne, Kathryn, Moon, AJung, Rostamzadeh, Negar, Nicholas, Paul, Yilla-Akbari, N'Mah, Gallegos, Jess, Smart, Andrew, Garcia, Emilio, Virk, Gurleen

Understanding the landscape of potential harms from algorithmic systems enables practitioners to better anticipate consequences of the systems they build. It also supports the prospect of incorporating controls to help minimize harms that emerge from the interplay of technologies and social and cultural dynamics. A growing body of scholarship has identified a wide range of harms across different algorithmic technologies. However, computing research and practitioners lack a high level and synthesized overview of harms from algorithmic systems. Based on a scoping review of computing research (n=172), we present an applied taxonomy of sociotechnical harms to support a more systematic surfacing of potential harms in algorithmic systems. The final taxonomy builds on and refers to existing taxonomies, classifications, and terminologies. Five major themes related to sociotechnical harms — representational, allocative, quality-of-service, interpersonal harms, and social system/societal harms — and sub-themes are presented along with a description of these categories. We conclude with a discussion of challenges and opportunities for future research.

An Alternative to Regulation: The Case for Public AI 2023 article

Vincent, Nicholas, Bau, David, Schwettmann, Sarah, Tan, Joshua

The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers 2023 inproceedings

Li, Hanlin and Vincent, Nicholas and Chancellor, Stevie and Hecht, Brent

State of AI Report 2023 2023 article

Nathan Benaich, Ian Hogarth

The Foundation Model Transparency Index 2023 article

Rishi Bommasani

Quantifying memorization across neural language models 2023 inproceedings

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

Open problems and fundamental limitations of reinforcement learning from human feedback 2023 article

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J'er'emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire

Mata v. Avianca, Inc., No. 1:22-cv-01461 (S.D.N.Y. June 22, 2023), Opinion and Order on Sanctions 2023 misc

P. Kevin Castel

GPTs are GPTs: An early look at the labor market impact potential of large language models 2023 article

Tyna Eloundou, Sam Manning, Pamela Mishkin, Daniel Rock

Occupational heterogeneity in exposure to generative AI 2023 article

Edward Felten, Manav Raj, Robert Seamans

Scaling laws for reward model overoptimization 2023 article

Leo Gao, John Schulman, Jacob Hilton

A watermark for large language models 2023 inproceedings

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein

The Data Provenance Initiative: A large scale audit of dataset licensing & attribution in AI 2023 article

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Siber, William Broz, Niklas Muriuki, Adam Waldis, Liyang Sun, Sandy Pentland

SILO language models: Isolating legal risk in a nonparametric datastore 2023 article

Sewon Min, Suchin Gururangan, Eric Wallace, Hannaneh Hajishirzi, Noah A Smith, Luke Zettlemoyer

Experimental evidence on the productivity effects of generative artificial intelligence 2023 article

Shakked Noy, Whitney Zhang

Proving Test Set Contamination in Black Box Language Models 2023 misc

Yonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, Tatsunori B. Hashimoto

The Eye of the Master: A Social History of Artificial Intelligence 2023 book

Matteo Pasquinelli

The impact of AI on developer productivity: Evidence from GitHub Copilot 2023 article

Sida Peng, Eirini Kalliamvakou, Peter Cihon, Mert Demirer

Direct preference optimization: Your language model is secretly a reward model 2023 article

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, Chelsea Finn

Changing the world by changing the data 2023 article

Anna Rogers

Generative AI meets copyright 2023 article

Pamela Samuelson

The gradient of generative AI release: Methods and considerations 2023 article

Irene Solaiman

Canada's Online News Act: A legislative response to platform power 2023 article

Gregory Taylor

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena 2023 article

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing

Common Crawl — Web-scale Data for Research 2022 misc
Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses 2022 article

Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, Tom Goldstein

Comprehensive survey systematically categorizing dataset vulnerabilities including poisoning and backdoor attacks, their threat models, and defense mechanisms.

DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning 2022 article

Chengcheng Guo, Bo Zhao, Yanbing Bai

Comprehensive library and empirical study of coreset selection methods for deep learning, finding that random selection remains a strong baseline across many settings.

Training Data Influence Analysis and Estimation: A Survey 2022 article

Zayd Hammoudeh, Daniel Lowd

Training Compute-Optimal Large Language Models 2022 inproceedings

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre

Shows that current LLMs are significantly undertrained. For compute-optimal training, model size and training tokens should scale equally. Introduces Chinchilla (70B params, 1.4T tokens) which outperforms larger models like Gopher (280B) trained on less data.

Datamodels: Predicting Predictions from Training Data 2022 inproceedings

Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry

Proposes datamodels that predict model outputs as a function of training data subsets, providing a framework for understanding data attribution through retraining experiments.

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning 2022 inproceedings

Yongchan Kwon, James Zou

Generalizes Data Shapley using Beta weighting functions, providing noise-reduced data valuation that better handles outliers and mislabeled data detection.

LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets 2022 misc
Training language models to follow instructions with human feedback 2022 article

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe

Probabilistic Machine Learning: An introduction 2022 book

Kevin P. Murphy

The Fallacy of AI Functionality 2022 article

Inioluwa Deborah Raji, Indra Elizabeth Kumar, Aaron Horowitz, Andrew D. Selbst

Releasing Re-LAION-5B 2022 misc
Why Black Box Machine Learning Should Be Avoided for High-Stakes Decisions, in Brief 2022 article

Cynthia Rudin

{LAION}-5B: An Open Large-Scale Dataset for Training Next {CLIP} Models 2022 inproceedings

Schuhmann, Christoph, Beaumont, Romain, Vencu, Richard, Gordon, Cade, Wightman, Ross, Cherti, Mehdi, Coombes, Theo, Katta, Aarush, Mullis, Clayton, Wortsman, Mitchell, Schramowski, Patrick, Kundurthy, Srivatsa, Crowson, Katherine, Schmidt, Ludwig, Kaczmarczyk, Robert, Jitsev, Jenia

Beyond neural scaling laws: beating power law scaling via data pruning 2022 article

Sorscher, Ben, Geirhos, Robert, Shekhar, Shashank, Ganguli, Surya, Morcos, Ari

The Stack: A Permissively Licensed Source Code Dataset 2022 misc
Introducing Whisper 2022 misc
Robust Speech Recognition via Large-Scale Weak Supervision 2022 article

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever

Constitutional AI: Harmlessness from AI feedback 2022 article

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space 2022 article

Mor Geva, Avi Caciularu, Kevin R Wang, Yoav Goldberg

In-context learning and induction heads 2022 article

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen

Mapping the design space of teachable social media 2022 inproceedings

Myle Ott

Training language models to follow instructions with human feedback 2022 article

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 2021 inproceedings

Bender, Emily M., Gebru, Timnit, McMillan-Major, Angelina, Shmitchell, Shmargaret

Machine Unlearning 2021 inproceedings

Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, Nicolas Papernot

Introduces SISA (Sharded, Isolated, Sliced, Aggregated) training for efficient exact machine unlearning. Partitions data into shards with separate models, enabling targeted retraining when data must be forgotten.

Extracting Training Data from Large Language Models 2021 inproceedings

Carlini, Nicholas, Tramer, Florian, Wallace, Eric, Jagielski, Matthew, Herbert-Voss, Ariel, Lee, Katherine, Roberts, Adam, Brown, Tom B., Song, Dawn, Erlingsson, {\'U}lfar, Oprea, Alina, Papernot, Nicolas

Unsolved Problems in ML Safety 2021 article

Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt

Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning 2021 article

Kwon, Yongchan and Zou, James

What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus 2021 inproceedings

Alexandra Sasha Luccioni, Joseph D. Viviano

Measuring Mathematical Problem Solving With the MATH Dataset 2021 article

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt

The Pile: An 800GB Dataset of Diverse Text for Language Modeling 2021 article

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy

What are you optimizing for? Aligning Recommender Systems with Human Values 2021 article

Jonathan Stray, Ivan Vendrov, Jeremy Nixon, Steven Adler, Dylan Hadfield-Menell

Quantifying the Invisible Labor in Crowd Work 2021 article

Carlos Toxtli, Siddharth Suri, Saiph Savage

Can "Conscious Data Contribution" Help Users to Exert "Data Leverage" Against Technology Companies? 2021 article

Nicholas Vincent, Brent Hecht

Data Leverage: A Framework for Empowering the Public in its Relationship with Technology Companies 2021 inproceedings

Vincent, Nicholas and Li, Hanlin and Tilly, Nicole and Chancellor, Stevie and Hecht, Brent

A Deeper Investigation of the Importance of Wikipedia Links to Search Engine Results 2021 article

Nicholas Vincent, Brent Hecht

Ethical and Social Risks of Harm from Language Models 2021 article

Weidinger, Laura, Mellor, John, Rauh, Maribeth, Griffin, Conor, Uesato, Jonathan, Huang, Po-Sen, Cheng, Myra, Glaese, Mia, Balle, Borja, Kasirzadeh, Atoosa, Kenton, Zac, Brown, Sasha, Hawkins, Will, Stepleton, Tom, Biles, Courtney, Birhane, Abeba, Haas, Julia, Rimell, Laura, Hendricks, Lisa Anne, Isaac, William, Legassick, Sean, Irving, Geoffrey, Gabriel, Iason

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 2021 inproceedings

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell

On the opportunities and risks of foundation models 2021 article

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill

To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making 2021 inproceedings

Zana Buccinca, Meghan B Malaya, Krzysztof Z Gajos

Extracting training data from large language models 2021 inproceedings

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson

All that's `human' is not gold: Evaluating human evaluation of generated text 2021 inproceedings

Elizabeth Clark, Tal August, Sofia Serber, Nikita Haduong, Suchin Gururangan, Noah A Smith

Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence 2021 book

Kate Crawford

Documenting large webtext corpora: A case study on the Colossal Clean Crawled Corpus 2021 inproceedings

Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner

A mathematical framework for transformer circuits 2021 article

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly

The Australian News Media Bargaining Code 2021 article

Terry Flew, Fiona Martin, Nicolas Suzor

Datasheets for datasets 2021 article

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum'e III, Kate Crawford

The value of data: Evidence from ride-hailing 2021 article

Jesse Gregory, Michael Kremer, Judd Kessler, Susan Athey

Copyright in the data economy: An overview 2021 article

P Bernt Hugenholtz

Dynabench: Rethinking benchmarking in NLP 2021 article

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia

Fair learning 2021 article

Mark A Lemley, Bryan Casey

Data leverage: A framework for empowering the public in its relationship with technology companies 2021 article

Nicholas Vincent, Hanlin Li, Nicole Tilly, Stevie Chancellor, Brent Hecht

Language (Technology) is Power: A Critical Survey of “Bias” in NLP 2020 inproceedings

Blodgett, Su Lin, Barocas, Solon, Daum{\'e} III, Hal, Wallach, Hanna

Language Models are Few-Shot Learners 2020 article

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

Artificial Intelligence, Values, and Alignment 2020 article

Iason Gabriel

Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning 2020 inproceedings
Scaling Laws for Neural Language Models 2020 article

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei

Establishes power-law scaling relationships between language model performance and model size, dataset size, and compute, spanning seven orders of magnitude.

Exploring Research Interest in Stack Overflow -- A Systematic Mapping Study and Quality Evaluation 2020 article

Sarah Meldrum, Sherlock A. Licorish, Bastin Tony Roy Savarimuthu

Coresets for Data-efficient Training of Machine Learning Models 2020 inproceedings

Baharan Mirzasoleiman, Jeff Bilmes, Jure Leskovec

Introduces CRAIG (Coresets for Accelerating Incremental Gradient descent), selecting subsets that approximate full gradient for 2-3x training speedups while maintaining performance.

The Economics of Maps 2020 article

Abhishek Nagaraj, Scott Stern

Deep Double Descent: Where Bigger Models and More Data Hurt 2020 inproceedings

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever

Demonstrates that double descent occurs across model size, training epochs, and dataset size in modern deep networks. Introduces effective model complexity to unify these phenomena and shows regimes where more data hurts.

The Biggest Lie on the Internet: Ignoring the Privacy Policies and Terms of Service Policies of Social Networking Services 2020 article

Obar, Jonathan A., Oeldorf-Hirsch, Anne

Estimating Training Data Influence by Tracing Gradient Descent 2020 inproceedings

Garima Pruthi, Frederick Liu, Mukund Sundararajan, Satyen Kale

Introduces TracIn, which computes influence of training examples by tracing how test loss changes during training. Uses first-order gradient approximation and saved checkpoints for scalability.

The pushshift reddit dataset 2020 article

Baumgartner, Jason, Zannettou, Savvas, Keegan, Brian, Squire, Megan, Blackburn, Jeremy

Example Citation Placeholder 2020 misc

Jane Smith

Placeholder reference to support example citations in docs. Replace with a real source when available.

Are anonymity-seekers just like everybody else? An analysis of contributions to Wikipedia from Tor 2020 inproceedings

Tran, Chau, Champion, Kaylea, Forte, Andrea, Hill, Benjamin Mako, Greenstadt, Rachel

In Pursuit of Interpretable, Fair and Accurate Machine Learning for Criminal Recidivism Prediction 2020 article

Caroline Wang, Bin Han, Bhrij Patel, Cynthia Rudin

Enchanted determinism: Power without responsibility in artificial intelligence 2020 article

Alexander Campolo, Kate Crawford

The digitization of day labor as gig work 2020 article

Veena B Dubal

interpreting GPT: the logit lens 2020 misc

nostalgebraist

Too Smart: How Digital Capitalism is Extracting Data, Controlling Our Lives, and Taking Over the World 2020 book

Jathan Sadowski

What do platforms do? Understanding the gig economy 2020 article

Steven Vallas, Juliet B Schor

The gig economy: A critical introduction 2020 article

Jamie Woodcock, Mark Graham

Common voice: A massively-multilingual speech corpus 2019 article

Ardila, Rosana, Branson, Megan, Davis, Kelly, Henretty, Michael, Kohler, Michael, Meyer, Josh, Morais, Reuben, Saunders, Lindsay, Tyers, Francis M, Weber, Gregor

Reconciling modern machine-learning practice and the classical bias–variance trade-off 2019 article

Belkin, Mikhail, Hsu, Daniel, Ma, Siyuan, Mandal, Soumik

The Secret Sharer: Measuring Unintended Memorization in Neural Networks 2019 inproceedings

Carlini, Nicholas, Liu, Chang, Erlingsson, {\'U}lfar, Kos, Jernej, Song, Dawn

Excavating AI: The Politics of Images in Machine Learning Training Sets 2019 misc
Ecosystem Tipping Points in an Evolving World 2019 article

Vasilis Dakos, Blake Matthews, Andrew P. Hendry, Jonathan Levine, Nicolas Loeuille, Jon Norberg, Patrik Nosil, Marten Scheffer, Luc De Meester

Data Shapley: Equitable Valuation of Data for Machine Learning 2019 inproceedings

Amirata Ghorbani, James Zou

Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects 2019 techreport

Grother, Patrick, Ngan, Mei, Hanaoka, Kayee

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain 2019 article

Tianyu Gu, Brendan Dolan-Gavitt, Siddharth Garg

First demonstration of backdoor attacks on deep neural networks. Shows that small trigger patterns in training data cause models to misclassify any input containing the trigger (e.g., stop signs with stickers classified as speed limits).

Incomplete Contracting and AI Alignment 2019 inproceedings

Dylan Hadfield-Menell, Gillian K. Hadfield

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips 2019 article

Antoine Miech, Dimitri Zhukov, Jean{-}Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

Towards Efficient Data Valuation Based on the Shapley Value 2019 inproceedings

Ruoxi Jia, Dah-Yuan Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Neslihan M. Gurel, Carl J. Spanos

On the Accuracy of Influence Functions for Measuring Group Effects 2019 inproceedings

Pang Wei Koh, Kai-Siang Ang, Hubert H. K. Teo, Percy Liang

Privacy, anonymity, and perceived risk in open collaboration: A study of service providers 2019 inproceedings

McDonald, Nora, Hill, Benjamin Mako, Greenstadt, Rachel, Forte, Andrea

Model Cards for Model Reporting 2019 inproceedings

Mitchell, Margaret, Wu, Simone, Zaldivar, Andrew, Barnes, Parker, Vasserman, Lucy, Hutchinson, Ben, Spitzer, Elena, Raji, Inioluwa Deborah, Gebru, Timnit

Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations 2019 article

Obermeyer, Ziad, Powers, Brian, Vogeli, Christine, Mullainathan, Sendhil

Rosenbach v. Six Flags Entertainment Corp. 2019 legal
Fairness and Abstraction in Sociotechnical Systems 2019 inproceedings

Selbst, Andrew D., Boyd, Danah, Friedler, Sorelle A., Venkatasubramanian, Suresh, Vertesi, Janet

A Survey on Image Data Augmentation for Deep Learning 2019 article

Connor Shorten, Taghi M. Khoshgoftaar

Comprehensive survey of image data augmentation techniques for deep learning, covering geometric transformations, color space transforms, kernel filters, mixing images, random erasing, and neural style transfer approaches.

Measuring the Importance of User-Generated Content to Search Engines 2019 inproceedings

Nicholas Vincent, Isaac Johnson, Patrick Sheehan, Brent Hecht

Mapping the Potential and Pitfalls of "Data Dividends" as a Means of Sharing the Profits of Artificial Intelligence 2019 article

Nicholas Vincent, Yichun Li, Renee Zha, Brent Hecht

"Data Strikes": Evaluating the Effectiveness of a New Form of Collective Action Against Technology Companies 2019 inproceedings

Nicholas Vincent, Brent Hecht, Shilad Sen

Simulates data strikes against recommender systems, showing that collective withholding of training data can create leverage for users against technology platforms.

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features 2019 inproceedings

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Youngjoon Yoo

Combines cutting and mixing: patches from one image replace regions in another, with labels mixed proportionally. Improves over Cutout by using cut pixels constructively rather than zeroing them out.

Automation and new tasks: How technology displaces and reinstates labor 2019 article

Daron Acemoglu, Pascual Restrepo

The Wrong Kind of AI? Artificial Intelligence and the Future of Labor Demand 2019 book

Daron Acemoglu, Pascual Restrepo

Race After Technology: Abolitionist Tools for the New Jim Code 2019 book

Ruha Benjamin

Regulatory options for conflicts of law and jurisdictional issues in the on-demand economy 2019 article

Miriam A Cherry

Data colonialism: Rethinking big data's relation to the contemporary subject 2019 article

Nick Couldry, Ulises A Mejias

The Costs of Connection: How Data is Colonizing Human Life and Appropriating it for Capitalism 2019 book

Nick Couldry, Ulises A Mejias

The Technology Trap: Capital, Labor, and Power in the Age of Automation 2019 book

Carl Benedikt Frey

Data Shapley: Equitable valuation of data for machine learning 2019 inproceedings

Amirata Ghorbani, James Zou

Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass 2019 book

Mary L Gray, Siddharth Suri

SuperGLUE: A stickier benchmark for general-purpose language understanding systems 2019 article

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel Bowman

The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power 2019 book

Shoshana Zuboff

A Reductions Approach to Fair Classification 2018 inproceedings

Alekh Agarwal, Alina Beygelzimer, Miroslav Dudik, John Langford, Hanna Wallach

Should We Treat Data as Labor? Moving Beyond 'Free' 2018 article

Imanol Arrieta-Ibarra, Leonard Goff, Diego Jimenez-Hernandez, Jaron Lanier, E. Glen Weyl

Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science 2018 article

Bender, Emily M., Friedman, Batya

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification 2018 inproceedings

Buolamwini, Joy, Gebru, Timnit

Datasheets for Datasets 2018 inproceedings

Gebru, Timnit, Morgenstern, Jamie, Vecchione, Briana, Vaughan, Jennifer Wortman, Wallach, Hanna, Daumé III, Hal, Crawford, Kate

The Dark (Patterns) Side of UX Design 2018 inproceedings

Gray, Colin M., Kou, Yubo, Battles, Bryan, Hoggatt, Joseph, Toombs, Austin L.

The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards 2018 misc

Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, Kasia Chmielinski

Troubling Trends in Machine Learning Scholarship 2018 article

Zachary C. Lipton, Jacob Steinhardt

Active Learning for Convolutional Neural Networks: A Core-Set Approach 2018 inproceedings

Ozan Sener, Silvio Savarese

Defines active learning as core-set selection, choosing points such that a model trained on the subset is competitive for remaining data. Provides theoretical bounds via k-Center problem.

Artificial Intelligence and Its Implications for Income Distribution and Unemployment 2018 techreport

Anton Korinek, Joseph E. Stiglitz

A Blueprint for a Better Digital Society 2018 article

Weyl, E. Glen and Lanier, Jaron

mixup: Beyond Empirical Risk Minimization 2018 inproceedings

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz

Introduces mixup, a data augmentation technique that trains on convex combinations of input pairs and their labels. Simple, data-independent, and model-agnostic approach that improves generalization and robustness.

Artificial intelligence, automation and work 2018 techreport

Daron Acemoglu, Pascual Restrepo

Prediction Machines: The Simple Economics of Artificial Intelligence 2018 book

Ajay Agrawal, Joshua Gans, Avi Goldfarb

Should we treat data as labor? Moving beyond ``free'' 2018 article

Imanol Arrieta-Ibarra, Leonard Goff, Diego Jim'enez-Hern'andez, Jaron Lanier, E Glen Weyl

Data statements for natural language processing: Toward mitigating system bias and enabling better science 2018 inproceedings

Emily M Bender, Batya Friedman

Artificial Unintelligence: How Computers Misunderstand the World 2018 book

Meredith Broussard

Neurons spike back: The invention of inductive machines and the artificial intelligence controversy 2018 article

Dominique Cardon, Jean-Philippe Cointet, Antoine Mazi\`eres

Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor 2018 book

Virginia Eubanks

Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media 2018 book

Tarleton Gillespie

Annotation artifacts in natural language inference data 2018 inproceedings

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, Noah A Smith

A blueprint for a better digital society 2018 article

Jaron Lanier, E Glen Weyl

Algorithms of Oppression: How Search Engines Reinforce Racism 2018 book

Safiya Umoja Noble

Radical Markets: Uprooting Capitalism and Democracy for a Just Society 2018 book

Eric A Posner, E Glen Weyl

Uberland: How Algorithms Are Rewriting the Rules of Work 2018 book

Alex Rosenblat

Artificial intelligence, economics, and industrial organization 2018 article

Hal R Varian

Improved Regularization of Convolutional Neural Networks with Cutout 2017 article

Terrance DeVries, Graham W. Taylor

Introduces Cutout, a regularization technique that randomly masks square regions of input images during training. Inspired by dropout but applied to inputs, encouraging models to learn from partially visible objects.

The Substantial Interdependence of Wikipedia and Google: A Case Study on the Relationship Between Peer Production Communities and Information Technologies 2017 inproceedings

McMahon, Connor and Johnson, Isaac and Hecht, Brent

Deep learning scaling is predictable, empirically 2017 article

Hestness, Joel, Narang, Sharan, Ardalani, Newsha, Diamos, Gregory, Jun, Heewoo, Kianinejad, Hassan, Patwary, Md, Ali, Mostofa, Yang, Yang, Zhou, Yanqi

The WARC Format 1.1 2017 misc

{International Internet Preservation Consortium}

Understanding Black-box Predictions via Influence Functions 2017 inproceedings

Pang Wei Koh, Percy Liang

Uses influence functions from robust statistics to trace model predictions back to training data, identifying training points most responsible for a given prediction.

Deep reinforcement learning from human preferences 2017 inproceedings

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

Heteromation, and Other Stories of Computing and Capitalism 2017 book

Hamid R Ekbia, Bonnie A Nardi

On Calibration of Modern Neural Networks 2017 misc

Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger

Understanding black-box predictions via influence functions 2017 inproceedings

Pang Wei Koh, Percy Liang

Thinking critically about and researching algorithms 2017 article

Rob Kitchin

Algorithms as culture: Some tactics for the ethnography of algorithmic systems 2017 article

Nick Seaver

Membership inference attacks against machine learning models 2017 inproceedings

Reza Shokri, Marco Stronati, Congzheng Song, Vitaly Shmatikov

Platform Capitalism 2017 book

Nick Srnicek

The EU General Data Protection Regulation (GDPR): A Practical Guide 2017 book

Paul Voigt, Axel Von dem Bussche

Big Data's Disparate Impact 2016 article

Barocas, Solon, Selbst, Andrew D.

General Data Protection Regulation (EU) 2016/679 2016 misc
Reality and Perception of Copyright Terms of Service for Online Content Creation 2016 inproceedings

Fiesler, Casey, Lampe, Cliff, Bruckman, Amy S.

Information fiduciaries and the first amendment 2016 article

Jack M Balkin

How the machine 'thinks': Understanding opacity in machine learning algorithms 2016 article

Jenna Burrell

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy 2016 book

Cathy O'Neil

SQuAD: 100,000+ questions for machine comprehension of text 2016 inproceedings

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang

Uberworked and Underpaid: How Workers Are Disrupting the Digital Economy 2016 book

Trebor Scholz

Ours to Hack and to Own: The Rise of Platform Cooperativism 2016 book

Trebor Scholz, Nathan Schneider

Towards Making Systems Forget with Machine Unlearning 2015 inproceedings

Yinzhi Cao, Junfeng Yang

First formal definition of machine unlearning. Proposes converting learning algorithms into summation form to enable efficient data removal without full retraining. Foundational work establishing the unlearning problem.

Causal Inference in Statistics, Social, and Biomedical Sciences 2015 book

Guido W. Imbens, Donald B. Rubin

Comprehensive treatment of causal inference methods for observational and experimental data. Covers randomized experiments, matching, propensity scores, instrumental variables, and regression discontinuity designs.

Turkers, Scholars, "Arafat" and "Peace": Cultural Communities and Algorithmic Gold Standards 2015 misc

Shilad Sen, Margaret E. Giesel, Rebecca Gold, Benjamin Hillmann, Matt Lesicko, Samuel Naden, Jesse Russell, Zixiao Wang, Brent J. Hecht

Why are there still so many jobs? The history and future of workplace automation 2015 article

David H Autor

Cyber-Proletariat: Global Labour in the Digital Vortex 2015 book

Nick Dyer-Witheford

The Black Box Society: The Secret Algorithms That Control Money and Information 2015 book

Frank Pasquale

Who Gets What—and Why: The New Economics of Matchmaking and Market Design 2015 book

Alvin E Roth

Hidden technical debt in machine learning systems 2015 inproceedings

David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison

What's Yours Is Mine: Against the Sharing Economy 2015 book

Tom Slee

What's wrong with social simulations? 2014 article

Eckhart Arnold

The Algorithmic Foundations of Differential Privacy 2014 article

Cynthia Dwork, Aaron Roth

The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies 2014 book

Erik Brynjolfsson, Andrew McAfee

Heteromation and its (dis)contents: The invisible division of labor between humans and machines 2014 article

Hamid Ekbia, Bonnie Nardi

The Fourth Revolution: How the Infosphere is Reshaping Human Reality 2014 book

Luciano Floridi

Digital Labour and Karl Marx 2014 book

Christian Fuchs

The relevance of algorithms 2014 article

Tarleton Gillespie

Children's Online Privacy Protection Rule (COPPA) — 16 CFR Part 312 2013 misc
The Future of Crowd Work 2013 inproceedings

Aniket Kittur, Jeffrey V. Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, John Horton

The China syndrome: Local labor market effects of import competition in the United States 2013 article

David H Autor, David Dorn, Gordon H Hanson

The Ethics of Artificial Intelligence 2013 book

Luciano Floridi, Jeff W Sanders

Turkopticon: Interrupting worker invisibility in Amazon Mechanical Turk 2013 inproceedings

Lilly C Irani, M Six Silberman

Who Owns the Future? 2013 book

Jaron Lanier

To Save Everything, Click Here: The Folly of Technological Solutionism 2013 book

Evgeny Morozov

Poisoning Attacks against Support Vector Machines 2012 inproceedings

Battista Biggio, Blaine Nelson, Pavel Laskov

Investigates poisoning attacks against SVMs where adversaries inject crafted training data to increase test error. Uses gradient ascent to construct malicious data points.

Configuring the Networked Self: Law, Code, and the Play of Everyday Practice 2012 book

Julie E Cohen

Infrastructure: The Social Value of Shared Resources 2012 book

Brett M Frischmann

The Winograd schema challenge 2012 inproceedings

Hector Levesque, Ernest Davis, Leora Morgenstern

Open Access 2012 book

Peter Suber

Skills, tasks and technologies: Implications for employment and earnings 2011 article

Daron Acemoglu, David Autor

Surveillance and alienation in the online economy 2011 article

Mark Andrejevic

Human Computation 2011 book

Edith Law, Luis von Ahn

The Precariat: The New Dangerous Class 2011 book

Guy Standing

Guide to Protecting the Confidentiality of Personally Identifiable Information (PII) 2010 techreport

McCallister, Erika, Grance, Tim, Scarfone, Karen

A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming 2010 book

Paul N Edwards

Privacy in Context: Technology, Policy, and the Integrity of Social Life 2010 book

Helen Nissenbaum

Curriculum Learning 2009 inproceedings

Yoshua Bengio, Jerome Louradour, Ronan Collobert, Jason Weston

Introduces curriculum learning: training models on examples of increasing difficulty. Shows this acts as a continuation method for non-convex optimization, improving both convergence speed and final generalization.

Causality: Models, Reasoning, and Inference 2009 book

Judea Pearl

Foundational book on causal inference introducing structural causal models, do-calculus, and counterfactual reasoning. Unifies graphical models with potential outcomes framework. Second edition with expanded coverage.

Active Learning Literature Survey 2009 techreport

Burr Settles

Canonical survey of active learning covering uncertainty sampling, query-by-committee, expected error reduction, variance reduction, and density-weighted methods. Establishes foundational taxonomy for the field.

The Google dilemma 2009 article

James Grimmelmann

Biometric Information Privacy Act (BIPA), 740 ILCS 14 2008 misc
The Cost of Reading Privacy Policies 2008 article

McDonald, Aleecia M., Cranor, Lorrie Faith

Robust De-anonymization of Large Sparse Datasets 2008 inproceedings

Narayanan, Arvind, Shmatikov, Vitaly

Understanding knowledge as a commons: From theory to practice 2007 book

Charlotte Hess, Elinor Ostrom

Human-Machine Reconfigurations: Plans and Situated Actions 2007 book

Lucy A Suchman

The polarization of the US labor market 2006 article

David H Autor, Lawrence F Katz, Melissa S Kearney

The Wealth of Networks: How Social Production Transforms Markets and Freedom 2006 book

Yochai Benkler

A taxonomy of privacy 2006 article

Daniel J Solove

Causal Inference Using Potential Outcomes: Design, Modeling, Decisions 2005 article

Donald B. Rubin

Comprehensive overview of the potential outcomes framework for causal inference. Covers experimental design, observational studies, propensity scores, and the fundamental problem of causal inference.

Reassembling the Social: An Introduction to Actor-Network-Theory 2005 book

Bruno Latour

Human Computation 2005 phdthesis

Luis von Ahn

Privacy as Contextual Integrity 2004 article

Nissenbaum, Helen

Writings of the Luddites 2004 book

Kevin Binfield

Free Culture: How Big Media Uses Technology and the Law to Lock Down Culture and Control Creativity 2004 book

Lawrence Lessig

Labeling images with a computer game 2004 inproceedings

Luis Von Ahn, Laura Dabbish

The skill content of recent technological change: An empirical exploration 2003 article

David H Autor, Frank Levy, Richard J Murnane

Platform competition in two-sided markets 2003 article

Jean-Charles Rochet, Jean Tirole

Skill-biased technological change and rising wage inequality: Some problems and puzzles 2002 article

David Card, John E DiNardo

State of the Union: A Century of American Labor 2002 book

Nelson Lichtenstein

Free Software, Free Society: Selected Essays of Richard M. Stallman 2002 book

Richard M Stallman

Modeling Complexity : The Limits to Prediction 2001 article

Michael Batty, Paul M. Torrens

HIPAA Privacy Rule — 45 CFR Parts 160 and 164 2000 misc

{U.S. Department of Health, Human Services}

Simple Demographics Often Identify People Uniquely 2000 article

Sweeney, Latanya

Privacy as intellectual property? 2000 article

Pamela Samuelson

Economics of the Public Sector 2000 book

Joseph E Stiglitz

Free labor: Producing culture for the digital economy 2000 article

Tiziana Terranova

Sorting Things Out: Classification and Its Consequences 1999 book

Geoffrey C Bowker, Susan Leigh Star

Technological determinism 1999 article

Donald MacKenzie

The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary 1999 book

Eric S Raymond

Social {Dilemmas}: {The} {Anatomy} of {Cooperation} 1998 article

Kollock, Peter

The study of social dilemmas is the study of the tension between individual and collective rationality. In a social dilemma, individually reasonable behavior leads to a situation in which everyone is worse off. The first part of this review is a discussion of categories of social dilemmas and how they are modeled. The key two-person social dilemmas (Prisoner’s Dilemma, Assurance, Chicken) and multiple-person social dilemmas (public goods dilemmas and commons dilemmas) are examined. The second part is an extended treatment of possible solutions for social dilemmas. These solutions are organized into three broad categories based on whether the solutions assume egoistic actors and whether the structure of the situation can be changed: Motivational solutions assume actors are not completely egoistic and so give some weight to the outcomes of their partners. Strategic solutions assume egoistic actors, and neither of these categories of solutions involve changing the fundamental structure of the situation. Solutions that do involve changing the rules of the game are considered in the section on structural solutions. I conclude the review with a discussion of current research and directions for future work.

The tragedy of the anticommons: Property in the transition from Marx to markets 1998 article

Michael A Heller

Gradient-based learning applied to document recognition 1998 article

Yann LeCun, L'eon Bottou, Yoshua Bengio, Patrick Haffner

Information Rules: A Strategic Guide to the Network Economy 1998 book

Carl Shapiro, Hal R Varian

No free lunch theorems for optimization 1997 article

Wolpert, David H, Macready, William G

Recommender systems 1997 misc

Paul Resnick, H. Varian

Humans and automation: Use, misuse, disuse, abuse 1997 article

Raja Parasuraman, Victor Riley

The Sciences of the Artificial 1996 book

Herbert A Simon

The critical mass in collective action 1993 book

Marwell, Gerald, Oliver, Pamela

What Computers Still Can't Do: A Critique of Artificial Reason 1992 book

Hubert L Dreyfus

Governing the Commons: The Evolution of Institutions for Collective Action 1990 book

Elinor Ostrom

The Society of Mind 1988 book

Marvin Minsky

The Social Construction of Technological Systems 1987 book

Wiebe E Bijker, Thomas P Hughes, Trevor Pinch

Science in Action: How to Follow Scientists and Engineers Through Society 1987 book

Bruno Latour

The Economic Institutions of Capitalism 1985 book

Oliver E Williamson

What Do Unions Do? 1984 book

Richard B Freeman, James L Medoff

Problems of Monetary Management: The UK Experience 1984 inbook

C. A. E. Goodhart

The Second Self: Computers and the Human Spirit 1984 book

Sherry Turkle

The Managed Heart: Commercialization of Human Feeling 1983 book

Arlie Russell Hochschild

Minds, brains, and programs 1980 article

John R Searle

Do artifacts have politics? 1980 article

Langdon Winner

Manufacturing Consent: Changes in the Labor Process Under Monopoly Capitalism 1979 book

Michael Burawoy

Computer Power and Human Reason: From Judgment to Calculation 1976 book

Joseph Weizenbaum

Family Educational Rights and Privacy Act (FERPA) 1974 misc
Labor and Monopoly Capital: The Degradation of Work in the Twentieth Century 1974 book

Harry Braverman

Brain of the Firm 1972 book

Stafford Beer

What Computers Can't Do: A Critique of Artificial Reason 1972 book

Hubert L Dreyfus

Human Problem Solving 1972 book

Allen Newell, Herbert A Simon

The Sciences of the Artificial 1969 book

Herbert A Simon

The tragedy of the commons 1968 article

Garrett Hardin

Computation: Finite and Infinite Machines 1967 book

Marvin Minsky

Economic welfare and the allocation of resources for invention 1962 article

Kenneth Arrow

The problem of social cost 1960 article

Ronald H Coase

Programs with common sense 1960 article

John McCarthy

An Introduction to Cybernetics 1956 book

W Ross Ashby

A behavioral model of rational choice 1955 article

Herbert A Simon

A value for n-person games 1953 article

Lloyd S Shapley

Rank analysis of incomplete block designs: I. The method of paired comparisons 1952 article

Ralph Allan Bradley, Milton E Terry

American Capitalism: The Concept of Countervailing Power 1952 book

John Kenneth Galbraith

Computing machinery and intelligence 1950 article

Alan M Turing

The human use of human beings 1950 article

Norbert Wiener

A mathematical theory of communication 1948 article

Claude E Shannon

Cybernetics: Or Control and Communication in the Animal and the Machine 1948 book

Norbert Wiener

The Great Transformation: The Political and Economic Origins of Our Time 1944 book

Karl Polanyi

The nature of the firm 1937 article

Ronald H Coase

Economic possibilities for our grandchildren 1930 article

John Maynard Keynes

Data Leverage & Collective Action paper_collection
User-Generated Content & AI Training Data paper_collection
Alpaca Data Cleaned Repository misc
Stanford Alpaca GitHub Repository misc
arXiv API User’s Manual misc
arXiv Bulk Data Access misc
arXiv OAI-PMH Interface misc
C4 Generator Code misc
Web Archiving File Formats Explained misc
Common Crawl – Get Started misc
Databricks Dolly Repository misc
GSM8K Hugging Face Dataset Card misc
Grade-School Math (GSM8K) Repository misc
HH-RLHF Dataset misc
HowTo100M Project misc
Journal Article Tag Suite misc
JSON Lines Specification misc
WARC, Web ARChive file format misc
Competition Math Dataset on Hugging Face misc
Wikipedia Data Dumps – Dump Format misc
NDJSON Specification misc
OpenAssistant OASST1 Dataset Card misc
OpenAI API Reference – Chat misc
Apache Parquet Project misc
Project Gutenberg Offline Catalogs and Feeds misc
Project Gutenberg File Formats misc
Pushshift.io misc
Reddit API Documentation misc
Reddit Data API Wiki misc
Stack Exchange Data Explorer Help misc
Why is the Stack Exchange Data Dump only available in XML? misc
BigCode Project Documentation misc
The Stack dataset on Hugging Face misc
The Stack v2 dataset on Hugging Face misc
C4 dataset in TensorFlow Datasets misc
TFRecord and tf.train.Example Tutorial misc
Wikipedia Database Download misc
Active Learning paper_collection
Experimental Design & Causal Inference paper_collection
Data Augmentation & Curriculum Learning paper_collection
Data Poisoning & Adversarial Training paper_collection
Data Selection & Coresets paper_collection
Data Valuation & Shapley paper_collection
Fairness via Data Interventions paper_collection
Machine Unlearning paper_collection
Privacy, Memorization & Unlearning paper_collection
Data Scaling Laws paper_collection
Influence Functions & Data Attribution paper_collection
Model Collapse & Synthetic Data paper_collection
Paper Review Queue memo