Collections | Shared References

Paper Collections

Curated reading lists organized by research area. 57 of 443 references catalogued.

❦ Active Learning

2 papers

Active learning asks: given a budget for labeling, which points should you label next? This is navigation within the grid—picking which rows become available. Settles' survey is the classic reference; recent work connects to coresets and deep learning.

Active Learning for Convolutional Neural Networks: A Core-Set Approach Ozan Sener · 2018
Active Learning Literature Survey Burr Settles · 2009

collection:active-learning

❦ Data Augmentation & Curriculum Learning

2 papers

Augmentation creates synthetic variations of training data—effectively generating new rows in the grid. Curriculum learning orders training examples from easy to hard. Mixup is a simple, influential technique; the survey covers the broader landscape.

mixup: Beyond Empirical Risk Minimization Hongyi Zhang · 2018
Curriculum Learning Yoshua Bengio · 2009

collection:data-augmentation

❦ Data Leverage & Collective Action

20 papers

This collection gathers foundational and recent work on **data leverage**—the strategic use of data withholding, contribution, or manipulation as a form of collective action. ## Core Concepts - **Data Strikes**: Coordinated refusal to contribute data to platforms - **Data Poisoning**: Intentionally corrupting training data to degrade model performance - **Conscious Data Contribution**: Strategically directing data to preferred systems - **Data Valuation**: Methods for quantifying individual data contributions (Shapley values, etc.) ## Why This Matters As AI systems become more dependent on user-generated content and behavioral data, data creators gain potential leverage over technology companies. This research explores when and how such leverage can be effectively exercised.

Algorithmic Collective Action with Two Collectives Aditya Karan · 2025
The Economics of AI Training Data: A Research Agenda Hamidah Oderinwale · 2025
Collective Bargaining in the Information Economy Can Address AI-Driven Power Concentration Nicholas Vincent · 2025
Push and Pull: A Framework for Measuring Attentional Agency on Digital Platforms Zachary Wojtowicz · 2025
Poisoning Web-Scale Training Datasets is Practical Nicholas Carlini · 2024
View all 20 papers →

collection:data-leverage

❦ Data Poisoning & Adversarial Training

4 papers

Poisoning asks: what if someone deliberately corrupts training data? This expands the grid dramatically—every possible perturbation creates new rows. BadNets showed how small triggers in images cause targeted misclassifications. Recent work looks at poisoning web-scale datasets and deceptive model behavior.

Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations NIST · 2025
Poisoning Web-Scale Training Datasets is Practical Nicholas Carlini · 2024
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain Tianyu Gu · 2019
Poisoning Attacks against Support Vector Machines Battista Biggio · 2012

collection:data-poisoning

❦ Data Scaling Laws

3 papers

Scaling laws describe how performance changes as you add more data (or parameters, or compute). In grid terms, they're regressions over average performance across rows of different sizes. Kaplan et al. established the modern framework; Chinchilla refined it for compute-optimal training.

AI models collapse when trained on recursively generated data Ilia Shumailov · 2024
Training Compute-Optimal Large Language Models Jordan Hoffmann · 2022
Scaling Laws for Neural Language Models Jared Kaplan · 2020

collection:scaling-laws

❦ Data Selection & Coresets

4 papers

Coresets are small subsets that approximate training on the full dataset. In grid terms, you're looking for a much smaller row that lands in roughly the same performance region. CRAIG and DeepCore are practical methods for finding these subsets efficiently.

Data-centric Artificial Intelligence: A Survey Daochen Zha · 2025
Position Paper: Data-Centric AI in the Age of Large Language Models Xinyi Xu · 2024
Coresets for Data-efficient Training of Machine Learning Models Baharan Mirzasoleiman · 2020
Active Learning for Convolutional Neural Networks: A Core-Set Approach Ozan Sener · 2018

collection:data-selection

❦ Data Valuation & Shapley

5 papers

Data Shapley and related methods assign a "value" to each training point by averaging its marginal contribution across many possible training sets. This is essentially a principled way to aggregate over the grid. These techniques matter for data markets, debugging, and understanding what data actually helps.

Shapley value-based data valuation for machine learning data markets · 2025
CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning Huaiguang Cai · 2024
Datamodels: Predicting Predictions from Training Data Andrew Ilyas · 2022
Data Shapley: Equitable Valuation of Data for Machine Learning Amirata Ghorbani · 2019
Towards Efficient Data Valuation Based on the Shapley Value Ruoxi Jia · 2019

collection:data-valuation

❦ Experimental Design & Causal Inference

2 papers

The grid is fundamentally about counterfactuals, which connects to causal inference. These are foundational references—Pearl for structural causal models, Rubin for potential outcomes, Imbens & Rubin for practical methods. Useful background if you want to think carefully about what "what if" means.

Causal Inference in Statistics, Social, and Biomedical Sciences Guido W. Imbens · 2015
Causality: Models, Reasoning, and Inference Judea Pearl · 2009

collection:causal-inference

❦ Fairness via Data Interventions

2 papers

Many fairness problems trace back to training data—who's included, how they're labeled, what's missing. Datasheets for Datasets is a practical starting point; Gender Shades demonstrates concrete harms. This collection mixes technical interventions with documentation practices.

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification Buolamwini · 2018
Datasheets for Datasets Gebru · 2018

collection:fairness-data

❦ Influence Functions & Data Attribution

7 papers

Influence functions let you ask: "which training examples are responsible for this prediction?" In grid terms, they approximate what would happen if you removed or upweighted specific rows. The original Koh & Liang paper is the standard starting point; more recent work extends these ideas to LLMs.

Distributional Training Data Attribution: What do Influence Functions Sample? Bruno Mlodozeniec · 2025
Revisiting Data Attribution for Influence Functions Hongbo Zhu · 2025
A Versatile Influence Function for Data Attribution with Non-Decomposable Loss Junwei Deng · 2024
Influence Functions for Scalable Data Attribution in Diffusion Models Bruno Mlodozeniec · 2024
Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration Kangxi Wu · 2024
View all 7 papers →

collection:influence-functions

❦ Machine Unlearning

3 papers

Unlearning asks: can you efficiently update a model as if a data point was never in the training set? This is moving from one row to another without retraining from scratch. SISA is a practical sharding approach; the survey covers the growing field.

Rethinking machine unlearning for large language models Sijia Liu · 2025
LLM Unlearning via Loss Adjustment with Only Forget Data Yaxuan Wang · 2024
Machine Unlearning Lucas Bourtoule · 2021

collection:machine-unlearning

❦ Model Collapse & Synthetic Data

2 papers

Model collapse occurs when generative models are trained on data produced by previous model generations. The original content distribution's tails disappear, leading to mode collapse and loss of diversity. This is increasingly important as AI-generated content proliferates on the web. Research explores both the phenomenon and potential mitigations like data accumulation and verification.

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data Matthias Gerstgrasser · 2024
AI models collapse when trained on recursively generated data Ilia Shumailov · 2024

collection:model-collapse

❦ Privacy, Memorization & Unlearning

2 papers

Models can memorize training data—sometimes enough to extract it verbatim. Differential privacy limits how much any single point can affect the model (constraining movement in the grid). The Carlini et al. extraction paper is a good entry point for understanding what's at stake.

Extracting Training Data from Large Language Models Carlini · 2021
The Algorithmic Foundations of Differential Privacy Cynthia Dwork · 2014

collection:privacy-memorization

❦ User-Generated Content & AI Training Data

9 papers

This collection examines the role of **user-generated content** in training and powering AI systems. ## Key Themes - **UGC in Search**: How Wikipedia and other UGC improves search engine results - **Training Data Value**: Quantifying the contribution of user content to AI models - **Platform Dependencies**: How AI systems rely on crowdsourced knowledge - **Content Creator Rights**: Implications for people who create the data AI learns from ## Related Collections See also: [Data Leverage & Collective Action](./data-leverage) for research on how content creators can exercise power over AI systems.

The Economics of AI Training Data: A Research Agenda Hamidah Oderinwale · 2025
The Rise of AI-Generated Content in Wikipedia Creston Brooks · 2024
Large language models reduce public knowledge sharing on online Q&A platforms R. Maria del Rio-Chanona · 2024
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI Shayne Longpre · 2024
Wikipedia's value in the age of generative {AI} Deckelmann · 2023
View all 9 papers →

collection:ugc-value

Library Registry

443 Total References

14 Collections

57 Catalogued

386 Uncatalogued