Data Leverage References

Paper Collections

Curated reading lists organized by research area. 57 of 443 references catalogued.

❦ Data Augmentation & Curriculum Learning

2 papers

Augmentation creates synthetic variations of training data—effectively generating new rows in the grid. Curriculum learning orders training examples from easy to hard. Mixup is a simple, influential technique; the survey covers the broader landscape.

collection:data-augmentation

❦ Data Leverage & Collective Action

20 papers

This collection gathers foundational and recent work on **data leverage**—the strategic use of data withholding, contribution, or manipulation as a form of collective action. ## Core Concepts - **Data Strikes**: Coordinated refusal to contribute data to platforms - **Data Poisoning**: Intentionally corrupting training data to degrade model performance - **Conscious Data Contribution**: Strategically directing data to preferred systems - **Data Valuation**: Methods for quantifying individual data contributions (Shapley values, etc.) ## Why This Matters As AI systems become more dependent on user-generated content and behavioral data, data creators gain potential leverage over technology companies. This research explores when and how such leverage can be effectively exercised.

collection:data-leverage

❦ Data Poisoning & Adversarial Training

4 papers

Poisoning asks: what if someone deliberately corrupts training data? This expands the grid dramatically—every possible perturbation creates new rows. BadNets showed how small triggers in images cause targeted misclassifications. Recent work looks at poisoning web-scale datasets and deceptive model behavior.

collection:data-poisoning

❦ Data Scaling Laws

3 papers

Scaling laws describe how performance changes as you add more data (or parameters, or compute). In grid terms, they're regressions over average performance across rows of different sizes. Kaplan et al. established the modern framework; Chinchilla refined it for compute-optimal training.

collection:scaling-laws

❦ Data Selection & Coresets

4 papers

Coresets are small subsets that approximate training on the full dataset. In grid terms, you're looking for a much smaller row that lands in roughly the same performance region. CRAIG and DeepCore are practical methods for finding these subsets efficiently.

collection:data-selection

❦ Data Valuation & Shapley

5 papers

Data Shapley and related methods assign a "value" to each training point by averaging its marginal contribution across many possible training sets. This is essentially a principled way to aggregate over the grid. These techniques matter for data markets, debugging, and understanding what data actually helps.

collection:data-valuation

❦ Experimental Design & Causal Inference

2 papers

The grid is fundamentally about counterfactuals, which connects to causal inference. These are foundational references—Pearl for structural causal models, Rubin for potential outcomes, Imbens & Rubin for practical methods. Useful background if you want to think carefully about what "what if" means.

collection:causal-inference

❦ Influence Functions & Data Attribution

7 papers

Influence functions let you ask: "which training examples are responsible for this prediction?" In grid terms, they approximate what would happen if you removed or upweighted specific rows. The original Koh & Liang paper is the standard starting point; more recent work extends these ideas to LLMs.

collection:influence-functions

❦ Machine Unlearning

3 papers

Unlearning asks: can you efficiently update a model as if a data point was never in the training set? This is moving from one row to another without retraining from scratch. SISA is a practical sharding approach; the survey covers the growing field.

collection:machine-unlearning

❦ Model Collapse & Synthetic Data

2 papers

Model collapse occurs when generative models are trained on data produced by previous model generations. The original content distribution's tails disappear, leading to mode collapse and loss of diversity. This is increasingly important as AI-generated content proliferates on the web. Research explores both the phenomenon and potential mitigations like data accumulation and verification.

collection:model-collapse

❦ Privacy, Memorization & Unlearning

2 papers

Models can memorize training data—sometimes enough to extract it verbatim. Differential privacy limits how much any single point can affect the model (constraining movement in the grid). The Carlini et al. extraction paper is a good entry point for understanding what's at stake.

collection:privacy-memorization

❦ User-Generated Content & AI Training Data

9 papers

This collection examines the role of **user-generated content** in training and powering AI systems. ## Key Themes - **UGC in Search**: How Wikipedia and other UGC improves search engine results - **Training Data Value**: Quantifying the contribution of user content to AI models - **Platform Dependencies**: How AI systems rely on crowdsourced knowledge - **Content Creator Rights**: Implications for people who create the data AI learns from ## Related Collections See also: [Data Leverage & Collective Action](./data-leverage) for research on how content creators can exercise power over AI systems.

collection:ugc-value

Library Registry

443 Total References
14 Collections
57 Catalogued
386 Uncatalogued