Data Leverage References

Paper Collections

Curated reading lists organized by research area. 30 of 211 references catalogued.

❦ Collective Action & Data Leverage

12 papers

What happens when data creators coordinate? Data strikes and leverage campaigns are strategic moves to different rows in the grid—ones with worse performance for the model operator. This area connects technical data valuation work to questions about power, labor, and governance.

collection:data-leverage

❦ Data Augmentation & Curriculum Learning

2 papers

Augmentation creates synthetic variations of training data—effectively generating new rows in the grid. Curriculum learning orders training examples from easy to hard. Mixup is a simple, influential technique; the survey covers the broader landscape.

dc:data-augmentation

❦ Data Poisoning & Adversarial Training

3 papers

Poisoning asks: what if someone deliberately corrupts training data? This expands the grid dramatically—every possible perturbation creates new rows. BadNets showed how small triggers in images cause targeted misclassifications. Recent work looks at poisoning web-scale datasets and deceptive model behavior.

dc:data-poisoning

❦ Data Selection & Coresets

2 papers

Coresets are small subsets that approximate training on the full dataset. In grid terms, you're looking for a much smaller row that lands in roughly the same performance region. CRAIG and DeepCore are practical methods for finding these subsets efficiently.

dc:data-selection

❦ Data Valuation & Shapley

3 papers

Data Shapley and related methods assign a "value" to each training point by averaging its marginal contribution across many possible training sets. This is essentially a principled way to aggregate over the grid. These techniques matter for data markets, debugging, and understanding what data actually helps.

dc:data-valuation

❦ Experimental Design & Causal Inference

2 papers

The grid is fundamentally about counterfactuals, which connects to causal inference. These are foundational references—Pearl for structural causal models, Rubin for potential outcomes, Imbens & Rubin for practical methods. Useful background if you want to think carefully about what "what if" means.

dc:causal-inference

❦ Fairness via Data Interventions

3 papers

Many fairness problems trace back to training data—who's included, how they're labeled, what's missing. Datasheets for Datasets is a practical starting point; Gender Shades demonstrates concrete harms. This collection mixes technical interventions with documentation practices.

dc:fairness-data

❦ Influence Functions & Data Attribution

2 papers

Influence functions let you ask: "which training examples are responsible for this prediction?" In grid terms, they approximate what would happen if you removed or upweighted specific rows. The original Koh & Liang paper is the standard starting point; more recent work extends these ideas to LLMs.

dc:influence-functions

❦ Machine Unlearning

1 papers

Unlearning asks: can you efficiently update a model as if a data point was never in the training set? This is moving from one row to another without retraining from scratch. SISA is a practical sharding approach; the survey covers the growing field.

dc:machine-unlearning

❦ Privacy, Memorization & Unlearning

2 papers

Models can memorize training data—sometimes enough to extract it verbatim. Differential privacy limits how much any single point can affect the model (constraining movement in the grid). The Carlini et al. extraction paper is a good entry point for understanding what's at stake.

dc:privacy-memorization

Library Registry

211 Total References
12 Collections
30 Catalogued
181 Uncatalogued