❦ Active Learning
2 papers Active learning asks: given a budget for labeling, which points should you label next? This is navigation within the grid—picking which rows become available. Settles' survey is the classic reference; recent work connects to coresets and deep learning.
dc:active-learning
❦ Collective Action & Data Leverage
12 papers What happens when data creators coordinate? Data strikes and leverage campaigns are strategic moves to different rows in the grid—ones with worse performance for the model operator. This area connects technical data valuation work to questions about power, labor, and governance.
collection:data-leverage
❦ Data Augmentation & Curriculum Learning
2 papers Augmentation creates synthetic variations of training data—effectively generating new rows in the grid. Curriculum learning orders training examples from easy to hard. Mixup is a simple, influential technique; the survey covers the broader landscape.
dc:data-augmentation
❦ Data Poisoning & Adversarial Training
3 papers Poisoning asks: what if someone deliberately corrupts training data? This expands the grid dramatically—every possible perturbation creates new rows. BadNets showed how small triggers in images cause targeted misclassifications. Recent work looks at poisoning web-scale datasets and deceptive model behavior.
dc:data-poisoning
❦ Data Scaling Laws
2 papers Scaling laws describe how performance changes as you add more data (or parameters, or compute). In grid terms, they're regressions over average performance across rows of different sizes. Kaplan et al. established the modern framework; Chinchilla refined it for compute-optimal training.
dc:scaling-laws
❦ Data Selection & Coresets
2 papers Coresets are small subsets that approximate training on the full dataset. In grid terms, you're looking for a much smaller row that lands in roughly the same performance region. CRAIG and DeepCore are practical methods for finding these subsets efficiently.
dc:data-selection
❦ Data Valuation & Shapley
3 papers Data Shapley and related methods assign a "value" to each training point by averaging its marginal contribution across many possible training sets. This is essentially a principled way to aggregate over the grid. These techniques matter for data markets, debugging, and understanding what data actually helps.
dc:data-valuation
❦ Experimental Design & Causal Inference
2 papers The grid is fundamentally about counterfactuals, which connects to causal inference. These are foundational references—Pearl for structural causal models, Rubin for potential outcomes, Imbens & Rubin for practical methods. Useful background if you want to think carefully about what "what if" means.
dc:causal-inference
❦ Fairness via Data Interventions
3 papers Many fairness problems trace back to training data—who's included, how they're labeled, what's missing. Datasheets for Datasets is a practical starting point; Gender Shades demonstrates concrete harms. This collection mixes technical interventions with documentation practices.
dc:fairness-data
❦ Influence Functions & Data Attribution
2 papers Influence functions let you ask: "which training examples are responsible for this prediction?" In grid terms, they approximate what would happen if you removed or upweighted specific rows. The original Koh & Liang paper is the standard starting point; more recent work extends these ideas to LLMs.
dc:influence-functions
❦ Machine Unlearning
1 papers Unlearning asks: can you efficiently update a model as if a data point was never in the training set? This is moving from one row to another without retraining from scratch. SISA is a practical sharding approach; the survey covers the growing field.
dc:machine-unlearning
❦ Privacy, Memorization & Unlearning
2 papers Models can memorize training data—sometimes enough to extract it verbatim. Differential privacy limits how much any single point can affect the model (constraining movement in the grid). The Carlini et al. extraction paper is a good entry point for understanding what's at stake.
dc:privacy-memorization