❦ Active Learning
2 papers Active learning asks: given a budget for labeling, which points should you label next? This is navigation within the grid—picking which rows become available. Settles' survey is the classic reference; recent work connects to coresets and deep learning.
collection:active-learning
❦ Data Augmentation & Curriculum Learning
2 papers Augmentation creates synthetic variations of training data—effectively generating new rows in the grid. Curriculum learning orders training examples from easy to hard. Mixup is a simple, influential technique; the survey covers the broader landscape.
collection:data-augmentation
❦ Data Leverage & Collective Action
20 papers This collection gathers foundational and recent work on **data leverage**—the strategic use of data withholding, contribution, or manipulation as a form of collective action.
## Core Concepts
- **Data Strikes**: Coordinated refusal to contribute data to platforms
- **Data Poisoning**: Intentionally corrupting training data to degrade model performance
- **Conscious Data Contribution**: Strategically directing data to preferred systems
- **Data Valuation**: Methods for quantifying individual data contributions (Shapley values, etc.)
## Why This Matters
As AI systems become more dependent on user-generated content and behavioral data, data creators gain potential leverage over technology companies. This research explores when and how such leverage can be effectively exercised.
collection:data-leverage
❦ Data Poisoning & Adversarial Training
4 papers Poisoning asks: what if someone deliberately corrupts training data? This expands the grid dramatically—every possible perturbation creates new rows. BadNets showed how small triggers in images cause targeted misclassifications. Recent work looks at poisoning web-scale datasets and deceptive model behavior.
collection:data-poisoning
❦ Data Scaling Laws
3 papers Scaling laws describe how performance changes as you add more data (or parameters, or compute). In grid terms, they're regressions over average performance across rows of different sizes. Kaplan et al. established the modern framework; Chinchilla refined it for compute-optimal training.
collection:scaling-laws
❦ Data Selection & Coresets
4 papers Coresets are small subsets that approximate training on the full dataset. In grid terms, you're looking for a much smaller row that lands in roughly the same performance region. CRAIG and DeepCore are practical methods for finding these subsets efficiently.
collection:data-selection
❦ Data Valuation & Shapley
5 papers Data Shapley and related methods assign a "value" to each training point by averaging its marginal contribution across many possible training sets. This is essentially a principled way to aggregate over the grid. These techniques matter for data markets, debugging, and understanding what data actually helps.
collection:data-valuation
❦ Experimental Design & Causal Inference
2 papers The grid is fundamentally about counterfactuals, which connects to causal inference. These are foundational references—Pearl for structural causal models, Rubin for potential outcomes, Imbens & Rubin for practical methods. Useful background if you want to think carefully about what "what if" means.
collection:causal-inference
❦ Fairness via Data Interventions
2 papers Many fairness problems trace back to training data—who's included, how they're labeled, what's missing. Datasheets for Datasets is a practical starting point; Gender Shades demonstrates concrete harms. This collection mixes technical interventions with documentation practices.
collection:fairness-data
❦ Influence Functions & Data Attribution
7 papers Influence functions let you ask: "which training examples are responsible for this prediction?" In grid terms, they approximate what would happen if you removed or upweighted specific rows. The original Koh & Liang paper is the standard starting point; more recent work extends these ideas to LLMs.
collection:influence-functions
❦ Machine Unlearning
3 papers Unlearning asks: can you efficiently update a model as if a data point was never in the training set? This is moving from one row to another without retraining from scratch. SISA is a practical sharding approach; the survey covers the growing field.
collection:machine-unlearning
❦ Model Collapse & Synthetic Data
2 papers Model collapse occurs when generative models are trained on data produced by previous model generations. The original content distribution's tails disappear, leading to mode collapse and loss of diversity. This is increasingly important as AI-generated content proliferates on the web. Research explores both the phenomenon and potential mitigations like data accumulation and verification.
collection:model-collapse
❦ Privacy, Memorization & Unlearning
2 papers Models can memorize training data—sometimes enough to extract it verbatim. Differential privacy limits how much any single point can affect the model (constraining movement in the grid). The Carlini et al. extraction paper is a good entry point for understanding what's at stake.
collection:privacy-memorization
❦ User-Generated Content & AI Training Data
9 papers This collection examines the role of **user-generated content** in training and powering AI systems.
## Key Themes
- **UGC in Search**: How Wikipedia and other UGC improves search engine results
- **Training Data Value**: Quantifying the contribution of user content to AI models
- **Platform Dependencies**: How AI systems rely on crowdsourced knowledge
- **Content Creator Rights**: Implications for people who create the data AI learns from
## Related Collections
See also: [Data Leverage & Collective Action](./data-leverage) for research on how content creators can exercise power over AI systems.
collection:ugc-value