Tag: needs-review (16 references)
Shapley value-based data valuation for machine learning data markets
Proposes G-Value to bridge the gap between leave-one-out (LOO) and Shapley value approaches for data valuation. Addresses practical applications in machine learning data markets.
Rethinking machine unlearning for large language models
Comprehensive review of machine unlearning in LLMs, aiming to eliminate undesirable data influence (sensitive or illegal information) while maintaining essential knowledge generation. Envisions LLM unlearning as a pivotal element in life-cycle management for developing safe, secure, trustworthy, and resource-efficient generative AI.
Distributional Training Data Attribution: What do Influence Functions Sample?
Introduces distributional training data attribution (d-TDA), which predicts how the distribution of model outputs depends upon the dataset. Shows that influence functions are "secretly distributional"—they emerge from this framework as the limit to unrolled differentiation without requiring restrictive convexity assumptions.
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations
Official NIST taxonomy and terminology for adversarial machine learning. Covers data poisoning attacks applicable to all learning paradigms, model poisoning attacks in federated learning, and supply-chain attacks. Provides guidance for defense strategies.
The Economics of AI Training Data: A Research Agenda
Research agenda documenting AI training data deals from 2020 to 2025. Reveals persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Found only 7 of 24 major deals compensate original creators.
Data-centric Artificial Intelligence: A Survey
Comprehensive survey on data-centric AI, providing a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and representative methods. Covers the paradigm shift from model refinement to prioritizing data quality.
Revisiting Data Attribution for Influence Functions
Comprehensive review of influence functions for data attribution, examining how individual training examples influence model predictions. Covers techniques for model debugging, data curation, bias detection, and identification of mislabeled or adversarial data points.
CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning
Proposes CHG (compound of Hardness and Gradient) utility function to approximate the utility of each data subset, reducing computational complexity to a single model retraining—achieving a quadratic improvement over existing Data Shapley methods.
A Versatile Influence Function for Data Attribution with Non-Decomposable Loss
Proposes Versatile Influence Function (VIF) designed to fully leverage auto-differentiation, eliminating case-specific derivations. Demonstrated across Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank, with estimates closely resembling leave-one-out retraining while being up to 10^3 times faster.
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
Studies whether model collapse is inevitable. Found that collapse occurs when replacing real data with synthetic data each generation. However, when accumulating synthetic data alongside original real data, models stay stable across sizes and modalities. Suggests data accumulation rather than replacement as a solution.
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
Large-scale audit of over 1,800 text AI datasets analyzing trends, permissions of use and global representation. Found frequent miscategorization of licences on dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. Released the Data Provenance Explorer tool for practitioners.
Influence Functions for Scalable Data Attribution in Diffusion Models
Develops influence function frameworks for diffusion models to address data attribution and interpretability challenges. Predicts how model output would change if training data were removed, showing how previously proposed methods can be interpreted as particular design choices in this framework.
AI models collapse when trained on recursively generated data
Landmark study showing that indiscriminate use of model-generated content in training causes irreversible defects in resulting models, where tails of original content distribution disappear. Model collapse is a degenerative learning process where models forget improbable events over time. Demonstrates this across LLMs, VAEs, and GMMs.
LLM Unlearning via Loss Adjustment with Only Forget Data
FLAT is a loss adjustment approach which maximizes f-divergence between the available template answer and the forget answer with respect to the forget data. Demonstrates superior unlearning performance compared to existing methods while minimizing impact on retained capabilities, tested on Harry Potter dataset and MUSE Benchmark.
Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration
Enhances training data attribution methods for large language models including LLaMA2, QWEN2, and Mistral by considering fitting error in the attribution process.
Position Paper: Data-Centric AI in the Age of Large Language Models
Position paper identifying four specific scenarios centered around data for LLMs, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.