Data Leverage References

← Back to browse

Tag: needs-review (16 references)

Shapley value-based data valuation for machine learning data markets 2025 article

Proposes G-Value to bridge the gap between leave-one-out (LOO) and Shapley value approaches for data valuation. Addresses practical applications in machine learning data markets.

Rethinking machine unlearning for large language models 2025 article

Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu

Comprehensive review of machine unlearning in LLMs, aiming to eliminate undesirable data influence (sensitive or illegal information) while maintaining essential knowledge generation. Envisions LLM unlearning as a pivotal element in life-cycle management for developing safe, secure, trustworthy, and resource-efficient generative AI.

Distributional Training Data Attribution: What do Influence Functions Sample? 2025 article

Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu, Richard E. Turner, Roger Grosse

Introduces distributional training data attribution (d-TDA), which predicts how the distribution of model outputs depends upon the dataset. Shows that influence functions are "secretly distributional"—they emerge from this framework as the limit to unrolled differentiation without requiring restrictive convexity assumptions.

Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations 2025 techreport

NIST

Official NIST taxonomy and terminology for adversarial machine learning. Covers data poisoning attacks applicable to all learning paradigms, model poisoning attacks in federated learning, and supply-chain attacks. Provides guidance for defense strategies.

The Economics of AI Training Data: A Research Agenda 2025 article

Hamidah Oderinwale, Anna Kazlauskas

Research agenda documenting AI training data deals from 2020 to 2025. Reveals persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Found only 7 of 24 major deals compensate original creators.

Data-centric Artificial Intelligence: A Survey 2025 article

Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, Xia Hu

Comprehensive survey on data-centric AI, providing a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and representative methods. Covers the paradigm shift from model refinement to prioritizing data quality.

Revisiting Data Attribution for Influence Functions 2025 article

Hongbo Zhu, Angelo Cangelosi

Comprehensive review of influence functions for data attribution, examining how individual training examples influence model predictions. Covers techniques for model debugging, data curation, bias detection, and identification of mislabeled or adversarial data points.

CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning 2024 article

Huaiguang Cai

Proposes CHG (compound of Hardness and Gradient) utility function to approximate the utility of each data subset, reducing computational complexity to a single model retraining—achieving a quadratic improvement over existing Data Shapley methods.

A Versatile Influence Function for Data Attribution with Non-Decomposable Loss 2024 article

Junwei Deng, Weijing Tang, Jiaqi W. Ma

Proposes Versatile Influence Function (VIF) designed to fully leverage auto-differentiation, eliminating case-specific derivations. Demonstrated across Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank, with estimates closely resembling leave-one-out retraining while being up to 10^3 times faster.

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data 2024 article

Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

Studies whether model collapse is inevitable. Found that collapse occurs when replacing real data with synthetic data each generation. However, when accumulating synthetic data alongside original real data, models stay stable across sizes and modalities. Suggests data accumulation rather than replacement as a solution.

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI 2024 article

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker

Large-scale audit of over 1,800 text AI datasets analyzing trends, permissions of use and global representation. Found frequent miscategorization of licences on dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. Released the Data Provenance Explorer tool for practitioners.

Influence Functions for Scalable Data Attribution in Diffusion Models 2024 article

Bruno Mlodozeniec, Runa Eschenhagen, Juhan Bae, Alexander Immer, David Krueger, Richard Turner

Develops influence function frameworks for diffusion models to address data attribution and interpretability challenges. Predicts how model output would change if training data were removed, showing how previously proposed methods can be interpreted as particular design choices in this framework.

AI models collapse when trained on recursively generated data 2024 article

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, Yarin Gal

Landmark study showing that indiscriminate use of model-generated content in training causes irreversible defects in resulting models, where tails of original content distribution disappear. Model collapse is a degenerative learning process where models forget improbable events over time. Demonstrates this across LLMs, VAEs, and GMMs.

LLM Unlearning via Loss Adjustment with Only Forget Data 2024 inproceedings

Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Parag Shah, Yujia Bao, Yang Liu, Wei Wei

FLAT is a loss adjustment approach which maximizes f-divergence between the available template answer and the forget answer with respect to the forget data. Demonstrates superior unlearning performance compared to existing methods while minimizing impact on retained capabilities, tested on Harry Potter dataset and MUSE Benchmark.

Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration 2024 inproceedings

Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng

Enhances training data attribution methods for large language models including LLaMA2, QWEN2, and Mistral by considering fitting error in the attribution process.

Position Paper: Data-Centric AI in the Age of Large Language Models 2024 inproceedings

Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang Low

Position paper identifying four specific scenarios centered around data for LLMs, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.