Data Leverage References

← Back to browse

Tag: data-governance (82 references)

If open source is to win, it must go public 2025 article

Tan, Joshua, Vincent, Nicholas, Elkins, Katherine, Sahlgren, Magnus

ai-society data-governance

View details Source arXiv preprint arXiv:2507.09296

Data-centric Artificial Intelligence: A Survey 2025 article

Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, Xia Hu

Comprehensive survey on data-centric AI, providing a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and representative methods. Covers the paradigm shift from model refinement to prioritizing data quality.

data-governance data-centric-ai format:survey data-quality status:needs-review

View details Source ACM Computing Surveys

Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool 2024 article

Arslan Akram

data-governance data-infrastructure

View details Source arXiv preprint arXiv:2403.13812

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions 2024 article

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura

data-governance ml-methods data-attribution data-valuation interpretability language-models

View details Source arXiv preprint arXiv:2405.13954

ANSI/NISO Z39.96-2024, JATS: Journal Article Tag Suite 2024 techreport

{NISO}

data-governance data-infrastructure

View details Source

Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing 2024 article

Johnson, Isaac, Kaffee, Lucie-Aim{\'e}e, Redi, Miriam

ai-society data-governance content-ecosystems data-infrastructure

View details Source arXiv preprint arXiv:2410.08918

Data Flywheel Go Brrr: Using Your Users to Build Better Products 2024 misc

Liu, Jason

Explore how data flywheels leverage user feedback to enhance product development and achieve business success with AI.

data-governance data-valuation

View details Source

Consent in Crisis: The Rapid Decline of the AI Data Commons 2024 article

Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh 0003, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Shamiso Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad A. Alghamdi, Enrico Shippole, Jianguo Zhang 0005, Joanna Materzynska, Kun Qian 0016, Kushagra Tiwary, Lester James V. Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma 0001, Minh Chien Vu, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Alex Pentland

data-governance data-infrastructure

View details Source NeurIPS

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI 2024 article

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker

Large-scale audit of over 1,800 text AI datasets analyzing trends, permissions of use and global representation. Found frequent miscategorization of licences on dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. Released the Data Provenance Explorer tool for practitioners.

data-governance legal-policy ml-methods data-provenance dataset-audit licensing foundational status:needs-review

View details Source Nature Machine Intelligence

StarCoder 2 and The Stack v2: The Next Generation 2024 misc

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries

data-governance ml-methods data-infrastructure language-models

View details Source

Releasing Re-LAION-5B: transparent iteration on LAION-5B with additional safety fixes 2024 misc

{LAION}

data-governance data-infrastructure

View details Source

What is a Data Flywheel? A Guide to Sustainable Business Growth 2024 misc

Roche, Adam, Sassoon, Yali

data-governance data-valuation blog

View details Source Snowplow Blog

The data addition dilemma 2024 article

Shen, Judy Hanwen, Raji, Inioluwa Deborah, Chen, Irene Y

data-governance data-valuation

View details Source MLHC 2024

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research 2024 article

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo

data-governance ml-methods data-infrastructure language-models training-dynamics

View details Source arXiv preprint arXiv:2402.00159

Position Paper: Data-Centric AI in the Age of Large Language Models 2024 inproceedings

Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang Low

Position paper identifying four specific scenarios centered around data for LLMs, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.

data-governance data-centric-ai language-models position-paper status:needs-review

View details Source EMNLP Findings

Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia 2023 article

Raul Castro Fernandez

data-governance data-valuation

View details Source Proceedings of the ACM on Management of Data

Algorithmic Collective Action in Machine Learning 2023 inproceedings

Moritz Hardt, Eric Mazumdar, Celestine Mendler-Dünner, Tijana Zrnic

Provides theoretical framework for algorithmic collective action, showing that small collectives can exert significant control over platform learning algorithms through coordinated data strategies.

data-governance data-labor data-selection ml-methods

View details Source International Conference on Machine Learning (ICML)

TRAK: Attributing Model Behavior at Scale 2023 inproceedings

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, Aleksander Madry

Introduces TRAK (Tracing with the Randomly-projected After Kernel), a data attribution method that is both effective and computationally tractable for large-scale models by leveraging random projections.

data-attribution data-governance interpretability ml-methods training-dynamics

View details Source International Conference on Machine Learning (ICML)

Common Crawl — Web-scale Data for Research 2022 misc

{Common Crawl}

data-governance legal-policy data-infrastructure scraping-law

View details Source

Datamodels: Predicting Predictions from Training Data 2022 inproceedings

Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry

Proposes datamodels that predict model outputs as a function of training data subsets, providing a framework for understanding data attribution through retraining experiments.

privacy ai-safety ai-society data-attribution data-governance fairness unlearning

View details Source International Conference on Machine Learning (ICML)

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning 2022 inproceedings

Yongchan Kwon, James Zou

Generalizes Data Shapley using Beta weighting functions, providing noise-reduced data valuation that better handles outliers and mislabeled data detection.

data-attribution data-augmentation data-governance ml-methods

View details Source International Conference on Artificial Intelligence and Statistics (AISTATS)

LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets 2022 misc

{LAION}

data-governance data-infrastructure

View details Source

{LAION}-5B: An Open Large-Scale Dataset for Training Next {CLIP} Models 2022 inproceedings

Schuhmann, Christoph, Beaumont, Romain, Vencu, Richard, Gordon, Cade, Wightman, Ross, Cherti, Mehdi, Coombes, Theo, Katta, Aarush, Mullis, Clayton, Wortsman, Mitchell, Schramowski, Patrick, Kundurthy, Srivatsa, Crowson, Katherine, Schmidt, Ludwig, Kaczmarczyk, Robert, Jitsev, Jenia

data-governance data-infrastructure

View details Source Proceedings of NeurIPS Datasets and Benchmarks

The Stack: A Permissively Licensed Source Code Dataset 2022 misc

{BigCode Project}

ml-methods benchmark copyright data-governance data-infrastructure legal-policy

View details Source Dataset documentation

Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning 2021 article

Yongchan Kwon, James Zou

data-governance data-valuation

View details Source arXiv preprint arXiv:2110.14049

What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus 2021 inproceedings

Alexandra Sasha Luccioni, Joseph D. Viviano

data-governance legal-policy data-infrastructure scraping-law

View details Source Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The Pile: An 800GB Dataset of Diverse Text for Language Modeling 2021 article

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy

data-governance ml-methods data-infrastructure language-models

View details Source CoRR

Quantifying the Invisible Labor in Crowd Work 2021 article

Carlos Toxtli, Siddharth Suri, Saiph Savage

data-governance data-labor

View details Source Proceedings of the ACM on Human-Computer Interaction

Can "Conscious Data Contribution" Help Users to Exert "Data Leverage" Against Technology Companies? 2021 article

Nicholas Vincent, Brent Hecht

conscious-data-contribution data-governance data-labor

View details Source Proceedings of the ACM on Human-Computer Interaction

Data Leverage: A Framework for Empowering the Public in its Relationship with Technology Companies 2021 inproceedings

Vincent, Nicholas and Li, Hanlin and Tilly, Nicole and Chancellor, Stevie and Hecht, Brent

data-governance data-labor

View details Source Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency

Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning 2020 inproceedings

Jo, Emily, Gebru, Timnit

ai-society data-governance ml-methods data-infrastructure fairness language-models

View details Source Proceedings of FAccT

Exploring Research Interest in Stack Overflow -- A Systematic Mapping Study and Quality Evaluation 2020 article

Sarah Meldrum, Sherlock A. Licorish, Bastin Tony Roy Savarimuthu

ml-methods ai-society data-governance content-ecosystems data-infrastructure benchmark

View details Source arXiv preprint arXiv:2010.12282

Estimating Training Data Influence by Tracing Gradient Descent 2020 inproceedings

Garima Pruthi, Frederick Liu, Mukund Sundararajan, Satyen Kale

Introduces TracIn, which computes influence of training examples by tracing how test loss changes during training. Uses first-order gradient approximation and saved checkpoints for scalability.

data-attribution data-governance interpretability ml-methods training-dynamics

View details Source Advances in Neural Information Processing Systems (NeurIPS)

The pushshift reddit dataset 2020 article

Baumgartner, Jason, Zannettou, Savvas, Keegan, Brian, Squire, Megan, Blackburn, Jeremy

data-governance data-infrastructure

View details Source Proceedings of the international AAAI conference on web and social media

Excavating AI: The Politics of Images in Machine Learning Training Sets 2019 misc

Crawford, Kate, Paglen, Trevor

ai-society data-governance ml-methods data-infrastructure fairness language-models

View details Source

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips 2019 article

Antoine Miech, Dimitri Zhukov, Jean{-}Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

data-governance data-infrastructure

View details Source CoRR

Towards Efficient Data Valuation Based on the Shapley Value 2019 inproceedings

Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gurel, Bo Li, Ce Zhang, Dawn Song, Costas J. Spanos

data-governance data-attribution data-valuation shapley-value

View details Source International Conference on Artificial Intelligence and Statistics

On the Accuracy of Influence Functions for Measuring Group Effects 2019 inproceedings

Pang Wei Koh, Kai-Siang Ang, Hubert H. K. Teo, Percy Liang

data-governance ml-methods data-attribution interpretability

View details Source Advances in Neural Information Processing Systems

Mapping the Potential and Pitfalls of "Data Dividends" as a Means of Sharing the Profits of Artificial Intelligence 2019 article

Nicholas Vincent, Yichun Li, Renee Zha, Brent Hecht

data-governance data-valuation

View details Source arXiv preprint arXiv:1912.00757

"Data Strikes": Evaluating the Effectiveness of a New Form of Collective Action Against Technology Companies 2019 inproceedings

Nicholas Vincent, Brent Hecht, Shilad Sen

Simulates data strikes against recommender systems, showing that collective withholding of training data can create leverage for users against technology platforms.

ai-economics ai-society data-governance data-labor

View details Source The World Wide Web Conference (WWW)

Should We Treat Data as Labor? Moving Beyond 'Free' 2018 article

Imanol Arrieta-Ibarra, Leonard Goff, Diego Jimenez-Hernandez, Jaron Lanier, E. Glen Weyl

data-governance data-labor

View details Source AEA Papers and Proceedings

Datasheets for Datasets 2018 inproceedings

Gebru, Timnit, Morgenstern, Jamie, Vecchione, Briana, Vaughan, Jennifer Wortman, Wallach, Hanna, Daumé III, Hal, Crawford, Kate

data-governance data-infrastructure

View details Source arXiv:1803.09010

The WARC Format 1.1 2017 misc

{International Internet Preservation Consortium}

data-governance data-infrastructure

View details Source

Understanding Black-box Predictions via Influence Functions 2017 inproceedings

Pang Wei Koh, Percy Liang

Uses influence functions from robust statistics to trace model predictions back to training data, identifying training points most responsible for a given prediction.

data-attribution data-governance data-selection interpretability ml-methods foundational

View details Source Proceedings of the 34th International Conference on Machine Learning (ICML)

The Future of Crowd Work 2013 inproceedings

Aniket Kittur, Jeffrey V. Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, John Horton

data-governance data-labor

View details Source Proceedings of the 2013 Conference on Computer Supported Cooperative Work

Social {Dilemmas}: {The} {Anatomy} of {Cooperation} 1998 article

Kollock, Peter

The study of social dilemmas is the study of the tension between individual and collective rationality. In a social dilemma, individually reasonable behavior leads to a situation in which everyone is worse off. The first part of this review is a discussion of categories of social dilemmas and how they are modeled. The key two-person social dilemmas (Prisoner’s Dilemma, Assurance, Chicken) and multiple-person social dilemmas (public goods dilemmas and commons dilemmas) are examined. The second part is an extended treatment of possible solutions for social dilemmas. These solutions are organized into three broad categories based on whether the solutions assume egoistic actors and whether the structure of the situation can be changed: Motivational solutions assume actors are not completely egoistic and so give some weight to the outcomes of their partners. Strategic solutions assume egoistic actors, and neither of these categories of solutions involve changing the fundamental structure of the situation. Solutions that do involve changing the rules of the game are considered in the section on structural solutions. I conclude the review with a discussion of current research and directions for future work.

ai-society data-governance data-labor

View details Source Annual Review of Sociology

The critical mass in collective action 1993 book

Marwell, Gerald, Oliver, Pamela

ai-society data-governance data-labor

View details Source Cambridge University Press

arXiv Bulk Data Access misc

{arXiv.org}

data-governance data-infrastructure

View details Source

arXiv OAI-PMH Interface misc

{arXiv.org}

data-governance data-infrastructure

View details Source

arXiv API User's Manual misc

{arXiv.org}

data-governance data-infrastructure

View details Source

C4 Generator Code misc

{TensorFlow Datasets}

data-governance data-infrastructure

View details Source

Common Crawl – Get Started misc

{Common Crawl}

data-governance legal-policy data-infrastructure scraping-law

View details Source

Web Archiving File Formats Explained misc

{Common Crawl}

data-governance data-infrastructure

View details Source

GSM8K Hugging Face Dataset Card misc

{OpenAI}

data-governance ml-methods data-infrastructure language-models benchmark

View details Source

HowTo100M Project misc

{École Normale Supérieure}

data-governance data-infrastructure

View details Source

Journal Article Tag Suite misc

{NLM}

data-governance data-infrastructure

View details Source

JSON Lines Specification misc

{jsonlines.org}

data-governance data-infrastructure

View details Source

WARC, Web ARChive file format misc

{Library of Congress}

data-governance data-infrastructure

View details Source

Competition Math Dataset on Hugging Face misc

Hendrycks, Dan

data-governance ml-methods data-infrastructure language-models benchmark

View details Source

NDJSON Specification misc

{ndjson}

data-governance data-infrastructure

View details Source

OpenAssistant OASST1 Dataset Card misc

{OpenAssistant}

data-governance data-infrastructure

View details Source

OpenAI API Reference – Chat Completions misc

{OpenAI}

data-governance data-infrastructure language-models

View details Source

Apache Parquet Project misc

{Apache Software Foundation}

data-governance data-infrastructure

View details Source

Project Gutenberg File Formats misc

{Project Gutenberg}

data-governance data-infrastructure

View details Source

Project Gutenberg Offline Catalogs and Feeds misc

{Project Gutenberg}

data-governance data-infrastructure

View details Source

Pushshift.io misc

{Pushshift}

data-governance data-infrastructure

View details Source

Reddit API Documentation misc

{Reddit}

data-governance data-infrastructure

View details Source

Reddit Data API Wiki misc

{Reddit Help}

data-governance data-infrastructure

View details Source

Stack Exchange Data Explorer Help misc

{Stack Exchange}

ai-society data-governance content-ecosystems data-infrastructure

View details Source

Why is the Stack Exchange Data Dump only available in XML? misc

{Meta Stack Exchange}

ai-society data-governance content-ecosystems data-infrastructure

View details Source

The Stack dataset on Hugging Face misc

{BigCode Project}

data-governance data-infrastructure

View details Source

The Stack v2 dataset on Hugging Face misc

{BigCode Project}

data-governance data-infrastructure

View details Source

BigCode Project Documentation misc

{BigCode Project}

ai-society data-governance content-ecosystems data-infrastructure

View details Source

TFRecord and tf.train.Example Tutorial misc

{TensorFlow}

data-governance data-infrastructure

View details Source

C4 dataset in TensorFlow Datasets misc

{TensorFlow Datasets}

data-governance data-infrastructure

View details Source

Data Leverage & Collective Action paper_collection

ai-safety ai-society data-governance ml-methods adversarial ai-economics content-ecosystems data-attribution data-labor data-valuation training-dynamics foundational

Data Valuation & Shapley paper_collection

data-governance ai-society ml-methods data-attribution data-valuation ai-economics

Influence Functions & Data Attribution paper_collection

data-governance ml-methods data-attribution training-dynamics

Fairness via Data Interventions paper_collection

ai-safety ai-society data-governance ml-methods

Machine Unlearning paper_collection

ml-methods ai-safety data-governance training-dynamics

Privacy, Memorization & Unlearning paper_collection

ai-safety ml-methods data-governance training-dynamics

User-Generated Content & AI Training Data paper_collection

ai-society data-governance ml-methods content-ecosystems data-labor training-dynamics