Tag: data-governance (82 references)
If open source is to win, it must go public
Data-centric Artificial Intelligence: A Survey
Comprehensive survey on data-centric AI, providing a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and representative methods. Covers the paradigm shift from model refinement to prioritizing data quality.
Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool
What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
ANSI/NISO Z39.96-2024, JATS: Journal Article Tag Suite
Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing
Data Flywheel Go Brrr: Using Your Users to Build Better Products
Explore how data flywheels leverage user feedback to enhance product development and achieve business success with AI.
Consent in Crisis: The Rapid Decline of the AI Data Commons
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
Large-scale audit of over 1,800 text AI datasets analyzing trends, permissions of use and global representation. Found frequent miscategorization of licences on dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. Released the Data Provenance Explorer tool for practitioners.
StarCoder 2 and The Stack v2: The Next Generation
Releasing Re-LAION-5B: transparent iteration on LAION-5B with additional safety fixes
What is a Data Flywheel? A Guide to Sustainable Business Growth
The data addition dilemma
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Position Paper: Data-Centric AI in the Age of Large Language Models
Position paper identifying four specific scenarios centered around data for LLMs, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia
Algorithmic Collective Action in Machine Learning
Provides theoretical framework for algorithmic collective action, showing that small collectives can exert significant control over platform learning algorithms through coordinated data strategies.
TRAK: Attributing Model Behavior at Scale
Introduces TRAK (Tracing with the Randomly-projected After Kernel), a data attribution method that is both effective and computationally tractable for large-scale models by leveraging random projections.
Common Crawl — Web-scale Data for Research
Datamodels: Predicting Predictions from Training Data
Proposes datamodels that predict model outputs as a function of training data subsets, providing a framework for understanding data attribution through retraining experiments.
Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning
Generalizes Data Shapley using Beta weighting functions, providing noise-reduced data valuation that better handles outliers and mislabeled data detection.
LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets
{LAION}-5B: An Open Large-Scale Dataset for Training Next {CLIP} Models
The Stack: A Permissively Licensed Source Code Dataset
Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning
What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Quantifying the Invisible Labor in Crowd Work
Can "Conscious Data Contribution" Help Users to Exert "Data Leverage" Against Technology Companies?
Data Leverage: A Framework for Empowering the Public in its Relationship with Technology Companies
Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
Exploring Research Interest in Stack Overflow -- A Systematic Mapping Study and Quality Evaluation
Estimating Training Data Influence by Tracing Gradient Descent
Introduces TracIn, which computes influence of training examples by tracing how test loss changes during training. Uses first-order gradient approximation and saved checkpoints for scalability.
The pushshift reddit dataset
Excavating AI: The Politics of Images in Machine Learning Training Sets
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Towards Efficient Data Valuation Based on the Shapley Value
On the Accuracy of Influence Functions for Measuring Group Effects
Mapping the Potential and Pitfalls of "Data Dividends" as a Means of Sharing the Profits of Artificial Intelligence
"Data Strikes": Evaluating the Effectiveness of a New Form of Collective Action Against Technology Companies
Simulates data strikes against recommender systems, showing that collective withholding of training data can create leverage for users against technology platforms.
Should We Treat Data as Labor? Moving Beyond 'Free'
Datasheets for Datasets
The WARC Format 1.1
Understanding Black-box Predictions via Influence Functions
Uses influence functions from robust statistics to trace model predictions back to training data, identifying training points most responsible for a given prediction.
The Future of Crowd Work
Social {Dilemmas}: {The} {Anatomy} of {Cooperation}
The study of social dilemmas is the study of the tension between individual and collective rationality. In a social dilemma, individually reasonable behavior leads to a situation in which everyone is worse off. The first part of this review is a discussion of categories of social dilemmas and how they are modeled. The key two-person social dilemmas (Prisoner’s Dilemma, Assurance, Chicken) and multiple-person social dilemmas (public goods dilemmas and commons dilemmas) are examined. The second part is an extended treatment of possible solutions for social dilemmas. These solutions are organized into three broad categories based on whether the solutions assume egoistic actors and whether the structure of the situation can be changed: Motivational solutions assume actors are not completely egoistic and so give some weight to the outcomes of their partners. Strategic solutions assume egoistic actors, and neither of these categories of solutions involve changing the fundamental structure of the situation. Solutions that do involve changing the rules of the game are considered in the section on structural solutions. I conclude the review with a discussion of current research and directions for future work.