The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
Authors
Venue
Nature Machine Intelligence
Abstract
Large-scale audit of over 1,800 text AI datasets analyzing trends, permissions of use and global representation. Found frequent miscategorization of licences on dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. Released the Data Provenance Explorer tool for practitioners.
Tags
Links
BibTeX
Local Entry
@article{longpre2024dataprovenance,
title = {The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI},
author = {Shayne Longpre and Robert Mahari and Anthony Chen and Naana Obeng-Marnu and Damien Sileo and William Brannon and Niklas Muennighoff and Nathan Khazam and Jad Kabbara and Kartik Perisetla and Xinyi Wu and Enrico Shippole and Kurt Bollacker and Tongshuang Wu and Luis Villa and Sandy Pentland and Sara Hooker},
year = {2024},
journal = {Nature Machine Intelligence},
url = {https://www.nature.com/articles/s42256-024-00878-8},
abstract = {Large-scale audit of over 1,800 text AI datasets analyzing trends, permissions of use and global representation. Found frequent miscategorization of licences on dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. Released the Data Provenance Explorer tool for practitioners.}
} External Source
Not found in external databases.