Tag: data-infrastructure

Tag: data-infrastructure (48 references)

Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool 2024 article

Arslan Akram

data-governance data-infrastructure

View details Source arXiv preprint arXiv:2403.13812

ANSI/NISO Z39.96-2024, JATS: Journal Article Tag Suite 2024 techreport

{NISO}

data-governance data-infrastructure

View details Source

Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing 2024 article

Johnson, Isaac, Kaffee, Lucie-Aim{\'e}e, Redi, Miriam

ai-society data-governance content-ecosystems data-infrastructure

View details Source arXiv preprint arXiv:2410.08918

Consent in Crisis: The Rapid Decline of the AI Data Commons 2024 article

Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh 0003, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Shamiso Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad A. Alghamdi, Enrico Shippole, Jianguo Zhang 0005, Joanna Materzynska, Kun Qian 0016, Kushagra Tiwary, Lester James V. Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma 0001, Minh Chien Vu, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Alex Pentland

data-governance data-infrastructure

View details Source NeurIPS

StarCoder 2 and The Stack v2: The Next Generation 2024 misc

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries

data-governance ml-methods data-infrastructure language-models

View details Source

Releasing Re-LAION-5B: transparent iteration on LAION-5B with additional safety fixes 2024 misc

{LAION}

data-governance data-infrastructure

View details Source

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research 2024 article

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo

data-governance ml-methods data-infrastructure language-models training-dynamics

View details Source arXiv preprint arXiv:2402.00159

Common Crawl — Web-scale Data for Research 2022 misc

{Common Crawl}

data-governance legal-policy data-infrastructure scraping-law

View details Source

LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets 2022 misc

{LAION}

data-governance data-infrastructure

View details Source

{LAION}-5B: An Open Large-Scale Dataset for Training Next {CLIP} Models 2022 inproceedings

Schuhmann, Christoph, Beaumont, Romain, Vencu, Richard, Gordon, Cade, Wightman, Ross, Cherti, Mehdi, Coombes, Theo, Katta, Aarush, Mullis, Clayton, Wortsman, Mitchell, Schramowski, Patrick, Kundurthy, Srivatsa, Crowson, Katherine, Schmidt, Ludwig, Kaczmarczyk, Robert, Jitsev, Jenia

data-governance data-infrastructure

View details Source Proceedings of NeurIPS Datasets and Benchmarks

The Stack: A Permissively Licensed Source Code Dataset 2022 misc

{BigCode Project}

ml-methods benchmark copyright data-governance data-infrastructure legal-policy

View details Source Dataset documentation

What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus 2021 inproceedings

Alexandra Sasha Luccioni, Joseph D. Viviano

data-governance legal-policy data-infrastructure scraping-law

View details Source Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The Pile: An 800GB Dataset of Diverse Text for Language Modeling 2021 article

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy

data-governance ml-methods data-infrastructure language-models

View details Source CoRR

Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning 2020 inproceedings

Jo, Emily, Gebru, Timnit

ai-society data-governance ml-methods data-infrastructure fairness language-models

View details Source Proceedings of FAccT

Exploring Research Interest in Stack Overflow -- A Systematic Mapping Study and Quality Evaluation 2020 article

Sarah Meldrum, Sherlock A. Licorish, Bastin Tony Roy Savarimuthu

ml-methods ai-society data-governance content-ecosystems data-infrastructure benchmark

View details Source arXiv preprint arXiv:2010.12282

The pushshift reddit dataset 2020 article

Baumgartner, Jason, Zannettou, Savvas, Keegan, Brian, Squire, Megan, Blackburn, Jeremy

data-governance data-infrastructure

View details Source Proceedings of the international AAAI conference on web and social media

Excavating AI: The Politics of Images in Machine Learning Training Sets 2019 misc

Crawford, Kate, Paglen, Trevor

ai-society data-governance ml-methods data-infrastructure fairness language-models

View details Source

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips 2019 article

Antoine Miech, Dimitri Zhukov, Jean{-}Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

data-governance data-infrastructure

View details Source CoRR

Datasheets for Datasets 2018 inproceedings

Gebru, Timnit, Morgenstern, Jamie, Vecchione, Briana, Vaughan, Jennifer Wortman, Wallach, Hanna, Daumé III, Hal, Crawford, Kate

data-governance data-infrastructure

View details Source arXiv:1803.09010

The WARC Format 1.1 2017 misc

{International Internet Preservation Consortium}

data-governance data-infrastructure

View details Source

arXiv Bulk Data Access misc

{arXiv.org}

data-governance data-infrastructure

View details Source

arXiv OAI-PMH Interface misc

{arXiv.org}

data-governance data-infrastructure

View details Source

arXiv API User's Manual misc

{arXiv.org}

data-governance data-infrastructure

View details Source

C4 Generator Code misc

{TensorFlow Datasets}

data-governance data-infrastructure

View details Source

Common Crawl – Get Started misc

{Common Crawl}

data-governance legal-policy data-infrastructure scraping-law

View details Source

Web Archiving File Formats Explained misc

{Common Crawl}

data-governance data-infrastructure

View details Source

GSM8K Hugging Face Dataset Card misc

{OpenAI}

data-governance ml-methods data-infrastructure language-models benchmark

View details Source

HowTo100M Project misc

{École Normale Supérieure}

data-governance data-infrastructure

View details Source

Journal Article Tag Suite misc

{NLM}

data-governance data-infrastructure

View details Source

JSON Lines Specification misc

{jsonlines.org}

data-governance data-infrastructure

View details Source

WARC, Web ARChive file format misc

{Library of Congress}

data-governance data-infrastructure

View details Source

Competition Math Dataset on Hugging Face misc

Hendrycks, Dan

data-governance ml-methods data-infrastructure language-models benchmark

View details Source

NDJSON Specification misc

{ndjson}

data-governance data-infrastructure

View details Source

OpenAssistant OASST1 Dataset Card misc

{OpenAssistant}

data-governance data-infrastructure

View details Source

OpenAI API Reference – Chat Completions misc

{OpenAI}

data-governance data-infrastructure language-models

View details Source

Apache Parquet Project misc

{Apache Software Foundation}

data-governance data-infrastructure

View details Source

Project Gutenberg File Formats misc

{Project Gutenberg}

data-governance data-infrastructure

View details Source

Project Gutenberg Offline Catalogs and Feeds misc

{Project Gutenberg}

data-governance data-infrastructure

View details Source

Pushshift.io misc

{Pushshift}

data-governance data-infrastructure

View details Source

Reddit API Documentation misc

{Reddit}

data-governance data-infrastructure

View details Source

Reddit Data API Wiki misc

{Reddit Help}

data-governance data-infrastructure

View details Source

Stack Exchange Data Explorer Help misc

{Stack Exchange}

ai-society data-governance content-ecosystems data-infrastructure

View details Source

Why is the Stack Exchange Data Dump only available in XML? misc

{Meta Stack Exchange}

ai-society data-governance content-ecosystems data-infrastructure

View details Source

The Stack dataset on Hugging Face misc

{BigCode Project}

data-governance data-infrastructure

View details Source

The Stack v2 dataset on Hugging Face misc

{BigCode Project}

data-governance data-infrastructure

View details Source

BigCode Project Documentation misc

{BigCode Project}

ai-society data-governance content-ecosystems data-infrastructure

View details Source

TFRecord and tf.train.Example Tutorial misc

{TensorFlow}

data-governance data-infrastructure

View details Source

C4 dataset in TensorFlow Datasets misc

{TensorFlow Datasets}

data-governance data-infrastructure

View details Source