Data Leverage References

← Back to browse

Tag: data-infrastructure (49 references)

Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool 2024 article

Arslan Akram

ANSI/NISO Z39.96-2024, JATS: Journal Article Tag Suite 2024 techreport
Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing 2024 article

Johnson, Isaac, Kaffee, Lucie-Aim{\'e}e, Redi, Miriam

StarCoder 2 and The Stack v2: The Next Generation 2024 misc

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries

Consent in Crisis: The Rapid Decline of the AI Data Commons 2024 article

Longpre, Shayne and Mahari, Robert and Lee, Ariel and Lund, Campbell and Oderinwale, Hamidah and Brannon, William and Saxena, Nayan and Obeng-Marnu, Naana and Sud, Tobin and Gupta, Sameer and Muennighoff, Niklas and others

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research 2024 article

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo

Common Crawl — Web-scale Data for Research 2022 misc
LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets 2022 misc
Releasing Re-LAION-5B 2022 misc
{LAION}-5B: An Open Large-Scale Dataset for Training Next {CLIP} Models 2022 inproceedings

Schuhmann, Christoph, Beaumont, Romain, Vencu, Richard, Gordon, Cade, Wightman, Ross, Cherti, Mehdi, Coombes, Theo, Katta, Aarush, Mullis, Clayton, Wortsman, Mitchell, Schramowski, Patrick, Kundurthy, Srivatsa, Crowson, Katherine, Schmidt, Ludwig, Kaczmarczyk, Robert, Jitsev, Jenia

The Stack: A Permissively Licensed Source Code Dataset 2022 misc
What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus 2021 inproceedings

Alexandra Sasha Luccioni, Joseph D. Viviano

The Pile: An 800GB Dataset of Diverse Text for Language Modeling 2021 article

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy

Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning 2020 inproceedings
Exploring Research Interest in Stack Overflow -- A Systematic Mapping Study and Quality Evaluation 2020 article

Sarah Meldrum, Sherlock A. Licorish, Bastin Tony Roy Savarimuthu

The pushshift reddit dataset 2020 article

Baumgartner, Jason, Zannettou, Savvas, Keegan, Brian, Squire, Megan, Blackburn, Jeremy

Excavating AI: The Politics of Images in Machine Learning Training Sets 2019 misc
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips 2019 article

Antoine Miech, Dimitri Zhukov, Jean{-}Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science 2018 article

Bender, Emily M., Friedman, Batya

Datasheets for Datasets 2018 inproceedings

Gebru, Timnit, Morgenstern, Jamie, Vecchione, Briana, Vaughan, Jennifer Wortman, Wallach, Hanna, Daumé III, Hal, Crawford, Kate

The WARC Format 1.1 2017 misc

{International Internet Preservation Consortium}

arXiv API User’s Manual misc
arXiv OAI-PMH Interface misc
arXiv Bulk Data Access misc
C4 Generator Code misc
Web Archiving File Formats Explained misc
Common Crawl – Get Started misc
GSM8K Hugging Face Dataset Card misc
HowTo100M Project misc
Journal Article Tag Suite misc
JSON Lines Specification misc
WARC, Web ARChive file format misc
Competition Math Dataset on Hugging Face misc
NDJSON Specification misc
OpenAssistant OASST1 Dataset Card misc
Apache Parquet Project misc
OpenAI API Reference – Chat misc
Project Gutenberg Offline Catalogs and Feeds misc
Project Gutenberg File Formats misc
Pushshift.io misc
Reddit API Documentation misc
Reddit Data API Wiki misc
Stack Exchange Data Explorer Help misc
Why is the Stack Exchange Data Dump only available in XML? misc
BigCode Project Documentation misc
The Stack dataset on Hugging Face misc
The Stack v2 dataset on Hugging Face misc
C4 dataset in TensorFlow Datasets misc
TFRecord and tf.train.Example Tutorial misc