Tag: data-infrastructure (49 references)
Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool
ANSI/NISO Z39.96-2024, JATS: Journal Article Tag Suite
Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing
StarCoder 2 and The Stack v2: The Next Generation
Consent in Crisis: The Rapid Decline of the AI Data Commons
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Common Crawl — Web-scale Data for Research
LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets
Releasing Re-LAION-5B
{LAION}-5B: An Open Large-Scale Dataset for Training Next {CLIP} Models
The Stack: A Permissively Licensed Source Code Dataset
What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus
View details Source Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)