The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Authors
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy
Venue
CoRR
arXiv