Data Leverage References

← Back to browse

A Neural Scaling Law from the Dimension of the Data Manifold

2020 misc sharma2020neural Not yet verified

Verification pending.

Authors
Utkarsh Sharma, J. Kaplan
Venue
arXiv.org
Abstract
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L \propto N^{-\alpha}$ in the number of network parameters $N$. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$. This simple theory predicts that the scaling exponents $\alpha \approx 4/d$ for cross-entropy and mean-squared error losses. We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of $d$ and $\alpha$ by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.

BibTeX

Local Entry
@misc{sharma2020neural,
  title = {A Neural Scaling Law from the Dimension of the Data Manifold},
  author = {Utkarsh Sharma and J. Kaplan},
  year = {2020},
  howpublished = {arXiv.org},
  url = {https://arxiv.org/abs/2004.10802},
  abstract = {When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L \propto N^{-\alpha}$ in the number of network parameters $N$. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$. This simple theory predicts that the scaling exponents $\alpha \approx 4/d$ for cross-entropy and mean-squared error losses. We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of $d$ and $\alpha$ by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.}
}
External Source

Syncing with external sources in progress. Check back soon for verified metadata.