Tag: training-dynamics (28 references)
To code, or not to code? exploring impact of code in pre-training
Poisoning Web-Scale Training Datasets is Practical
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
TRAK: Attributing Model Behavior at Scale
Introduces TRAK (Tracing with the Randomly-projected After Kernel), a data attribution method that is both effective and computationally tractable for large-scale models by leveraging random projections.
Training Data Influence Analysis and Estimation: A Survey
Training Compute-Optimal Large Language Models
Shows that current LLMs are significantly undertrained. For compute-optimal training, model size and training tokens should scale equally. Introduces Chinchilla (70B params, 1.4T tokens) which outperforms larger models like Gopher (280B) trained on less data.
Beyond neural scaling laws: beating power law scaling via data pruning
Extracting Training Data from Large Language Models
Scaling Laws for Neural Language Models
Establishes power-law scaling relationships between language model performance and model size, dataset size, and compute, spanning seven orders of magnitude.
Coresets for Data-efficient Training of Machine Learning Models
Introduces CRAIG (Coresets for Accelerating Incremental Gradient descent), selecting subsets that approximate full gradient for 2-3x training speedups while maintaining performance.
Deep Double Descent: Where Bigger Models and More Data Hurt
Demonstrates that double descent occurs across model size, training epochs, and dataset size in modern deep networks. Introduces effective model complexity to unify these phenomena and shows regimes where more data hurts.
Estimating Training Data Influence by Tracing Gradient Descent
Introduces TracIn, which computes influence of training examples by tracing how test loss changes during training. Uses first-order gradient approximation and saved checkpoints for scalability.
Reconciling modern machine-learning practice and the classical bias–variance trade-off
Curriculum Learning
Introduces curriculum learning: training models on examples of increasing difficulty. Shows this acts as a continuation method for non-convex optimization, improving both convergence speed and final generalization.