Tag: interpretability

Tag: interpretability (11 references)

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions 2024 article

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura

View details Source arXiv preprint arXiv:2405.13954

LEACE: Perfect linear concept erasure in closed form 2023 article

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman

ai-safety ml-methods interpretability unlearning

View details Source NeurIPS 2023

TRAK: Attributing Model Behavior at Scale 2023 inproceedings

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, Aleksander Madry

Introduces TRAK (Tracing with the Randomly-projected After Kernel), a data attribution method that is both effective and computationally tractable for large-scale models by leveraging random projections.

data-attribution data-governance interpretability ml-methods training-dynamics

View details Source International Conference on Machine Learning (ICML)

Why Black Box Machine Learning Should Be Avoided for High-Stakes Decisions, in Brief 2022 article

Cynthia Rudin

ml-methods interpretability

View details Source Nature Reviews Methods Primers

Coresets for Data-efficient Training of Machine Learning Models 2020 inproceedings

Baharan Mirzasoleiman, Jeff Bilmes, Jure Leskovec

Introduces CRAIG (Coresets for Accelerating Incremental Gradient descent), selecting subsets that approximate full gradient for 2-3x training speedups while maintaining performance.

data-selection interpretability ml-methods training-dynamics

View details Source International Conference on Machine Learning (ICML)

interpreting GPT: the logit lens 2020 misc

nostalgebraist

ml-methods language-models interpretability

View details Source LessWrong

Estimating Training Data Influence by Tracing Gradient Descent 2020 inproceedings

Garima Pruthi, Frederick Liu, Mukund Sundararajan, Satyen Kale

Introduces TracIn, which computes influence of training examples by tracing how test loss changes during training. Uses first-order gradient approximation and saved checkpoints for scalability.

data-attribution data-governance interpretability ml-methods training-dynamics

View details Source Advances in Neural Information Processing Systems (NeurIPS)

In Pursuit of Interpretable, Fair and Accurate Machine Learning for Criminal Recidivism Prediction 2020 article

Caroline Wang, Bin Han, Bhrij Patel, Cynthia Rudin

ai-society ml-methods fairness interpretability

View details Source arXiv preprint arXiv:2005.04176

On the Accuracy of Influence Functions for Measuring Group Effects 2019 inproceedings

Pang Wei Koh, Kai-Siang Ang, Hubert H. K. Teo, Percy Liang

data-governance ml-methods data-attribution interpretability

View details Source Advances in Neural Information Processing Systems

Model Cards for Model Reporting 2019 inproceedings

Mitchell, Margaret, Wu, Simone, Zaldivar, Andrew, Barnes, Parker, Vasserman, Lucy, Hutchinson, Ben, Spitzer, Elena, Raji, Inioluwa Deborah, Gebru, Timnit

ml-methods interpretability

View details Source Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT)

Understanding Black-box Predictions via Influence Functions 2017 inproceedings

Pang Wei Koh, Percy Liang

Uses influence functions from robust statistics to trace model predictions back to training data, identifying training points most responsible for a given prediction.

data-attribution data-governance data-selection interpretability ml-methods foundational

View details Source Proceedings of the 34th International Conference on Machine Learning (ICML)