LLM Social Simulations Are a Promising Research Method
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Exploring the limits of strong membership inference attacks on large language models
Trust and Friction: Negotiating How Information Flows Through Decentralized Social Media
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
Extending "GPTs Are GPTs" to Firms
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
Cybernetics
Cybernetics is the transdisciplinary study of circular causal processes such as feedback and recursion, where the effects of a system's actions (its outputs) return as inputs to that system, influencing subsequent action. It is concerned with general principles that are relevant across multiple contexts, including in engineering, ecological, economic, biological, cognitive and social systems and also in practical activities such as designing, learning, and managing. Cybernetics' transdisciplinary character has meant that it intersects with a number of other fields, leading to it having both wide influence and diverse interpretations. The field is named after an example of circular causal feedback—that of steering a ship (the ancient Greek κυβερνήτης (kybernḗtēs) refers to the person who steers a ship). In steering a ship, the position of the rudder is adjusted in continual response to the effect it is observed as having, forming a feedback loop through which a steady course can be maintained in a changing environment, responding to disturbances from cross winds and tide. Cybernetics has its origins in exchanges between numerous disciplines during the 1940s. Initial developments were consolidated through meetings such as the Macy Conferences and the Ratio Club. Early focuses included purposeful behaviour, neural networks, heterarchy, information theory, and self-organising systems. As cybernetics developed, it became broader in scope to include work in design, family therapy, management and organisation, pedagogy, sociology, the creative arts and the counterculture.
The Leaderboard Illusion
Welcome to the Era of Experience
If open source is to win, it must go public
Canada as a Champion for Public AI: Data, Compute and Open Source Infrastructure for Economic Growth and Inclusive Innovation
Shapley value-based data valuation for machine learning data markets
Proposes G-Value to bridge the gap between leave-one-out (LOO) and Shapley value approaches for data valuation. Addresses practical applications in machine learning data markets.
Rethinking machine unlearning for large language models
Comprehensive review of machine unlearning in LLMs, aiming to eliminate undesirable data influence (sensitive or illegal information) while maintaining essential knowledge generation. Envisions LLM unlearning as a pivotal element in life-cycle management for developing safe, secure, trustworthy, and resource-efficient generative AI.
Distributional Training Data Attribution: What do Influence Functions Sample?
Introduces distributional training data attribution (d-TDA), which predicts how the distribution of model outputs depends upon the dataset. Shows that influence functions are "secretly distributional"—they emerge from this framework as the limit to unrolled differentiation without requiring restrictive convexity assumptions.
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations
Official NIST taxonomy and terminology for adversarial machine learning. Covers data poisoning attacks applicable to all learning paradigms, model poisoning attacks in federated learning, and supply-chain attacks. Provides guidance for defense strategies.
The Economics of AI Training Data: A Research Agenda
Research agenda documenting AI training data deals from 2020 to 2025. Reveals persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Found only 7 of 24 major deals compensate original creators.
Data-centric Artificial Intelligence: A Survey
Comprehensive survey on data-centric AI, providing a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and representative methods. Covers the paradigm shift from model refinement to prioritizing data quality.
Revisiting Data Attribution for Influence Functions
Comprehensive review of influence functions for data attribution, examining how individual training examples influence model predictions. Covers techniques for model debugging, data curation, bias detection, and identification of mislabeled or adversarial data points.
Membership inference attacks against large language models
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
AI as Normal Technology
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool
Machines of Loving Grace: How AI Could Transform the World for the Better
The Illusion of Artificial Inclusion
To code, or not to code? exploring impact of code in pre-training
The Rise of AI-Generated Content in Wikipedia
Poisoning Web-Scale Training Datasets is Practical
What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
Large language models reduce public knowledge sharing on online Q&A platforms
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
GPTs are GPTs: Labor Market Impact Potential of LLMs
Artificial Intelligence Act
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Public {AI}: {Infrastructure} for the {Common} {Good}
ANSI/NISO Z39.96-2024, JATS: Journal Article Tag Suite
Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing
Data {Flywheel} {Go} {Brrr}: {Using} {Your} {Users} to {Build} {Better} {Products} - {Jason} {Liu}
Explore how data flywheels leverage user feedback to enhance product development and achieve business success with AI.
Consent in Crisis: The Rapid Decline of the AI Data Commons
StarCoder 2 and The Stack v2: The Next Generation
LLM Dataset Inference: Did you train on my dataset?
Public AI: Making AI Work for Everyone, by Everyone
Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Generative AI Profile (Draft/2024)
A Canary in the AI Coal Mine: American Jews May Be Disproportionately Harmed by Intellectual Property Dispossession in Large Language Model Training
What is a {Data} {Flywheel}? {A} {Guide} to {Sustainable} {Business} {Growth}
Data {Flywheels} for {LLM} {Applications}
The data addition dilemma
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Copyright and Artificial Intelligence: Policy Studies and Guidance
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
Push and Pull: A Framework for Measuring Attentional Agency
A Systematic Review of NeurIPS Dataset Management Practices
Machine Unlearning: A Survey
Comprehensive survey of machine unlearning covering definitions, scenarios, verification methods, and applications. Cited in the International AI Safety Report 2025 as a pioneering paradigm for removing sensitive information.
CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning
Proposes CHG (compound of Hardness and Gradient) utility function to approximate the utility of each data subset, reducing computational complexity to a single model retraining—achieving a quadratic improvement over existing Data Shapley methods.
A Versatile Influence Function for Data Attribution with Non-Decomposable Loss
Proposes Versatile Influence Function (VIF) designed to fully leverage auto-differentiation, eliminating case-specific derivations. Demonstrated across Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank, with estimates closely resembling leave-one-out retraining while being up to 10^3 times faster.
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
Studies whether model collapse is inevitable. Found that collapse occurs when replacing real data with synthetic data each generation. However, when accumulating synthetic data alongside original real data, models stay stable across sizes and modalities. Suggests data accumulation rather than replacement as a solution.
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
Large-scale audit of over 1,800 text AI datasets analyzing trends, permissions of use and global representation. Found frequent miscategorization of licences on dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. Released the Data Provenance Explorer tool for practitioners.
Influence Functions for Scalable Data Attribution in Diffusion Models
Develops influence function frameworks for diffusion models to address data attribution and interpretability challenges. Predicts how model output would change if training data were removed, showing how previously proposed methods can be interpreted as particular design choices in this framework.
AI models collapse when trained on recursively generated data
Landmark study showing that indiscriminate use of model-generated content in training causes irreversible defects in resulting models, where tails of original content distribution disappear. Model collapse is a degenerative learning process where models forget improbable events over time. Demonstrates this across LLMs, VAEs, and GMMs.
LLM Unlearning via Loss Adjustment with Only Forget Data
FLAT is a loss adjustment approach which maximizes f-divergence between the available template answer and the forget answer with respect to the forget data. Demonstrates superior unlearning performance compared to existing methods while minimizing impact on retained capabilities, tested on Harry Potter dataset and MUSE Benchmark.
Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration
Enhances training data attribution methods for large language models including LLaMA2, QWEN2, and Mistral by considering fitting error in the attribution process.
Position Paper: Data-Centric AI in the Age of Large Language Models
Position paper identifying four specific scenarios centered around data for LLMs, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
The simple macroeconomics of AI
Self-consuming generative models go MAD
The labor market impacts of technological change: From unbridled enthusiasm to qualified optimism to vast uncertainty
The Foundation Model Transparency Index v1.1
The consequences of generative AI for online knowledge communities
Large language models reduce public knowledge sharing on online Q&A platforms
The impact of generative AI on Wikipedia traffic
The short-term effects of generative artificial intelligence on employment: Evidence from an online labor market
Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions
Looking Beyond the Top-1: Transformers Determine Top Tokens in Order
Model collapse from recursive training on generated data
Benchmarking Benchmark Leakage in Large Language Models
Alpaca: A Strong, Replicable Instruction-Following Model
LEACE: Perfect linear concept erasure in closed form
Quantifying Memorization Across Neural Language Models
Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia
Understanding CC Licenses and Generative AI
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
Wikipedia's value in the age of generative {AI}
If there was a generative artificial intelligence system that could, on its own, write all the information contained in Wikipedia, would it be the same as Wikipedia today?
Algorithmic Collective Action in Machine Learning
Provides theoretical framework for algorithmic collective action, showing that small collectives can exert significant control over platform learning algorithms through coordinated data strategies.
ISO/IEC 23894:2023 Information Technology—Artificial Intelligence—Risk Management
Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity| Winners of the 2024 Nobel Prize for Economics
A Watermark for Large Language Models
The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers
Textbooks Are All You Need II: phi-1.5 technical report
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
Artificial Intelligence Risk Management Framework (AI RMF 1.0)
OWASP Top 10 for Large Language Model Applications
TRAK: Attributing Model Behavior at Scale
Introduces TRAK (Tracing with the Randomly-projected After Kernel), a data attribution method that is both effective and computationally tractable for large-scale models by leveraging random projections.
Terms-we-serve-with: Five dimensions for anticipating and repairing algorithmic harm
Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction
Understanding the landscape of potential harms from algorithmic systems enables practitioners to better anticipate consequences of the systems they build. It also supports the prospect of incorporating controls to help minimize harms that emerge from the interplay of technologies and social and cultural dynamics. A growing body of scholarship has identified a wide range of harms across different algorithmic technologies. However, computing research and practitioners lack a high level and synthesized overview of harms from algorithmic systems. Based on a scoping review of computing research (n=172), we present an applied taxonomy of sociotechnical harms to support a more systematic surfacing of potential harms in algorithmic systems. The final taxonomy builds on and refers to existing taxonomies, classifications, and terminologies. Five major themes related to sociotechnical harms — representational, allocative, quality-of-service, interpersonal harms, and social system/societal harms — and sub-themes are presented along with a description of these categories. We conclude with a discussion of challenges and opportunities for future research.
An Alternative to Regulation: The Case for Public AI
The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers
State of AI Report 2023
The Foundation Model Transparency Index
Quantifying memorization across neural language models
Open problems and fundamental limitations of reinforcement learning from human feedback
Mata v. Avianca, Inc., No. 1:22-cv-01461 (S.D.N.Y. June 22, 2023), Opinion and Order on Sanctions
GPTs are GPTs: An early look at the labor market impact potential of large language models
Occupational heterogeneity in exposure to generative AI
Scaling laws for reward model overoptimization
A watermark for large language models
The Data Provenance Initiative: A large scale audit of dataset licensing & attribution in AI
SILO language models: Isolating legal risk in a nonparametric datastore
Experimental evidence on the productivity effects of generative artificial intelligence
Proving Test Set Contamination in Black Box Language Models
The Eye of the Master: A Social History of Artificial Intelligence
The impact of AI on developer productivity: Evidence from GitHub Copilot
Direct preference optimization: Your language model is secretly a reward model
Changing the world by changing the data
Generative AI meets copyright
The gradient of generative AI release: Methods and considerations
Canada's Online News Act: A legislative response to platform power
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Common Crawl — Web-scale Data for Research
Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses
Comprehensive survey systematically categorizing dataset vulnerabilities including poisoning and backdoor attacks, their threat models, and defense mechanisms.
DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning
Comprehensive library and empirical study of coreset selection methods for deep learning, finding that random selection remains a strong baseline across many settings.
Training Data Influence Analysis and Estimation: A Survey
Training Compute-Optimal Large Language Models
Shows that current LLMs are significantly undertrained. For compute-optimal training, model size and training tokens should scale equally. Introduces Chinchilla (70B params, 1.4T tokens) which outperforms larger models like Gopher (280B) trained on less data.
Datamodels: Predicting Predictions from Training Data
Proposes datamodels that predict model outputs as a function of training data subsets, providing a framework for understanding data attribution through retraining experiments.
Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning
Generalizes Data Shapley using Beta weighting functions, providing noise-reduced data valuation that better handles outliers and mislabeled data detection.
LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets
Training language models to follow instructions with human feedback
Probabilistic Machine Learning: An introduction
The Fallacy of AI Functionality
Releasing Re-LAION-5B
Why Black Box Machine Learning Should Be Avoided for High-Stakes Decisions, in Brief
{LAION}-5B: An Open Large-Scale Dataset for Training Next {CLIP} Models
Beyond neural scaling laws: beating power law scaling via data pruning
The Stack: A Permissively Licensed Source Code Dataset
Introducing Whisper
Robust Speech Recognition via Large-Scale Weak Supervision
Constitutional AI: Harmlessness from AI feedback
Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space
In-context learning and induction heads
Mapping the design space of teachable social media
Training language models to follow instructions with human feedback
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
Machine Unlearning
Introduces SISA (Sharded, Isolated, Sliced, Aggregated) training for efficient exact machine unlearning. Partitions data into shards with separate models, enabling targeted retraining when data must be forgotten.
Extracting Training Data from Large Language Models
Unsolved Problems in ML Safety
Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning
What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus
Measuring Mathematical Problem Solving With the MATH Dataset
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
What are you optimizing for? Aligning Recommender Systems with Human Values
Quantifying the Invisible Labor in Crowd Work
Can "Conscious Data Contribution" Help Users to Exert "Data Leverage" Against Technology Companies?
Data Leverage: A Framework for Empowering the Public in its Relationship with Technology Companies
A Deeper Investigation of the Importance of Wikipedia Links to Search Engine Results
Ethical and Social Risks of Harm from Language Models
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
On the opportunities and risks of foundation models
To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making
Extracting training data from large language models
All that's `human' is not gold: Evaluating human evaluation of generated text
Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence
Documenting large webtext corpora: A case study on the Colossal Clean Crawled Corpus
A mathematical framework for transformer circuits
The Australian News Media Bargaining Code
Datasheets for datasets
The value of data: Evidence from ride-hailing
Copyright in the data economy: An overview
Dynabench: Rethinking benchmarking in NLP
Fair learning
Data leverage: A framework for empowering the public in its relationship with technology companies
Language (Technology) is Power: A Critical Survey of “Bias” in NLP
Language Models are Few-Shot Learners
Artificial Intelligence, Values, and Alignment
Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
Scaling Laws for Neural Language Models
Establishes power-law scaling relationships between language model performance and model size, dataset size, and compute, spanning seven orders of magnitude.
Exploring Research Interest in Stack Overflow -- A Systematic Mapping Study and Quality Evaluation
Coresets for Data-efficient Training of Machine Learning Models
Introduces CRAIG (Coresets for Accelerating Incremental Gradient descent), selecting subsets that approximate full gradient for 2-3x training speedups while maintaining performance.
The Economics of Maps
Deep Double Descent: Where Bigger Models and More Data Hurt
Demonstrates that double descent occurs across model size, training epochs, and dataset size in modern deep networks. Introduces effective model complexity to unify these phenomena and shows regimes where more data hurts.
The Biggest Lie on the Internet: Ignoring the Privacy Policies and Terms of Service Policies of Social Networking Services
Estimating Training Data Influence by Tracing Gradient Descent
Introduces TracIn, which computes influence of training examples by tracing how test loss changes during training. Uses first-order gradient approximation and saved checkpoints for scalability.
The pushshift reddit dataset
Example Citation Placeholder
Placeholder reference to support example citations in docs. Replace with a real source when available.
Are anonymity-seekers just like everybody else? An analysis of contributions to Wikipedia from Tor
In Pursuit of Interpretable, Fair and Accurate Machine Learning for Criminal Recidivism Prediction
Enchanted determinism: Power without responsibility in artificial intelligence
The digitization of day labor as gig work
interpreting GPT: the logit lens
Too Smart: How Digital Capitalism is Extracting Data, Controlling Our Lives, and Taking Over the World
What do platforms do? Understanding the gig economy
The gig economy: A critical introduction
Common voice: A massively-multilingual speech corpus
Reconciling modern machine-learning practice and the classical bias–variance trade-off
The Secret Sharer: Measuring Unintended Memorization in Neural Networks
Excavating AI: The Politics of Images in Machine Learning Training Sets
Ecosystem Tipping Points in an Evolving World
Data Shapley: Equitable Valuation of Data for Machine Learning
Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
First demonstration of backdoor attacks on deep neural networks. Shows that small trigger patterns in training data cause models to misclassify any input containing the trigger (e.g., stop signs with stickers classified as speed limits).
Incomplete Contracting and AI Alignment
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Towards Efficient Data Valuation Based on the Shapley Value
On the Accuracy of Influence Functions for Measuring Group Effects
Privacy, anonymity, and perceived risk in open collaboration: A study of service providers
Model Cards for Model Reporting
Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations
Rosenbach v. Six Flags Entertainment Corp.
Fairness and Abstraction in Sociotechnical Systems
A Survey on Image Data Augmentation for Deep Learning
Comprehensive survey of image data augmentation techniques for deep learning, covering geometric transformations, color space transforms, kernel filters, mixing images, random erasing, and neural style transfer approaches.
Measuring the Importance of User-Generated Content to Search Engines
Mapping the Potential and Pitfalls of "Data Dividends" as a Means of Sharing the Profits of Artificial Intelligence
"Data Strikes": Evaluating the Effectiveness of a New Form of Collective Action Against Technology Companies
Simulates data strikes against recommender systems, showing that collective withholding of training data can create leverage for users against technology platforms.
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
Combines cutting and mixing: patches from one image replace regions in another, with labels mixed proportionally. Improves over Cutout by using cut pixels constructively rather than zeroing them out.
Automation and new tasks: How technology displaces and reinstates labor
The Wrong Kind of AI? Artificial Intelligence and the Future of Labor Demand
Race After Technology: Abolitionist Tools for the New Jim Code
Regulatory options for conflicts of law and jurisdictional issues in the on-demand economy
Data colonialism: Rethinking big data's relation to the contemporary subject
The Costs of Connection: How Data is Colonizing Human Life and Appropriating it for Capitalism
The Technology Trap: Capital, Labor, and Power in the Age of Automation
Data Shapley: Equitable valuation of data for machine learning
Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass
SuperGLUE: A stickier benchmark for general-purpose language understanding systems
The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power
A Reductions Approach to Fair Classification
Should We Treat Data as Labor? Moving Beyond 'Free'
Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
Datasheets for Datasets
The Dark (Patterns) Side of UX Design
The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards
Troubling Trends in Machine Learning Scholarship
Active Learning for Convolutional Neural Networks: A Core-Set Approach
Defines active learning as core-set selection, choosing points such that a model trained on the subset is competitive for remaining data. Provides theoretical bounds via k-Center problem.
Artificial Intelligence and Its Implications for Income Distribution and Unemployment
A Blueprint for a Better Digital Society
mixup: Beyond Empirical Risk Minimization
Introduces mixup, a data augmentation technique that trains on convex combinations of input pairs and their labels. Simple, data-independent, and model-agnostic approach that improves generalization and robustness.
Artificial intelligence, automation and work
Prediction Machines: The Simple Economics of Artificial Intelligence
Should we treat data as labor? Moving beyond ``free''
Data statements for natural language processing: Toward mitigating system bias and enabling better science
Artificial Unintelligence: How Computers Misunderstand the World
Neurons spike back: The invention of inductive machines and the artificial intelligence controversy
Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor
Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media
Annotation artifacts in natural language inference data
A blueprint for a better digital society
Algorithms of Oppression: How Search Engines Reinforce Racism
Radical Markets: Uprooting Capitalism and Democracy for a Just Society
Uberland: How Algorithms Are Rewriting the Rules of Work
Artificial intelligence, economics, and industrial organization
Improved Regularization of Convolutional Neural Networks with Cutout
Introduces Cutout, a regularization technique that randomly masks square regions of input images during training. Inspired by dropout but applied to inputs, encouraging models to learn from partially visible objects.
The Substantial Interdependence of Wikipedia and Google: A Case Study on the Relationship Between Peer Production Communities and Information Technologies
Deep learning scaling is predictable, empirically
The WARC Format 1.1
Understanding Black-box Predictions via Influence Functions
Uses influence functions from robust statistics to trace model predictions back to training data, identifying training points most responsible for a given prediction.
Deep reinforcement learning from human preferences
Heteromation, and Other Stories of Computing and Capitalism
On Calibration of Modern Neural Networks
Understanding black-box predictions via influence functions
Thinking critically about and researching algorithms
Algorithms as culture: Some tactics for the ethnography of algorithmic systems
Membership inference attacks against machine learning models
Platform Capitalism
The EU General Data Protection Regulation (GDPR): A Practical Guide
Big Data's Disparate Impact
General Data Protection Regulation (EU) 2016/679
Reality and Perception of Copyright Terms of Service for Online Content Creation
Information fiduciaries and the first amendment
How the machine 'thinks': Understanding opacity in machine learning algorithms
Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy
SQuAD: 100,000+ questions for machine comprehension of text
Uberworked and Underpaid: How Workers Are Disrupting the Digital Economy
Ours to Hack and to Own: The Rise of Platform Cooperativism
Towards Making Systems Forget with Machine Unlearning
First formal definition of machine unlearning. Proposes converting learning algorithms into summation form to enable efficient data removal without full retraining. Foundational work establishing the unlearning problem.
Causal Inference in Statistics, Social, and Biomedical Sciences
Comprehensive treatment of causal inference methods for observational and experimental data. Covers randomized experiments, matching, propensity scores, instrumental variables, and regression discontinuity designs.
Turkers, Scholars, "Arafat" and "Peace": Cultural Communities and Algorithmic Gold Standards
Why are there still so many jobs? The history and future of workplace automation
Cyber-Proletariat: Global Labour in the Digital Vortex
The Black Box Society: The Secret Algorithms That Control Money and Information
Who Gets What—and Why: The New Economics of Matchmaking and Market Design
Hidden technical debt in machine learning systems
What's Yours Is Mine: Against the Sharing Economy
What's wrong with social simulations?
The Algorithmic Foundations of Differential Privacy
The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies
Heteromation and its (dis)contents: The invisible division of labor between humans and machines
The Fourth Revolution: How the Infosphere is Reshaping Human Reality
Digital Labour and Karl Marx
The relevance of algorithms
Children's Online Privacy Protection Rule (COPPA) — 16 CFR Part 312
The Future of Crowd Work
The China syndrome: Local labor market effects of import competition in the United States
The Ethics of Artificial Intelligence
Turkopticon: Interrupting worker invisibility in Amazon Mechanical Turk
Who Owns the Future?
To Save Everything, Click Here: The Folly of Technological Solutionism
Poisoning Attacks against Support Vector Machines
Investigates poisoning attacks against SVMs where adversaries inject crafted training data to increase test error. Uses gradient ascent to construct malicious data points.
Configuring the Networked Self: Law, Code, and the Play of Everyday Practice
Infrastructure: The Social Value of Shared Resources
The Winograd schema challenge
Open Access
Skills, tasks and technologies: Implications for employment and earnings
Surveillance and alienation in the online economy
Human Computation
The Precariat: The New Dangerous Class
Guide to Protecting the Confidentiality of Personally Identifiable Information (PII)
A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming
Privacy in Context: Technology, Policy, and the Integrity of Social Life
Curriculum Learning
Introduces curriculum learning: training models on examples of increasing difficulty. Shows this acts as a continuation method for non-convex optimization, improving both convergence speed and final generalization.
Causality: Models, Reasoning, and Inference
Foundational book on causal inference introducing structural causal models, do-calculus, and counterfactual reasoning. Unifies graphical models with potential outcomes framework. Second edition with expanded coverage.
Active Learning Literature Survey
Canonical survey of active learning covering uncertainty sampling, query-by-committee, expected error reduction, variance reduction, and density-weighted methods. Establishes foundational taxonomy for the field.
The Google dilemma
Biometric Information Privacy Act (BIPA), 740 ILCS 14
The Cost of Reading Privacy Policies
Robust De-anonymization of Large Sparse Datasets
Understanding knowledge as a commons: From theory to practice
Human-Machine Reconfigurations: Plans and Situated Actions
The polarization of the US labor market
The Wealth of Networks: How Social Production Transforms Markets and Freedom
A taxonomy of privacy
Causal Inference Using Potential Outcomes: Design, Modeling, Decisions
Comprehensive overview of the potential outcomes framework for causal inference. Covers experimental design, observational studies, propensity scores, and the fundamental problem of causal inference.
Reassembling the Social: An Introduction to Actor-Network-Theory
Human Computation
Privacy as Contextual Integrity
Writings of the Luddites
Free Culture: How Big Media Uses Technology and the Law to Lock Down Culture and Control Creativity
Labeling images with a computer game
The skill content of recent technological change: An empirical exploration
Platform competition in two-sided markets
Skill-biased technological change and rising wage inequality: Some problems and puzzles
State of the Union: A Century of American Labor
Free Software, Free Society: Selected Essays of Richard M. Stallman
Modeling Complexity : The Limits to Prediction
HIPAA Privacy Rule — 45 CFR Parts 160 and 164
Simple Demographics Often Identify People Uniquely
Privacy as intellectual property?
Economics of the Public Sector
Free labor: Producing culture for the digital economy
Sorting Things Out: Classification and Its Consequences
Technological determinism
The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary
Social {Dilemmas}: {The} {Anatomy} of {Cooperation}
The study of social dilemmas is the study of the tension between individual and collective rationality. In a social dilemma, individually reasonable behavior leads to a situation in which everyone is worse off. The first part of this review is a discussion of categories of social dilemmas and how they are modeled. The key two-person social dilemmas (Prisoner’s Dilemma, Assurance, Chicken) and multiple-person social dilemmas (public goods dilemmas and commons dilemmas) are examined. The second part is an extended treatment of possible solutions for social dilemmas. These solutions are organized into three broad categories based on whether the solutions assume egoistic actors and whether the structure of the situation can be changed: Motivational solutions assume actors are not completely egoistic and so give some weight to the outcomes of their partners. Strategic solutions assume egoistic actors, and neither of these categories of solutions involve changing the fundamental structure of the situation. Solutions that do involve changing the rules of the game are considered in the section on structural solutions. I conclude the review with a discussion of current research and directions for future work.