How does statistics handle low- versus high-dimensional problems? The logical starting point is understanding linear regression, as many machine learning techniques are based on this simple method. The course then shows how linear regression can become flexible and be adjusted to explore high-dimensional associations and causal problems. In addition, I provide an introduction to highly flexible, nonparametric machine learning techniques such as tree-based methods. I present their applicability to both predictive and causal problems and draw contrasts with traditional regression approaches.
To sum up: in this course I summarize the literature on predictive algorithms and provide concrete guidance for their application for causal effect estimation in high dimensions.
Moreover, targeting policy or treatment interventions to specific subgroups of the population requires the understanding of heterogeneous causal effects. Heterogeneous causal effect analysis focuses on examining individualized treatment effects for individuals or subgroups in a population. Understanding heterogeneous treatment effects can critically guide, for instance, policymakers to identify socioeconomic groups of individuals for which the policy causes the largest effects, design effective policies, and tailor information campaigns for the least-responsive groups. In this course I present the recent advancements in the treatment effect and machine learning literature on the estimation of conditional average treatment effects (CATEs) from observational data with binary or continuous treatments.
The guidance I provide is supported by comprehensive R tutorials in which I will carefully explain codes piece by piece and provide tools for autonomous work.
Online exam modality: Homework (in groups) + short oral presentation on November 1, 2024 (in groups)
Task: Using tree-based algorithms studied in class and using the R codes provided in the tutorials, find a dataset you like and use such algorithms to predict the outcome, identifying the main predictors of such outcome.
Data sources:
https://archive.ics.uci.edu/datasets
For instance:
1. What drives student alcohol consumption?
2. What are the main predictors of house prices?
Dep. Variable:
MEDV: Median value of owner-occupied homes in $1000's
https://www.kaggle.com/datasets/fedesoriano/the-boston-houseprice-data
3. What predicts creditworthiness of individuals in Germany?
Dep. Variable (target):
Creditability: 0/1 whether you are considered as creditworthy by the bank or not
https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data
https://www.kaggle.com/datasets/uciml/german-credit/data
4. What predicts wine quality?
Dep. Variable (target):
Quality: 0/10 integer value
https://archive.ics.uci.edu/dataset/186/wine+quality
5. Predict student performance in 2ndary education
https://archive.ics.uci.edu/dataset/320/student+performance
6. Predict heart disease
Dep. Variable (target):
Num: 0/1 integer value for diagnosis of heart disease
https://archive.ics.uci.edu/dataset/45/heart+disease- Kursleiter*in: Dr. Marica Valente