Date Awarded


Document Type


Degree Name

Doctor of Philosophy (Ph.D.)


Computer Science


Peter Kemper

Committee Member

Gang Zhou

Committee Member

Xu Liu

Committee Member

Zhengming Liu

Committee Member

Daniel Runfola


In recent years, machine learning methods have enabled us to predict with good precision using large training data, such as deep learning. However, for many problems, we care more about causality than prediction. For example, instead of knowing that smoking is statistically associated with lung cancer, we are more interested in knowing that smoking is the cause of lung cancer. With causality, we can understand how the world progresses and how impacts are made on an outcome by influencing the cause. This thesis explores how to quantify the causal effects of a treatment on an observable outcome in the presence of heterogeneity. We focus on investigating the causal impacts that World Bank projects have on environmental changes. This high dimensional World Bank data set includes covariates from various sources and of different types, including time series data, such as the Normalized Difference Vegetation Index (NDVI) values, temperature and precipitation, spatial data such as longitude and latitude, and many other features such as distance to roads and rivers. We estimate the heterogeneous causal effect of World Bank projects on the change of NDVI values. Based on causal tree and causal forest proposed by Athey, we described the challenges we met and lessons we learned when applying these two methods to an actual World Bank data set. We show our observations of the heterogeneous causal effect of the World Bank projects on the change of environment. as we do not have the ground truth for the World Bank data set, we validate the results using synthetic data for simulation studies. The synthetic data is sampled from distributions fitted with the World Bank data set. We compared the results among various causal inference methods and observed that feature scaling is very important to generating meaningful data and results. in addition, we investigate the performance of the causal forest with various parameters such as leaf size, number of confounders, and data size. Causal forest is a black-box model, and the results from it cannot be easily interpreted. The results are also hard for humans to understand. By taking advantage of the tree structure, the neighbors of the project to be explained are selected. The weights are assigned to the neighbors according to dynamic distance metrics. We can learn a linear regression model with the neighbors and interpret the results with the help of the learned linear regression model. in summary, World Bank projects have small impacts on the change to the environment, and the result of an individual project can be interpreted using a linear regression model learned from closed projects.



© The Author