1 Introduction
Understanding the causal effects of an intervention is a key question in many applications, from personalised medicine to marketing (e.g. Sun et al. (2015); Wager and Athey (2017); Alaa and van der Schaar (2017)). Predicting the causal outcome typically involves dealing with highdimensional observational data that is frequently subject to the effects of confounding.
In general, we distinguish between measured and hidden confounding: When confounders are directly measured, they may be accounted for using techniques that correct for their effects, such as propensity reweighting (IPS) or covariate shift (Hernán and Robins, 2006; Rosenbaum and Rubin, 1984). In contrast, to account for hidden confounding, proxy variables may be used as noisy representatives of latent confounders (Greenland and Lash, 2008; Pearl, 2012; Kuroki and Pearl, 2014; Louizos et al., 2017)
. Both approaches can however only be applied when covariate data is completely measured. This assumption is not feasible in a large number of settings such as medicine. For example, doctors are interested in identifying treatments that improve patient outcomes, and have to base decisions on hundreds of potentially confounding variables such as age and genetic factors. Here, a doctor may readily have access to many routine measurements such as blood count data for all patients, but may only have genetic information for some patients. Inferring the causal effects of a treatment requires learning a joint distribution over covariates and confounders of patients whose data is completely observable, while simultaneously transferring this knowledge to patients whose data is missing. This is not achievable in practice since we have to integrate over all missing covariates.
We propose addressing the problem of performing causal inference with partial covariate information from an decisiontheoretic point of view. Specifically, we assume that a fixed set of measurements is unavailable for a subset of the data (or patients) at test time. The key idea is to use the Information Bottleneck (IB) criterion (Tishby et al., 2000) to perform a sufficient reduction of the covariate and recover a distribution of the confounding information. The IB enables us to build a discrete reference class over patients whose covariate data is complete, to which we can map patients with incomplete data and estimate treatment effects on the basis of such a mapping. Finally, we demonstrate that our method outperforms existing approaches across established causal inference benchmarks and a real world application for treating sepsis.
2 Method
We refer to our model as causeeffect IB (CEIB). In Figure 1, we illustrate an overview of the possible configurations for performing causal inference and present our model in the context of existing work. The corresponding causal graphs for Cases I and II are shown in Figure 2. The major difference between I and II is the reversal of the arrow between and , and the fact that in Case II confounders are not measured, but indirectly observed via noisy proxies.
Influence diagrams of the two cases considered in this paper. Red and green circles correspond to observed and latent random variables respectively, while blue rectangles represent interventions. In Case I, we identify a lowdimensional representation
of measured covariates to estimate the effects of an intervention on outcome . In Case II, the arrow between and is reversed and confounders are indirectly measured via proxy variables, indicated by an orange circle here. We identify a lowdimensional representation and use this to explicitly estimate as well as . In both cases, representation is used to make inferences for a subset of patients where only partial covariate information is available.In our paper, we consider the decisiontheoretic approach of Dawid (2007) to estimate the causal effect where we have both hidden and measured confounding with incomplete covariates. This involves computing the ACE of on . Dawid (2007) show that the ACE and observational ACE are equivalent under the conditional independence assumption . This assumption expresses that the distribution of is the same in the interventional and observational regimes. It can also be extended to account for the notion of confounding. Here, the treatment assignment may be ignored when estimating , provided a sufficient covariate and . Formally, is a sufficient covariate for the effect of on outcome if and . It can also be shown via Pearl’s backdoor criterion (Pearl, 2009) that the ACE may be defined in terms of the Specific Causal Effect (SCE),
(1) 
where
(2) 
Importantly, estimating the ACE only requires computing a distribution in Figure 2. In what follows, we use the IB to learn a sufficient covariate that allows us to approximate this distribution.
Case I: Measured Confounding
This case occurs when we have observational data where all the relevant confounding variables are measured, but where a fixed set of covariates is only available for some subset of the data at test time. Let and be our covariate sets (both available at training). We adapt the IB for learning the outcome of a therapy when partial covariate information is available for at test time. To do so, we consider the following parametric form,
(3) 
where and are lowdimensional discrete representations of the covariate data, is a concatenation of and and represents the mutual information parameterised by networks , , and respectively. We assume a parametric form of the conditionals , , ,
, as well as Markov chain
. The three terms in Equation 3 have the following forms:as a result of the Markov assumption in the IB model. Here is the entropy of . For the decoder model, we use an architecture similar to the TARnet (Johansson et al., 2016), where we replace conditioning on highdimensional covariates with conditioning on latent . We can thus express the conditionals as,
(4) 
with logistic function , and outcome
given by a Gaussian distribution parameterised with a TARnet with
. Note that the termscorrespond to neural networks.
Case II: Hidden Confounding
This case is analogous to the work of Louizos et al. (2017). We however, treat proxies as measured confounders and propose using Case I to estimate the causal effect here. Using Case I is permissible since both DAGs in Figure 2 are Markov equivalent, and the causal direction between and can only be determined by additional assumptions on the causal graph. However, assuming the causal structure in Figure 1(b) as in Louizos et al. (2017) requires the definition of a complex prior over . Hence, it may be more natural to treat all covariates including proxies as measured confounders like we propose in this paper. In doing so, we compress the relevant information to a sufficient covariate as described in Case I.
Once we can estimate in both cases using the proposed model, we can compute the ACE. When given a test patient with partial covariates, we can assign them to the closest equivalence class of patients with similar characteristics, and approximate the effect of treatments this basis.
3 Experiments
We demonstrate the performance of our approach on a highdimensional real world task for managing and treating sepsis. Additional experiments are in the supplement. For this experiment, we make use of data from the Multiparameter Intelligent Monitoring in Intensive Care (MIMICIII) database (Johnson et al., 2016). We focus on patients satisfying Sepsis3 criteria (16 804 patients in total). For each patient, we have a 48dimensional set of physiological parameters including demographics, lab values, vital signs and input/output events, where covariates are partially incomplete. Our outcomes
correspond to the odds of mortality, while we binarise medical interventions
according to whether or not a vasopressor is administered. The data set is divided into 60/20/20% into training/validation/testing sets. We train our model with 6, 4dimensional Gaussian mixture components and analysed the information curves and cluster compositions respectively.The information curves for and are shown in Figures 2(a) and 2(b) respectively. We observe that we can perform a sufficient reduction of the highdimensional covariate information to between 4 and 6 dimensions while achieving high predictive accuracy of outcomes . Since there is no ground truth available for the sepsis task, we do not have access to the true confounding variables. However, we can perform an analysis on the basis of the clusters obtained over the latent space. Here, we see that we can characterise the patients in each cluster according to their initial SOFA (Sequential Organ Failure Assessment) scores. SOFA scores range between 14 and are used to track a patient’s stay in hospital. In Figure 4, we observe clear differences in cluster composition relative to the SOFA scores. Clusters 2, 5 and 6 tend to have higher proportions of patients with lower SOFA scores, while Clusters 3 and 4 have larger proportions of patients with higher SOFA scores. This result suggests that a patient’s initial SOFA score is potentially a confounder when determining how to administer subsequent treatments and predicting their odds of inhospital mortality. This is consistent with medical studies such as Medam et al. (2017); Studnek et al. (2012) where authors indicate that high initial SOFA scores were likely to impact on their overall chances of survival and treatments administered in hospital. Overall, performing such analyses for tasks like Sepsis may help correct for confounding and assist in establishing potential guidelines.
4 Discussion
CEIB makes stateoftheart predictions of the ACE that are robust against confounding.
CEIB learns a lowdimensional, interpretable representation of latent confounding.
Since CEIB extracts only the information that is relevant for making predictions, it is able to learn a lowdimensional representation of the confounding effect and uses this to make predictions. In particular, the introduction of a discrete cluster structure in the latent space allows an easier interpretation of the confounding effect. Similar methods such as Louizos et al. (2017) typically use a higher dimensional representation to account for these effects without gains in performance. This is likely a result of misrepresenting the true confounding effect. Modelling the task as an IB alleviates this problem. For sepsis, we identify a latent space of 6 dimensions when predicting odds of mortality, where clusters exhibit a distinct structure with respect to a patient’s initial SOFA score.
CEIB enables estimating the causal effect with incomplete covariates.
Unlike previous approaches, CEIB can deal with incomplete covariate data during test time by introducing a discrete latent space. Specifically, we learn equivalence classes among patients such that the approximate the effects of treatments can be computed where data is incomplete.
References
 Alaa and van der Schaar (2017) Ahmed M. Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multitask gaussian processes. CoRR, abs/1704.02801, 2017.
 Almond et al. (2005) Douglas Almond, Kenneth Y Chay, and David S Lee. The costs of low birth weight. The Quarterly Journal of Economics, 120(3):1031–1083, 2005.
 Dawid (2007) Philip Dawid. Fundamentals of statistical causality. Technical report, Department of Statistical Science, University College London, 2007.
 Greenland and Lash (2008) Sander Greenland and Timothy Lash. Bias analysis. Modern Epidemiology, pages 345 – 380, 2008.
 Hernán and Robins (2006) Miguel A Hernán and James M Robins. Estimating causal effects from epidemiological data. Journal of Epidemiology & Community Health, 60(7):578–586, 2006.
 Hill (2011) Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.

Johansson et al. (2016)
Fredrik D. Johansson, Uri Shalit, and David Sontag.
Learning representations for counterfactual inference.
In
Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48
, ICML’16, pages 3020–3029. JMLR.org, 2016.  Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Liwei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 Kuroki and Pearl (2014) Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101(2):423–437, 2014.
 Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latentvariable models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6446–6456. Curran Associates, Inc., 2017.
 McCormick et al. (2013) Marie C. McCormick, Jeanne BrooksGunn, and Stephen L. Buka. Infant health and development program, phase iv, 20012004 united states. 2013. doi: 10.3886/ICPSR23580.v2.
 Medam et al. (2017) Sophie Medam, Laurent Zieleskiewicz, Gary Duclos, Karine Baumstarck, Anderson Loundou, Julie Alingrin, Emmanuelle Hammad, Coralie Vigne, François Antonini, and Marc Leone. Medicine, 96(50), 12 2017. doi: 10.1097/MD.0000000000009241.
 Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
 Pearl (2012) Judea Pearl. On measurement bias in causal inference. arXiv preprint arXiv:1203.3504, 2012.
 Rosenbaum and Rubin (1984) Paul R Rosenbaum and Donald B Rubin. Reducing bias in observational studies using subclassification on the propensity score. Journal of the American statistical Association, 79(387):516–524, 1984.
 Shalit et al. (2017) Uri Shalit, Fredrik D. Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3076–3085, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 Studnek et al. (2012) Jonathan R Studnek, Melanie R Artho, Craymon L Garner Jr, and Alan E Jones. The impact of emergency medical services on the ed care of severe sepsis. The American journal of emergency medicine, 30(1):51–56, 2012.
 Sun et al. (2015) Wei Sun, Pengyuan Wang, Dawei Yin, Jian Yang, and Yi Chang. Causal inference via sparse additive models with application to online advertising. In AAAI, pages 297–303, 2015.
 Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.

Wager and Athey (2017)
Stefan Wager and Susan Athey.
Estimation and inference of heterogeneous treatment effects using random forests.
Journal of the American Statistical Association, 2017.
References
 Alaa and van der Schaar (2017) Ahmed M. Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multitask gaussian processes. CoRR, abs/1704.02801, 2017.
 Almond et al. (2005) Douglas Almond, Kenneth Y Chay, and David S Lee. The costs of low birth weight. The Quarterly Journal of Economics, 120(3):1031–1083, 2005.
 Dawid (2007) Philip Dawid. Fundamentals of statistical causality. Technical report, Department of Statistical Science, University College London, 2007.
 Greenland and Lash (2008) Sander Greenland and Timothy Lash. Bias analysis. Modern Epidemiology, pages 345 – 380, 2008.
 Hernán and Robins (2006) Miguel A Hernán and James M Robins. Estimating causal effects from epidemiological data. Journal of Epidemiology & Community Health, 60(7):578–586, 2006.
 Hill (2011) Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.

Johansson et al. (2016)
Fredrik D. Johansson, Uri Shalit, and David Sontag.
Learning representations for counterfactual inference.
In
Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48
, ICML’16, pages 3020–3029. JMLR.org, 2016.  Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Liwei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 Kuroki and Pearl (2014) Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101(2):423–437, 2014.
 Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latentvariable models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6446–6456. Curran Associates, Inc., 2017.
 McCormick et al. (2013) Marie C. McCormick, Jeanne BrooksGunn, and Stephen L. Buka. Infant health and development program, phase iv, 20012004 united states. 2013. doi: 10.3886/ICPSR23580.v2.
 Medam et al. (2017) Sophie Medam, Laurent Zieleskiewicz, Gary Duclos, Karine Baumstarck, Anderson Loundou, Julie Alingrin, Emmanuelle Hammad, Coralie Vigne, François Antonini, and Marc Leone. Medicine, 96(50), 12 2017. doi: 10.1097/MD.0000000000009241.
 Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
 Pearl (2012) Judea Pearl. On measurement bias in causal inference. arXiv preprint arXiv:1203.3504, 2012.
 Rosenbaum and Rubin (1984) Paul R Rosenbaum and Donald B Rubin. Reducing bias in observational studies using subclassification on the propensity score. Journal of the American statistical Association, 79(387):516–524, 1984.
 Shalit et al. (2017) Uri Shalit, Fredrik D. Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3076–3085, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 Studnek et al. (2012) Jonathan R Studnek, Melanie R Artho, Craymon L Garner Jr, and Alan E Jones. The impact of emergency medical services on the ed care of severe sepsis. The American journal of emergency medicine, 30(1):51–56, 2012.
 Sun et al. (2015) Wei Sun, Pengyuan Wang, Dawei Yin, Jian Yang, and Yi Chang. Causal inference via sparse additive models with application to online advertising. In AAAI, pages 297–303, 2015.
 Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.

Wager and Athey (2017)
Stefan Wager and Susan Athey.
Estimation and inference of heterogeneous treatment effects using random forests.
Journal of the American Statistical Association, 2017.
Appendix A Additional Experiments
a.1 Infant Health and Development Program
The Infant Health and Development Program (IHDP) [McCormick et al., 2013, Hill, 2011] is a randomised control experiment assessing the impact of educational intervention on outcomes of premature, low birth weight infants born in 19841985. Measurements from children and their mother were collected for studying the effects of childcare and home visits from a trained specialist on test scores. Briefly, the study contains information about the children and their mothers/caregivers. Data on the children include treatment group, sex, birth weight, health indices. Information about the mothers includes maternal age, mother’s race as well as educational achievement. Hill [2011] extract features and treatment assignments from the realworld clinical trial, and introduce selection bias to the data artificially by removing a nonrandom portion of the treatment group, in particular children with nonwhite mothers. In total, the data set consists of 747 subjects (139 treated, 608 control), each represented by 25 covariates measuring properties of the child and their mother. The data set is divided into 60/20/20% into training/validation/testing sets.
For our experiments, we compare the performance of CEIB for predicting the ACE against several existing baselines as in Louizos et al. [2017]: OLS1 is a least squares regression; OLS2 uses two separate least squares regressions to fit the treatment and control groups respectively; TARnet is a feedforward neural network from Shalit et al. [2017]
; KNN is a
nearest neighbours regression; RF is a random forest; BNN is a balancing neural network [Johansson et al., 2016]; BLR is a balancing linear regression
[Johansson et al., 2016], and CFRW is a counterfactual regression that using the Wasserstein distance [Shalit et al., 2017].Method  

OLS1  
OLS2  
KNN  
BLR  
TARnet  
BNN  
RF  
CEVAE  
CFRW  
CEIB 
Withinsample and outofsample mean and standard errors for the metrics across models on the IHDP data set. A smaller value indicates better performance. Bold values indicate the method with the best performance.
We train our model with , dimensional Gaussian mixture components, although our method can be applied without loss of generality to any number of dimensions. To assess the ability to estimate treatment effects on the basis of partial information, we artificially exclude three covariates at test time. These are covariates that are exhibit a moderate correlation to the hidden confounder ethnicity. The results are shown in Table 1. Overall, our approach exhibits good performance for both insample and outofsample predictions, while simultaneously accounting for partial covariate information.
To assess the interpretability of the proposed approach and the ability to account for hidden confounding, we perform an analysis on the latent space of our model. First, we plot two information curves illustrating the number of latent dimensions required to reconstruct the output for the terms and respectively. These results are shown in Figure 4(a) and Figure 4(b). In particular, we perform this analysis when the data set of subjects is both derandomised and randomised (i.e. when we do not introduce selection bias into the data set). Comparing the information curves in Figure 4(a) confirms that when we do not derandomise the data, the information content in the treatment tends to be closer to 0, whereas the opposite is true when the data is derandomised. The information curves in Figure 4(b) additionally demonstrate our model’s ability to account for indirect effects of confounding when predicting the overall outcomes: when data is derandomised, we are able to reconstruct treatment outcomes more accurately. Overall, the results from Figures 4(a) and 4(b) highlight that there is indeed a hidden confounding effect that we can account for using the proposed approach.
Next, we perform an analysis of the discretised latent space by comparing the proportions of ethnic groups of test subjects in each cluster from the Gaussian mixture to see if we can recover the hidden confounding effect. These results are shown in Figure 6 where we plot a hard assignment of test subjects to clusters on the basis of their ethnicity. Evidently, the clusters exhibit a clear structure with respect to the ethnic groups. In particular, Cluster 2 in Figure 5(b) has a significantly higher proportion of nonwhite members in the derandomised setting, confirming that we are able to correctly identify the true confounding effect and account for this when making predictions. Finally, we perform similar analyses and assess the error in estimating the ACE when varying the number of mixture components in Figure 7. When the number of clusters is larger, the clusters get smaller and it becomes more difficult to reliably estimate the ACE since we average over the cluster members to account for partial covariate information at test time. Here, model selection is made by observing where the error in estimating the ACE stabilises (anywhere between 47 mixture components).
a.2 Binary Treatment Outcome on Twins
Like Louizos et al. [2017], we apply CEIB to a benchmark task using the birth data of twins in the USA between 1989 and 1991 [Almond et al., 2005]. Here, treatment is a binary indicator of being the heavier twin at birth, while outcome corresponds to the mortality within a year after birth. Since mortality is rare, we consider only same sex twins with weights less than 2 kg which results in 11 984 pairs of twins. Each twin has a set of 46 covariates including information about their parents such as their level of education, race, incidence of renal disease, diabetes, smoking etc. as well as whether the birth took place in hospital or at home and the number of gestation weeks prior to birth.
To simulate an observational study, we selectively hide one of the twins. To illustrate the ability of CEIB to be applied to Case II where we treat proxy variables as measured confounders, we base the treatment assignment on a single variable which is highly correlated with the outcome: GESTAT10, the number of gestation weeks prior to birth. This has values from 09 that correspond to the weeks of gestation before birth i.e. birth before 20 weeks gestation, 2027 weeks of gestation, etc. Analogous to Louizos et al. [2017] we set treatment to for , where is GESTAT10 and are the 45 remaining covariates. Since CEIB can account for incomplete covariates, we artificially exclude 3 covariates from at test time.
Like Louizos et al. [2017]
, proxies are created with a onehot encoding of
, replicated 3 times and randomly flipping the 30 bits, where the flipping probability varies from 0.05 to 0.15. There may also be additional proxy variables for
in the data from the set of variables. Our task is to predict the ACE. Specifically, we compare the performance of CEIB to CEVAE (with a varying number of hidden layers), TARnet (with varying numbers of hidden layers) and logistic regression (LR). These results are shown in Figure
8. Here too, CEIB achieves close to stateoftheart performance on the Twins task.
Comments
There are no comments yet.