PhD Defense Shimeng Huang
Title: Causal Inference for Complex Data Structures
Abstract:
This thesis explores and develops causality-related methodology through three individual projects, each focusing on a specific complex data structure. While the problems investigated are diverse, they all center around the concepts of intervention and (causal) invariance. Chapter 1 introduces these foundational ideas, which pervade the remainder of the
thesis. Brief introductions of the data structures addressed in the subsequent chapters are also provided in this chapter.
Chapter 2 dives into the first complex data structure, compositional data, where observations lie on a simplex (i.e., each observation is constrained to sum to one). This project is motivated by microbiome research, in which compositions of microbial strains are typically observed. We examine how to define interpretable statistical targets to quantify the effects of the components on a response when the predictor is compositional in regression or classification problems. We develop non-parametric estimators of these effects based on kernels that are specifically suited for compositional data. Our estimators are evaluated on 33 publicly available microbiome datasets and are shown to achieve comparable or superior performance compared tostate-of-the-art machinelearning methods.
In Chapter 3, we consider sequential data and introduce a new type of change point, termed causal change points, which indicate changes in the causal mechanism relative to a response variable under appropriate assumptions. We propose methods to detect and localize these change points without requiring prior knowledge of the causal structure in the data. These methods leverage the reverse concept of causal invariance—the property that the conditional distribution of the response, given its parents, remains fixed under interventions that do not directly target the response. We demonstrate our methods using two real-world datasets, one on air quality and the other on macroeconomic policy.
The final chapter, Chapter 4, considers sparse causal effects estimation using two-sample summary statistics, a type of summary-level data commonly used in genetics research. In a two-sample summary statistics setting, one does not have access to individual-level data but only to the marginal associations obtained from two samples: one containing paired observations of instruments and covariates, and the other containing paired observations of instruments and the response. We propose a generalized, two-sample summary statistics version of the test statistic considered in spaceIV [Pfister and Peters, 2022], and prove that our proposed test is uniformly asymptotically level. We apply our method, spaceTSIV, to real proteomic and gene-expression data for discovering possible drug targets for coronary artery disease.
Thesis for download
Supervisors:
Associate Professor Niklas Pfister
University of Copenhagen
Professor Jonas Peters
ETH Zurich
Professor Susanne Ditlevsen
University of Copenhagen
Assessment Committee:
Professor Niels Richard Hansen (chair)
University of Copenhagen
Professor Stefan Bauer
Technical University of Munich
Assistant Professor Sara Magliacane
University of Amsterdam