Supervised learning and model analysis with compositional data

Research output: Contribution to journal › Journal article › Research › peer-review

Standard

Supervised learning and model analysis with compositional data. / Huang, Shimeng; Ailer, Elisabeth; Kilbertus, Niki; Pfister, Niklas.

In: PLOS Computational Biology, Vol. 19, No. 6, e1011240, 2023.

Research output: Contribution to journal › Journal article › Research › peer-review

Harvard

Huang, S, Ailer, E, Kilbertus, N & Pfister, N 2023, 'Supervised learning and model analysis with compositional data', PLOS Computational Biology, vol. 19, no. 6, e1011240. https://doi.org/10.1371/journal.pcbi.1011240

APA

Huang, S., Ailer, E., Kilbertus, N., & Pfister, N. (2023). Supervised learning and model analysis with compositional data. PLOS Computational Biology, 19(6), [e1011240]. https://doi.org/10.1371/journal.pcbi.1011240

Vancouver

Huang S, Ailer E, Kilbertus N, Pfister N. Supervised learning and model analysis with compositional data. PLOS Computational Biology. 2023;19(6). e1011240. https://doi.org/10.1371/journal.pcbi.1011240

Author

Huang, Shimeng ; Ailer, Elisabeth ; Kilbertus, Niki ; Pfister, Niklas. / Supervised learning and model analysis with compositional data. In: PLOS Computational Biology. 2023 ; Vol. 19, No. 6.

Bibtex

@article{3dac294ed50a40f4b139bcd92aab4214,

title = "Supervised learning and model analysis with compositional data",

abstract = "Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.",

author = "Shimeng Huang and Elisabeth Ailer and Niki Kilbertus and Niklas Pfister",

note = "Publisher Copyright: {\textcopyright} 2023 Huang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.",

year = "2023",

doi = "10.1371/journal.pcbi.1011240",

language = "English",

volume = "19",

journal = "P L o S Computational Biology (Online)",

issn = "1553-734X",

publisher = "Public Library of Science",

number = "6",

}

RIS

TY - JOUR

T1 - Supervised learning and model analysis with compositional data

AU - Huang, Shimeng

AU - Ailer, Elisabeth

AU - Kilbertus, Niki

AU - Pfister, Niklas

N1 - Publisher Copyright: © 2023 Huang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PY - 2023

Y1 - 2023

N2 - Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.

AB - Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.

UR - http://www.scopus.com/inward/record.url?scp=85164748144&partnerID=8YFLogxK

U2 - 10.1371/journal.pcbi.1011240

DO - 10.1371/journal.pcbi.1011240

M3 - Journal article

C2 - 37390111

AN - SCOPUS:85164748144

VL - 19

JO - P L o S Computational Biology (Online)

JF - P L o S Computational Biology (Online)

SN - 1553-734X

IS - 6

M1 - e1011240

ER -

ID: 360261541

Department of Mathematical Sciences