Scalable & Interpretable Dimensionality Reduction for scRNA-seq
本文提出了一种改进的变分贝叶斯高斯过程潜在变量模型(BGPLVM),用于单细胞RNA测序数据的降维分析。该模型通过专门设计的编码器、核函数和似然函数,在合成和真实数据上实现了与scVI相当的性能,并能有效整合细胞周期和批次信息,揭示更具可解释性的潜在结构。 2025-5-20 01:38:42 Author: hackernoon.com(查看原文) 阅读量:3 收藏

Abstract and 1. Introduction

2. Background

2.1 Amortized Stochastic Variational Bayesian GPLVM

2.2 Encoding Domain Knowledge through Kernels

3. Our Model and Pre-Processing and Likelihood

3.2 Encoder

4. Results and Discussion

4.1 Each Component is Crucial to Modifies Model Performance

4.2 Modified Model achieves Significant Improvements over Standard Bayesian GPLVM and is Comparable to SCVI

4.3 Consistency of Latent Space with Biological Factors

4. Conclusion, Acknowledgement, and References

A. Baseline Models

B. Experiment Details

C. Latent Space Metrics

D. Detailed Metrics

ABSTRACT

Dimensionality reduction is crucial for analyzing large-scale single-cell RNA-seq data. Gaussian Process Latent Variable Models (GPLVMs) offer an interpretable dimensionality reduction method, but current scalable models lack effectiveness in clustering cell types. We introduce an improved model, the amortized stochastic variational Bayesian GPLVM (BGPLVM), tailored for single-cell RNA-seq with specialized encoder, kernel, and likelihood designs. This model matches the performance of the leading single-cell variational inference (scVI) approach on synthetic and real-world COVID datasets and effectively incorporates cell-cycle and batch information to reveal more interpretable latent structures as we demonstrate on an innate immunity dataset.

1 INTRODUCTION

Single-cell transcriptomics sequencing (scRNA-seq) has enabled the study of gene expression at the individual cell level. This high-resolution analysis has helped discover new cell types and cell states, reveal developmental lineages, and identify cell type-specific gene expression profiles (Montoro et al., 2018; Plasschaert et al., 2018; Luecken & Theis, 2019). This high-level resolution, however, comes with a cost. scRNA-seq data are often extremely sparse and prone to various technical and biological noise, such as sequencing depth, batch effects, and cell-cycle phases (Svensson et al., 2018; Tanay & Regev, 2017; Luecken & Theis, 2019; Hie et al., 2020). Various dimensionality reduction techniques have been developed to leverage intrinsic structures in the data (Heimberg et al., 2016) to map to a lower-dimensional latent space. These methods help facilitate downstream tasks like clustering and visualization, while avoiding the curse of dimensionality. Our work emphasizes probabilistic dimensionality reduction methods, which, through providing explicit probabilistic models for the data, allows for more interpretable models and uncertainty measures in the learned latent space.

In particular, we study a class of latent variable models known as Gaussian Process Latent Variable Models (GPLVMs) (Lawrence, 2004), which have recently been applied to scRNA-seq data (Campbell & Yau, 2015; Buettner et al., 2015; Ahmed et al., 2019; Verma & Engelhardt, 2020; Lalchand et al., 2022a). These models, which use Gaussian processes (GPs) to define nonlinear mappings from the latent space to data space, can incorporate prior information in the GP kernel function, motivating its use in single-cell transcriptomics data to model known or approximated covariate random effects, such as batch IDs and cell cycle phases. This approach is made scalable via mini-batching; however, the resulting Bayesian GPLVM model (BGPLVM) struggles to learn informative latent spaces for certain datasets (Lalchand et al., 2022a).

In this work, we present an amortized BGPLVM better fit to scRNA-seq data by leveraging design choices made in a leading probabilistic dimensionality reduction method called single cell variational inference (scVI) (Lopez et al., 2018). While scVI has seen impressive performance in a variety of downstream tasks, it does not easily allow for interpretable incorporation of prior domain knowledge.

In Sections 2 and 3, we describe this model, providing a concise background on BGPLVMs and highlighting the model modifications. Section 4 then discusses (1) an ablation study demonstrating each components contribution to the model’s performance via a synthetic dataset; (2) comparable performance to scVI for both the synthetic dataset and a real-world COVID-19 dataset (Stephenson et al., 2021); and (3) promising results for interpretably incorporating prior domain knowledge about cell-cycle phases in an innate immunity dataset (Kumasaka et al., 2021). Our work shines a light on key considerations in developing a scalable, interpretable, and informative probabilistic dimensionality method for scRNA-seq data.

Authors:

(1) Sarah Zhao, Department of Statistics, Stanford University, ([email protected]);

(2) Aditya Ravuri, Department of Computer Science, University of Cambridge ([email protected]);

(3) Vidhi Lalchand, Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard ([email protected]);

(4) Neil D. Lawrence, Department of Computer Science, University of Cambridge ([email protected]).


文章来源: https://hackernoon.com/scalable-and-interpretable-dimensionality-reduction-for-scrna-seq?source=rss
如有侵权请联系:admin#unsafe.sh