Chapter 9 Single-Cell Data Integration

FIGURE 9.1: Challenges in multi-modal integration

Integrating single-cell multi-omics data (e.g., scRNA-seq, scATAC-seq) is essential for a holistic understanding of cellular states but comes with challenges:

Different modalities have distinct statistical properties
Confounding biological and technical sources of heterogeneity
Missing values and sparsity across datasets

To tackle these, integration methods are categorized into horizontal (across cells or batches) and vertical (across modalities in the same cell) approaches.

9.1 Horizontal Integration

Used when cells are assayed with different modalities in different batches.

Normalize expression and accessibility (e.g., gene activity as a surrogate for expression)
Match shared features across modalities

Popular methods include:

CCA (Canonical Correlation Analysis): Joint projection into a shared subspace
Harmony: Corrects batch effects using soft clustering
MNN (Mutual Nearest Neighbors): Aligns cell neighborhoods
Scanorama, BBKNN, LIGER: Graph- or matrix-based manifold stitching approaches

CCA-based integration of scRNA and scATAC [-@Stuart2019]

FIGURE 9.2: CCA-based integration of scRNA and scATAC (2019)

9.2 Vertical Integration

Used when different omics layers are measured from the same cell (e.g., 10x Multiome).

9.2.1 MOFA / MOFA+

Probabilistic factor model with sparsity-aware priors
Captures shared and modality-specific sources of variation
Enables batch correction, dimensionality reduction, and imputation

MOFA+ framework for joint dimensionality reduction [-@Argelaguet2020]

FIGURE 9.3: MOFA+ framework for joint dimensionality reduction (2020)

9.2.2 LIGER (Linked Inference of Genomic Experimental Relationships)

Non-negative matrix factorization
Joint clustering via shared factor neighborhood graphs
Learns both dataset-specific and shared gene modules

FIGURE 9.4: LIGER joint clustering (2019)

9.3 Weighted Nearest Neighbors (WNN)

WNN builds modality-specific KNN graphs and learns weights for each modality per cell, generating a unified WNN graph for clustering and downstream analysis.

Combines transcriptome and epigenome information
Scales well and supports imputation when one modality is missing

WNN: Combining cell-cell similarities across modalities [-@Hao2021]

FIGURE 9.5: WNN: Combining cell-cell similarities across modalities (2021)

9.4 Bridge Integration

Uses multi-omics data as a “dictionary” to integrate separately measured scRNA and scATAC datasets via shared latent spaces.

Dictionary learning maps cells to atoms in multiome reference
Use PCA, LSI, or CCA for initial embeddings
Mutual nearest neighbors refine integration

Bridge integration framework [-@Hao2022]

FIGURE 9.6: Bridge integration framework (2022)

9.5 Deep Learning Approaches: MultiVI

MultiVI is a variational autoencoder for learning a joint latent space of scRNA and scATAC:

Models transcriptome (NB) and accessibility (Bernoulli)
Trained on multiome or paired data
Supports imputation, batch correction, and latent space learning

MultiVI for integrative latent representation [-@Ashuach2021]

FIGURE 9.7: MultiVI for integrative latent representation (2021)

References

Argelaguet, Ricard, Denis Arnol, Dmitry Bredikhin, Yann Deloro, Benedikt Velten, John C. Marioni, and Oliver Stegle. 2020. “MOFA+: A Statistical Framework for Comprehensive Integration of Multi-Modal Single-Cell Data.” Genome Biology 21: 111. https://doi.org/10.1186/s13059-020-02015-1.

Ashuach, Tal, Mariano Gabitto, Michael I Jordan, and Nir Yosef. 2021. “MultiVI: Deep Generative Model for the Integration of Multi-Modal Data.” bioRxiv. https://doi.org/10.1101/2021.08.20.457057.

Hao, Yuhan, Stephanie Hao, Erin Andersen-Nissen, et al. 2022. “Bridge Integration of Single-Cell Multi-Omics Data with Dictionary Learning.” bioRxiv. https://doi.org/10.1101/2022.02.24.481684.

Hao, Yuhan, Stephanie Hao, Erin Andersen-Nissen, William M Mauck III, Shiwei Zheng, Andrew Butler, Madison J Lee, et al. 2021. “Integrated Analysis of Multimodal Single-Cell Data.” Cell 184 (13): 3573–3587.e29. https://doi.org/10.1016/j.cell.2021.04.048.

Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2019. “Comprehensive Integration of Single-Cell Data.” Cell 177 (7): 1888–1902.e21. https://doi.org/10.1016/j.cell.2019.05.031.

Welch, Joshua D, Vanja Kozareva, Arthur Ferreira, Caleb Vanderburg, Craig Martin, and Evan Z Macosko. 2019. “Single-Cell Multi-Omic Integration Compares and Contrasts Features of Brain Cell Identity.” Cell 177 (7): 1873–1887.e17. https://doi.org/10.1016/j.cell.2019.05.006.