Chapter 9 Single-Cell Data Integration

Challenges in multi-modal integration

FIGURE 9.1: Challenges in multi-modal integration

Integrating single-cell multi-omics data (e.g., scRNA-seq, scATAC-seq) is essential for a holistic understanding of cellular states but comes with challenges:

  • Different modalities have distinct statistical properties
  • Confounding biological and technical sources of heterogeneity
  • Missing values and sparsity across datasets

To tackle these, integration methods are categorized into horizontal (across cells or batches) and vertical (across modalities in the same cell) approaches.

9.1 Horizontal Integration

Used when cells are assayed with different modalities in different batches.

  • Normalize expression and accessibility (e.g., gene activity as a surrogate for expression)
  • Match shared features across modalities

Popular methods include:

  • CCA (Canonical Correlation Analysis): Joint projection into a shared subspace
  • Harmony: Corrects batch effects using soft clustering
  • MNN (Mutual Nearest Neighbors): Aligns cell neighborhoods
  • Scanorama, BBKNN, LIGER: Graph- or matrix-based manifold stitching approaches
CCA-based integration of scRNA and scATAC [-@Stuart2019]

FIGURE 9.2: CCA-based integration of scRNA and scATAC (2019)

9.2 Vertical Integration

Used when different omics layers are measured from the same cell (e.g., 10x Multiome).

9.2.1 MOFA / MOFA+

  • Probabilistic factor model with sparsity-aware priors
  • Captures shared and modality-specific sources of variation
  • Enables batch correction, dimensionality reduction, and imputation
MOFA+ framework for joint dimensionality reduction [-@Argelaguet2020]

FIGURE 9.3: MOFA+ framework for joint dimensionality reduction (2020)

9.2.2 LIGER (Linked Inference of Genomic Experimental Relationships)

  • Non-negative matrix factorization
  • Joint clustering via shared factor neighborhood graphs
  • Learns both dataset-specific and shared gene modules
LIGER joint clustering [-@Welch2019]

FIGURE 9.4: LIGER joint clustering (2019)

9.3 Weighted Nearest Neighbors (WNN)

WNN builds modality-specific KNN graphs and learns weights for each modality per cell, generating a unified WNN graph for clustering and downstream analysis.

  • Combines transcriptome and epigenome information
  • Scales well and supports imputation when one modality is missing
WNN: Combining cell-cell similarities across modalities [-@Hao2021]

FIGURE 9.5: WNN: Combining cell-cell similarities across modalities (2021)

9.4 Bridge Integration

Uses multi-omics data as a “dictionary” to integrate separately measured scRNA and scATAC datasets via shared latent spaces.

  • Dictionary learning maps cells to atoms in multiome reference
  • Use PCA, LSI, or CCA for initial embeddings
  • Mutual nearest neighbors refine integration
Bridge integration framework [-@Hao2022]

FIGURE 9.6: Bridge integration framework (2022)

9.5 Deep Learning Approaches: MultiVI

MultiVI is a variational autoencoder for learning a joint latent space of scRNA and scATAC:

  • Models transcriptome (NB) and accessibility (Bernoulli)
  • Trained on multiome or paired data
  • Supports imputation, batch correction, and latent space learning
MultiVI for integrative latent representation [-@Ashuach2021]

FIGURE 9.7: MultiVI for integrative latent representation (2021)

References

Argelaguet, Ricard, Denis Arnol, Dmitry Bredikhin, Yann Deloro, Benedikt Velten, John C. Marioni, and Oliver Stegle. 2020. “MOFA+: A Statistical Framework for Comprehensive Integration of Multi-Modal Single-Cell Data.” Genome Biology 21: 111. https://doi.org/10.1186/s13059-020-02015-1.
Ashuach, Tal, Mariano Gabitto, Michael I Jordan, and Nir Yosef. 2021. “MultiVI: Deep Generative Model for the Integration of Multi-Modal Data.” bioRxiv. https://doi.org/10.1101/2021.08.20.457057.
Hao, Yuhan, Stephanie Hao, Erin Andersen-Nissen, et al. 2022. “Bridge Integration of Single-Cell Multi-Omics Data with Dictionary Learning.” bioRxiv. https://doi.org/10.1101/2022.02.24.481684.
Hao, Yuhan, Stephanie Hao, Erin Andersen-Nissen, William M Mauck III, Shiwei Zheng, Andrew Butler, Madison J Lee, et al. 2021. “Integrated Analysis of Multimodal Single-Cell Data.” Cell 184 (13): 3573–3587.e29. https://doi.org/10.1016/j.cell.2021.04.048.
Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2019. “Comprehensive Integration of Single-Cell Data.” Cell 177 (7): 1888–1902.e21. https://doi.org/10.1016/j.cell.2019.05.031.
Welch, Joshua D, Vanja Kozareva, Arthur Ferreira, Caleb Vanderburg, Craig Martin, and Evan Z Macosko. 2019. “Single-Cell Multi-Omic Integration Compares and Contrasts Features of Brain Cell Identity.” Cell 177 (7): 1873–1887.e17. https://doi.org/10.1016/j.cell.2019.05.006.