Multisample Spatial Domains

Here we describe an intuitive adaptable approach to multi-sample integration for SpatialLeiden clustering in the 10x Visium spatial transcriptomics dataset of the human dorsolateral prefrontal cortex (DLPFC) (Maynard et al., 2021).

The schematic below illustrates a general workflow for SpatialLeiden clustering with batch correction. Expression data of multiple samples undergo dimensionality reduction using PCA. Based on this representation a k-nearest neighbors (kNN) graph is constructed that is used as input for SpatialLeiden. Here, we will leverage Harmony for batch correction of the PCA space. But as we can see, other methods that perform the batch correction at different steps of the workflow can be used as well.

Multi-sample workflow

Settings and Import

First of all we are going to load the relevant packages that we will be working with as well as setting a random seed that we will use throughout this example to make the results reproducible.

from pathlib import Path

import anndata as ad
import scanpy as sc
import seaborn as sns
import squidpy as sq

seed = 42

Preproccess raw DLPFC data from spatialLIBD

Using R the DLPFC dataset is downloaded with spatialLIBD, and the data is saved as .h5ad with zellkonverter.

# R Script to download DLPFC data and convert it to h5ad
library(spatialLIBD)
library(zellkonverter) 

spe <- fetch_data("spe")

coords <- as.data.frame(spatialCoords(spe))
colData(spe)$x <- coords$pxl_col_in_fullres
colData(spe)$y <- coords$pxl_row_in_fullres

writeH5AD(spe, file = file.path("./path/to/data", "raw.h5ad"))

Load and preprocess

To optimize memory usage, large unused attributes like obsm, uns, and layers are removed. Spatial coordinates are extracted from the metadata and stored in adata.obsm["spatial"]. The gene and sample metadata are streamlined by selecting relevant columns, and the main expression matrix is converted to a sparse integer format.

We subset the data for demonstration to only include cells from patient Br8100. A spatial neighborhood graph is generated with Squidpy using a regular grid (see here for more information which spatial graph is suitable for your data). Then, we perform standard preprocessing steps. These include filtering genes, selecting spatially variable genes, normalization, log-transformation, scaling, and PCA for dimensionality reduction. The parameters and concrete steps can be adjusted according to your needs.

# Subset to 1 patient
adata = adata[adata.obs["subject"] == "Br8100"].copy()

# generate spatial graph
sq.gr.spatial_neighbors(adata, library_key="sample_id", coord_type="grid", n_neighs=6)

# gene filtering
sc.pp.filter_genes(adata, min_cells=10)

# scaling / normalization
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.scale(adata)

# SVGs
sq.gr.spatial_autocorr(adata, genes=adata.var_names, mode="moran", seed=seed)
svgs = adata.uns["moranI"].nlargest(2_000, columns="I", keep="all").index
adata.var["svg"] = adata.var_names.isin(svgs)

sc.tl.pca(adata, n_comps=30, mask_var="svg", random_state=seed)

Batch integration

First, Harmony is applied to correct batch effects between the samples of the patient. The batch-corrected PCA embeddings are stored in adata.obsm["X_pca_harmony"]. Next, a neighborhood graph is constructed based on these Harmony-corrected embeddings, considering the 15 nearest neighbors. Other batch correction methods like scanorama, bbknn, and scVI can also be used for this step, depending on the data and analysis preferences.

import harmonypy as hm

harmony_result = hm.run_harmony(
    adata.obsm["X_pca"], adata.obs, "sample_id", random_state=seed
)
adata.obsm["X_pca_harmony"] = harmony_result.Z_corr

sc.pp.neighbors(adata, n_neighbors=15, use_rep="X_pca_harmony", random_state=seed)

Clustering

SpatialLeiden combines the spatial information from the spatial graph with the neighborhood graph built from the batch-corrected expression data. To get the desired number of clusters, SpatialLeiden automatically searches over different resolutions and picks the one that comes closest to the target. ALternatively, if you don’t know the number of clusters (which is usually the caseThe final cluster assignments are stored in adata.obs.

from spatialleiden import search_resolution

n_cluster = adata.obs["spatialLIBD"].nunique()

resolutions = search_resolution(
    adata,
    n_cluster,
    latent_kwargs={"directed": False, "random_state": seed},
    spatial_kwargs={"directed": False, "layer_ratio": 0.6, "random_state": seed},
)

Visualizing SpatialLeiden multisample clusters across samples

After defining spatial domains with SpatialLeiden, we can check that these are actually spatially coherent across sections. Using joint clustering across multiple batch-corrected samples helps to identify consistent spatial patterns and improves the robustness of detected domains by leveraging shared biological signals.

labels_integrated = adata.obs[["spatialleiden", "sample_id"]]
labels_integrated.loc[:, ["x", "y"]] = adata.obsm["spatial"]

g = sns.FacetGrid(labels_integrated, col="sample_id")

g.map_dataframe(
    sns.scatterplot, x="x", y="y", hue="spatialleiden", s=5, linewidth=0, legend=True
)

g.set_axis_labels("x", "y")
g.set_titles(col_template="Sample {col_name}")

ymin, ymax = g.axes.flat[0].get_ylim()

for ax in g.axes.flat:
    ax.set_ylim(ymax, ymin)  # flip y-axis
    ax.set_aspect("equal")
    ax.set(xticklabels=[], yticklabels=[], xlabel=None, ylabel=None)
    ax.tick_params(left=False, bottom=False)

g.fig.suptitle("Patient Br8100")
_ = g.tight_layout()

../_images/02acfb794e3b43e757f3322c70d200f888ba5abe3ad5463023403aae899cd687.png