# download data
dir.create('data')
<- c('gene_counts', 'cell_coords', 'neuron_cells', 'other_cells',
file_list 'positive_genes', 'negative_genes', 'other_genes')
for(filename in file_list) {
download.file(paste0('https://github.com/dmcable/BIOSTAT620/raw/main/data/p9/',filename,'.rds'),
destfile = paste0('data/',filename,'.rds'))
}
Problem set 9
This problem set explores the analysis of high dimensional data in the application area of spatial transcriptomics. For reference, consult the following papers:
- Robust decomposition of cell type mixtures in spatial transcriptomics
- Cell type-specific inference of differential expression in spatial transcriptomics
Load in the data
We begin by downloading the data. Hint: run this once and leave it as eval = FALSE
in your script.
Next, we load in the data and packages (note that counts
is a sparse matrix):
# required packages
library(ggplot2)
library(Matrix)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::expand() masks Matrix::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ tidyr::pack() masks Matrix::pack()
✖ tidyr::unpack() masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# load data
<- readRDS('data/gene_counts.rds') # gene counts matrix
counts <- readRDS('data/cell_coords.rds') # cell coords matrix
coords
<- readRDS('data/neuron_cells.rds') # list of neuron cell barcodes.
neuron_cells <- readRDS('data/other_cells.rds') # list of non-neuron cell barcodes.
other_cells
<- readRDS('data/positive_genes.rds') # list of genes specific for neurons
positive_genes <- readRDS('data/negative_genes.rds') # list of genes specific for not neurons
negative_genes <- readRDS('data/other_genes.rds') # selected list of other genes
other_genes
<- c(positive_genes, negative_genes, other_genes) # all genes
all_genes <- as.matrix(counts[all_genes, c(neuron_cells, other_cells)]) # subset of the counts small_counts
Data visualization
- Begin by visualizing the coordinates of the cells in 2D space. Hint: this information is contained in the
coords
dataframe.
set.seed(2025)
### YOUR ANSWER HERE
- Let’s now visualize the expression of the Hpca gene, which is specific for neurons. Begin by creating a histogram of the counts of the gene. Hint: examine the
counts
matrix.
### YOUR ANSWER HERE
- Make a spatial plot of the gene expression of this gene. Use the cell coordinates as the spatial coordinates, and use gene expression counts as the color. Set a maximum threshold for your color scale based on the histogram above.
### YOUR ANSWER HERE
Distance
- Now, make a spatial plot of the cell types. Create a dataframe
plot_df
containing only the neurons and the non-neurons. Plot the neurons and non-neurons as two different colors. Hint: the neuron barcodes are inneuron_cells
, and the non-neuron barcodes are inother_cells
.
### YOUR ANSWER HERE
- Using only the cells selected above in
plot_df
, compute the distance matrix pairwise between each cell (hint: userdist
fromfields
). and compute the k-nearest-neighbors withK = 25
(hint useget.knn
fromFNN
). Choose the first cell, and visualize the spatial locations of the k nearest neighbors relative to that cell.
### YOUR ANSWER HERE
- For each cell, calculate the proportion of neighbors that are neurons, and visualize these proportions spatially.
### YOUR ANSWER HERE
Smoothing
- Using
loess
, fit a 2D spatial smoothing function to the neuron proportion values. Usedegree = 1
andspan = 0.05
. Create a spatial plot with the color as the smooth fit.
### YOUR ANSWER HERE
- Visualize a 1-dimensional section of the smooth function. Filter for
y
within50
of3000
. Plot both the proportion and the fitted smooth values in two separate colors.
### YOUR ANSWER HERE
Dimension reduction
- Using a subset of the counts (
small_counts
), perform a PCA (hint: useprcomp
). Plot the first principal component (hint: check thex
variable of theprcomp
object) vs the total cellular count.
### YOUR ANSWER HERE
Comment on your finding:
YOUR SHORT ANSWER HERE
- To mitigate this issue, we will normalize
small_counts
by dividing each cell by the sum of the total counts. Hint: use thesweep
function. Repeate the PCA with the normalized count matrix. Make a plot of percent of variance explained for each of the first 500 PCs (threshold this plot at 5% variance explained).
### YOUR ANSWER HERE
- Make spatial plots visualizing the first four PCs in spatial coordinates.
### YOUR ANSWER HERE
- For the first 20 pcs, compute the average weight for each of the following sets of genes:
positive_genes
,negative_genes
, andother_genes
. Create ageom_point
plot of these with PC index as the x-axis, average weight as the y axis, and gene list type as the color.
### YOUR ANSWER HERE
- Now, remake the previous plot for the first five PCs and include standard error bars.
### YOUR ANSWER HERE
Which of the first 5 PCs appear to have statistically significant differences across gene types?
YOUR SHORT ANSWER HERE