Problem set 9

Published

April 13, 2025

This problem set explores the analysis of high dimensional data in the application area of spatial transcriptomics. For reference, consult the following papers:

Load in the data

We begin by downloading the data. Hint: run this once and leave it as eval = FALSE in your script.

# download data
dir.create('data')
file_list <- c('gene_counts', 'cell_coords', 'neuron_cells', 'other_cells',
               'positive_genes', 'negative_genes', 'other_genes')
for(filename in file_list) {
  download.file(paste0('https://github.com/dmcable/BIOSTAT620/raw/main/data/p9/',filename,'.rds'), 
                destfile = paste0('data/',filename,'.rds'))
}

Next, we load in the data and packages (note that counts is a sparse matrix):

# required packages
library(ggplot2)
library(Matrix)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::expand() masks Matrix::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ tidyr::pack()   masks Matrix::pack()
✖ tidyr::unpack() masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# load data
counts <- readRDS('data/gene_counts.rds') # gene counts matrix
coords <- readRDS('data/cell_coords.rds') # cell coords matrix

neuron_cells <- readRDS('data/neuron_cells.rds') # list of neuron cell barcodes.
other_cells <- readRDS('data/other_cells.rds') # list of non-neuron cell barcodes.

positive_genes <- readRDS('data/positive_genes.rds') # list of genes specific for neurons
negative_genes <- readRDS('data/negative_genes.rds') # list of genes specific for not neurons
other_genes <- readRDS('data/other_genes.rds') # selected list of other genes

all_genes <- c(positive_genes, negative_genes, other_genes) # all genes
small_counts <- as.matrix(counts[all_genes, c(neuron_cells, other_cells)]) # subset of the counts

Data visualization

Begin by visualizing the coordinates of the cells in 2D space. Hint: this information is contained in the coords dataframe.

set.seed(2025)
### YOUR ANSWER HERE

Let’s now visualize the expression of the Hpca gene, which is specific for neurons. Begin by creating a histogram of the counts of the gene. Hint: examine the counts matrix.

### YOUR ANSWER HERE

Make a spatial plot of the gene expression of this gene. Use the cell coordinates as the spatial coordinates, and use gene expression counts as the color. Set a maximum threshold for your color scale based on the histogram above.

### YOUR ANSWER HERE

Distance

Now, make a spatial plot of the cell types. Create a dataframe plot_df containing only the neurons and the non-neurons. Plot the neurons and non-neurons as two different colors. Hint: the neuron barcodes are in neuron_cells, and the non-neuron barcodes are in other_cells.

### YOUR ANSWER HERE

Using only the cells selected above in plot_df, compute the distance matrix pairwise between each cell (hint: use rdist from fields). and compute the k-nearest-neighbors with K = 25 (hint use get.knn from FNN). Choose the first cell, and visualize the spatial locations of the k nearest neighbors relative to that cell.

### YOUR ANSWER HERE

For each cell, calculate the proportion of neighbors that are neurons, and visualize these proportions spatially.

### YOUR ANSWER HERE

Smoothing

Using loess, fit a 2D spatial smoothing function to the neuron proportion values. Use degree = 1 and span = 0.05. Create a spatial plot with the color as the smooth fit.

### YOUR ANSWER HERE

Visualize a 1-dimensional section of the smooth function. Filter for y within 50 of 3000. Plot both the proportion and the fitted smooth values in two separate colors.

### YOUR ANSWER HERE

Dimension reduction

Using a subset of the counts (small_counts), perform a PCA (hint: use prcomp). Plot the first principal component (hint: check the x variable of the prcomp object) vs the total cellular count.

### YOUR ANSWER HERE

Comment on your finding:

YOUR SHORT ANSWER HERE

To mitigate this issue, we will normalize small_counts by dividing each cell by the sum of the total counts. Hint: use the sweep function. Repeate the PCA with the normalized count matrix. Make a plot of percent of variance explained for each of the first 500 PCs (threshold this plot at 5% variance explained).

### YOUR ANSWER HERE

Make spatial plots visualizing the first four PCs in spatial coordinates.

### YOUR ANSWER HERE

For the first 20 pcs, compute the average weight for each of the following sets of genes: positive_genes, negative_genes, and other_genes. Create a geom_point plot of these with PC index as the x-axis, average weight as the y axis, and gene list type as the color.

### YOUR ANSWER HERE

Now, remake the previous plot for the first five PCs and include standard error bars.

### YOUR ANSWER HERE

Which of the first 5 PCs appear to have statistically significant differences across gene types?

YOUR SHORT ANSWER HERE