Welcome

Preamble

Thank you for joining us for our mid-program challenge! We at the Programs team congratulate and thank you for your excellent work in our program. We hope that you have had meaningful experiences with us thus far, and we look forward to delivering more high-quality data science material just for you.

—Eddie Guo, Associate Director of Programs

The Mid-Program Challenge

Today, we will apply our knowledge and skills in statistics and R programming to identify differentially expressed genes (DEGs), a common task in bioinformatics. We will be working on a recently published dataset of gene expression from healthy people as well as COVID-19 patients with mild or severe symptoms (Zhang et al., 2021). Here are our expectations for you:

Understand the workflow (statistics and scripts) used for DEG analysis.
Adapt the code to work on the tasks you are asked to perform.
Interpret the results from tests and figures.
Propose future directions and carry out further exploration based on your conclusions.

Submission

During today’s challenge, you will be given time to work in your group and perform the tasks. You may or may not be able to finish the entire task in session. Therefore, we ask you to complete and finalize your work by Sunday, March 14, 2021 at 10:00 PM MST. Specifically,

For each task, please finish the codes in the designated code blocks. Interpret and discuss your results in the markdown texts as instructed in the task description.
After completing the entire .Rmd document, save the .Rmd file and click the Knit button to generate the .html file, to which point a web page containing all codes, figures and texts should show up nicely.
Submit your .Rmd file and .html file using this Google Form. Download the .Rmd file that you will submit by clicking here. You will also find a few questions asking you to discuss your results from the tasks. The answers should be both filled in the Google Form AND found in the .Rmd document.

Evaluation

We will evaluate the following aspects of your work:

Coding.
1. Correct usage of commands.
2. Robustness and adaptability of the code.
3. Good coding style and convention.
4. Readability of code (e.g. naming of variables, adequate commenting).
Visualization.
1. Correct choices of figures.
2. Compliance with publication standards.
Reasoning and communication.
1. Thorough understanding of the statistical methods employed at each step.
2. Adequate discussion and correct interpretation of the results.
3. Appropriate reference to results in figures/tables.
4. Appropriate reference to literature.
5. Proper wording in discussion and interpretation.
6. Critical thinking and insight.

Tip: using resources outside those provided by Youreka will enhance your project.

Background information

Selection of differentially expressed genes in microarray data analysis

Per Chen et al. (2007),

“One common objective in microarray experiments is to identify a subset of genes that express differentially among different experimental conditions, for example, between drug treatment and no drug treatment. Often, the goal is to determine the underlying relationship between poor versus good gene signatures for identifying biological functions or predicting specific therapeutic outcomes. Because of the complexity in studying hundreds or thousands of genes in an experiment, selection of a subset of genes to enhance relationships among the underlying biological structures or to improve prediction accuracy of clinical outcomes has been an important issue in microarray data analysis.”

Procedure

After data wrangling, there are two main steps to select differentially expressed genes:

Select an appropriate test statistic and compute the p-values. We often use the t-test or Mann-Whitney U test to compute these values.
Determine a significance threshold from the p-values to determine significance genes.

After these steps, we often create a visualization such as a volcano plot to visualize our (log2 transformed) test statistic against our (-log10 transformed) p-value.

Module 1: Data wrangling

Reading in the data set

First, we will clean RStudio’s environment as well as load the tidyverse.

rm(list = ls())
library(tidyverse)

Now, we’re going to read in the data. You can download the data by clicking here: data. You can get more information about this data on this website. Please ensure you set RStudio to the correct working directory using the setwd() function.

gset <- read.table("GSE164805_series_matrix.txt", comment.char = '!', header = T)
gset <- as_tibble(gset)

In the code above, we read in the table from the text file “GSE164805_series_matrix.txt”. Notice that we didn’t read any lines that started with an exclamation point. Feel free to open the text file and read what we didn’t read in with read.table().

Our data comes from an Affymetrix microarray, which is a common assay to probe for mRNA expression of many genes. Let’s explore what the assay results are:

head(gset)

## # A tibble: 6 x 16
##   ID_REF       GSM5019817 GSM5019818 GSM5019819 GSM5019820 GSM5019821 GSM5019822
##   <chr>             <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
## 1 (+)E1A_r60_1       3.95       2.55       2.90       2.41       2.58       2.43
## 2 (+)E1A_r60_3       3.57       3.14       3.31       3.42       3.36       3.55
## 3 (+)E1A_r60_…       3.59       2.36       2.92       2.66       2.40       2.90
## 4 (+)E1A_r60_…       3.95       2.81       3.16       3.10       2.71       4.67
## 5 (+)E1A_r60_…       3.64       2.36       2.91       2.38       2.41       2.67
## 6 (+)E1A_r60_…       4.27       3.66       3.76       3.86       3.51       3.46
## # … with 9 more variables: GSM5019823 <dbl>, GSM5019824 <dbl>,
## #   GSM5019825 <dbl>, GSM5019826 <dbl>, GSM5019827 <dbl>, GSM5019828 <dbl>,
## #   GSM5019829 <dbl>, GSM5019830 <dbl>, GSM5019831 <dbl>

nrow(gset)

## [1] 60795

ncol(gset)

## [1] 16

Notice that our dataset is quite large with a dimension of 60,795 rows by 16 columns. The first column ID_REF contains Affymetrix probes, which each correspond to a particular gene. The other 15 columns start with GSM.... These columns represent our patients’ gene expression. You should also memorize the following:

The first 5 patients are healthy patients.
The next 5 patients show mild COVID-19 symptoms.
The last 5 patients show severe COVID-19 symptoms.

Checking for invalid entries

Whenever you work with gene expression data, you must check for the following 2 invalid conditions:

Missing or NA values. You need to either remove the row (i.e., remove the gene from the analysis) or impute (i.e., predict) the gene expression value.
Values that are ≤ 0. In your first task at then end of this module, you will answer why these values are invalid.

For now, we are going to focus on the patients’ gene expression data for easier analysis. Let’s select every column except ID_REF and save it to a variable called expr.

expr <- gset %>% select(-ID_REF)

Now, let’s check for NA values using apply().

apply(is.na(expr), MARGIN = 2, FUN = sum)

## GSM5019817 GSM5019818 GSM5019819 GSM5019820 GSM5019821 GSM5019822 GSM5019823 
##          0          0          0          0          0          0          0 
## GSM5019824 GSM5019825 GSM5019826 GSM5019827 GSM5019828 GSM5019829 GSM5019830 
##          0          0          0          0          0          0          0 
## GSM5019831 
##          0

Here, MARGIN = 2 means “look at each column” and FUN = sum means “count up all rows in this column where is.na(expr) returns true”.

In addition to missing data, we should check for negative values, because gene expression can’t be negative.

min(expr)

## [1] 2.32193

Since min(expr) is greater than 0, there are no negative expressions. In this case, we got lucky. Lots of times, however, our data may contain negative values (either due to the data being log transformed or just bad data).

Data transformation

In bioinformatics, researchers commonly use apply the log2 transformation to gene expression values. In your task for this module, you will answer why we apply the log2 transform.

transformed <- cbind(ID_REF = gset$ID_REF, log2(expr))
head(transformed)

##            ID_REF GSM5019817 GSM5019818 GSM5019819 GSM5019820 GSM5019821
## 1    (+)E1A_r60_1   1.982318   1.353205   1.538184   1.269224   1.368772
## 2    (+)E1A_r60_3   1.837821   1.649621   1.728440   1.775893   1.748335
## 3 (+)E1A_r60_a104   1.843451   1.241322   1.545550   1.411503   1.261741
## 4 (+)E1A_r60_a107   1.982179   1.489193   1.660969   1.634088   1.437225
## 5 (+)E1A_r60_a135   1.865034   1.238980   1.540731   1.250820   1.267045
## 6  (+)E1A_r60_a20   2.094270   1.872988   1.910261   1.950167   1.809577
##   GSM5019822 GSM5019823 GSM5019824 GSM5019825 GSM5019826 GSM5019827 GSM5019828
## 1   1.281014   1.265400   1.556132   1.298916   1.363145   1.932319   1.879500
## 2   1.829719   1.868244   1.847651   1.992013   1.986494   2.006909   1.846429
## 3   1.538004   1.488496   1.650841   1.421117   1.459974   1.914972   1.926574
## 4   2.223608   2.233425   2.175192   2.413566   2.212238   2.333545   2.352547
## 5   1.416476   1.398552   1.504726   1.381740   1.388790   1.941106   1.885347
## 6   1.789775   1.807800   1.771651   1.731936   1.791754   2.005354   1.990547
##   GSM5019829 GSM5019830 GSM5019831
## 1   1.735442   1.860215   1.327579
## 2   1.813518   1.991264   1.767412
## 3   1.934841   1.901143   1.753091
## 4   2.416856   2.362836   2.419400
## 5   1.802814   1.904059   1.542891
## 6   1.969354   2.006445   1.947226

Now, the reason we created the expr tibble becomes clear: to separate character and numerical data to allow for easier data manipulation. That is, we can’t take a log transform on character data.

TASK 1: How does the log2 transformation work?

Generate histograms to illustrate the distribution of the untransformed AND log2-transformed gene expression levels for a patient of your choice.

Compare the histograms you just generated in question 1.
1. Based on your histogram, how does the log2 transformation change the distribution of gene expression?
2. Why do you think researchers commonly apply this transformation to gene expression data?
3. In the previous step, we checked for the existence of values that are 0 or negative. Why is this step necessary for log2 transformation?

Module 2: Statistical tests

Step 1: Using t-tests to examine differential expression

In this data set, subject GSM5019817 to GSM5019821 (first 5 columns) are healthy subjects, GSM5019822 to GSM5019826 (next 5 columns) have mild COVID-19 symptoms, and GSM5019827 to GSM5019831 (last 5 columns) have severe COVID-19 symptoms.

Here, we will use a t-test to compare the healthy subjects to the subjects with mild COVID-19 symptoms for a single gene.

temp_ID <- transformed[1, 1]
healthy <- transformed[1, 2:6]
mild <- transformed[1, 7:11]
t.test(healthy, mild)

## 
##  Welch Two Sample t-test
## 
## data:  healthy and mild
## t = 1.0794, df = 5.3602, p-value = 0.3266
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1993431  0.4981807
## sample estimates:
## mean of x mean of y 
##  1.502340  1.352921

In this case, the test does not return significant results at = 0.05.

Step 2: Calculate fold change and save your results

In brief, fold change measures how much something changes between an original and subsequent measurement. Here is the mathematical definition of a log2 fold change:

where stands for “fold change”.

In our case, we will be continuing our comparison between the healthy subjects and patients with mild COVID-19 symptoms. In particular, we will use our test results from the t-test in the previous section:

# saving t-test on the log2 transformed gene expression
t.results <- t.test(healthy, mild)
# getting the estimate
t.results$estimate

## mean of x mean of y 
##  1.502340  1.352921

# saving the fold change
fold.change <- t.results$estimate[2] - t.results$estimate[1]
# saving the p-value
p.val <- t.results$p.value

Please note that fold.change is already log2 transformed. This is why we use subtraction used instead of division. Also note that t.results$estimate[1] is the mean of the first sample (i.e., x) that went into the t.test() function, whereas t.results$estimate[2] the second (y). Therefore, this fold.change represents the log2 transformed fold change of the expression of mild symptom patients over that of the healthy patients. The order here is important.

Let’s now create a tibble to store our results:

results <- tibble(
  ID_REF = temp_ID,
  fold.change = fold.change,
  p.val = p.val
)

At this point, we only have one entry in this tibble. As we work with more and more genes, this tibble will prove useful.

Now that we’ve completed step 1 and 2, we have a single data point for our p-value vs fold change plot!

ggplot() +
  geom_point(aes(x = fold.change, y = -log10(p.val))) +
  labs(x = "Log2 Fold Change", y = "-log10(Pvalue)")

We will be filling in the rest of this plot very shortly :)

Performing statistical tests in batch

To identify the DEGs between 2 patient groups, we are in fact looking for the genes of which the expression levels are significantly different between the said groups. So far we have used t-test and U-test to examine one gene. Eventually, we will perform these tests on all of the genes at the same time.

To achieve this goal, we will take advantage of a for loop that iterates over all rows of our expression tibble. Since there are over 60,000 rows in the data set, iterating through every row will take a substantial amount of time. For demonstration purposes, we randomly subset a sample of 5,000 rows from the data set. We would also like to time the run to estimate the time if we were to perform such tests on the complete dataset.

sampled_genes <- transformed[sample(nrow(transformed), size = 5000, replace = FALSE), ]
start_time <- Sys.time()
# we start at index 2 b/c we've already done the 1st gene
for (i in 2:nrow(sampled_genes)) {
  # getting gene name
  temp_ID <- sampled_genes[i, 1]
  # saving our t-test results
  t.results <- t.test(sampled_genes[i, 2:6], sampled_genes[i, 7:11])
  fold.change <- t.results$estimate[2] - t.results$estimate[1]
  p.val <- t.results$p.value
  # adding to our results tibble
  results <- rbind(results, c(temp_ID, fold.change, p.val))
}

end_time <- Sys.time()
print(end_time - start_time)

## Time difference of 4.889766 secs

When performing many statistical tests, researchers use an “adjusted p-value” to correct for type I errors (probability of rejecting given that is true). In a later task, you will calculate the exact probability of a type I error (with guidance!).

Here we correct by false discovery rate as an example. Notice the difference between p.val, the original p-value, and p.adj, the adjusted p-value.

# fdr = false discovery rate
# another common method is the Bonferroni correction
p.adj <- p.adjust(results$p.val, method = "fdr")
# saving the adjustment back into our results tibble
results <- tibble(
  results,
  p.adj = p.adj
)
head(results)

## # A tibble: 6 x 4
##   ID_REF                 fold.change         p.val                p.adj
##   <chr>                  <chr>               <chr>                <dbl>
## 1 (+)E1A_r60_1           -0.149418817293759  0.326561726075244   0.497 
## 2 ASHGV40036478V5        -0.422233348453604  0.00204835600288314 0.0197
## 3 ASHGV40036403V5        -0.276363535541946  0.13000691232188    0.275 
## 4 ASHG19AP1B100003428V5  -0.554855950379242  0.027971150150058   0.0972
## 5 ASHG19LNC1A100081332V5 -0.324725706520771  0.109842105297743   0.244 
## 6 ASHG19LNC1A114033127V5 -0.0725809472439629 0.481890737886877   0.637

TASK 2: What is the appropriate test to use?

Find an arbitrary gene of your choice. Adapt the code above and find the expression levels of mild patients and severe patients. Compare their expression levels using both a t-test AND a U-test.

Which test do you think is more appropriate? You will want to generate a diagnostic plot to support your argument. Please include your discussion after the code block in the designated area.

Read this review article and answer the following questions.
1. What are the tests researchers use for identifying DEGs? Which test(s) is/(are) parametric and which is/(are) non-parametric?
2. For non-parametric methods to become powerful, what are the requirements for the samples?

Read the research article associated with this dataset.
1. What statistical tests did the authors use to identify the DEGs? What p-value adjustment method did the authors use?
2. How is their method similar or different to your choices?
3. Read this review article and reflect on our discussion in the statistics lectures. Why is p-value adjustment necessary? How is it problematic to use uncorrected p-value?
4. From the results tibble, How what are the type I errors, based on probability calculation, were made with and without the FDR p-value adjustment? Please input a numerical value (no need to show work).

Module 3: Data visualization and exploration

Exploratory Plot

Our tibble now includes 5,000 entries. Before we explore further, let’s generate a plot to get an idea of what information our analysis has to offer.

One of the most popular graph for DEG analysis is a scatter plot of fold change against p-value. This plot is also called a volcano plot because it resembles an erupting volcano.

ggplot(results, aes(x = as.numeric(fold.change), y = -log10(p.adj))) +
  geom_point(size=0.75) +
  labs(x = "Log2 Fold Change", y = "-log10(Pvalue)")

In this figure, each point represents a gene. The x-axis is the log2 transformed fold change. The points close to the center (0) are the genes with comparable expression levels in the two groups of samples, whereas the points to either side of the graph have a larger fold change. The y-axis is the -log10 transformed p-value. The higher the points are, the smaller the p-value is.

Plotting Labelled DEGs

We now propose a question for you: where do you think the DEGs are found on this figure?

To answer this question, we first need to label some groups:

Entries that are not significant.
Entries with significance and have upregulated gene expression.
Entries with significance and have downregulated gene expression.

# assigning "not significant" to all genes by default
results$group <- "NS"

# large enough fold change and upregulated
results[
  which(results$p.adj < 0.05 & results$fold.change > 0),
  "group"
] <- "Upregulated"

# significant but not downregulated
results[
  which(results$p.adj < 0.05 & results$fold.change < 0),
  "group"
] <- "Downregulated"

We are then able to use different colours to represent this label.

ggplot(results, aes(x = as.numeric(fold.change), y = -log10(p.adj),
                    color = group)) +
  geom_point(size=0.75) +
  labs(colour = "Group", x = "Log2 Fold Change", y = "-log10(Pvalue)") +
  geom_vline(xintercept = c(-1, 1), linetype="dashed", size=0.25) +
  geom_hline(yintercept = -log10(0.05), linetype="dashed", size=0.25) +
  scale_colour_manual(values = c("red", "grey", "green4")) +
  theme_classic()

There are many other ways to colour, label, and modify the volcano plot. We encourage you to explore additional customizations by reading academic literature related to DEGs.

TASK 3: Volcano plots and beyond

In the previous tasks, you have determined an appropriate statistical test for our dataset. With the 5,000 randomly sampled entries, please use the statistical test you have chosen to identify the DEGs between patients with mild symptoms and severe symptoms. The fold change should compare patients with mild and severe symptoms. Generate a volcano plot for your analysis.

Identify the DEGs that you believe to be the most significant. Save them as a tibble and print this tibble. Label their REF_ID directly on the volcano plot that you generate.

In question 2., what were your criterion for choosing these genes? Is it reasonable to look only at p-value or fold change?

Using the same 5,000 entries with the criterion you have chosen, find the up-regulated DEGs for the comparison of mild over healthy, severe over healthy, and severe over mild. How are these DEGs overlapping among the 3 pairs of comparison? How about the down-regulated DEGs? Can you create a Venn diagram to visualize your finding?

Find 1 to 3 genes in which you are most interested from the analyses you have performed so far. Use box plots to show the expression levels in the 3 groups of patients. Ensure that the box plots include dots to show the raw expression data from each sample. Why are these genes most interesting to you? What are the functions of these genes and how do you speculate these genes may play a role in COVID-19? (Hint: You may want to do some research using KEGG database, UniProt database, and published literature)

Please write a paragraph of discussion to summarize your mini-project critically. In this paragraph, you will briefly describe the motivation, rationale for methods, results and conclusions. Based on your current findings, you will point to the hypotheses/speculations you would like to make and propose work to further the research. You will also discuss the drawbacks and limitations of your work.

Describe the contribution of each of your team members.

Credit

A huge shoutout goes to Shuce Zhang, Programs Analyst, for creating the R script for our Mid Program Challenge. We would also like to thank Mehul Gupta and Sunand Kannappan, co-Founders of Youreka Canada, for providing their invaluable input on our Challenge topic. The background document was prepared by Pouria Torabi, Associate Director of Programs. This R Markdown document was prepared by Eddie Guo, Associate Director of Programs.

References

Chen, J., Wang, SJ., Tsai, CA. et al. Selection of differentially expressed genes in microarray data analysis. Pharmacogenomics J 7, 212–220 (2007). https://doi.org/10.1038/sj.tpj.6500412
Zhang, Q., Meng, Y., Wang, K., et al. Inflammation and antiviral immune response associated with severe progression of COVID-19. Frontiers in Immunology 12 (2021). https://doi.org/10.3389/fimmu.2021.631226

Youreka Canada 2021 Mid-Program Challenge

Youreka Canada Programs Team