Thank you for joining us for our mid-program challenge! We at the Programs team congratulate and thank you for your excellent work in our program. We hope that you have had meaningful experiences with us thus far, and we look forward to delivering more high-quality data science material just for you.
—Eddie Guo, Associate Director of Programs
Today, we will apply our knowledge and skills in statistics and R programming to identify differentially expressed genes (DEGs), a common task in bioinformatics. We will be working on a recently published dataset of gene expression from healthy people as well as COVID-19 patients with mild or severe symptoms (Zhang et al., 2021). Here are our expectations for you:
Understand the workflow (statistics and scripts) used for DEG analysis.
Adapt the code to work on the tasks you are asked to perform.
Interpret the results from tests and figures.
Propose future directions and carry out further exploration based on your conclusions.
During today’s challenge, you will be given time to work in your group and perform the tasks. You may or may not be able to finish the entire task in session. Therefore, we ask you to complete and finalize your work by Sunday, March 14, 2021 at 10:00 PM MST. Specifically,
For each task, please finish the codes in the designated code blocks. Interpret and discuss your results in the markdown texts as instructed in the task description.
After completing the entire .Rmd document, save the .Rmd file and click the Knit
button to generate the .html file, to which point a web page containing all codes, figures and texts should show up nicely.
Submit your .Rmd file and .html file using this Google Form. Download the .Rmd file that you will submit by clicking here. You will also find a few questions asking you to discuss your results from the tasks. The answers should be both filled in the Google Form AND found in the .Rmd document.
We will evaluate the following aspects of your work:
Tip: using resources outside those provided by Youreka will enhance your project.
Per Chen et al. (2007),
“One common objective in microarray experiments is to identify a subset of genes that express differentially among different experimental conditions, for example, between drug treatment and no drug treatment. Often, the goal is to determine the underlying relationship between poor versus good gene signatures for identifying biological functions or predicting specific therapeutic outcomes. Because of the complexity in studying hundreds or thousands of genes in an experiment, selection of a subset of genes to enhance relationships among the underlying biological structures or to improve prediction accuracy of clinical outcomes has been an important issue in microarray data analysis.”
After data wrangling, there are two main steps to select differentially expressed genes:
After these steps, we often create a visualization such as a volcano plot to visualize our (log2 transformed) test statistic against our (-log10 transformed) p-value.
First, we will clean RStudio’s environment as well as load the tidyverse.
rm(list = ls())
library(tidyverse)
Now, we’re going to read in the data. You can download the data by clicking here: data. You can get more information about this data on this website. Please ensure you set RStudio to the correct working directory using the setwd()
function.
<- read.table("GSE164805_series_matrix.txt", comment.char = '!', header = T)
gset <- as_tibble(gset) gset
In the code above, we read in the table from the text file “GSE164805_series_matrix.txt”. Notice that we didn’t read any lines that started with an exclamation point. Feel free to open the text file and read what we didn’t read in with read.table()
.
Our data comes from an Affymetrix microarray, which is a common assay to probe for mRNA expression of many genes. Let’s explore what the assay results are:
head(gset)
## # A tibble: 6 x 16
## ID_REF GSM5019817 GSM5019818 GSM5019819 GSM5019820 GSM5019821 GSM5019822
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (+)E1A_r60_1 3.95 2.55 2.90 2.41 2.58 2.43
## 2 (+)E1A_r60_3 3.57 3.14 3.31 3.42 3.36 3.55
## 3 (+)E1A_r60_… 3.59 2.36 2.92 2.66 2.40 2.90
## 4 (+)E1A_r60_… 3.95 2.81 3.16 3.10 2.71 4.67
## 5 (+)E1A_r60_… 3.64 2.36 2.91 2.38 2.41 2.67
## 6 (+)E1A_r60_… 4.27 3.66 3.76 3.86 3.51 3.46
## # … with 9 more variables: GSM5019823 <dbl>, GSM5019824 <dbl>,
## # GSM5019825 <dbl>, GSM5019826 <dbl>, GSM5019827 <dbl>, GSM5019828 <dbl>,
## # GSM5019829 <dbl>, GSM5019830 <dbl>, GSM5019831 <dbl>
nrow(gset)
## [1] 60795
ncol(gset)
## [1] 16
Notice that our dataset is quite large with a dimension of 60,795 rows by 16 columns. The first column ID_REF
contains Affymetrix probes, which each correspond to a particular gene. The other 15 columns start with GSM...
. These columns represent our patients’ gene expression. You should also memorize the following:
Whenever you work with gene expression data, you must check for the following 2 invalid conditions:
Missing or NA
values. You need to either remove the row (i.e., remove the gene from the analysis) or impute (i.e., predict) the gene expression value.
Values that are ≤ 0. In your first task at then end of this module, you will answer why these values are invalid.
For now, we are going to focus on the patients’ gene expression data for easier analysis. Let’s select every column except ID_REF
and save it to a variable called expr
.
<- gset %>% select(-ID_REF) expr
Now, let’s check for NA
values using apply()
.
apply(is.na(expr), MARGIN = 2, FUN = sum)
## GSM5019817 GSM5019818 GSM5019819 GSM5019820 GSM5019821 GSM5019822 GSM5019823
## 0 0 0 0 0 0 0
## GSM5019824 GSM5019825 GSM5019826 GSM5019827 GSM5019828 GSM5019829 GSM5019830
## 0 0 0 0 0 0 0
## GSM5019831
## 0
Here, MARGIN = 2
means “look at each column” and FUN = sum
means “count up all rows in this column where is.na(expr)
returns true”.
In addition to missing data, we should check for negative values, because gene expression can’t be negative.
min(expr)
## [1] 2.32193
Since min(expr)
is greater than 0, there are no negative expressions. In this case, we got lucky. Lots of times, however, our data may contain negative values (either due to the data being log transformed or just bad data).
In bioinformatics, researchers commonly use apply the log2 transformation to gene expression values. In your task for this module, you will answer why we apply the log2 transform.
<- cbind(ID_REF = gset$ID_REF, log2(expr))
transformed head(transformed)
## ID_REF GSM5019817 GSM5019818 GSM5019819 GSM5019820 GSM5019821
## 1 (+)E1A_r60_1 1.982318 1.353205 1.538184 1.269224 1.368772
## 2 (+)E1A_r60_3 1.837821 1.649621 1.728440 1.775893 1.748335
## 3 (+)E1A_r60_a104 1.843451 1.241322 1.545550 1.411503 1.261741
## 4 (+)E1A_r60_a107 1.982179 1.489193 1.660969 1.634088 1.437225
## 5 (+)E1A_r60_a135 1.865034 1.238980 1.540731 1.250820 1.267045
## 6 (+)E1A_r60_a20 2.094270 1.872988 1.910261 1.950167 1.809577
## GSM5019822 GSM5019823 GSM5019824 GSM5019825 GSM5019826 GSM5019827 GSM5019828
## 1 1.281014 1.265400 1.556132 1.298916 1.363145 1.932319 1.879500
## 2 1.829719 1.868244 1.847651 1.992013 1.986494 2.006909 1.846429
## 3 1.538004 1.488496 1.650841 1.421117 1.459974 1.914972 1.926574
## 4 2.223608 2.233425 2.175192 2.413566 2.212238 2.333545 2.352547
## 5 1.416476 1.398552 1.504726 1.381740 1.388790 1.941106 1.885347
## 6 1.789775 1.807800 1.771651 1.731936 1.791754 2.005354 1.990547
## GSM5019829 GSM5019830 GSM5019831
## 1 1.735442 1.860215 1.327579
## 2 1.813518 1.991264 1.767412
## 3 1.934841 1.901143 1.753091
## 4 2.416856 2.362836 2.419400
## 5 1.802814 1.904059 1.542891
## 6 1.969354 2.006445 1.947226
Now, the reason we created the expr
tibble becomes clear: to separate character and numerical data to allow for easier data manipulation. That is, we can’t take a log transform on character data.
In this data set, subject GSM5019817
to GSM5019821
(first 5 columns) are healthy subjects, GSM5019822
to GSM5019826
(next 5 columns) have mild COVID-19 symptoms, and GSM5019827
to GSM5019831
(last 5 columns) have severe COVID-19 symptoms.
Here, we will use a t-test to compare the healthy subjects to the subjects with mild COVID-19 symptoms for a single gene.
<- transformed[1, 1]
temp_ID <- transformed[1, 2:6]
healthy <- transformed[1, 7:11]
mild t.test(healthy, mild)
##
## Welch Two Sample t-test
##
## data: healthy and mild
## t = 1.0794, df = 5.3602, p-value = 0.3266
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1993431 0.4981807
## sample estimates:
## mean of x mean of y
## 1.502340 1.352921
In this case, the test does not return significant results at \(\alpha\) = 0.05.
In brief, fold change measures how much something changes between an original and subsequent measurement. Here is the mathematical definition of a log2 fold change:
\[\begin{align} FC &= \log_2 \left( \frac{\text{final value}}{\text{initial value}} \right) \\ &= \log_2 (\text{final value}) - \log_2 (\text{initial value}) \end{align}\]
where \(FC\) stands for “fold change”.
In our case, we will be continuing our comparison between the healthy subjects and patients with mild COVID-19 symptoms. In particular, we will use our test results from the t-test in the previous section:
# saving t-test on the log2 transformed gene expression
<- t.test(healthy, mild)
t.results # getting the estimate
$estimate t.results
## mean of x mean of y
## 1.502340 1.352921
# saving the fold change
<- t.results$estimate[2] - t.results$estimate[1]
fold.change # saving the p-value
<- t.results$p.value p.val
Please note that fold.change
is already log2 transformed. This is why we use subtraction used instead of division. Also note that t.results$estimate[1]
is the mean of the first sample (i.e., x) that went into the t.test()
function, whereas t.results$estimate[2]
the second (y). Therefore, this fold.change
represents the log2 transformed fold change of the expression of mild symptom patients over that of the healthy patients. The order here is important.
Let’s now create a tibble to store our results:
<- tibble(
results ID_REF = temp_ID,
fold.change = fold.change,
p.val = p.val
)
At this point, we only have one entry in this tibble. As we work with more and more genes, this tibble will prove useful.
Now that we’ve completed step 1 and 2, we have a single data point for our p-value vs fold change plot!
ggplot() +
geom_point(aes(x = fold.change, y = -log10(p.val))) +
labs(x = "Log2 Fold Change", y = "-log10(Pvalue)")
We will be filling in the rest of this plot very shortly :)
To identify the DEGs between 2 patient groups, we are in fact looking for the genes of which the expression levels are significantly different between the said groups. So far we have used t-test and U-test to examine one gene. Eventually, we will perform these tests on all of the genes at the same time.
To achieve this goal, we will take advantage of a for
loop that iterates over all rows of our expression tibble. Since there are over 60,000 rows in the data set, iterating through every row will take a substantial amount of time. For demonstration purposes, we randomly subset a sample of 5,000 rows from the data set. We would also like to time the run to estimate the time if we were to perform such tests on the complete dataset.
<- transformed[sample(nrow(transformed), size = 5000, replace = FALSE), ]
sampled_genes <- Sys.time()
start_time # we start at index 2 b/c we've already done the 1st gene
for (i in 2:nrow(sampled_genes)) {
# getting gene name
<- sampled_genes[i, 1]
temp_ID # saving our t-test results
<- t.test(sampled_genes[i, 2:6], sampled_genes[i, 7:11])
t.results <- t.results$estimate[2] - t.results$estimate[1]
fold.change <- t.results$p.value
p.val # adding to our results tibble
<- rbind(results, c(temp_ID, fold.change, p.val))
results
}
<- Sys.time()
end_time print(end_time - start_time)
## Time difference of 4.889766 secs
When performing many statistical tests, researchers use an “adjusted p-value” to correct for type I errors (probability of rejecting \(H_0\) given that \(H_0\) is true). In a later task, you will calculate the exact probability of a type I error (with guidance!).
Here we correct by false discovery rate as an example. Notice the difference between p.val
, the original p-value, and p.adj
, the adjusted p-value.
# fdr = false discovery rate
# another common method is the Bonferroni correction
<- p.adjust(results$p.val, method = "fdr")
p.adj # saving the adjustment back into our results tibble
<- tibble(
results
results,p.adj = p.adj
)head(results)
## # A tibble: 6 x 4
## ID_REF fold.change p.val p.adj
## <chr> <chr> <chr> <dbl>
## 1 (+)E1A_r60_1 -0.149418817293759 0.326561726075244 0.497
## 2 ASHGV40036478V5 -0.422233348453604 0.00204835600288314 0.0197
## 3 ASHGV40036403V5 -0.276363535541946 0.13000691232188 0.275
## 4 ASHG19AP1B100003428V5 -0.554855950379242 0.027971150150058 0.0972
## 5 ASHG19LNC1A100081332V5 -0.324725706520771 0.109842105297743 0.244
## 6 ASHG19LNC1A114033127V5 -0.0725809472439629 0.481890737886877 0.637
results
tibble, How what are the type I errors, based on probability calculation, were made with and without the FDR p-value adjustment? Please input a numerical value (no need to show work).Our tibble now includes 5,000 entries. Before we explore further, let’s generate a plot to get an idea of what information our analysis has to offer.
One of the most popular graph for DEG analysis is a scatter plot of fold change against p-value. This plot is also called a volcano plot because it resembles an erupting volcano.
ggplot(results, aes(x = as.numeric(fold.change), y = -log10(p.adj))) +
geom_point(size=0.75) +
labs(x = "Log2 Fold Change", y = "-log10(Pvalue)")
In this figure, each point represents a gene. The x-axis is the log2 transformed fold change. The points close to the center (0) are the genes with comparable expression levels in the two groups of samples, whereas the points to either side of the graph have a larger fold change. The y-axis is the -log10 transformed p-value. The higher the points are, the smaller the p-value is.
We now propose a question for you: where do you think the DEGs are found on this figure?
To answer this question, we first need to label some groups:
# assigning "not significant" to all genes by default
$group <- "NS"
results
# large enough fold change and upregulated
results[which(results$p.adj < 0.05 & results$fold.change > 0),
"group"
<- "Upregulated"
]
# significant but not downregulated
results[which(results$p.adj < 0.05 & results$fold.change < 0),
"group"
<- "Downregulated" ]
We are then able to use different colours to represent this label.
ggplot(results, aes(x = as.numeric(fold.change), y = -log10(p.adj),
color = group)) +
geom_point(size=0.75) +
labs(colour = "Group", x = "Log2 Fold Change", y = "-log10(Pvalue)") +
geom_vline(xintercept = c(-1, 1), linetype="dashed", size=0.25) +
geom_hline(yintercept = -log10(0.05), linetype="dashed", size=0.25) +
scale_colour_manual(values = c("red", "grey", "green4")) +
theme_classic()
There are many other ways to colour, label, and modify the volcano plot. We encourage you to explore additional customizations by reading academic literature related to DEGs.
REF_ID
directly on the volcano plot that you generate.A huge shoutout goes to Shuce Zhang, Programs Analyst, for creating the R script for our Mid Program Challenge. We would also like to thank Mehul Gupta and Sunand Kannappan, co-Founders of Youreka Canada, for providing their invaluable input on our Challenge topic. The background document was prepared by Pouria Torabi, Associate Director of Programs. This R Markdown document was prepared by Eddie Guo, Associate Director of Programs.
Chen, J., Wang, SJ., Tsai, CA. et al. Selection of differentially expressed genes in microarray data analysis. Pharmacogenomics J 7, 212–220 (2007). https://doi.org/10.1038/sj.tpj.6500412
Zhang, Q., Meng, Y., Wang, K., et al. Inflammation and antiviral immune response associated with severe progression of COVID-19. Frontiers in Immunology 12 (2021). https://doi.org/10.3389/fimmu.2021.631226