# Set the seed so our results are reproducible:
set.seed(2020)
# Required packages
# pathway analysis
library(gprofiler2)
# We will need them for data handling
library(magrittr)
library(ggrepel)
library(dplyr)
library(tidyverse)
library(readr)
# plotting
library(ggplot2)
library(ComplexHeatmap)
library(RColorBrewer)
Gene Ontology (GO) enrichment analysis is a computational method used to identify which biological, cellular, or molecular processes are over-represented (or under-represented) in a given list of genes. It leverages the structured vocabulary of the Gene Ontology to interpret the functions of genes within a larger biological context.
How it works:
Input: You start with a list of genes of interest, often derived from experimental data like differentially expressed genes (DEGs) from a RNA-Seq experiment.
Annotation: Each gene in your list is associated with GO terms that describe its known functions.
Statistical Test: A statistical test (e.g., Fisher’s exact test, hypergeometric test) is applied to determine if any GO terms are enriched in your gene list compared to what would be expected by chance.
Output: The result is a list of GO terms along with their associated p-values (or adjusted p-values). Significantly enriched terms suggest that the biological processes they represent are particularly relevant to your gene list.
The Gene Ontology is organized into three main categories:
Biological Process (BP): Represents a series of events accomplished by one or more ordered assemblies of molecular functions (e.g., DNA replication, cell cycle, signal transduction).
Molecular Function (MF): Describes the elemental activities of a gene product at the molecular level, such as binding or catalysis (e.g., kinase activity, DNA binding, transporter activity).
Cellular Component (CC): Refers to the location relative to cellular structures in which a gene product performs a function (e.g., nucleus, cytoplasm, plasma membrane). Types of GO Enrichment Analysis:
For more information, we encourage further reading at: https://geneontology.org/docs/ontology-documentation/
There are two main approaches to GO enrichment analysis:
Over-Representation Analysis (ORA): This is the simpler method and focuses on identifying GO terms that are significantly over-represented in your gene list. It is based on the assumption that all genes in the list are independent.
Functional Class Scoring (FCS): This approach considers the relative ranking of genes in your list (e.g., based on expression levels) and tests if gene sets associated with specific GO terms are enriched towards the top or bottom of the list. This can reveal subtle patterns that ORA might miss.
The enriched GO terms provide valuable insights into the biological functions that are most relevant to your gene list. This can help generate hypotheses about the underlying mechanisms involved and guide further research.
KEGG is a collection of manually drawn pathway maps representing molecular interaction and reaction networks.
GO and KEGG are the most frequently used for functional analysis. They are typically the first choice because of their long-standing curation and availability for a wide range of species.
Other gene sets include but are not limited to Disease Ontology
DO
, Disease Gene Network DisGeNET
,
wikiPathways
, Molecular Signatures Database
MSigDb
.
Transplant_vs_Naive_annotated <- read.csv(file = "./results/Transplant_vs_Naive/Transplant_vs_Naive_annotated_DEGlist.csv",
row.names = 1)
normalized_counts_annotated <- read.csv(file="./results/Transplant_vs_Naive/normalized_counts.csv")
For our enrichment analysis, we will use DEGs filtered by the
following cutoff:
adj.P.Val <= 0.05, logFC <= -1 | logFC >= 1
.
Transplant_vs_Naive_annotated_DEGs <- Transplant_vs_Naive_annotated %>%
dplyr::filter(str_trim(external_gene_name) != "") %>% # Remove empty strings
dplyr::filter(padj <= 0.05, log2FoldChange <= -1 | log2FoldChange >= 1) %>% # THIS IS FILTERED ANALYSIS
dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>%
# Filter out the duplicated rows using `dplyr::distinct()`
dplyr::distinct(external_gene_name, .keep_all = TRUE)
Quick look at the number of DEGs.
nrow(Transplant_vs_Naive_annotated_DEGs)
## [1] 1179
Check for duplicated gene_ids
any(duplicated(Transplant_vs_Naive_annotated_DEGs$external_gene_name))
## [1] FALSE
gprofiler2
gprofiler2
performs functional enrichment analysis, also
known as over-representation analysis (ORA) on input gene list. It maps
genes to known functional information sources and detects statistically
significantly enriched terms.
In other words, gprofiler2
takes a list of genes and
analyzes them to see if they are enriched for certain functions. This
can be helpful for understanding the biological processes that the genes
are involved in.
Many GO enrichment analysis tools allow you to archive your results
for future reference or sharing. gprofiler2
maintains
archived versions based on genome versions used for secondary analysis,
specifically Ensembl. Thus, here we also set the archive to Ensembl 109
version.
Archiving results is often optional, but it can be a valuable practice for reproducibility and collaboration.
set_base_url("https://biit.cs.ut.ee/gprofiler_archive3/e109_eg56_p17")
Further, note that although gprofiler2
accepts Ensembl
IDs, most biologists/scientist prefer gene names as they are more
interpretable. One caveat, is that the use of gene names can can lead to
redundancies for cases where a name corresponds to more than one id
(e.g: haplotypes). However, for the most part it is acceptable to use
gene names. The example below is executed with gene names.
Lastly, note we set the custom_bg
(custom background) to
the gene names present in our normalized counts which is derived from
the raw data that was pre-filtered. This is important due to the
following reasons:
Statistical Significance: The background set defines the domain
of genes from which your input list was derived. By comparing the
frequency of GO terms in your gene list to their frequency in the
background set, gprofiler2
can calculate p-values that
indicate the likelihood of observing such enrichment by chance. Without
a proper background set, these p-values would be unreliable.
Context-Specific Analysis: The choice of background set can dramatically influence the results of your GO analysis. Using the entire genome as the background might not be appropriate if your gene list comes from a specific tissue, cell type, or experimental condition. In such cases, it’s often more meaningful to use a custom background set that reflects the relevant biological context (in our case this corresponds to the total gene number after pre-filtering).
Controlling False Positives: A well-defined background set helps control the false positive rate in your analysis. If your background set is too small or biased, you might detect spurious enrichment of GO terms that are not truly relevant to your gene list.
gprofiler2
offers several options for choosing a
background set: annotated genes, custom background, and predefined
backgrounds.
# select species
species <- "mmusculus"
#select data sources, these are our min. standards:
data_sources <- c("GO", "KEGG", "REAC", "MIRNA", "HP", "HPA", "WP")
gost_results <- gost(query = Transplant_vs_Naive_annotated_DEGs$external_gene_name,
organism = species,
ordered_query = FALSE,
multi_query = FALSE,
significant = TRUE,
exclude_iea = FALSE,
measure_underrepresentation = FALSE,
user_threshold = 0.05,
correction_method = "g_SCS",
domain_scope = "custom",
custom_bg = normalized_counts_annotated$external_gene_name, # genes names from normalized counts
numeric_ns = "",
sources = data_sources,
as_short_link = FALSE,
evcodes = TRUE)
Challenge: Enable the url option for gost
explore the gprofiler
results in detail through the
url.
gost_results_url <- gost(query = Transplant_vs_Naive_annotated_DEGs$external_gene_name,
organism = species,
ordered_query = FALSE,
multi_query = FALSE,
significant = TRUE,
exclude_iea = FALSE,
measure_underrepresentation = FALSE,
user_threshold = 0.05,
correction_method = "g_SCS",
domain_scope = "custom",
custom_bg = normalized_counts_annotated$external_gene_name, # genes names from normalized counts
numeric_ns = "",
sources = data_sources,
as_short_link = TRUE,
evcodes = TRUE)
gost_results_url
## [1] "https://biit.cs.ut.ee/gplink/l/jvAyBd7nRV"
gprofiler2
resultsgostplot(gost_results, capped = TRUE, interactive = TRUE)
publish_gosttable(gost_results, highlight_terms = gost_results$result[c(1:10),],
use_colors = TRUE,
show_columns = c("source", "term_name", "term_size", "intersection_size"),
filename = NULL)
## The input 'highlight_terms' is a data.frame. The column 'term_id' will be used.
Write gprofiler2
results to csv
go_results <- gost_results$result
go_results$parents <- as.character(gost_results$result$parents)
write.csv(go_results,
file = "./results/Transplant_vs_Naive/Transplant_vs_Naive_Gprofiler_padj_fc.csv",
row.names = FALSE)
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gprofiler2_0.2.3 ComplexUpset_1.3.3 EnhancedVolcano_1.20.0 RColorBrewer_1.1-3
## [5] ComplexHeatmap_2.18.0 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
## [9] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [13] tidyverse_2.0.0 dplyr_1.1.4 ggrepel_0.9.5 ggplot2_3.5.0
## [17] magrittr_2.0.3 biomaRt_2.58.2 vsn_3.70.0 Glimma_2.12.0
## [21] DESeq2_1.42.1 SummarizedExperiment_1.32.0 Biobase_2.62.0 MatrixGenerics_1.14.0
## [25] matrixStats_1.3.0 GenomicRanges_1.54.1 GenomeInfoDb_1.38.8 IRanges_2.36.0
## [29] S4Vectors_0.40.2 BiocGenerics_0.48.1 tximport_1.30.0
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.16.0 jsonlite_1.8.8 shape_1.4.6.1 farver_2.1.1 rmarkdown_2.26
## [6] GlobalOptions_0.1.2 zlibbioc_1.48.2 vctrs_0.6.5 memoise_2.0.1 RCurl_1.98-1.14
## [11] htmltools_0.5.8.1 S4Arrays_1.2.1 progress_1.2.3 curl_5.2.1 SparseArray_1.2.4
## [16] sass_0.4.9 bslib_0.7.0 htmlwidgets_1.6.4 fontawesome_0.5.2 plyr_1.8.9
## [21] plotly_4.10.4 cachem_1.0.8 mime_0.12 lifecycle_1.0.4 iterators_1.0.14
## [26] pkgconfig_2.0.3 Matrix_1.6-5 R6_2.5.1 fastmap_1.1.1 shiny_1.8.1.1
## [31] GenomeInfoDbData_1.2.11 clue_0.3-65 numDeriv_2016.8-1.1 digest_0.6.35 colorspace_2.1-0
## [36] patchwork_1.2.0 AnnotationDbi_1.64.1 crosstalk_1.2.1 RSQLite_2.3.6 filelock_1.0.3
## [41] labeling_0.4.3 fansi_1.0.6 timechange_0.3.0 httr_1.4.7 abind_1.4-5
## [46] compiler_4.3.3 bit64_4.0.5 withr_3.0.0 doParallel_1.0.17 BiocParallel_1.36.0
## [51] DBI_1.2.2 hexbin_1.28.3 highr_0.10 MASS_7.3-60.0.1 rappdirs_0.3.3
## [56] DelayedArray_0.28.0 rjson_0.2.21 tools_4.3.3 httpuv_1.6.15 glue_1.7.0
## [61] promises_1.3.0 cluster_2.1.6 generics_0.1.3 gtable_0.3.5 tzdb_0.4.0
## [66] preprocessCore_1.64.0 data.table_1.15.4 hms_1.1.3 xml2_1.3.6 utf8_1.2.4
## [71] XVector_0.42.0 foreach_1.5.2 pillar_1.9.0 emdbook_1.3.13 vroom_1.6.5
## [76] limma_3.58.1 later_1.3.2 circlize_0.4.16 BiocFileCache_2.10.2 lattice_0.22-6
## [81] bit_4.0.5 tidyselect_1.2.1 locfit_1.5-9.9 Biostrings_2.70.3 knitr_1.46
## [86] gridExtra_2.3 edgeR_4.0.16 xfun_0.43 statmod_1.5.0 stringi_1.8.3
## [91] lazyeval_0.2.2 yaml_2.3.8 evaluate_0.23 codetools_0.2-20 bbmle_1.0.25.1
## [96] BiocManager_1.30.22 cli_3.6.2 affyio_1.72.0 xtable_1.8-4 munsell_0.5.1
## [101] jquerylib_0.1.4 Rcpp_1.0.12 dbplyr_2.5.0 coda_0.19-4.1 png_0.1-8
## [106] bdsmatrix_1.3-7 XML_3.99-0.16.1 parallel_4.3.3 blob_1.2.4 prettyunits_1.2.0
## [111] bitops_1.0-7 viridisLite_0.4.2 mvtnorm_1.2-4 apeglm_1.24.0 scales_1.3.0
## [116] affy_1.80.0 crayon_1.5.2 GetoptLong_1.0.5 rlang_1.1.3 KEGGREST_1.42.0