Functional Analysis

.title[
# Functional Analysis
]
.author[
### <a href="https://shibalytics.com/">Max Qiu, PhD</a> </br> Bioinformatician/Computational Biologist </br> <a href="mailto:maxqiu@unl.edu" class="email">maxqiu@unl.edu</a>
]
.institute[
### <a href="https://biotech.unl.edu/bioinformatics">Bioinformatics Research Core Facility, Center for Biotechnology</a> </br> <a href="https://ncibc.unl.edu/">Data Life Science Core, NCIBC</a>
]
.date[
### 02-17-2023
]

---

background-image: url(data:image/png;base64,#https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13024-018-0304-2/MediaObjects/13024_2018_304_Fig1_HTML.png?as=webp)
background-size: 75%

# MS Omics Workflow

.footnote[
[Shao, Y., Le, W. Mol Neurodegeneration 14, 3 (2019)](https://doi.org/10.1186/s13024-018-0304-2)

]

???

After deconvolution, missing value assessment, batch correction and QA, two things we should keep in mind before statistical analysis.

* Data distribution: 
  + Log transformation, normalization and/or scaling. 
  + Resembles normal: t-test and ANOVA for univariate analysis
  + Not resembles normal: wilcoxon and kruscal-wallis for univariate analysis
* High-dimensionality
  + Univariate analysis with multiple testing correction
  + Multivariate analysis: dimension reduction for visualization and feature selection/extraction (PCA, PLS-DA)

---

# Pathway analysis - interpret gene list

My Omics experiment worked great and produced 1000 DEG (differentiated expressed genes)! ... Now what? 
* Genome-scale analysis (Omics)
  + Genomics, transcriptomics, proteomics
* Tell me what's interesting about these genes

???

Brief introduction of pathway analysis. The goal is to help you interpret gene list. 
**The general idea is that you perform some kind of experiment where you get thousands of genes as your result, and then you want to know how to interpret those.** Typically any kind of genome scale experiment or high-throughput Omics, such as transcriptomics, proteomics etc, produces lots of information like this. One of the main ways that people interpret this data is by trying to understand some mechanistic story that pulls these things together.

---

# Pathway analysis - interpret gene list

???

So, if we want to know what's interesting about the thousands of genes that were resulted from our transcriptomics experiment, we might ask if they are enriched in known pathways complexes or functions.

This picture represents a transcriptomics experiment and once you have collected the data and screened out a list of genes of interest, you might rank or cluster it to generate a gene list, and then we want to compare that gene list to prior knowledge about cellular processes using various analysis tools and ideally find some interesting new discovery.

Pathway analysis can help us understand the role genes play in relation to each other in phenotype/disease. Pathway analysis is a great way to help you **brainstorm new ideas and generate mechanistic hypotheses**.

---

# Pathway analysis - Proteomics

### Gene **list** enrichment analysis

i.e., Fisher's exact test (hypergeometric test)

### Gene **set** enrichment analysis

i.e., modified Kolmogorov Smirnov test (KS test). Or Wilcoxon rank sum test.

???

Through a proteomics experiment, we can get a list of differentiated genes. So, if we want to know what's interesting about these genes, we might ask if they are **enriched in known pathways, complexes or functions**.

Mass spectrometry-based untargeted metabolomics can now profile several thousand of metabolites simultaneously. However, these metabolites have to be **identified** before any biological meaning can be drawn from the data. **Metabolite identification is a challenging and low throughput process, therefore becomes the bottleneck of the filed.**

The limited throughput of targeted metabolomics usually does not motivate large scale pathway and network analysis. Untargeted metabolomics cannot move onto pathway and network analysis without knowing the identity of metabolites.

---

# Pathway analysis workflow overview

???

We start by normalizing the data, apply statistical analysis, and then at the end of all the steps we get the **gene list that we would like to functionally interpret in an unbiased way**, so that's one element. The other element is the **prior knowledge that we have about the function of these genes** that are collected and stored in different pathway database.

These two elements, the gene list and the pathways, they can connect or talk to each other, only if we use the **same gene identifiers**. If my gene list in the format of a gene name, then my pathway database should also use the format of a gene name.

Once we have these two elements, we can perform pathway enrichment analysis, which is to **look for pathways that are overly enriched by my gene list**. The end goal is to look for a list of **pathways that were activated or inhibited** in our experiment model.

---

# Pathway enrichment analysis a way to summarize your gene list into pathways to ease biological interpretation of the data

???

**A simpler way to think about pathway enrichment analysis is that it is just a way to organize your gene list in different categories of biological processes.** After we summarize our gene list this way, we can now concentrate only on these few biological processes to interpret our data and generate new hypotheses. In this sense, **pathway enrichment analysis can simplify data interpretation**. The main reason we do this is due to the large size of the gene list we get from Omics experiment; if we have a smaller gene list, we will want to interpret the gene list in a different way.

---

# Pathway enrichment analysis calculates the overlap between our gene list and a pathway

???

Next is an important concept in pathway enrichment analysis, the **overlap** that is used to calculate the enrichment score .

On the left, my gene list contains 41 genes. On the right, axon guidance is the first pathway we are testing, which contains 39 genes. There are 13 genes is the size of the overlap between my gene list and this pathway. **Size of the overlap is going to contribute to the enrichment score.** 13 is about a quarter of my gene list, and about a quarter of the number of genes on this pathway I am testing. |

]

???

In addition to the simple concept of overlap, we can also associate a score with the genes when we calculate the pathway enrichment score. If my **genes in the overlap have higher scores**, it will **increase the enrichment score for the pathway** being tested. What we can do is rank our genes using a scoring system, could be a fold-change, or differential gene expression p value.

---

# The background represents the genes that could have been captured in the omics experiment

???
Another important concept when we do enrichment analysis testing is the **background**. The background, sometimes called the universe, represents only the **genes that could have been captured in the experiment**. **The genes that cannot be measured or does not get expressed in my cells/samples are not a part of the background/universe and should not be counted.**

In the case of microarray, it will have a restricted background to only the genes that are placed on the array. Other than microarray, other Omics experiment are analyzing the whole genome, then we shouldn’t worry too much about the background.

---

# Types of enrichment analysis (gene list)

]

???
Depending on your gene list, different statistical tests are used. Both tests will give us **a value that is associated with each tested pathway**, and a p-value assess the **probability that this pathway is enriched in our gene list by chance only**. As we are testing for many pathways in a pathway enrichment analysis, we will also need to correct for multiple testing using either Bonferroni or BH correction. |

* **Defined gene list** (e.g., fold-change > 2-fold)
  + Answers the question: are any pathways (gene sets) surprisingly enriched in my gene list? 
  + Statistical test: **Fisher’s exact test**

* **Ranked gene list** (e.g., by differential expression)
  + Answers the question: are any pathways (gene sets) ranked surprisingly high or low in my ranked list of genes?
  + Statistical test: **rank-based sum test** included in the tool GSEA

]

???

If we provide a defined gene list using some threshold, the question we are trying to answer is …
If we provide a ranked gene list with score, the question we are trying to answer is …

---

# Ranked or not ranked?

* Possible problems with gene list test
  + No "neutral" value for the threshold
  + Different results at differnt threshold settings
  + Possible loss of statistical power due to thresholding
    * No resolution between significant signals with different strengths
    * Weak signal neglected

* Type of enrichment analysis
  + **Gene list** enrichment analysis
  + **Gene set** enrichment analysis
]

???

Ranked gene list is always preferred over not ranked gene list due to three main reasons. We are trying to avoid arbitrary threshold. With defined gene list, it’s difficult to decide where to put a threshold to select the genes. If we are too stringent, we lose information. If we are too permissive, the result will include too many false positives. We don't have this issue with ranked gene list. |

]

???

There are three types of data that are easy to get a ranked gene list. With bulk RNA-Seq, we can rank all genes from up-regulated to down-regulated. Similarly in single cell RNA-Seq. Ranked gene list is also possible with label-free proteomics, if we can have sufficient number of proteins.

---

# Ranked gene list: two-class

![](data:image/png;base64,#./img/pna_rank.png)

???

How do we do the ranking? For example I have a label-free peptidomics data with two-class experimental design, control and patient. I did univariate analysis with multiple testing correction. After that I want to organize my gene list where top up-regulated genes on top, not-significantly changed genes in the middle, and top down-regulatee genes at the bottom.

The full stat table we have from univariate analysis will have logFC and P-adj. These is what we use to calculate the rank of these genes.

logFC is reflecting up or down regulation, positive value means upregulated, negative means downregulated.

---

# How does **gene list** enrichment analysis work?

???

**Given a gene list and a pathway, is this pathway surprisingly enriched in the gene list? Here we need to define and estimate “surprisingly”.**

We have 41 genes in pink, which is **part of the background** and represented in gold. Our pathway has 39 genes, making an overlap of 13. The next question is that **is this overlap larger than expected by chance?** From the analysis, we’ll get a p-value, **the smaller the p-value, the more probable this pathway is enriched in my gene list and is not due to random chance**.

How do we get this p-value? One method is **random sampling using background genes**. Say, we try 1000 random gene list of 41, **compute the overlap size for each random list to build a null distribution**. If the overlap of 13 is far away from the null distribution, then we can calculate an empirical p-value. **P-value we get is assessing the probability that the overlap is observed by chance only.** P-value is ranged from 0 to 1; if we have a near zero p-value, then we have **extremely low chance of observing this overlap under random chance**, which leads to us objecting the null.

This is a permutation-based test, which takes a lot of time and computing resources. If we already know the null distribution of random sampling, then we don’t have to do permutation. And the null distribution for random sampling is known, which is **hypergeometric probability**, and the test using this distribution is **Fisher’s exact test**. Using Fisher’s exact test, we don’t have to calculate empirical p-value, instead, we can calculate the p-value analytically.

---

# How to assess "surprisingly" (statistics)?

* Fisher's exact test
  + Null hypothesis: list is an random sample from population
  + Alternative hypothesis: more black genes than expected in my list

* A bucket of different colored balls
  + Background
  + 500 black genes
  + 4500 red genes

]

???

And the null distribution for random sampling is known, which is hypergeometric probability, and the test using this distribution is Fisher’s exact test (or hypergeometric test). Using Fisher’s exact test, we don’t have to calculate empirical p-value, instead, we can calculate the p-value analytically.

---

# [Fisher's exact test](https://www.youtube.com/watch?v=udyAvvaMjfM) &nbsp; &nbsp; &nbsp; &nbsp; ![](data:image/png;base64,#./img/PNA10_1.jpg)

???
Probability is high because the result is easy to get. The probability of getting 4 black genes is the probability of getting 4 or more black genes.

---

# How does **gene set** enrichment analysis work?

]

???
GSEA is using a modified Kolmogorov Smirnov test (KS test) to calculate enrichment score. With a ranked gene list, **upregulated genes on top and downregulated genes on bottom, nonsignificant genes in the middle**. **GSEA will see if a pathway is enriched at the top or at the bottom of the ranked gene list**, it will give **a p-value**, which assess the probability that this pathway is enriched by random chance only, and **a direction**, indicated by the sign of the enrichment score. If it’s **a positive enrichment score, meaning the pathway is enriched in the genes that are upregulated**. If it’s **a negative enrichment score, meaning this pathway is enriched in downregulated genes**.

---

# How is enrichment score calculated?

???

Placed the **ranked gene list horizontally** and place the testing pathway over on top of it. **The black bars indicate the genes that are also in the pathway, and we can see where they are located in the ranked file. Each time a gene appears in pathway, the ES score increase a step, each time a gene is not in the pathway, the ES score decrease.** In order to get a score like this, we need a lot of genes from our ranked gene list to also be on the pathway. The maximum or minimum of the ES score is the final ES score for the pathway.

GSEA also has a **weight system**. **The genes on both ends of the ranked gene list will have more weight than the genes in the middle**, because we don’t want nonsignificant genes affect the ES score. ES score usually peak in the left or right, but not in the middle.

---

# From ES score to p-value

* Generate null-hypothesis distribution from randomized data (through permutation)

* Estimate empirical p-value by comparing observed ES score to null distribution

???
We need to **estimate if the enrichment score is equal or larger than the one that could have been obtained by chance only**. GSEA use **permutation** to calculate empirical p-value. There are many permutation methods available. One way permutation is done by replacing genes on the pathway with random genes to **create random pathways**. Another way of permutation is by **creating random ranked gene lists**. Either way, we are comparing observed ES score to null hypothesis distribution build from randomized data.

---

# Other enrichment tests for **ranked gene list**

### Panther: **Wilcoxon rank-sum test**

Available in [PANTHER](http://www.pantherdb.org/)

]

???

Other than GSEA, another method to use a ranked gene list is Wilcoxon rank-sum test, and it **only considers the rank of the genes**. We have the global null distribution in blue, and we have the distribution of the pathway we are testing in red. In this graph, we see the red line has **a shift to higher log fold change**, meaning this **pathway is enriched in the upregulated genes**. This graph is an output from the package Panther, which is using the Wilcoxon rank sum test.

---

# How to correct for repeating the test?

* We are testing many pathways at the same time -> **Correction for multiple hypothesis testing**

* Bonferroni: controlling the family-wise error rate (FWER)

* Benjamini-Hochberg procedure: controlling false discovery rate (FDR)
]

???
In past discussion, we have talked about how we test one pathway. But in fact, we are testing many pathways at the same time. Multiple testing correction is needed. Even if an event is unlikely, if we try many times, we still may get it.

]

---

.footnote[
Read about 
[g:Profiler](https://biit.cs.ut.ee/gprofiler/), 
[gprofiler2](https://cran.r-project.org/package=gprofiler2/vignettes/gprofiler2.html) the R package, and
[GSEA](https://www.gsea-msigdb.org/)

]

???
How to choose a tool?

* Does it cover your model organism?
* Is there a good choice of gene sets (pathway database)?
* Are the pathway databases up to date?
* Which statistics (for gene list or ranked gene list)?
* Is the description of statistics clear enough? 
* Do you like the output style?
* Can you connect it with network visualization tools?

---

# Pathway analysis - Metabolomics

## Targeted

### Metabolite list enrichment analysis

* i.e., Over representation analysis (ORA) (Fisher's exact test)

* ORA is implemented in [Enrichment Analysis module](https://www.metaboanalyst.ca/MetaboAnalyst/upload/EnrichUploadView.xhtml) of `MetaboAnalyst` and also implemented in `IMPaLA`.

### Metabolite set enrichment analysis (MSEA)

* MSEA using [globaltest](https://pubmed.ncbi.nlm.nih.gov/14693814/) is implemented in [Enrichment Analysis module](https://www.metaboanalyst.ca/MetaboAnalyst/upload/EnrichUploadView.xhtml) of `MetaboAnalyst`.

* MSEA using [Wilcoxon rank sum test](https://pubmed.ncbi.nlm.nih.gov/16081659/) test is implemented by `IMPaLA`.

.footnote[
Read about how globaltest and Wilcoxon rank sum test are implemented in [`MetaboAnalyst`](https://pubmed.ncbi.nlm.nih.gov/20457745/) and [`IMPaLA`](https://pubmed.ncbi.nlm.nih.gov/21893519/), respectively.
]

???

It is possible metabolites can be placed into context with upstream genes and proteomics to generate mechanistic hypotheses. Great way to help you brainstorm new hypotheses and ideas.

Unlike transcriptomics which allows comprehensive gene expression profiling, **targeted metabolomics usually covers only a small percentage of metabolome** (the actual coverage is platform/protocol specific). This means that metabolites (defined in our current pathways or metabolite sets) **do not have equal probabilities of being measured in your studies**, and the enriched functions are the results from both platform/protocol-specific effects and biological perturbations. Since the primary interest is to detect the latter, we highly recommend uploading a **reference metabolome** containing all measurable metabolites from your platform to eliminate the former effects.

---

# Over representation analysis (ORA)

* To test a particular group of compounds is **significantly associated** with a particular pathway or set of pathways **more than expected by chance**

* Take a list of compounds that scoring above a certain **threshold** (i.e. a list of important compounds identified by feature selection)

* Implemented based on a cumulative **hypergeometric** distribution (Fisher's)

* Score each pathway by counting **number of overlaps**

* P-value from ORA indicates the probability of seeing at least a **particular number of metabolites (hits) from a certain metabolite pathway)**

]

.pull-right[
![ORA impala](data:image/png;base64,#https://www.nonlinear.com/progenesis/qi/v2.0/faq/images/EnrichOverrep2.png)
]

.footnote[
Implementation by both [`IMPaLA`](https://doi.org/10.1093/bioinformatics/btg382) or [metaboAnalyst](https://www.metaboanalyst.ca/) and many others

]

???

Over-representation analysis (ORA) is perhaps the **most common pathway analysis** method used in the metabolomics community.

ORA analyze **whether the list you supply is significantly associated with a particular pathway or set of pathways**. That is, localized to certain pathways or classifications, instead of randomly scattered throughout the whole set of possible pathways.

Therefore, you should have **already selected a list of identifiers of interest**, which are a sub-set of all the metabolites you can/have measured. These metabolites of interest might be those **significantly different between your experimental conditions**, for example.

You have the option of also supplying the **entire list of measurable metabolites as a background list**, which is advisable.

---

# Over representation analysis (ORA)

![Venn diagram ORA](data:image/png;base64,#https://journals.plos.org/ploscompbiol/article/figure/image?size=inline&id=10.1371/journal.pcbi.1009105.g001)

]

* **Only consider overlap** (i,e, the total number of compounds that match a particular pathway)

* **Does not consider the magnitude** of concentration changes of those hits (not quantitative) **--> Quantitative enrichment analysis**

* Use **arbitrary threshold** as cutoff. Many moderate but meaningful changes may be missed if an inappropriate threshold is chosen.

* Needs to adjust for multiple testing.

]

.footnote[
[Wieder C, et al., PLOS Computational Biology 17(9): e1009105](https://doi.org/10.1371/journal.pcbi.1009105)
]

???

ORA relies on you having selected a sub-set appropriately, and all metabolites on the list are treated as equally important by the test.

So compound that are changed more significant will be treated the same as compounds that are less significant.

Venn diagram representing ORA parameters: N is the size of background set, n denotes the number of metabolites of interest (i.e., differentially abundant metabolites), M is the number of metabolites in the background set mapping to the ith pathway, and k gives the number of metabolites of interest which map to the ith pathway.

---

# Metabolite set enrichment analysis

* Requires **full metabolite feature set** along with a **quantitative** measure for each metabolite reflecting its difference between two states

* Assess the joint quantitative difference of all metabolite entities contained in each pathway through **Wilcoxon rank sum test**

* Needs to adjust for multiple testing.

]

.pull-right[
![enrichment analysis](data:image/png;base64,#https://www.nonlinear.com/progenesis/qi/v2.0/faq/images/EnrichOverrep3.png)

]

???

MSEA requires your full metabolite feature set, along with an expression measure for each metabolite reflecting its difference between two states. **Either a fold change value or two average concentration values for two different experimental conditions.**

This test is a more **hypothesis-free approach**, in that **you have not pre-selected the metabolites of interest**, and also the **relative extent of between-group differences** is taken into account for every metabolite.

---

# Metabolomics pathway analysis

## Untargeted: Mummichog algorithms

* [Mummichog algorithms](https://doi.org/10.1371/journal.pcbi.1003123) is implemented in [Functional Analysis module](https://www.metaboanalyst.ca/MetaboAnalyst/upload/PeakUploadView.xhtml) of `MetaboAnalyst`.

* Input: Ranked peak list (by p-value or t-statistics) or Peak intensity table

* The goal is to predict biological activity directly from mass spectrometry data **without a priori identification of metabolites**.
  + From user input, mummichog requires two lists of m/z features, the significant list `\(L_{sig}\)` (e.g. selected by univariate statistics) and the reference list `\(L_{ref}\)` (all features detected in the MS experiment). 
  + From the m/z features in `\(L_{sig}\)`, mummichog computes all possibly matched metabolites, and searches the reference metabolic network for all the modules that can be formed by these tentative metabolites. (**Significant pathways**)
  + Random lists of m/z features are drawn from `\(L_{ref}\)` many times to estimate the null distribution of module activities. The statistical significance of modules from `\(L_{sig}\)` can then be calculated on this null distribution. (**Background pathways**)

???

Mass spectrometry based untargeted metabolomics can now profile several thousand of metabolites simultaneously. However, these metabolites have to be **identified** before any biological meaning can be drawn from the data. **Metabolite identification is a challenging and low throughput process, therefore becomes the bottleneck of the filed.**

The limited throughput of targeted metabolomics usually does not motivate large scale network analysis. Untargeted metabolomics cannot move onto pathway and network analysis without knowing the identity of metabolites.

This module supports functional analysis of untargeted metabolomics data generated from mass spectrometry. The basic assumption is that putative annotation at individual compound level can collectively predict changes at functional levels as defined by metabolite sets or pathways. This is because changes at group level rely on "collective behavior" which is more tolerant to random errors in compound annotation

---

# Topological analysis

* Represent the **structure of biological pathway** and complex relationships between compounds on the same pathway (activation, inhibition, reaction, etc.)

* Changes in **key positions** of a network will trigger more severe impact than marginal positions

* Topology-based pathway analysis will compute a score for each pathway which quantifies the significance of changes between the two phenotypes

]

]

.footnote[
Implemented in [Pathway Analysis module](https://www.metaboanalyst.ca/MetaboAnalyst/upload/PathUploadView.xhtml) of `metaboAnalyst`
]

???

**The structure of biological pathways represents our knowledge about the complex relationships between molecules** (activation, inhibition, reaction, etc.). However, neither over-representation analysis or quantitative enrichment analysis (MSEA) take the **pathway structure** into consideration when determining **which pathways are more likely to be involved in the conditions under study**.

It is obvious that changes in the key positions of a network will trigger more severe impact on the pathway than changes on marginal or relatively isolated positions.

Pathway analysis contains all the matched pathways (the metabolome) arranged by **p values (from pathway enrichment analysis) on Y-axis, and pathway impact values (from pathway topology analysis) on X-axis**. The **node color** is based on its p value and the **node radius** is determined based on their pathway impact values.

---

# Pathway analysis summary

### Recipe for defined list enrichment analysis

* A defined list
* Background/reference omics (all the genes/metabolites that could be detected by your analytical platform)
* Pathway database

**Proteomics**
* A ranked gene list
* Background
* Pathway database
]

**Targeted metabolomics**
* A ranked metabolite list or metabolite with concentration (table)
* Background/Reference metabolome
* Pathway database

**Untargeted metabolomics**
* Ranked peak list or peak intensity table
* Pathway database
]

---

# Thanks!

Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). Sections of this lecture were modified from "Pathway and Network Analysis" workshop materials given by [**Canadian Bioinformatics Workshop**](bioinformatics.ca)

---

# Resources:

* [Data carpentry](https://datacarpentry.org/lessons/)
  + [Data Analysis and Visualization in R](https://datacarpentry.org/R-genomics/index.html)
  + [Data Analysis and Visualization in R for Ecologists](https://datacarpentry.org/R-ecology-lesson/index.html)

* [Software carpentry](https://software-carpentry.org/lessons/)
  + [Programming with R](http://swcarpentry.github.io/r-novice-inflammation/)
  + [Programming with Python](https://swcarpentry.github.io/python-novice-inflammation/)
  + [Version Control with Git](https://swcarpentry.github.io/git-novice/)
  + [The Unix Shell](https://swcarpentry.github.io/shell-novice/)
  + [R for Reproducible Scientific Analysis](https://swcarpentry.github.io/r-novice-gapminder/)

* [OpenIntro Statistics](https://www.openintro.org/book/os/)
* [An Introduction to Statistical Learning](https://www.statlearning.com/)