Statistical analysis (for High Dimensional data)

class: center, middle, inverse, title-slide

.title[
# Statistical analysis (for High Dimensional data)
]
.author[
### <a href="https://shibalytics.com/">Max Qiu, PhD</a> </br> Bioinformatician/Computational Biologist </br> <a href="mailto:maxqiu@unl.edu" class="email">maxqiu@unl.edu</a>
]
.institute[
### <a href="https://biotech.unl.edu/bioinformatics">Bioinformatics Research Core Facility, Center for Biotechnology</a> </br> <a href="https://ncibc.unl.edu/">Data Life Science Core, NCIBC</a>
]
.date[
### 02-15-2023
]

---

background-image: url(data:image/png;base64,#https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13024-018-0304-2/MediaObjects/13024_2018_304_Fig1_HTML.png?as=webp)
background-size: 75%

# We bring in errors every step of the way

.footnote[
[Shao, Y., Le, W. Mol Neurodegeneration 14, 3 (2019)](https://doi.org/10.1186/s13024-018-0304-2)

]

???

We bring in errors and bias every step of the way. We will discuss what to look out for in each of these steps. Let's pause for a minute and think about what could possibly go wrong in each step, what variances we could possibly bring to the data.

---

# (How to reduce) Sources of variances

.pull-left[

### Experimental design
- Confounding factors
- Control, randomize, replicate, block
- Sample size `\(n\)` and statistical power

### Sample Collection and Preparation
- Robust SOP
- Sampling bias
- Technical replicates vs biological replicates
- Pseudo-replication
- Quality Control: pooled QC and injection order

]

.pull-right[

### Data Acquisition (instrumentation and data preprocessing)
- Instruments: LC-MS, LC-MS/MS
- Deconvolution: sample alignment and peak picking

### (Post-acquisition) Data Processing 
- Assess the quality of data: missing values
- Assess feature presence/absence: 80% rule; missing value imputation
- QC-based batch correction
- High-dimensional data processing
  + Log2 transformation, normalization, scaling
  + Goal: near Gaussian distribution

]

???

Study question needs to be clearly defined ahead of time, because it will influence experimental design and everything follows.

In experimental design, we need to **control all sources of variation except the one** we are interested in studying. We can control and remove confounding factors by **setting up control group, random sampling and assignment, use of biological and technical replicates and blocking**. We also discussed statistical power and power analysis to decide sample number per group.

We covered sample correction and preparation, including **different types of samples, and its impact on sample collection, storage and extraction**. We know **experimental procedures should be standardized and optimized** and that different groups should be treated in the same way. However there are a few things to be cautious about here. In addition, we also discuss type of QC samples, the preparation of pooled QC, planning the injection order (layout).

In data acquisition, we discussed instruments themselves and deconvolution.

In data processing, we discussed type of missing values, and how to assess feature presence/absence, and how to do imputation for present feature, QC-based batch correction, how to normalize the data.

---

# Why do we care anyway?

???

If I am the one performing the whole experiment, and I am bring equal amount of variance/bias to each group, then it is still comparable because equivalent variances/bias has been brought to each groups.(Wrong, of course.)  
--

.left-column[

## Research reproducibility

]

.right-column[

.pull-left[
![Reports_rising](data:image/png;base64,#./img/F1.large.jpg)]

.footnote[
[Steven N. Goodman et al., Sci Transl Med 2016;8:341ps12](https://stm.sciencemag.org/content/8/341/341ps12)
]

.pull-right[
### Rubric of reproducibility: design, reporting, analysis, interpretation

- Method reproducibility
- Results reproducibility
- Robustness and generalizability
- Inferential reproducibility

]
]

???

This graph is taken from a paper called "What does research reproducibility mean?" by Goodman et al. on Science Translational Medicine, where it shows that Number of publications recorded in Scopus that have, in the title or abstract, at least one of the following expressions: research reproducibility, reproducibility of research, reproducibility of results, results reproducibility, reproducibility of study, study reproducibility, reproducible research, reproducible finding, or reproducible result. **It shows that concern about the reproducibility of scientific research has been steadily rising.**

---

# Statistical analysis

.pull-left[

### Understand the statistical nature of your dataset

* Is your data normally distributed? 
* **High-dimensional**

### Consideration for high-dimensionality
* Univariate analysis: **multiple testing**
* Multivariate analysis: **dimension reduction**
]

???

The next step in the workflow is to use statistical techniques to extract the relevant and meaningful information from the processed data. Two types of statistical analysis: univariant and multivariate analysis.

Last lecture we discussed the **distribution of the data**, and how it will affect the type of statistical test applicable to the dataset. This time our focus will be on the **high-dimensional nature of the data**, and its impact on statistics, specifically **multiple testing correction and dimension reduction**. We will walk through these as we talk about the statistical analysis. |

.pull-left[

### Univariate - feature selection

* Comparisons
  + compare numeric features between groups
  + check distribution with histogram and qq plot

* Multiple testing correction
  + control false discovery rate (FDR) with BH correction

* Ratios (Fold change)
  + degree of quantity change between two groups
  + Volcano plot: log2(abs(FC)) ~ log10(p-values)

]

???

Univariate statistics refer to all statistical analyses that include **a single dependent variable** and can include one or more independent variables. Univariate analysis mainly includes contrast analysis (i.e., pairwise comparison) and omnibus test (i.e., multi-group comparison).

**Typically, the primary goal is to identify features that differ between groups.**

---

# Univariant analysis: Differential analysis

.pull-left[

* **Comparisons**
  + **compare numeric features between groups**
  + **check distribution with histogram and qq plot**

* Multiple testing correction
  + control false discovery rate (FDR) with BH correction

* Ratios (Fold change)
  + degree of quantity change between two groups
  + Volcano plot: log2(abs(FC)) ~ log10(p-values)

]

???
The goal of univariate analysis is to **differentiate**; the goal is to identify the features that are significantly changing between classes of biological samples.

Type of univariate analysis, parametric or non-parametric tests, depends on the distribution of the data. Therefore, always check your distribution with histogram and qq plot. |

.pull-right[

## Data distribution and normality

* If sample distribution is near normal, **parametric** methods can be applied
  + **T-test**: assumes normal distribution and equal variance; **Welch's t-test** over Student's t-test
  + **ANOVA**: assumes normal distribution and equal variance

* If sample distribution is not normal, central limit theorem (CLT) is not satisfied, only **non-parametric** methods for hypothesis test
  + **Wilcoson rank sum test**
  + **Kruskal-Wallis test**

]

???
After data processing steps above, if data distribution is approximately normal, parametric statistical analysis should be applied (i.e., using t-test for a contrast analysis and/or using ANOVA for an omnibus test). In the case of data with non-normal distribution, nonparametric statistical analysis should be considered. For example, the Wilcoxon rank-sum test (also called Mann–Whitney U test or Wilcoxon–Mann–Whitney test) and/or Kruskal-Wallis one-way ANOVA for a contrast analysis and an omnibus test, respectively.

We should use **Welch’s t-test** by default, instead of Student’s t-test, because Welch's t-test performs better than Student's t-test whenever **sample sizes and variances are unequal between groups**, and gives the same result when sample sizes and variances are equal.

---

# Univariant analysis

.pull-left[

* Comparisons
  + compare numeric features between groups
  + check distribution with histogram and qq plot

* **Multiple testing correction**
  + **control false discovery rate (FDR) with BH correction**

* Ratios (Fold change)
  + degree of quantity change between two groups
  + Volcano plot: log2(abs(FC)) ~ log10(p-values)

]

.pull-right[

## Multiple testing correction

* Why? **Multiple simultaneous statistical tests** increase the number of false positives in the results.

* Familywise error rate (FWER) vs. False discovery rate (FDR)

.footnote[
[FWER vs. FDR](https://egap.org/resource/10-things-to-know-about-multiple-comparisons/)
]

]

???

Univariate analysis focuses on the analysis of a single dependent variable (metabolite/peptide). However, in a high-dimensional datasets generated from LC-MS, there are thousands of features being tested all at once. Performing multiple statistical tests simultaneously will increase the number of false positives in the result dramatically.

If multiple hypothesis tests are run at the same time, the number of false positives will inflate by the number of tests. For example, if one test is performed at the 5% level and the corresponding null hypothesis is true, there is only a **5% chance of incorrectly rejecting the null hypothesis**. However, if 100 tests are each conducted at the 5% level and all corresponding null hypotheses are true, **the expected number of incorrect rejections** (also known as false positives or Type I errors) is 5.

Truth is **when you run multiple simultaneous statistical tests, a fraction will always be false positives.** But there are ways we can **decrease the number of false positives.**

---

# Multiple testing correction (cont.)

.pull-left[
* Controlling the familywise error rate (**FWER**): **Bonferroni correction**
  + If a significance threshold of `\(α\)` is used (**family-wise error rate**), but `\(n\)` separate tests are performed, then the Bonferroni adjustment deems a feature significant only if the corresponding P-value is `\(≤ α/n\)`. 
  + **Too strict.**
]

.pull-right[
![BH-correction](data:image/png;base64,#./img/BH_correction.png)
]

* Controlling the false discovery rate (**FDR**): **Benjamini–Hochberg procedure**
  + First rank the p-values in ascending order; assign ranks to the p-values;
  + Set the significance threshold of `\(α\)` (FDR) you are willing to accept. 
  + Calculate each individual p-value’s Benjamini-Hochberg critical value using this formula `\((i/m)Q\)`;
      - i = the individual p-value’s rank
      - m = total number of tests
      - Q = the false discovery rate ( `\(α\)`, chosen by you)
  + Compare each original p-values against its Benjamini-Hochberg critical value; Find the largest p value that is smaller than the BH critical value.

???

**Adjusting the p-value threshold for significance is one of the main approaches for addressing the multiple testing problem**. There are two types of p-value adjustment: either controlling the Family-Wise Error Rate (FWER) using the Bonferroni-style correction (including the Holm correction) or controlling the False Discovery Rate (FDR) using the Benjamini-Hochberg procedure.

In many cases, Bonferroni is too strict. Bonferroni "penalize" all input p-values equally, whereas Benjamini-Hochberg (as a way to control the FDR) "punishes" p-values accordingly to their ranking.

---

# Univariant analysis

.pull-left[

* Comparisons
  + compare numeric features between groups
  + check distribution with histogram and qq plot

* Multiple testing corrections
  + control false discovery rate (FDR) with BH correction

* **Ratios (Fold change)**
  + **degree of quantity change between two groups**
  + **Volcano plot: log2(FC) ~ log10(p-values)**

]

.pull-right[

]
???
A volcano plot is a type of scatterplot that shows **statistical significance (P value)** versus **magnitude of change (fold change)**. It enables quick visual identification of genes with large fold changes that are also statistically significant.

In a volcano plot, the most upregulated genes are towards the right, the most downregulated genes are towards the left, and the most statistically significant genes are towards the top.

---

# Multivariate analysis

* Purpose of multivariate analysis (through dimension reduction)
  + **Visualization** and **feature selection/extraction**

* Motivating question 
  + How do we visualize high-dimensional data?
  + Can we find a **small number of features** that accurately capture the **relevant properties** of the data?

* Gist: Project the data from original high-dimensional space into a "smaller" low-dimensional subspace
  + **Goal: to discover the dimensions that matter the most**

* Two main methods for reducing dimensionality
  + Feature extraction: PCA (unsupervised method): finding a **new** set of `\(k\)` dimensions that are **combinations of the original** `\(d\)` dimensions
  + Feature selection: PLS-DA (supervised method): finding `\(k\)` of the `\(d\)` dimensions that give us the most information and we discard the other `\((d-k)\)` dimensions

???

Multivariate analyses are statistical procedures designed for analyzing **data involving more than one type of measurement or observation**. Multivariate analysis techniques **look at multiple variables simultaneously, assuming that information resides in the joint distribution**.

The purpose of multivariate analysis are two-fold: **visualization and feature selection/extraction**, through dimension reduction, or subspace estimation (in machine learning term).

The motivating question we are asking here is "". (For example, we can probably describe the trajectory of a tennis ball by its velocity, diameter, and mass.) Similarly, can we find just a few features in your high dimensional omics data that can capture the essence of this dataset?  How? (gist)

The two main methods to use are PCA and PLS-DA, that can help us achieve our goal, which is **to discover the dimensions that matter the most, or to help us see the dominant trend in the data**.

In feature extraction, we are interested in finding a **new** set of `\(k\)` dimensions that are **combinations of the original** `\(d\)` dimensions.

In feature selection, we are interested in finding `\(k\)` of the `\(d\)` dimensions that give us the most information and we discard the other `\((d-k)\)` dimensions.

---

# Principle component analysis (PCA)

.pull-left[
![2D to 1D](data:image/png;base64,#./img/pca_1.PNG)
]

.pull-right[

* Goal: We want to find a feature that can explain most of the variance of the data.

* Problem: `\(x1\)` and `\(x2\)` are correlated, i.e., non-zero covariance. **Both features contribute to the variance of the data.**

* This is **Feature extraction**. Instead of choosing between `\(x1\)` or `\(x2\)` (existing features), we create a **new** feature that can explain most of the variance.

* Solution: De-correlation
 + Eigen-decomposition
 + Singular value decomposition

]

???

Principal component analysis (PCA) is the most versatile and prevalent dimension reduction method. **It projects the original features onto a new feature space and creates a set of new features** (principal components, PCs), each of which is a linear combination of the original features. The method reduces the dimension of the data by keeping only the top new features that capture the majority of the variability of the original dataset.

This graph explains the main idea of PCA. Let's consider a 2D data distribution being plotted here. We want to reduce the dimensionality of this dataset, i.e., we want to find a feature (1D) that can explain most of the variance of this data.

Problem is we cannot simply decide which feature to keep, which feature to eliminate. Why? Because these two features are **correlated**. We cannot determine if one feature contributes more to the variance than the other one.

How to find this most discriminating new feature? By **de-correlate** the features such that their covariance vanishes. By rotating the axis, changing the bases. In the new bases, the horizontal axis accounts for most of the variance of the data. Thus, we have created (extracted) a new feature (along the horizontal axis) that contributes most to the variance.

How does PCA find the de-correlated features `\(z1\)` and `\(z2\)`? That is the matrix algebra behind PCA, which we will not get into. There are generally two ways, **eigen-decomposition and singular value decomposition**.

---

# Principle component analysis (PCA)

.pull-left[
* **Score plot**
  + **Projected observation distribution on new plane/base**
  + Similar observations accumulate within the same relative space (dispersion = dissimilar)
* Loading plot (biplot)
  + Explains how original variables are linearly combined to form new PCs
  + Variables with largest absolute loadings have greatest importance
  + Direction in score plots corresponds to direction in loading plot - biplot
* Scree plot
  + How many PCs should be keep?
  + Plot the variance explained as a function of number of PCs kept
  + At the "elbow", adding new PC does not significantly increase the variance explained by PCA
]

.pull-right[

![](data:image/png;base64,#MQ5_Statistics_files/figure-html/unnamed-chunk-2-1.png)

.footnote[
[Nguyen LH, Holmes S. PLoS Comput. Biol. 15(6): e1006907 (2019)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907)
]

]

???
Diagnostic plots of PCA (output).

Example datasets: wine data. The variables include the chemical
properties and composition of the wines. Class labels for grape varieties 
(59 Barolo, 71 Grignolino, 48 Barbera).

---

# Principle component analysis (PCA)

.pull-left[

* Score plot
  + Projected observation distribution on new plane/base
  + Similar observations accumulate within the same relative space (dispersion = dissimilar)
* **Loading plot (biplot)**
  + **Explains how original variables are linearly combined to form new PCs**
  + **Variables with largest absolute loadings have greatest importance**
  + Direction in score plots corresponds to direction in loading plot - biplot
* Scree plot
  + How many PCs should be keep?
  + Plot the variance explained as a function of number of PCs kept
  + At the "elbow", adding new PC does not significantly increase the variance explained by PCA
]

.pull-right[

![](data:image/png;base64,#MQ5_Statistics_files/figure-html/unnamed-chunk-3-1.png)

.footnote[
[Nguyen LH, Holmes S. PLoS Comput. Biol. 15(6): e1006907 (2019)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907)
]
]

???

Loading plot explains exactly how the old variables (dimensions) contributes to the new variables (PCs). Some original variables contributes a lot to a PC, while others contributes less. 
---

# Principle component analysis (PCA)

.pull-left[

.pull-right[

![](data:image/png;base64,#MQ5_Statistics_files/figure-html/unnamed-chunk-4-1.png)

.footnote[
[Nguyen LH, Holmes S. PLoS Comput. Biol. 15(6): e1006907 (2019)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907)
]
]

???

This is another version of the loading plot, not a bar graph, but it still shows how important each original variable to the new PC.

Variables with largest absolute loadings (imagine a shadow) have the greatest importance to the new PC. 
---

# Principle component analysis (PCA)

.pull-left[

* Score plot
  + Projected observation distribution on new plane/base
  + Similar observations accumulate within the same relative space (dispersion = dissimilar)
* **Loading plot (biplot)**
  + Explains how original variables are linearly combined to form new PCs
  + Variables with largest absolute loadings have greatest importance
  + **Direction in score plots corresponds to direction in loading plot** - biplot
* Scree plot
  + How many PCs should be keep?
  + Plot the variance explained as a function of number of PCs kept
  + At the "elbow", adding new PC does not significantly increase the variance explained by PCA
]

.pull-right[

![](data:image/png;base64,#MQ5_Statistics_files/figure-html/unnamed-chunk-5-1.png)

.footnote[
[Nguyen LH, Holmes S. PLoS Comput. Biol. 15(6): e1006907 (2019)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907)
]
]

???
Loading plot overlay with score plot gives biplot. 
---

# Principle component analysis (PCA)

.pull-left[
* Score plot
  + Projected observation distribution on new plane/base
  + Similar observations accumulate within the same relative space (dispersion = dissimilar)
* Loading plot (biplot)
  + Explains how original variables are linearly combined to form new PCs
  + Variables with largest absolute loadings have greatest importance
  + Direction in score plots corresponds to direction in loading plot - biplot
* **Scree plot**
  + How many PCs should be keep?
  + **Plot the variance explained as a function of number of PCs kept**
  + At the "**elbow**", adding new PC does not significantly increase the variance explained by PCA

]
.pull-right[

![](data:image/png;base64,#MQ5_Statistics_files/figure-html/unnamed-chunk-6-1.png)

.footnote[
[Nguyen LH, Holmes S. PLoS Comput. Biol. 15(6): e1006907 (2019)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907)
]
]

???
There is an elbow point in scree plot, after which the variance explained in following PCs flattens. Adding more PC after the elbow point doesn't increase the total variance explained by the PCA significantly.

---

# Partial least squares-discriminant analysis (PLS-DA)

.footnote[
[Abid M.S., Qiu H, et al., Sci Rep 12, 8289 (2022)](https://doi.org/10.1038/s41598-022-12197-2)
]

???

Partial least squares-discriminant analysis (PLS-DA) is another versatile dimension reduction method that can be used for predictive and descriptive modeling as well as for feature selection in high-dimensional datasets. In contrast to PCA, PLS-DA as a **supervised method and provides stronger classification prediction**. However, because PLS-DA takes into account sample label information, this method runs the risk of **overfitting** the data, and the user needs to optimize many parameters before reaching reliable and valid outcomes.

---

# Partial least squares-discriminant analysis (PLS-DA)

![](data:image/png;base64,#./img/vip_ROC.png)

.footnote[
[Abid M.S., Qiu H, et al., Sci Rep 12, 8289 (2022)](https://doi.org/10.1038/s41598-022-12197-2)
]

???

The PLS-DA will generate a **variable importance in projection (VIP)** score for each variable (peptide), which **measures the importance of that individual variable and summarizes the contribution a variable makes to the PLS model**.

---

# Power analysis

**Power analyses exploit an equation with four variables ( `\(α\)`, power, `\(N\)`, and effect size). The ultimate aim of power analysis is to determine the minimum sample size used to detect an effect size of interest.**

### Type of power
* Predicted (*priori*) power
  + Power calculated **before data collection**, used for deciding sample number per group in order to observe a particular effect size. 
  + Relationship between power and `\(N\)` after **stipulating `\(α\)` and (population or estimated) effect size**.

* Observed (*post-hoc*) power
  + Power calculated **after the fact, with sample size and effect size constrains**.
  + Solve for power by stipulating `\(α\)`, `\(N\)`, and (sample) effect size.

---

# Power analysis

### Pro or against post-hoc power depending on your motivation

.pull-left[

* Against: What chance was there  of producing  a statistically  significant result, assuming  that the  population effect is  **exactly  equal to  the observed sample effect size**?
  + Calculating post-hoc power of the **test you have performed** (usually nonsignificant result), **based on the effect size estimate from your data**.
  + ['“The claim that a study is ‘underpowered’ with  respect to an observed nonsignificant result”  is  “tautological and uninformative”.'](https://www.tandfonline.com/doi/abs/10.1080/19312450701641375)

]

???
WRONG:

People often use post-hoc power analysis to determine the power they need in order to **detect the effect observed in their study** after **finding a non-significant result**, and **use the low power to justify why their result was non-significant and that their theory might still be right**.

First, **given a nonsignificant result, one already knows that the observed statistical power is low** (the power for detecting a population effect equal  to the obtained  sample effect). As  Hoenig and Heisey (2001)  point out, “because of  the **one-to-one  relationship  between p values and observed  power**, **nonsignificant  p values  always correspond to  low observed powers**”. Thus, “the claim that a study is ‘underpowered’ with  respect to an observed nonsignificant result” is “tautological and uninformative”.

The argument would go something like this "I didn't get a statistically significant result, but then for an effect size of x my power was only 50% so this doesn't really tell me very much." This is a **circular logic**.

Second, observed power differs from the true power of your test, because the true power depends on the true effect size you are examining, which is unknown. It is tempting to treat post-hoc power as if it is similar to the true power of your study, but it is a **USELESS** statistical concept.

.pull-right[

* Pro: What chance was there  of producing  a statistically  significant result, based  on  **population effect  sizes** of  independent  interest? 
  + ["Where after-the-fact power analyses are based on population effect sizes of independent interest (as opposed to a population effect size exactly equal to whatever happened to be found in the sample at hand), they can potentially be useful."](https://www.tandfonline.com/doi/abs/10.1080/19312450701641375)
  +  Can be useful supplement to p-values and confidence intervals, but **only when based on population effect magnitudes of independent interest**. Confidence intervals are almost always more informative.

]

???
RIGHT:

"where after-the-fact power analyses are based on population effect sizes of independent interest (as opposed to a population effect size exactly equal to whatever happened to be found in the sample at hand), they can potentially be useful."

If a researcher knew they would only be able to get a certain number of patients with a rare disease and wanted to know the power they would be able to achieve to detect a given clinically significant effect.

“Previous researchers found effects averaging about  r=.40, and we had good power (a good chance of finding statistically  significant results)  assuming a population effect of  .40,  so  the fact that we  didn’t  find significant effects is meaningful...”

---

# More reading about observed power

* [The Abuse of Power](https://www.tandfonline.com/doi/abs/10.1198/000313001300339897)
  + “Because of  the **one-to-one  relationship  between  p values and observed power**, **nonsignificant p values always correspond to  low observed powers**.”

* [Brief Report: Post Hoc Power, Observed Power, A Priori Power, Retrospective Power, Prospective Power, Achieved Power: Sorting Out Appropriate Uses of Statistical Power Analyses](https://www.tandfonline.com/doi/abs/10.1080/19312450701641375)

* [Calculating Observed Power Is Just Transforming Noise](https://lesslikely.com/statistics/observed-power-magic/)

* [Observed power, and what to do if your editor asks for post-hoc power analyses](http://daniellakens.blogspot.com/2014/12/observed-power-and-what-to-do-if-your.html)

* [With the Ability to Calculate Power Comes Great Responsibility](https://medium.com/geekculture/with-the-ability-to-calculate-power-comes-great-responsibility-8f2792e59e0c)

???
First paper demonstrated the one-to-one relationship between p-values and observed power given a nonsignificant result.

---
class: inverse, center, middle

# Next: Functional Analysis

Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).

.footnote[
[OpenIntro Statistics](https://www.openintro.org/book/os/)
]