One of the best thing about this job is that I get to work with PI (i.e. principle investigator) of very different discipline and learn about interesting projects in their respective field of expertise. And sure, they are all under the giant dome of “life science”, but you should know that a project focusing on mammalian hibernation, and others studying human gut microbiome, or sequencing the genome of a novel plant to explore their unique clinical effects, are trying to achieve scientific goals of enormous differences. Their common ground, however, does exist. And that is me, the bioinformatician that process, analyze and draw statistical inferences from their data. And by my observations, their science might be suffering from the same pitfall: they invite the bioinformatician way too late into the party. There are things (mostly statistics related things) an investigator should consider at the beginning of every project, and well, they didn’t. Until the data analyst (me) have to write long reports explaining why their data, for lack of better wording and without getting into details, suck.

Looking up “scientific method” in Wikipedia, you will find that it is a circle consisting of 6 major steps. To expand and dig in a little deeper, an experimental study workflow consists of these steps: formulate a question, experimental design, sample collection, sample preparation, data acquisition, data processing, data analysis, statistic inference, and at long last, biological interpretation. I don’t mean to bore you with all these dry, minute, utterly un-sexy details that every graduate student should know after their first week of class, but I just want to make the point that experimental workflow is an upstream-to-downstream process. To be specific, steps that comes first has more impact to the final results than steps coming later. More importantly, don’t expect later steps to compensate what was done (or not done) before.

Hard it is for people to believe, but the truth is bioinformaticians are not magicians. By the time an experiment reaches data acquisition phase, more than 50% of the outcomes has already been decided by choices made in the prior stages. By the time batches of data reaches bioinformatics, major results of the study is fixed. Complicated data transformation and normalization cannot increase statistical power for high dimensional data analysis when an experiment contains 5 phenotypes with only 3 samples each. Machine learning cannot remedy the lack of quality control during weeks long LC-MS/MS data acquisition, or less-than-robust SOPs that will probably not stand the test of reproducibility.

So listen up, ladies and gents, consult your neighborhood friendly statistician or bioinformatician early on please. Last minute party invitations makes us real insecure.

Austin made me breakfast!

PS: Austin made me breakfast!