Full disclaimer, I know the title is a little “clickbaity”, but after years of writing R scripts for data analysis, I believe I’ve come across a solid milestone of what a fully reproducible R workflow should aspire to be. This article is NOT intended for beginners but rather for advanced R users who write functionalized code and may already have personal workflows in managing many scripts. Also, please keep in mind that this article is entirely my opinion, as everyone may have their own notion of an “ultimate” workflow. Hopefully, this article may give you some new ideas!
This article is aimed towards people who are looking to “break into” the bioinformatics realm and have experience with R (ideally using the tidyverse). Bioinformatics can be a scary-sounding concept (as least it is for me) because it is such a vast and fast-developing field that it can be difficult to define exactly what it is. I’ve always thought that bioinformatics was a highly advanced field beyond what I was capable of doing — that I would need years of technical training to begin actually doing it. …
When I was searching for jobs, I had a lot of time on my hands. Add to that COVID, and well, I had a lot of time on my hands. One day I thought to myself that I could use a little practice with
dplyr, given the relatively recent updates with dplyr 1.0.0+ and my personal philosophy that you can always practice fundamental skills. My first thought was to Google “dplyr practice problems”, but if you were to do that right now, you’d find a bunch of tutorial websites that have pretty basic
dplyr problems. For intermediate and advanced users…
This article is not meant to be a technical article nor is it meant to be a comprehensive article on all the different methods out there that control Type I and Type II error rates. This article will assume some background knowledge and is primarily focused on motivating a novel paradigm for combatting the multiple hypothesis testing problem and introducing a set of tools in R and R Shiny that you can use.
“Can you describe what’s going on in these Kaplan-Meier curves?” the interviewers asked me. I of course knew what those were, and I was admittedly stunned when they prodded me to say more — I didn’t know what else I could say. So, I stumbled through an answer, and after a while, the interviewers nodded their heads and thanked me for my time.
I didn’t get the job.
I recently participated in a relatively popular Stack Overflow “contest” (what would “popular” even mean on Stack Overflow??), where the prompt was to write a more “elegant”
tidyverse solution to the solution presented.
The problem statement was to perform two regressions: 1)
dep ~ cov_a + cont_a + cont_b and 2)
dep ~ cov_b + cont_a + cont_b.
This was the original posted code:
map(.x = names(df)[grepl("cov_", names(df))],
~ df %>%
mutate(res = map(data, function(y) tidy(lm(dep ~ cont_a + cont_b + !!sym(.x), data = y)))) %>%
and this was the sample dataset provided:
For the past couple of months, I’ve been building a Shiny App that researchers can use to control something called the False Discovery Rate. You can check it out here — I’ll probably write an article about it in the future. Along the way, I learned a lot of cool features from various sources — random Stackexchange posts, Dean Attali’s blog, and Appsilon’s blog to name a few. I’ve decided to list some of them here in this article in no particular order. …
If you’ve been using R for a while now, you may have come across the double “&” operator. Most people who’ve coded before, whether in R or some other language, have an intuitive feel for what the “&” represents. It’s a logical AND statement. “The sky is blue AND cows can fly” is a logically false statement because even though the sky is blue, the second part of the statement is false. So what the heck then does a “&&” represent?
If you look up the help page, using
?"&&", you will read “& and && indicate logical AND…The shorter form…
Data Scientist at Merck. Tidyverse enthusiast and a neRd.