Education

When do we introduce best statistical practices to undergraduate biology majors?

By Jeff Walker

Within the cell biology/physiology literature, but not so much the ecology and evolution literature, there have been several editorials and short papers encouraging researchers to use alternatives to bar plots to communicate experimental results (here, here, here, here, and here).

Bar plots aren’t wrong, there are simply much better plots for communicating features of our data or the models that we fit. All the links above advocate some version of box plots or dot plots. I suggest a third, a forest plot, or even better a Harrell plot, which combines a forest plot of modeled effects with a dot/box plot of the raw data (see below).

I do think there is one major impediment to any large-scale implementation of bar plot alternatives in biology – how we train undergraduate biology majors. If other programs are like ours at the University of Southern Maine, then most undergraduate biology majors are taught that “the way” to plot experimental data is with bar plots using Microsoft Excel/Google Sheets, a practice that often begins in the first year Introductory Biology labs. And, we humans, being humans, mostly continue using what we were first taught. A major reason we continue using what we were first taught is because there is no “corrective” teaching in upper level classes, or during graduate training. And a major reason there is no corrective teaching is because most researchers are unaware that there is something to correct, or that there are alternative best practices – after all, bar plots are ubiquitous in the primary literature. And a major reason bar plots are ubiquitous is because Excel, which is on every machine in every university and easy to teach to undergraduates. It is a positive feedback cycle of mediocrity.

I don’t think there should be correctives. If we are to break the cycle, I think we should start undergraduate training with best practices from the beginning – the Introductory Biology labs. But we can’t use Excel/Sheets for this. And we aren’t (realistically) going to teach intro bio students R scripting, at least until most bio faculty become as comfortable with R as they are with Excel. This is the root of the impediment – there are few tools to nudge professors to implement best practices in applied statistics and statistical graphing in undergraduate biology labs.

So, Thanksgiving morning, I decided to do something about it, and teach myself how to make a Shiny App, so that I could create a web-based tool for undergraduate teaching or for researchers who were resistant to scripting in R (or Python or whatever). After a couple of hours and a few dead ends, I had created a simple shiny app to read in a data file exported from Excel/Sheets and generate a simple dot plot with mean and 95% confidence intervals.

But a dot plot with superimposed means and confidence interval fails to solve what bothers me most about bar plots – any plot of means (or medians) only indirectly communicates what we often want to communicate in our results – the modeled effect sizes and their error. In this regard, box plots or dot plots with means and CI bars aren’t any better than bar plots.

My “how extremely stupid not to have thought of that” moment came later that day when I saw this figure from Frank Harrell’s Principals of Graph Construction

harrell_fig_1_1This is a pretty awesome plot, combining a plot of the treatment means and confidence intervals in the bottom part and a plot of the point estimate of the difference in means with its confidence interval in the top part. This is what we should all be publishing.

With Harrell’s plot as my target, I developed a shiny app to generate a “Harrell plot” (or HarrellPlot). A Harrel plot combines 1) a dot plot to show individual values, 2) a box plot to show the distribution of the response within treatment groups, and 3) a forest plot of modeled effect estimates and confidence intervals. This plot addresses all of the concerns raised in the above links over bar plots but is much more. The plot communicates what we often really want to communicate and focus on, effect sizes and our uncertainty in their estimates.

Here is an example, which shows the modeled effects of sprint, endurance, and no training on sprint swimming performance in zebrafish. The effects illustrated in the upper panel are not the simple difference in treatment means but adjusted means conditioned on pre-treatment sprint performance speeds (see Walker 2018).

zfish_ancova

So when do we start teaching best practices in how-to-do-biology to undergraduate biology majors? Day 1. The HarrellPlot app (and other Shiny apps) are easily implemented in any undergraduate biology teaching lab, or from home, because the app is web based. Students can use Excel/Sheets to store experimental data and learn about long vs. wide formatting and how to organize data more generally and then import the data (from a text file) into the web app. Maybe the app could be a gateway to scripting? Regardless, I don’t think undergraduate biology majors will be routinely learning R (or Python) scripting, at least until a new generation of faculty are in place, all of whom are comfortable with R scripting. Until then, I think the best solutions will be cheap (free!), widely implementable tools like HarrellPlot.

Author biography: Jeff Walker is a Professor of Biological Sciences at the University of Southern Maine with research interests in ecological and evolutionary biomechanics and research best practices. More info is at www.middleprofessor.com.

Image caption and credit: description and credits are in the text. Frank Harrell gave me permission to use Fig 1.1 (the original is in a self-published document). The second figure is mine.

Categories: Education, Research Tools

Tagged as: , , , ,

1 reply »