Preface

This is a set of course notes to accompany the second semester of a traditional graduate-level sequence in statistical methods. These notes aim both to introduce the linear model (e.g., regression and ANOVA) and adjacent methods and to serve as a reference for later consultation.

Philosophy

These notes take the following perspectives.

Statistics is nonintuitive.
When it comes to statistics, researchers cannot necessarily rely on their common sense to lead them towards correct answers. Statistical reasoning is non-intuitive (Kahneman (2011)), and the foundational ideas of statistics are elusive. Therefore statistical literacy must be learned. The primary goal of this course is to sharpen students’ statistical literacy so that they may become more effective researchers.

The route to conceptual understanding is the detailed study of basic methods.
However, one does not develop a deep conceptual understanding merely by discussing concepts. Instead, conceptual understanding is honed in part by studying the details of particular methods to understand why those details matter. When we study the details of a method, the expectation is not that the student will remember those details in perpetuity. Indeed, practicing scientists are unlikely anyway to remember details about statistical methods that they do not use routinely. (This is not a knock on practicing scientists, but is instead simply a statement about the limitations of human memory.) Instead, the point of studying statistical methods in detail is to strengthen conceptual understanding by exploring statistical methods at a reasonably deep level. Examining details now will also make those details more recognizable if and when one faces a future data-analysis task that requires re-engaging those details. That said, the ultimate emphasis of this course is not on the details of the methods, but is instead on the ideas, concepts, and reasoning that underlie statistical thinking. I hope that these notes will deepen readers’ conceptual understanding of statistics and by doing so strengthen their effectiveness as scientists.

Except when it comes to sums-of-squares decompositions in ANOVA.
The exception to the statement above is ANOVA and the associated sums-of-squares decompositions. At this mathemtical level — that is, without assuming a familiarity with linear algebra — sums-of-squares decompositions and the associated ANOVA tables are poor vehicles for developing conceptual understanding. Instead, they fill texts with inscrutable tables tethered to formulas that lack a compelling underlying logic.¹ ANOVA is still worth learning — especially for the analysis of designed experiments — but unless the underlying linear algebra is engaged, ANOVA is more usefully approached as a special case of a regression model. For this reason, these notes introduce regression modeling first and ANOVA second, reversing the path taken by most texts.

Confidence intervals deserve more emphasis, and hypothesis tests less.
Hypothesis tests have become the primary route to inference in contemporary science, likely because they are the de facto evidentiary standard for announcing results in the scientific literature. This is unfortunate, because statistical significance provides only the thinnest of summaries of pattern in data. Confidence intervals, on the other hand (or even standard errors), are often relegated to a secondary role, even though they provide a richer basis for characterizing pattern and uncertainty. In the fullness of time, these notes will seek to promote the reporting of confidence intervals or standard errors as opposed to hypothesis tests as the primary vehicle for inference.

Simplicity in statistical analysis is a virtue.
Contemporary statistical software allows anyone to fit complex statistical models. However, just because one can fit a complex model does not mean that one should. For a statistical analysis to have scientific value, it must be understood by both the analyst and the analyst’s audience. Unfortunately, many contemporary statistical methods produce opaque analyses that are impossible for the informed reader to understand without a substantial and unusual investment of time and energy. (It doesn’t help that, in the scientific literature, the details of contemporary analyses are buried in supplementary material that escapes the scrutiny of most reviewers, but that’s another matter.) As a result, informed and well-intentioned readers of the scientific literature have little choice but to accept the analysis at face value without understanding the genesis of the announced results. This state of affairs does not serve science well.

It is high time for the scientific community to ask whether complex but opaque statistical analyses are in the best interest of science. For most scientific studies, a simple and transparent analysis provides a more trustworthy vehicle to understanding for both the analyst and the analyst’s audience. These notes will emphasize methods that, when studied, are transparent enough to promote such an understanding.

Nearly everyone is a visual learner.
No one is convinced by a \(p\)-value in isolation, or at least no one should be, given the ease of making errors in statistical analysis. When an analysis suggests a pattern in data, the best way to understand the pattern is to visualize it with a thoughtfully constructed graphic. Unfortunately, these notes in their current state are not as richly illustrated as they should be. Eventually, I hope the notes will contain a full set of graphics that exemplify how to visualize patterns revealed by statistical analysis.

Scope and coverage

These notes form the basis for an intermediate course in statistical analysis. These notes do not start from the very beginning, and they presume a basic knowledge of the fundamentals of (frequentist) statistical inference, similar to what one might see in an introductory statistics course.

Breiman (2001) wrote a provocative paper some twenty years ago describing two statistical cultures: one of data modeling and a second of algorithmic modeling, thus anticipating the era of machine learning and artificial intelligence in which we increasingly seem to reside. These notes fall squarely within (and even celebrate!) the former culture of data modeling as a path to scientific understanding. Moreover, it seems to me that the data-modeling culture contains at least two different subcultures that map to disciplines that either learn from designed experiments vs. disciplines that rely primarily on so-called observational data.² The experimental-data culture tends to involve disciplines in the life sciences, favors ANOVA modeling, uses frequentist inference, and codes its models in SAS. The observational-data culture tends to involve disciplines in the social and environmental sciences, favors regression modeling, increasingly embraces Bayesian analysis, and codes its models in R or Python.

These notes aim to serve both the experimental and observational subcultures. The saving grace that allows us to do so is that the core statistical model in both subcultures is the linear statistical model, which encompasses both regression and ANOVA. Of course, most graduate students will need to learn specialized statistical methods that are popular in their own field of study. These notes are not meant to cover these specialized methods, and thus they are not meant to embody the whole of statistics. However, study of regression and ANOVA provides an opportunity to master core tools and provides a springboard to the study of more specialized and possibly discipline-specific methods.

These notes also deal exclusively with so-called “frequentist” statistical inference. We do not engage Bayesian methods yet, although some Bayesian coverage is eventually forthcoming. This class is also firmly situated in the study of low-dimensional statistical models. We value parsimony, and we take the view that well constructed models are worthy objects of study in their own right. More concretely, we seek to construct statistical models with parameters that correspond to natural phenomena of interest. Algorithmic modeling (that is, machine learning) is outside the scope of these notes.

Mathematical level

These notes do not assume knowledge of or use any math beyond arithmetic and basic probability. This basic probability includes an understanding of random variables, standard distributions — primarily Gaussian (normal) distributions, but also binomial and Poisson — and basic properties of means and variances.

Students who are willing to engage the math a bit more deeply will find that doing so provides a more satisfying path through the material and leads to a more durable understanding. Without knowing the math underneath, one can only learn statistical methods as different recipes in a vast cookbook, a tedious task that taxes the memory and gives statistics courses their reputation for drudgery. For those who are so inclined, learning a bit of the mathematical theory reveals how the methods we study connect with one another, and thus provides a scaffolding to organize the methods sensibly and coherently. Moreover, the underpinning mathematics can be understood with a minimum of calculus. Linear algebra, however, is more essential. Indeed, the linear models that we study are, ultimately, exercises in linear algebra. These notes assume no previous familiarity with linear algebra, and so we will not emphasize the linear algebra underpinnings of the methods. In the fullness of time, I hope that these notes will eventually include sidebars that present the linear algebra underneath the methods, for interested readers.

In this day and age, one might ask why it’s necessary to understand the math at all. Indeed, the internet makes it easy to quickly find code for any standard analysis.³ In such a world, the primary task facing an analyst is not so much to get the computer to give you an answer, but instead to confirm that the answer is in fact the one you want. Towards this end, knowing a bit about the math behind the methods makes it possible to determine whether the computer output you’ve obtained is indeed the analysis you hoped for. Throughout, we will try to emphasize simple, quick calculations that can be used to verify that computer output is correct, or indicate that something needs to be fixed.

Computing

The first portion of these notes (focused on regression) presents analyses in R, while the latter portion (focused on designed experiments) presents analyses in SAS. In the fullness of time, I hope that these notes will include complete code for conducting analyses in both R and SAS, but that is a work in progress. While the notes examine the R and SAS implementation of the methods that it presents, these notes are not intended as a complete guide for learning either R or SAS from square one. The internet abounds with resources for learning the basics of R, and I would not be able to improve on those resources here. In many cases I provide R code for the sake of illustration, but—especially when it comes to data wrangling and to graphics—the code is not meant to be authoritative. ST 512 students will receive instruction in R and SAS coding in the laboratory component fo the course.

Readers interested in using R professionally would be well served by consulting Hadley Wickham’s tidyverse style guide. The ideas therein have helped me write substantially cleaner code, even if I haven’t had the discipline to adhere to those ideas in all the code in these notes.

That said, the R code in these notes does not fully embrace the piping style of the tidyverse ecosystem and the associated graphical facilities of ggplot. I take this approach because the focus of these notes is on fitting and visualizing traditional statistical models, and it seems to me that the conventional style of R coding is still best suited for this purpose. The piping style of the tidyverse seems better suited to data-science tasks such as wrangling with and visualizing large data sets. As for ggplot, I prefer the style of coding in R’s native graphical facilities, although ggplot can certainly produce high-quality graphics with relatively few lines of code.

As a practical matter, these notes are prepared in bookdown (Xie (2022)). While it is possible to compile both R and SAS code on the fly in bookdown, the extensive output produced by SAS does not serve these notes well. As a consequence, SAS output is condensed to show only the most salient portions of the output.

Format of the notes

Advanced sections are indicated by section titles that begin with stars (\(^\star\)). Shorter sidebars for enrichment appear in gray text and are offset by horizontal rules (like the one following the acknowledgments). This material may be skipped without loss. Sections that are in an early and rougher stage of development are indicated with section titles shown in italics.

A word on the title

As statistical practice evolves, new methods come and go, but the linear model remains a cornerstone of applied data analysis. The primary objective of these notes is to equip researchers to analyze their own data using contemporary methods. That said, an exclusive focus on contemporary methods risks making the statistical methods in the historical literature inaccessible. These notes also aim to contextualize contemporary methods in the broader history of statistical thought and to familiarize readers with past statistical practice so that they can understand the historical literature. I think this is something of a unique aspect to these notes, and I have chosen the title accordingly.⁴

Acknowledgments and license

I am deeply indebted to the R community (R Core Team (2021)) for their project which has done no less than revolutionize data analysis in our times. I also thank the developers of bookdown for providing the platform for these notes (Xie (2022)).

These notes are provided under version 3 of the GNU General Public License.

Bibliography

Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3): 199–231.

Kahneman, Daniel. 2011. Thinking, Fast and Slow. Macmillan.

R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Xie, Yihui. 2022. Bookdown: Authoring Books and Technical Documents with r Markdown. https://github.com/rstudio/bookdown.

Again, to be clear, the formulas lack a compelling underlying logic because we are not engaging the linear algebra. If we embraced the linear algebra foundations, then the underlying logic would be clear indeed.↩︎
In this context, the term “observational data” is used to refer to data collected outside the context of a designed experiment. One might bicker that all data, even those data from designed experiments, must be “observed”, but this is the terminology that we have.↩︎
Indeed, we are probably not too far away from the rise of artificial intelligence-based statistical consulting, where anyone can upload a data set and answer a few questions about it in return for an AI’s analysis.↩︎
Sticklers would be right to note that the “yesterday” and “today” components of the title are much better justified than the “tomorrow” part. I claim no special ability to predict the future of statistics, and these notes make no attempt to do so. I’ve added “tomorrow” to the title because I don’t think the linear model will be replaced anytime soon, and because leaving things at “yesterday and today” sounds flat.↩︎

Statistical analysis of designed experiments: yesterday, today, and tomorrow