Chapter 1 Introduction

In the first few weeks we introduced the R language and learned how to perform some basic data summaries. The second part of the exercise classes offers a review of some basic statistical concepts.

Before we continue you should be very aware that the examples discussed here are developed strictly for the purpose of illuminating some key statistical concepts. Although most examples will use real-world data (unless explicitly indicated), most datasets are adapted for the ease of use at the cost of possibly ignoring important information. Therefore all conclusions from the analyses conducted in the exercise classes should be understood as preliminary at best!

Consider a problem that is very salient today as communities around the world struggle to cope with the SARS-CoV-2 virus. In order to understand the novel virus, its transmission mechanism and to predict the possible scale of the pandemic it is crucial. One extremely important characteristic of the virus is its incubation period.

The dataset linton is adapted from (Linton et al. 2020), see 1.1 and contains data on 144 patients who developed COVID19 symptoms. For each patient we known the date of exposure to the virus and the date of symptoms onset. Let use use this information to estimate the incubation period.

Table 1.1: Incubation times for patients with COVID19. First five observations.
ID Onset Incubation
NW001 2020-01-03 2
NW006 2020-01-18 4
NW008 2020-01-13 6
NW009 2020-01-14 5
NW011 2020-01-15 3

A summary of the incubation times (in days) is given in 1.2. We see that the average incubation period was about 5.4 days. The vast majority of the patients (95 percent) showed symptoms in less than 14.2 days after exposure, so the recommendation of a 14 days quarantine seems reasonable.

Table 1.2: Summary of incubation times (in days).
Min Q25 Mean Median SD Q75 Q95 Max
1.5 3 5.394737 4 4.534272 6 14.175 24.5

Before we proceed further with the statistical theory let us point to a couple of issues with these conclusions. The length of the incubation period is important, because the scale of the epidemic may depend on it. Therefore we should ask ourselves: How confident are we in our conclusions? Do we expect our conclusions to change substantially if we observe another group of patients? These are complex questions and a thorough discussion of all their aspects falls outside of the scope of the present course. In this introductory course we will focus on some if their statistical aspects in a simplified setting.

The conclusions we have drawn about the incubation period are based on data (observed patients) and intuitively we can imagine that if we would observe another group of persons and measure their incubation times, their data would not be exactly the same as in the present dataset. Our results are thus subject to a degree of uncertainty and we will need tools that enable us to understand it and to describe it.

Our tools case consists of two parts: probability theory and statistics. Probability theory is a branch of mathematics that will provide us with models useful for the study of phenomena. Statistics will provide us with tools that unable us to relate observed data to our theoretical models.

This course generally assumes that you are familiar with the concepts of probability, random variables, distribution and density/probability functions, statistical tests. I will not attempt to cover all this material in depth here. Instead, this script offers a short refresher of the most important aspects. For a more extensive treatment please refer to Bertsekas and Tsitsiklis (2008), Casella and Berger (2001) and Freedman, Pisani, and Purves (2007).

Chapter 2 of this script shortly presents the three probability axioms.

Chapter 3 introduces the concepts of a discrete probability distribution, discrete random variables, probability mass function, expected value and variance that will be important in all discussions throughout the course.

Chapter 4 is a refresher on continuous distributions, mainly the normal distribution. It also presents three other distributions that play a role in statistical test: the \(\chi^2\), \(t\) and \(F\) distributions.

Chapter 5 deals with the estimation of the expected value and discusses the sampling properties of the sample mean and variance. Chapter 6 introduces tests about the expected value of a normally distributed variable and Chapter 7 reviews some basics about interval estimation for the expected value.

Chapter 8 introduces the linear regression model and the ordinary least squares estimator.

References

Bertsekas, Dimitri P., and John N. Tsitsiklis. 2008. Introduction to Probability. Belmont, Mass: Athena Scientific.

Freedman, David, Robert Pisani, and Roger Purves. 2007. Statistics. W W NORTON & CO. https://www.ebook.de/de/product/6481761/david_freedman_robert_pisani_roger_purves_statistics.html.

Linton, Natalie M., Tetsuro Kobayashi, Yichi Yang, Katsuma Hayashi, Andrei R. Akhmetzhanov, Sung-mok Jung, Baoyin Yuan, Ryo Kinoshita, and Hiroshi Nishiura. 2020. “Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Case Data.” Journal of Clinical Medicine 9 (2). https://doi.org/10.3390/jcm9020538.