Chapter 8 Linear regression
This chapter presents the core material of the introductory econometrics course and provides detailed explanations to the solutions of the assignments in Chapter 9.
8.1 Intercept only model
8.1.1 The model
Consider the example discussed in Section 9.2. There we discuss data on n=114 covid-19 patients with known incubation times y1,y2,…,yn.
Before we begin with the discussion of the model presented in that assignment let us first summarise the data.
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
1.5 | 3 | 4 | 5.394737 | 6 | 24.5 |
The average incubation time was 5.39 days. The shortest incubation period was 1.50 days and the longest one was 24.50 days. Half of the patients developed symptoms after less than 4.00 days.
You will notice that the incubation times differ from person to person. We are not going to answer the question why these times are different. Instead, we are going to attempt to describe this variation with a simple model.
Consider a deterministic model for the incubation times:
yi=β0,,i=1,…,n.
which implies that the incubation time is one and the same and equal to β0 for all patients, e.g. β0=5 days. A glance at Table 8.1 and Figure 8.1 should convince us that (8.1) cannot hold as the observed times differ between patients. If all incubation times were equal all points in Figure 8.1 would lie on a single straight line.

Figure 8.1: Incubation times plot. The thin vertical lines depict the differences between observed times (points) and 5 days (horizontal line).
We can relax the assumption in (8.1) by adding the difference between the observed yi and β0: ui=yi−β0.
yi=β0+ui.
Our second approach would describe the observed incubation times perfectly but that trick has achieved little other than adding more mathematical notation. Model (8.2) has more parameters than there are data points (observations): n deviations and β0. Furthermore we cannot use (8.2) to predict the incubation time of persons whom we have not seen yet, because we need to know their incubation time in order to compute u and this defeats the purpose of a prediction.
Let us look at another approach. Instead of computing the individual deviations ui for each person let us views the deviation for person i as a realisation of a random variable ui. When we say a random variable we need to describe its distribution at the very least by assuming something about the expected values and the variances of u1,…,un. For now we will assume that all ui terms have a common expected value of zero and variance σ2 that does not depend on the individual patient. We will even go further and assume that the ui are independent and normally distributed. In this way we can describe the deviations with a single parameter σ2.
Note that these are all assumptions that we need to question and we will address this in later chapters. For now we will examine the implications of the assumptions.
Our model is now: yi=β0+ui,i=1,…,n
where the errors terms u1,u2,…,un are independent, identically distributed (iid) random variables with zero mean and variance σ2, i.e. N(0,σ2).
Using Theorems 3.1, 3.2 and 3.4 we see that the expected value of yi is β0 and that its variance is σ2. We are also using that β0 is a unknown but fixed value (a real number) and is not random.
E(y)=E(β0+u)=E(β0)+E(u)=β0+0=β0 Var(y)=Var(β0+u)=Var(β0)+Var(u)=0+σ2=σ2
yi is thus normally distributed with expected value β0 and variance σ2. y∼N(β0,σ2).
The coefficient β0 is simply the expected incubation time.
It is important to know what model (8.4) does not imply. It doesn’t say that the incubation times are constant. So to say that the incubation time is 8 days does not make much sense without a reference to an observed patient. Saying that the incubation time of patient 1 was 2 days is OK (this is an observed value). Saying that the expected incubation time is β0=8 is also OK.
Model (8.4) states that the incubation time is a random variable. This means that we cannot know the exact incubation time for a new patient. We are not completely in the dark, though, because equation (8.4) tells us something about the structure of the randomness. Assuming this model is true and we know its parameters to be β0=5,σ2=42 (for example) then some incubation periods are more likely than others. Let us compute the probability for an incubation period that is longer than 30 days under model (8.4).
## [1] 2.052264e-10
PN(5,42)(Y>30)=1−PN(5,42)(Y<30)=2.0522639×10−10
The probability of an incubation period of less than 2 days is (under the model):
## [1] 0.2266274
PN(5,42)(Y<2)=0.2266
8.1.2 The prediction
Let us for now pretend that we actually know the expected incubation time (let it be equal to β0=5) and that the incubation times are normally distributed as per model (8.3). How would we predict the incubation time for new (yet unseen) patients? If our prediction is one and the same number of days for all patients, then what is the best prediction? Is it 4, is it 5, 6, 9 days? If we are asking about what is the best prediction then we must define what a good prediction is and what a bad prediction is. There are multiple ways to do this but here we will examine only one.
Let Y be the incubation period of the new patient and let ˆy be our prediction for that period. Until we wait and see when that patient starts to show symptoms we don’t know the actual value of Y. While we don’t know what value Y will turn out to have (it is random) we can still think about the error of our prediction. The error of our prediction will be ˆy−Y, whatever Y turns out to be. If our prediction is 2 days and the actual incubation period turns out to be 5 days, then the error will be 2−5=−3 days. We would have underestimated the incubation period of that patient by 3 days. If the incubation period turns out to be one day and we predict it to be 2 days, then we would have overestimated it by 1 day (1−2=−2).
Because Y is random, we must give up any hope to predict the incubation period perfectly. What we can do is find the prediction that makes our errors “small” in some sense. Be aware that every time that we mention “small”, “large”, etc, we need to say what exactly is small and what exactly is large.
When we make predictions we must know that wrong predictions have consequences. Imagine again the patient whose actual incubation time was 5 days and we predicted it to be 3 days. After the third day she may think that the incubation period is over (it is not over) and may stop observing a quarantine, potentially infecting other people over the course of two whole days.
As we see that prediction errors entail consequences we should be careful and we must attempt to make these as small as possible. Let us denote the error to be:
Y−ˆy.
and imagine that we incur a penalty for each deviation between Y and ˆy in a way so that we lose
Loss(ˆy)=(Y−ˆy)2
for each prediction. We would like to avoid large penalties and we may attempt to minimise the expected loss form wrong predictions by choosing ˆy so that (8.5) is as small as possible. Remember that the average loss over a large number of predictions will be close to the expected loss.
E(Loss(ˆy))=E(Y−ˆy)2
Expanding the right-hand side of (8.6) gives:
E(Loss(ˆy))=E(Y−ˆy)2=E(Y2−2Yˆy+ˆy2)=E(Y2)−2ˆyE(Y)+ˆy2.
Notice that (8.5) and therefore (8.7) don’t involve random quantities, because the expected value of a random variable is a fixed number and is not random (see for example ??). Furthermore, E(Y) and E(Y2) do not depend on ˆy (this would mean that our prediction changes the distribution of incubation times). To find the minimum of (8.7) we can differentiate with respect to ˆy and set the derivative to zero:
∂E(Loss(ˆy))∂ˆy=0−2E(Y)+2ˆy=0⟹ˆy=E(Y). The second derivative of (8.7) with respect to ˆy is 2 and is positive, therefore the expected loss has a minimum at ˆy=E(Y).
To summarise, predicting the incubation time using E(Y) minimises the expected square loss from prediction errors.
8.1.3 OLS estimator
By now we have established how to predict incubation times using E(Y) but we have (rather unrealistically) assumed that we know parameter β0 in (8.4). If this was in fact the case we could skip everything else in this chapter. In many situations this parameter is not known and we need to “guess” or “learn” it from observed data (the sample).
Once we have some “guess” ˆβ0 about β0 we can use it to predict the incubation time of yet unseen patients. The estimated regression equation is: ˆy=^β0.
Let us now turn to the question how to generate a “good” guess for β0 from the data. Let us denote the predicted value for the i-th observation yi with with ˆyi. Let us pick, rather arbitrarily, a value for ˆβ0, say ˆβ0=8. With this value we are able to use the model to predict the incubation time for yet unseen patients:
ˆy=8.
Are these predictions good? To answer this question we need to be able to say what is a good prediction and what is a bad prediction. We can evaluate the quality of our predicted incubation times ˆy1,…,ˆyn when we compare these to the actually observed incubation times y1,…,yn. Loosely speaking predictions (model) that are far away from the observed values (reality) will be of little use. This might lead us to the idea to base our “guess” for β0 on the data in such a way so that the predicted values of the model (predictions) are as close as possible to the observed values (reality). In other words, the estimator for β0 must minimise the “distance” between the model (predictions) and the reality (observations). Since we have involved the notion of distance, we need to give it a more precise mathematical meaning in order to be able to minimise it. Consider the difference between an observed and a predicted value for some observation i.
ri=yi−ˆyi
This value is called the residual of observation i. For a good model the residuals ri,i=1,…,n should be “small”. The ordinary least squares method (OLS) minimises the sum of squared residuals over all observations. Note that the residual ri is not the same as the error term ui, hence the different names and the different symbols! The error terms are generally unobserved because β0 is unknown!
RSS=n∑i=0(yi−ˆyi)2.
Because the predicted values ˆyi depend on ˆβ0, so does the residual sum of squares.
RSS(^β0)=n∑i=0(yi−ˆyi)2.
Before we continue let’s make sure that you understand how to calculate the residual sum of squares given a concrete value of ˆβ0.
RSS(5)=(2.00−5)2+(4.00−5)2+(6.00−5)2=9.00+1.00+1.00=11.00.
## [1] 2341
RSS(3)=46.25.
Let us see now how RSS changes for different values for ˆβ0. Figure 8.2 shows the value of RSS calculated for a range of values of ˆβ0 between 0 and 11. You should notice the parabolic shape of curve (not surprising, since RSS involves squares) and that the RSS has a minimum near ˆβ0≈5.

Figure 8.2: Residual sum of squares for values of ˆβ0 between 0 and 11.
Can we find this minimum analytically using the tools we already know? To find the value of ˆβ0 at the minimum we will differentiate RSS with respect to ˆβ0 and will set the derivative to 0. Before we do that we will simplify (8.11) a little bit by expanding the parentheses and replacing ˆy with ˆβ0 (because of (8.8)).
RSS(^β0)=n∑i=0(yi−ˆyi)2=n∑i=0(y2i−2yiˆyi+ˆy2i)=n∑i=0y2i−2n∑i=0yiˆyi+n∑i=0ˆy2i=n∑i=0y2i−2n∑i=0yiˆβ0+n∑i=0ˆβ20=n∑i=0y2i−2ˆβ0n∑i=0yi+nˆβ20=nnn∑i=0y2i−2ˆβ0nnn∑i=0yi+nˆβ20=n¯y2−2ˆβ0nˉy+nˆβ20.
Let us find its first derivative with respect to ˆβ0. The first term in the sum in (8.12), ¯y2, is simply n times the average of the squared values of y and it does not depend on ˆβ0. The average of y in the second term in the sum (ˉy) also does not depend on ˆβ0. Therefore we can treat both ¯y2 and ˉy as constants when we differentiate with respect to ˆβ0 (i.e. their derivatives are zero).
∂RSS(ˆβ0)∂ˆβ0=0−2nˉy+2nˆβ0.
To find the extreme values of RSS(ˆβ0) we set the derivative in (8.13) to zero and solve the equation.
∂RSS(ˆβ0)∂ˆβ0=−2nˉy+2nˆβ0=0⟹ˆβ0=ˉy.
The second derivative of RSS(ˆβ0) with respect to ˆβ0 is
∂2RSS(ˆβ0)δˆβ20=2n>0.
As the second derivative is positive, RSS(ˆβ0) has a minimum at ˆβ0=ˉy.
We will use the lm
function to estimate linear regression models in R
. The first argument of the function is a formula
, a special object that is used to describe models. We specify our response variable (y) on the left hand side of the formula. On the right hand side we specify the predictor variables. In our simple case there are no predictor variables, so we instruct it to fit an intercept only model by writing 1
. The data
argument instructs the function where to evaluate variable names. In our case Incubation
is a column in the table linton
.
##
## Call:
## lm(formula = Incubation ~ 1, data = linton)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8947 -2.3947 -1.3947 0.6053 19.1053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3947 0.4247 12.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.534 on 113 degrees of freedom
You can find ˆβ0 in the Coefficients
part of the fit summary under the row named (Intercept)
. For our data this is ˆβ0=5.39.
Inserting the estimate for β0 (look at the regression summary) into (8.8) we obtain ˆy=5.39.
Our prediction of the incubation time of all patients will be 5.39, the estimated expected value. To confirm our analysis so far, you should compare the estimate for β0 with the sample mean and see that both are equal.
## [1] 5.394737
8.1.4 Variance estimation
The second parameter in model (9.1) is the variance of the error term u: σ2.
We can estimate it using the sample variance of y: S(y)=1n−1n∑i=1(yi−ˉy)2.
Take care to remember the difference between the variance of u which is a parameter of the distributions of u and y, as we see from (8.4). The sample variance is a function of the data (y1,y2,…,yn) used to estimate σ2. The value of the sample variance changes when the data changes. σ2 does not change with the data!
Note that the residual sum of squares (8.10) reduces to n−1 times the sample variance, because ˆyi=ˆβ0=ˉy for each i=1,…,n:
RSS=n∑i=1(yi−ˆyi)2=n∑i=1(yi−ˉy)2=(n−1)S(y).
8.1.5 Hypothesis testing
In the previous parts we have derived the sample mean as the OLS estimator for the intercept in the intercept only model. In Chapter @ref(#hypothesis-tests) we showed how to perform t-test about the mean of a normal distribution and you can apply all the knowledge from there. Instead of repeating all the theory discussed there we will learn how to use the regression output of lm
to conduct simple tests.
Let us look again at that output:
##
## Call:
## lm(formula = Incubation ~ 1, data = linton)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8947 -2.3947 -1.3947 0.6053 19.1053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3947 0.4247 12.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.534 on 113 degrees of freedom
The ingredients of the test statistic were the sample mean ˉX, the standard deviation of the sample, the sample size n and the expected value under the null hypothesis (μ0). For a two-sided alternative
H0:μ=μ0
we used the test statistic
T=ˉX−μ0S(X)/√n.
Let us rewrite it in the notation of the regression model. In our model the expected value of y is β0, the sample mean is ˆβ0. For the value of β0 under the null hypothesis we will write β(0)0. When you compare the denominator in the test statistic to the result for the variance of the sample mean in (5.4) you should see that the term in the denominator is simply the square root of the variance of the sample mean or in other words: it is the standard deviation of the sample mean. We call the standard deviation of an estimator its standard error and we will write SE(ˆβ0). With this additional notation the null hypothesis and the t statistic looks like:
H0:μ=μ0⟹β0=β(0)0
T=ˆβ0−β(0)0SE(ˆβ0).
Note that we have only rewritten the test statistic in terms of ˆβ0 without changing anything at all, so all results from Chapter @ref(#hypothesis-tests) apply without changes, except that instead of X we use Y to denote the sample data.
8.1.6 Assumptions
To be continued…