Chapter 10 Homeworks

10.1 Homework 1

Github classroom link: https://classroom.github.com/g/vguWjRNd

Please choose one of the following problems. Write your solution in a file named Problem_1.R or Problem_2.R depending on the number of the problem that you have picked. When you are finished, upload your code to your team’s homework repository by pasting it in one of the files there: Problem_1.R or Problem_2.R.

10.1.1 Problem 1

In the wake of the COVID-19 outbreak data analysis can yield important insights that can help to contain the disease and ultimately to save lives. The dataset COVID19_2020_open_line_list contains data on patients with confirmed COVID19 infections in the United States, Japan and China (outside Hubei province). Disclaimer: This is an exercise dataset. Cases with missing or incomplete data on age and sex were removed. When age was given as an interval (e.g. 0-10 years) it was replaced with the midpoint of the interval. Do not draw any real-life conclusions based on the analysis here! For the full dataset refer to Xu et al. (2020).

ID (numeric): Case id.
date_confirmation (date): Date when the infection was confirmed.
sex (character): Patients’ sex: male/female
age (numeric): Patients’ age in years.
province (character): Province where the infection was confirmed.
country (character): Country

Download and read the dataset and store it in an object called patients.

patients <- read.csv2('https://firebasestorage.googleapis.com/v0/b/uni-sofia.appspot.com/o/data%2FCOVID19_2020_open_line_list%20-%20outside_Hubei_subset.csv?alt=media&token=919a181b-6417-45fd-9314-91497bb4b0fc', stringsAsFactors = FALSE)

How many patients are there in the dataset?

nrow(patients)

## [1] 1565

## Returns the number of rows in the data frame (table)

## Alternatively, use str to output a summary of the structure of the table.
str(patients)

## 'data.frame':    1565 obs. of  12 variables:
##  $ X                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ID                    : int  649 2596 99 119 120 121 280 669 56 100 ...
##  $ date_confirmation     : chr  "2020-01-15" "2020-01-16" "2020-01-19" "2020-01-20" ...
##  $ sex                   : chr  "male" "male" "male" "male" ...
##  $ age                   : num  30 34.5 66 78 76 48 56 35 44 65 ...
##  $ province              : chr  "Tokyo" "Kanagawa Prefecture" "Guangdong" "Guangdong" ...
##  $ country               : chr  "Japan" "Japan" "China" "China" ...
##  $ additional_information: chr  NA NA "Among the 53 cases diagnosed in our province, 28 were males and 25 were females, aged between 10-81 years old. "| __truncated__ "Among the 53 cases diagnosed in our province, 28 were males and 25 were females, aged between 10-81 years old. "| __truncated__ ...
##  $ outcome               : chr  "discharged" NA NA NA ...
##  $ source                : chr  "https://www.who.int/csr/don/17-january-2020-novel-coronavirus-japan-ex-china/en/" "https://www.mhlw.go.jp/stf/newpage_08906.html" "http://www.bjnews.com.cn/news/2020/01/20/676734.html and http://wsjkw.gd.gov.cn/zwyw_yqxx/content/post_2876057."| __truncated__ "https://www.sinchew.com.my/content/content_2205270.html and http://wsjkw.gd.gov.cn/zwyw_yqxx/content/post_2877668.html" ...
##  $ latitude              : chr  "35.71145" "35.41557" "22.65389" "22.16925" ...
##  $ longitude             : chr  "139.4468" "139.339" "114.1291" "113.361" ...

What are the earliest and latest date of infection confirmation?

## Open the data frame in the R-Studio viewer, sort the table by confirmation column by clicking on the column header and look up the first and
## the last value

What is the average age of the patients?

mean(patients$age)

## [1] 49.1828

## The average age is 49.12 years.

How many men and how many women are there in the sample?

freqTableSex <- table(patients$sex)
freqTableSex

## 
## female   male 
##    719    846

## There are 719 female patients and 846 male patients.

Plot the frequency distribution of sex using a bar-chart.

barplot(freqTableSex, main = "Frequency distribution of sex")

7. What was the age of the youngest woman?

isFemale <- (patients$sex == "female")
min(patients$age[isFemale])

## [1] 0.08333

## The youngest female patient was 0.08 years old.

What was the age of the oldest man?

isMale <- patients$sex == "male"
max(patients$age[isMale])

## [1] 94.5

## The oldest male patient was 94.5 years old.

Is there a difference between the average age of male and female patients?

averageAgeMale <- mean(patients$age[isMale])
averageAgeFemale <- mean(patients$age[isFemale])
averageAgeMale - averageAgeFemale

## [1] 0.7097229

## Male patients were 0.7 year older on average.

Compare the distribution of age between the countries using a box-plot. Interpret the plot (write your answer as a comment in the code).

boxplot(age ~ country, data = patients, horizontal = TRUE)

## THe Boxplots indicate similar distributions of age between Japan and the USA. The patients in China tended to be younger, but this
## probably reflects the different population age in China that is generally younger then the populations in Japan and in the USA.

10.1.2 Problem 2

Orley Ashenfelter, an Economics Professor at Princeton University claimed to have found a method to predict the quality of Bordeaux (http://www.liquidasset.com/orley.htm)[wine]. In this problem we will borrow data from (http://www.liquidasset.com/winedata.htm)[http://www.liquidasset.com/winedata.htm]. The dataset contains information about the prices of Bordeaux wines produced between 1952 and 1980 organised in the following columns:

Year: Year in which the wine was produced (unique).
LogPrice: Logarithm of the price of the wine.
WinterRain: Winter rain in the Bordeaux region (October to March, in ml).
Temperature: Average temperature in the region (April to September, in degrees Celsius).
HarvestRain: Harvest rain in the region (August and September, in ml).
TimeYears: Time since vintage in years.

For a short, entertaining, story about Ashenfelter’s analysis and his predictions of wine prices, read the first few pages of Ayres (2008), freely available on (https://books.google.bg/books?id=brHyklsoPRMC&printsec=frontcover&hl=bg#v=onepage&q&f=false)[books.google.com].

Download and read the dataset and store in a variable called wines.
Create a new variable (in the data.frame wines) called Price by converting LogPrice back to its original scale. Hint: use the exp function.
How many years are recorded in the dataset?
What was the average temperature in 1953? Write your answer as a comment in the code.
Which was the coldest year recorded?
Compute the average wine price for hot and cold years. Define a cold year to be a year with below average temperature.
Are wines produced during cold years more valuable (on average) than wines produced during hot years?
How many years had below-average temperature? Hint: use the table function.
Compare the distribution of prices between hot and cold years using a box-plot. Interpret the plot.
Create a scatterplot for wine price and the rainfall level during harvest . Do you see any association pattern? Write your answer as a comment in the code.

10.2 Homework 2 (Linear regression 1)

Github classroom link: https://classroom.github.com/g/LBe858uX Submission deadline: 27 April 2020, 23:00.

A detailed discussion of the methods used here is available in Chapter 8.1.

Disclaimer: As in the previous homework, please note that the dataset used here is simplified for ease of use and the analysis here MUST be viewed as an exercise, not real-world research! For the full dataset always refer to the data source!

The dataset lintonHospitalisation is an adapted version of the data used in Linton et al. (2020). It contains 82 observations on COVID19 patients, mainly from mainland China. For this homework we are only interested in the variable timeToHealthcareVisit
that measures the time in days from symptoms onset to the first visit of a healthcare facility.

Read and download the data (already done in the starter code)

lintonHospitalisation <- read.csv('https://raw.githubusercontent.com/feb-uni-sofia/econometrics2020-solutions/master/data/linton2020_sought_healthcare_complete.csv', stringsAsFactors = FALSE)
simData <- read.csv('https://raw.githubusercontent.com/feb-uni-sofia/econometrics2020-solutions/master/data/sim-t-statistics-int-only-model.csv', stringsAsFactors = FALSE)

Let \(y_i\) denote the time in days between symptoms onset and the first visit of a healthcare facility (time to visit for short) for patient \(i = 1,\ldots,n\). Let \(u_i \sim N(0, \sigma ^ 2)\) be a normally distributed random variable with zero mean and variance \(\sigma ^ 2\). Furthermore, assume that the error terms \(u_i\) are independent.

\[\begin{align} y_i = \beta_0 + u_i, \quad i = 1,\ldots,n. \tag{10.1} \end{align}\]

Estimate \(\beta_0\) and \(\sigma\) using OLS and write down the result (as a comment in the code). Hint: use the lm function in R.

## ~ 1 instructs to lm to fit a model that only includes an intercept and no 
fit <- lm(timeToHealthcareVisit ~ 1, data = lintonHospitalisation)
summary(fit)

\[ \hat{\beta}_0 = 2.0854 \\ \hat{\sigma} = 2.535 \]

# Note that the in this model the estimate of the error term standard error (sigma) is simply the sample standard
## deviation of y.
sd(lintonHospitalisation$timeToHealthcareVisit)

## [1] 2.534703

What is the meaning of \(\beta_0\) in (10.1)? Write down your answer as a comment in the code. Computing the expected value and the variance of \(y\) shows (see (8.4)) that \(y\) is normally distributed with mean \(\beta_0\) and variance \(\sigma ^ 2\). Therefore \(\beta_0\) is the expected time (in days) between symptoms onset and a visit to a healthcare facility (given that the person seeks medical help at all!).

\[ y \sim N(\beta_0, \sigma ^ 2) \]

What would be your prediction for the time to visit for a new patient who just started showing symptoms? The best prediction in the sense of (8.7) is the expected time to the visit \(\beta_0\). \(\beta_0\) itself is unknown and we use the estimate for \(\beta_0\) that we derived from the data (the OLS estimate). \[ \widehat{E(y)} = \hat{y} = \hat{\beta}_0 = 2.08 \]
Estimate the probability that a patient will wait longer than 3 days before seeking medical aid. Hint: use pnorm.

Previously we saw that \(y\) is normally distributed with mean \(\beta_0\) and variance \(\sigma^2\). To estimate the probability we we use the estimated distribution of \(y\) by plugging in the estimates for \(\beta_0\) and \(\sigma^2\). \[ y \sim N(\beta_0, \sigma ^2) \\ \] \[ N(2.08, 2.535 ^ 2) \] \[ P_{N(\hat{\beta}_0), \hat{\sigma}^2}(y > 3) = 1 - P_{N(\hat{\beta}_0), \hat{\sigma}^2}(y < 3) \]

1 - pnorm(3, mean = 2.08, sd = 2.535)

## [1] 0.3583326

## The estimated probability is 0.358.

Local healthcare workers are complaining that patients tend to wait 3 days on average before seeking medical aid.

Write down this hypothesis in terms of model (10.1).
- Compute the value of the t-statistic for this hypothesis.
- Decide whether to reject the null hypothesis at a 1 percent significance level. Hint: compute the rejection region and compare the t-statistic to the critical values.
- Compute the p-value of the test.

\[ H_0: \beta_0 = 3 \\ H_1: \beta_0 \neq 3 \] \[ T = \frac{\hat{\beta}_0 - 3}{SE(\hat{\beta}_0)} \stackrel{H_0}{\sim} t(n - 1) \] Assuming that the null hypothesis is true, then \(T\) follows a t-distribution with \(n - 1\) degrees of freedom.

n <- nrow(lintonHospitalisation)
tStat <- (2.08 - 3) / 0.2799
alpha <- 0.01
lowerCriticalValue <- qt(alpha / 2, df = n - 1)
upperCriticalValue <- -lowerCriticalValue
tStat

## [1] -3.286888

lowerCriticalValue

## [1] -2.637897

upperCriticalValue

## [1] 2.637897

The t-statistic is less than the lower critical value, therefore we reject \(H_0\) that \(\beta_0 = 3\) at the 1 percent significance level.

Alternatively we can compute the p-value of the test (see ??). \[\begin{align} \text{p-value} & = P(T > |t|) \\ & = P\left((T < -|t|) \cup (T > |t|)\right) = 2 P(T < -|t|) \\ & = P\left((T < -3.28) \cup (T > 3.28)\right) = 2 P(T < -3.28) \end{align}\]

## p-value
abs(-2)

## [1] 2

2 * pt(-abs(tStat), df = n - 1)

## [1] 0.001499144

abs() returns the absolute value of a number. Based on the p-value of the test we reject the null hypothesis, because the observed p-value is less than the significance level.

The dataset simData summarises the result of 2000 datasets of size n = 82 that were generated from \(N(3, 2.5 ^ 2)\). The columns mean and sd contain the mean and the standardd deviation of each sample. The following code shows how to compute the t-statistic for \(H_0: \mu = 3\) and the critical values for a 5 percent significance level for the alternative \(H_1: \mu_0 \neq 3\). Decide whether to reject the null hypothesis (in each sample). In how many samples was your decision wrong? Hint: use the < and > logical operators to compare the test statistics to the critical values. Use the | operator to perform a logical OR. Remember that < and > return logical (TRUE/FALSE) vectors. See this page for a short overview of logical operators in R.

alpha <- 0.05
simData$lowerCritivalValue <- qt(alpha / 2, df = simData$n - 1)
simData$upperCriticalValue <- -simData$lowerCritivalValue
simData$tStat1 <- sqrt(simData$n) * (simData$mean - 3) / simData$sd

simData$H0rejected <- (
  simData$tStat1 < simData$lowerCritivalValue | 
  simData$tStat1 > simData$upperCriticalValue
)
sum(simData$H0rejected) / 2000

## [1] 0.0575

## First note that the null hypothesis is true, because the problem states that the datasets (samples) were generated from a N(3, 2.5^2) distribuion.
## Following the rules for rejection we will reject H_0 in 5.75 percent of the 2000 samples, because the t-statistic is in the rejection region
## (less than the lower critical value or greater than the upper critical value). This is close to the _expected_ error rate which is simply the
## significance level of the test.

Using the simData dataset, compute the t-statistic for \(H_0: \beta_0 = 2\) for each sample. Test the null hypothesis against the alternative \(H_0: \beta_0 \neq 2\) at a 5 percent significance level. In how many samples was your decision to reject the null hypothesis wrong?

simData$tStat2 <- sqrt(simData$n) * (simData$mean - 2) / simData$sd
simData$H0falseRejected <- (
  simData$tStat2 < simData$lowerCritivalValue | 
  simData$tStat2 > simData$upperCriticalValue
)
sum(simData$H0falseRejected) / 2000

## [1] 0.941

## In this problem the null hypothesis is wrong, because the datasets (samples) were generated from a N(3, 2.5^2) distribution and the null hypothesis
## states that \beta_0 = 2 (and 2 is not equal to 3...). Therefore the error the decision to reject H_0 is correct for all samples. The wrong decision here ## is to _fail_ to reject the null nypothesis. We see that we correcly reject it in 94.1 percent of the samples. There are 100 - 94.1 percent of the samples, however, where we do not reject H0, because the t-statistic is _not_ in the rejection region.
## You should not confuse this error rate with the significance level of the test, though. Experiment with different values for \beta_0 in the 
## null hypothesis and see how this error rate changes.

10.3 Homework 3 (English)

Github classroom code: https://classroom.github.com/g/ejCpmcYX

Deadline: 21.05.2020 23:00

You are hired by a consultancy firm operating out of Lahore, Pakistan. Your goal is to develop a statistical model for the sales price of cars. The dataset cars contains observations on 24,973 car sales in Pakistan.

Condition (character): Used or new. KMs.Driven: Kilometers driven (mileage). Year: Manufacturing year of the car. Fuel (character): Fuel type (5 categories). Transaction.Type (character): Cash payment or leasing. Price (numeric): Sales price of the car in Pakistani Rupees.

Download and read the dataset.

  cars <- read.csv("https://raw.githubusercontent.com/feb-uni-sofia/econometrics2020-solutions/master/data/cars.csv", stringsAsFactors = FALSE)

Create a new variable (column) in the data set cars called PriceUSD that contains the sales price in USD. Use an exchange rate of 157 Pakistani rupees for one USD.

cars$PriceUSD <- cars$Price / 157

What is the average sales price in USD?

mean(cars$PriceUSD)

## [1] 5885.022

Create a new (logical) variable in the data.frame cars called hasCNG that equals TRUE if the car’s fuel type is CNG (compressed natural gas) and is FALSE otherwise.

cars$hasCNG <- (cars$Fuel == "CNG")

How many cars with CNG fuel are there in the sample?

table(cars$hasCNG)

## 
## FALSE  TRUE 
## 15731  6797

## or you can sum the logical vector
sum(cars$hasCNG)

## [1] 6797

Estimate the linear regression model: \[\begin{align*} \text{PriceUSD}_i = \beta_0 + \beta_1 \text{hasCNG}_i + u_i \end{align*}\] where \(u_i\), \(i = 1,\ldots,n\) are assumed to be independent normal random variables with zero mean and constant variance \(\sigma^2\).

fit <- lm(PriceUSD ~ hasCNG, data = cars)
summary(fit)

## 
## Call:
## lm(formula = PriceUSD ~ hasCNG, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6930  -2790   -885    873 551059 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7248.32      79.51   91.16   <2e-16 ***
## hasCNGTRUE  -4518.52     144.75  -31.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9973 on 22526 degrees of freedom
## Multiple R-squared:  0.04146,    Adjusted R-squared:  0.04142 
## F-statistic: 974.4 on 1 and 22526 DF,  p-value: < 2.2e-16

Write down the estimated regression equation.

\[ \widehat{PriceUSD} = 7248.32 - 4518.52 \times \text{hasCNG} \]

A consultant in that firm claims that the sales price is (on average) not different for CNG-fueled cars compared to all other cars. Formulate a hypothesis (and an alternative) in terms of the model coefficients, compute and write down the value of the test-statistic and the p-value of the test. Explain your decision to reject or to not reject the hypothesis at the a 5% significance level. (Your calculations need to be reproducible!)

\[ H_0: \beta_1 = 0\\ H_1: \beta_1 \neq 0 \] Based on the p-value of the test \(<2 \times 10^{-16} < 0.05\) we reject the null hypothesis of no difference between the average sales prices for cars with and without CNG.

Another consultant claims that the average sales price for cars not using CNG is 100,000 rupees. Formulate her hypothesis in terms of the model coefficients, compute and write down the value of the test-statistic, the critical values at a 5 percent significance level. Explain your decision to reject or to not reject the hypothesis at the 5% significance level.

\[ H_0: \beta_0 = 100000 / 157 \\ H_0: \beta_0 \neq 100000 / 157 \] The critical values at the 5 percent level are -1.96 and 1.96.

alpha <- 0.05
qt(alpha / 2, df = nrow(cars) - 2)

## [1] -1.960069

The t-statistic is

tStat <- (7248.32 -100000 / 157) / 79.51
tStat

## [1] 83.15152

\[ t = \frac{7248.32 -100000 / 157}{79.51} = 83.15 \] We reject the null hypothesis at the 5 percent level, because 83.15 is greater than the upper critical value.

The consultants are curious about the meaning of the 5% error probability referenced in the tests above. Give a explanation.

We expect the test to wrongly reject a true null hypothesis in 5 percent of the possible samples of cars.

Create a new variable (column) in the data set cars called MileageKm that contains the car mileage in 10,000 km.

cars$MileageKm <- cars$KMs.Driven / 10000

Fit the linear regression model: \[\begin{align} \text{PriceUSD}_i = \beta_0 + \beta_1 \text{MileageKm}_i + \beta_2 \text{Year}_i + u_i \tag{10.2} \end{align}\] with \(i = 1,\ldots,n\) and where \(u_i\) are assumed to be independent normally distributed random variables with zero mean and constant variance \(\sigma^2\).

fitMileage <- lm(PriceUSD ~ MileageKm + Year, data = cars)
summary(fitMileage)

## 
## Call:
## lm(formula = PriceUSD ~ MileageKm + Year, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8805  -2640   -979    736 549185 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.295e+05  1.389e+04 -38.116   <2e-16 ***
## MileageKm   -1.047e+00  1.102e+00  -0.951    0.342    
## Year         2.669e+02  6.924e+00  38.546   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9860 on 22525 degrees of freedom
## Multiple R-squared:  0.06309,    Adjusted R-squared:  0.06301 
## F-statistic: 758.4 on 2 and 22525 DF,  p-value: < 2.2e-16

Write down the estimated regression equation.

\[ \widehat{PriceUSD} = -529477.90 -1.05 \text{ MileageKm} + 266.9 \text{ Year} \]

What is the meaning of the intercept (\(\beta_0\)) in this model?

The intercept is the expected sales price in USD for a new car that is manufactured in year = 0 (does not make much sense).

Interpret the signs of the estimated two regression coefficients (pay attention to the scales of the variables). Do the estimated coefficient conform to your expectations? Explain why.

The negative sign of MileageKm indicates that cars were driven more (high mileage) tended to fetch a lower sales price on average given compared with cars that were manufactured in the same year. The positive coefficient for Year indicates that newer cars (with a high value for Year) tended to cost more on average. The expected sales price for a car that was manufactured one year earlier is 266.90 USD less than another car with the same mileage. The coefficient of MileageKm is not significant (p-value = 0.34 > 0.05). This is due to the fact that mileage and manufacturing year are (negatively) correlated. Cars that were manufactured recently tended to have a lower mileage.

Estimate the expected number sales price for a car with a 30,000 km mileage that was manufactured in 2010. Write down your estimate. Calculate the standard error of this estimate. Hint: use the predict function and write down your result: estimate and standard error.

predict(fitMileage, newdata = data.frame(MileageKm = 30, Year = 2010), se.fit = TRUE)

## $fit
##      1 
## 6956.6 
## 
## $se.fit
## [1] 74.8368
## 
## $df
## [1] 22525
## 
## $residual.scale
## [1] 9859.614

What share of the variation of the sales price can be attributed to variation of manufacturing year and mileage? Write down your answer.

The model explains 0.0631 (\(R^2\)) of the total variation of the sales price in USD.

The consultancy firm claims that the sales price is not affected by differences in mileage given the manufacturing year. Formulate a hypothesis (and an alternative) in terms of the model coefficients. Explain your decision to reject or not to reject the null hypothesis at a 5 percent significance level.

\[ H_0: \beta_1 = 0 \\ H_1: \beta_1 \neq 0 \]

The p-value of the test is 0.34 > 0.05 so we cannot reject this statement based on the data.

Estimate the linear regression model: \[\begin{align} \text{PriceUSD}_i = \beta_0 + \beta_1 \text{Year}_i + u_i \tag{10.3} \end{align}\]
Is the estimate for \(\beta_1\) in model @ref{eq:model-1} the same as the estimate of \(\beta_1\) in model @ref{eq:model-2}? Explain the difference (if any).
Plot the estimated price distributions for cars with 1000, 10,000 and 100,000 km mileage and year of manufacturing 2000.

The model assumes that the sales price of a car is normally distributed with mean \[ E(PriceUSD | \text{MileageKm}, \text{Year}) = \beta_0 + \beta_1\times\text{MileageKm} + \beta_2 \times\text{Year} \] and variance \(\sigma ^2\). Therefore the distributions of the samples price at \(MileageKm = 0.1; 1 and 10\) (pay attention to the measurement unit of MileageKm) for manufacturing year 2000 are:

\[ E(PriceUSD | \text{MileageKm} = 0.1, \text{Year} = 2000) = \beta_0 + \beta_1\times 0.1 + \beta_2 \times 2000 \\ E(PriceUSD | \text{MileageKm} = 1, \text{Year} = 2000) = \beta_0 + \beta_1\times 1 + \beta_2 \times 2000 \\ E(PriceUSD | \text{MileageKm} = 10, \text{Year} = 2000) = \beta_0 + \beta_1\times 10 + \beta_2 \times 2000 \] These distributions are have unknown parameters: \(\beta_0, \beta_1, \beta_2\) and \(\sigma\) . Replacing these with their OLS estimates gives us estimated distributions:

predict(fitMileage, newdata = data.frame(MileageKm = c(0.1, 1, 100), Year = 2000))

##        1        2        3 
## 4318.932 4317.989 4214.300

\[ N(4318.9, 9860 ^ 2) \\ N(4318.0, 9860 ^ 2) \\ N(4308.6, 9860 ^ 2) \]

To plot the distributions notice that 95 percent of the probability of a normal distribution falls in an interval 2 standard deviations around its mean. We will draw the x-axis of the plot starting at two standard deviations below the lowest mean (4308.6) and ending at two standard deviations above the highest mean (4318.9). We create a sequence of 100 values between the start of the x-axis and the end of the x-axis.

xGrid <- seq(4308.6 - 2 * 9860, 4318 + 2 * 9860, length.out = 100)

For each of these 100 values we compute the densities of the three normal distributions. Finally we plot the lines.

density1 <- dnorm(xGrid, 4308.6, 9860)
density2 <- dnorm(xGrid, 4318.0, 9860)
density3 <- dnorm(xGrid, 4318.9, 9860)

plot(x=xGrid, y = density1, type = "n")

lines(xGrid, density1, lty = 1)
lines(xGrid, density2, lty = 2)
lines(xGrid, density3, lty = 3)

As the difference between the means is quite small you will only see one of the lines. Repeat the same exersice for Year (e.g. Year = 1980, Year = 2000) and plot the densitites.

Create a new variable called MileageKmCentered by subtracting the average of MileageKm from MileageKm. Fit model (10.3) using MileageKmCentered instead of MileageKm and interpret the coefficients of the model. Which coefficient change and why?

  cars$MileageKmCentered <- cars$MileageKm - mean(cars$MileageKm)
  summary(lm(PriceUSD ~ MileageKmCentered + Year, data = cars))

## 
## Call:
## lm(formula = PriceUSD ~ MileageKmCentered + Year, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8805  -2640   -979    736 549185 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5.295e+05  1.389e+04 -38.122   <2e-16 ***
## MileageKmCentered -1.047e+00  1.102e+00  -0.951    0.342    
## Year               2.669e+02  6.924e+00  38.546   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9860 on 22525 degrees of freedom
## Multiple R-squared:  0.06309,    Adjusted R-squared:  0.06301 
## F-statistic: 758.4 on 2 and 22525 DF,  p-value: < 2.2e-16

Notice that the intercept has changed. This is due to the fact that it now estimates the expected sales price for a car manufactured in year = 0 (still does not make much sense) that hasthe average mileage of all cars. Experiment by centering Year and fitting the model with the centered year instead of the original variable. See how it changes.

Ayres, Ian. 2008. Super Crunchers. John Murray Press. https://www.ebook.de/de/product/23088749/ian_ayres_super_crunchers.html.

Bertsekas, Dimitri P., and John N. Tsitsiklis. 2008. Introduction to Probability. Belmont, Mass: Athena Scientific.

Casella, George, and Roger Berger. 2001. Statistical Inference. Cengage Learning, Inc. https://www.ebook.de/de/product/3248529/george_university_of_florida_casella_roger_arizona_state_university_berger_statistical_inference.html.

Freedman, David, Robert Pisani, and Roger Purves. 2007. Statistics. W W NORTON & CO. https://www.ebook.de/de/product/6481761/david_freedman_robert_pisani_roger_purves_statistics.html.

Heumann, Christian, and Michael Schomaker Shalabh. 2016. Introduction to Statistics and Data Analysis. Springer International Publishing. https://doi.org/10.1007/978-3-319-46162-5.

Linton, Natalie M., Tetsuro Kobayashi, Yichi Yang, Katsuma Hayashi, Andrei R. Akhmetzhanov, Sung-mok Jung, Baoyin Yuan, Ryo Kinoshita, and Hiroshi Nishiura. 2020. “Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Case Data.” Journal of Clinical Medicine 9 (2). https://doi.org/10.3390/jcm9020538.

Xu, Bo, Moritz U G Kraemer, Bo Xu, Bernardo Gutierrez, Sumiko Mekaru, Kara Sewalk, Alyssa Loskill, et al. 2020. “Open Access Epidemiological Data from the COVID-19 Outbreak.” The Lancet Infectious Diseases, February. https://doi.org/10.1016/S1473-3099(20)30119-5.

References

Ayres, Ian. 2008. Super Crunchers. John Murray Press. https://www.ebook.de/de/product/23088749/ian_ayres_super_crunchers.html.