Chapter 10 Homeworks
10.1 Homework 1
Github classroom link: https://classroom.github.com/g/vguWjRNd
Please choose one of the following problems. Write your solution in a file named Problem_1.R
or Problem_2.R
depending on the number of the problem that you have picked. When you are finished, upload your code to your team’s homework repository by pasting it in one of the files there: Problem_1.R
or Problem_2.R
.
10.1.1 Problem 1
In the wake of the COVID-19 outbreak data analysis can yield important insights that can help to contain the disease and ultimately to save lives. The dataset COVID19_2020_open_line_list
contains data on patients with confirmed COVID19 infections in the United States, Japan and China (outside Hubei province). Disclaimer: This is an exercise dataset. Cases with missing or incomplete data on age and sex were removed. When age was given as an interval (e.g. 0-10 years) it was replaced with the midpoint of the interval. Do not draw any real-life conclusions based on the analysis here! For the full dataset refer to Xu et al. (2020).
- ID (numeric): Case id.
- date_confirmation (date): Date when the infection was confirmed.
- sex (character): Patients’ sex: male/female
- age (numeric): Patients’ age in years.
- province (character): Province where the infection was confirmed.
- country (character): Country
- Download and read the dataset and store it in an object called
patients
.
patients <- read.csv2('https://firebasestorage.googleapis.com/v0/b/uni-sofia.appspot.com/o/data%2FCOVID19_2020_open_line_list%20-%20outside_Hubei_subset.csv?alt=media&token=919a181b-6417-45fd-9314-91497bb4b0fc', stringsAsFactors = FALSE)
- How many patients are there in the dataset?
## [1] 1565
## Returns the number of rows in the data frame (table)
## Alternatively, use str to output a summary of the structure of the table.
str(patients)
## 'data.frame': 1565 obs. of 12 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ID : int 649 2596 99 119 120 121 280 669 56 100 ...
## $ date_confirmation : chr "2020-01-15" "2020-01-16" "2020-01-19" "2020-01-20" ...
## $ sex : chr "male" "male" "male" "male" ...
## $ age : num 30 34.5 66 78 76 48 56 35 44 65 ...
## $ province : chr "Tokyo" "Kanagawa Prefecture" "Guangdong" "Guangdong" ...
## $ country : chr "Japan" "Japan" "China" "China" ...
## $ additional_information: chr NA NA "Among the 53 cases diagnosed in our province, 28 were males and 25 were females, aged between 10-81 years old. "| __truncated__ "Among the 53 cases diagnosed in our province, 28 were males and 25 were females, aged between 10-81 years old. "| __truncated__ ...
## $ outcome : chr "discharged" NA NA NA ...
## $ source : chr "https://www.who.int/csr/don/17-january-2020-novel-coronavirus-japan-ex-china/en/" "https://www.mhlw.go.jp/stf/newpage_08906.html" "http://www.bjnews.com.cn/news/2020/01/20/676734.html and http://wsjkw.gd.gov.cn/zwyw_yqxx/content/post_2876057."| __truncated__ "https://www.sinchew.com.my/content/content_2205270.html and http://wsjkw.gd.gov.cn/zwyw_yqxx/content/post_2877668.html" ...
## $ latitude : chr "35.71145" "35.41557" "22.65389" "22.16925" ...
## $ longitude : chr "139.4468" "139.339" "114.1291" "113.361" ...
- What are the earliest and latest date of infection confirmation?
## Open the data frame in the R-Studio viewer, sort the table by confirmation column by clicking on the column header and look up the first and
## the last value
- What is the average age of the patients?
## [1] 49.1828
- How many men and how many women are there in the sample?
##
## female male
## 719 846
- Plot the frequency distribution of sex using a bar-chart.
7. What was the age of the youngest woman?
## [1] 0.08333
- What was the age of the oldest man?
## [1] 94.5
- Is there a difference between the average age of male and female patients?
averageAgeMale <- mean(patients$age[isMale])
averageAgeFemale <- mean(patients$age[isFemale])
averageAgeMale - averageAgeFemale
## [1] 0.7097229
- Compare the distribution of age between the countries using a box-plot. Interpret the plot (write your answer as a comment in the code).
10.1.2 Problem 2
Orley Ashenfelter, an Economics Professor at Princeton University claimed to have found a method to predict the quality of Bordeaux (http://www.liquidasset.com/orley.htm)[wine]. In this problem we will borrow data from (http://www.liquidasset.com/winedata.htm)[http://www.liquidasset.com/winedata.htm]. The dataset contains information about the prices of Bordeaux wines produced between 1952 and 1980 organised in the following columns:
- Year: Year in which the wine was produced (unique).
- LogPrice: Logarithm of the price of the wine.
- WinterRain: Winter rain in the Bordeaux region (October to March, in ml).
- Temperature: Average temperature in the region (April to September, in degrees Celsius).
- HarvestRain: Harvest rain in the region (August and September, in ml).
- TimeYears: Time since vintage in years.
For a short, entertaining, story about Ashenfelter’s analysis and his predictions of wine prices, read the first few pages of Ayres (2008), freely available on (https://books.google.bg/books?id=brHyklsoPRMC&printsec=frontcover&hl=bg#v=onepage&q&f=false)[books.google.com].
- Download and read the dataset and store in a variable called
wines
. - Create a new variable (in the
data.frame
wines
) calledPrice
by convertingLogPrice
back to its original scale. Hint: use theexp
function. - How many years are recorded in the dataset?
- What was the average temperature in 1953? Write your answer as a comment in the code.
- Which was the coldest year recorded?
- Compute the average wine price for hot and cold years. Define a cold year to be a year with below average temperature.
- Are wines produced during cold years more valuable (on average) than wines produced during hot years?
- How many years had below-average temperature? Hint: use the
table
function. - Compare the distribution of prices between hot and cold years using a box-plot. Interpret the plot.
- Create a scatterplot for wine price and the rainfall level during harvest . Do you see any association pattern? Write your answer as a comment in the code.
10.2 Homework 2 (Linear regression 1)
Github classroom link: https://classroom.github.com/g/LBe858uX Submission deadline: 27 April 2020, 23:00.
A detailed discussion of the methods used here is available in Chapter 8.1.
Disclaimer: As in the previous homework, please note that the dataset used here is simplified for ease of use and the analysis here MUST be viewed as an exercise, not real-world research! For the full dataset always refer to the data source!
The dataset lintonHospitalisation
is an adapted version of the data used in Linton et al. (2020). It contains 82 observations on COVID19 patients, mainly from mainland China. For this homework we are only interested in the variable timeToHealthcareVisit
that measures the time in days from symptoms onset to the first visit of a healthcare facility.
- Read and download the data (already done in the starter code)
lintonHospitalisation <- read.csv('https://raw.githubusercontent.com/feb-uni-sofia/econometrics2020-solutions/master/data/linton2020_sought_healthcare_complete.csv', stringsAsFactors = FALSE)
simData <- read.csv('https://raw.githubusercontent.com/feb-uni-sofia/econometrics2020-solutions/master/data/sim-t-statistics-int-only-model.csv', stringsAsFactors = FALSE)
- Let \(y_i\) denote the time in days between symptoms onset and the first visit of a healthcare facility (time to visit for short) for patient \(i = 1,\ldots,n\). Let \(u_i \sim N(0, \sigma ^ 2)\) be a normally distributed random variable with zero mean and variance \(\sigma ^ 2\). Furthermore, assume that the error terms \(u_i\) are independent.
\[\begin{align} y_i = \beta_0 + u_i, \quad i = 1,\ldots,n. \tag{10.1} \end{align}\]
Estimate \(\beta_0\) and \(\sigma\) using OLS and write down the result (as a comment in the code). Hint: use the lm
function in R.
## ~ 1 instructs to lm to fit a model that only includes an intercept and no
fit <- lm(timeToHealthcareVisit ~ 1, data = lintonHospitalisation)
summary(fit)
\[ \hat{\beta}_0 = 2.0854 \\ \hat{\sigma} = 2.535 \]
# Note that the in this model the estimate of the error term standard error (sigma) is simply the sample standard
## deviation of y.
sd(lintonHospitalisation$timeToHealthcareVisit)
## [1] 2.534703
- What is the meaning of \(\beta_0\) in (10.1)? Write down your answer as a comment in the code. Computing the expected value and the variance of \(y\) shows (see (8.4)) that \(y\) is normally distributed with mean \(\beta_0\) and variance \(\sigma ^ 2\). Therefore \(\beta_0\) is the expected time (in days) between symptoms onset and a visit to a healthcare facility (given that the person seeks medical help at all!).
\[ y \sim N(\beta_0, \sigma ^ 2) \]
What would be your prediction for the time to visit for a new patient who just started showing symptoms? The best prediction in the sense of (8.7) is the expected time to the visit \(\beta_0\). \(\beta_0\) itself is unknown and we use the estimate for \(\beta_0\) that we derived from the data (the OLS estimate). \[ \widehat{E(y)} = \hat{y} = \hat{\beta}_0 = 2.08 \]
Estimate the probability that a patient will wait longer than 3 days before seeking medical aid. Hint: use
pnorm
.
Previously we saw that \(y\) is normally distributed with mean \(\beta_0\) and variance \(\sigma^2\). To estimate the probability we we use the estimated distribution of \(y\) by plugging in the estimates for \(\beta_0\) and \(\sigma^2\). \[ y \sim N(\beta_0, \sigma ^2) \\ \] \[ N(2.08, 2.535 ^ 2) \] \[ P_{N(\hat{\beta}_0), \hat{\sigma}^2}(y > 3) = 1 - P_{N(\hat{\beta}_0), \hat{\sigma}^2}(y < 3) \]
## [1] 0.3583326
- Local healthcare workers are complaining that patients tend to wait 3 days on average before seeking medical aid.
- Write down this hypothesis in terms of model (10.1).
- Compute the value of the t-statistic for this hypothesis.
- Decide whether to reject the null hypothesis at a 1 percent significance level. Hint: compute the rejection region and compare the t-statistic to the critical values.
- Compute the p-value of the test.
\[ H_0: \beta_0 = 3 \\ H_1: \beta_0 \neq 3 \] \[ T = \frac{\hat{\beta}_0 - 3}{SE(\hat{\beta}_0)} \stackrel{H_0}{\sim} t(n - 1) \] Assuming that the null hypothesis is true, then \(T\) follows a t-distribution with \(n - 1\) degrees of freedom.
n <- nrow(lintonHospitalisation)
tStat <- (2.08 - 3) / 0.2799
alpha <- 0.01
lowerCriticalValue <- qt(alpha / 2, df = n - 1)
upperCriticalValue <- -lowerCriticalValue
tStat
## [1] -3.286888
## [1] -2.637897
## [1] 2.637897
The t-statistic is less than the lower critical value, therefore we reject \(H_0\) that \(\beta_0 = 3\) at the 1 percent significance level.
Alternatively we can compute the p-value of the test (see ??). \[\begin{align} \text{p-value} & = P(T > |t|) \\ & = P\left((T < -|t|) \cup (T > |t|)\right) = 2 P(T < -|t|) \\ & = P\left((T < -3.28) \cup (T > 3.28)\right) = 2 P(T < -3.28) \end{align}\]
## [1] 2
## [1] 0.001499144
abs()
returns the absolute value of a number. Based on the p-value of the test we reject the null hypothesis, because the observed p-value is less than the significance level.
- The dataset
simData
summarises the result of 2000 datasets of sizen = 82
that were generated from \(N(3, 2.5 ^ 2)\). The columnsmean
andsd
contain the mean and the standardd deviation of each sample. The following code shows how to compute the t-statistic for \(H_0: \mu = 3\) and the critical values for a 5 percent significance level for the alternative \(H_1: \mu_0 \neq 3\). Decide whether to reject the null hypothesis (in each sample). In how many samples was your decision wrong? Hint: use the<
and>
logical operators to compare the test statistics to the critical values. Use the|
operator to perform a logicalOR
. Remember that<
and>
return logical (TRUE/FALSE
) vectors. See this page for a short overview of logical operators inR
.
alpha <- 0.05
simData$lowerCritivalValue <- qt(alpha / 2, df = simData$n - 1)
simData$upperCriticalValue <- -simData$lowerCritivalValue
simData$tStat1 <- sqrt(simData$n) * (simData$mean - 3) / simData$sd
simData$H0rejected <- (
simData$tStat1 < simData$lowerCritivalValue |
simData$tStat1 > simData$upperCriticalValue
)
sum(simData$H0rejected) / 2000
## [1] 0.0575
## First note that the null hypothesis is true, because the problem states that the datasets (samples) were generated from a N(3, 2.5^2) distribuion.
## Following the rules for rejection we will reject H_0 in 5.75 percent of the 2000 samples, because the t-statistic is in the rejection region
## (less than the lower critical value or greater than the upper critical value). This is close to the _expected_ error rate which is simply the
## significance level of the test.
- Using the
simData
dataset, compute the t-statistic for \(H_0: \beta_0 = 2\) for each sample. Test the null hypothesis against the alternative \(H_0: \beta_0 \neq 2\) at a 5 percent significance level. In how many samples was your decision to reject the null hypothesis wrong?
simData$tStat2 <- sqrt(simData$n) * (simData$mean - 2) / simData$sd
simData$H0falseRejected <- (
simData$tStat2 < simData$lowerCritivalValue |
simData$tStat2 > simData$upperCriticalValue
)
sum(simData$H0falseRejected) / 2000
## [1] 0.941
## In this problem the null hypothesis is wrong, because the datasets (samples) were generated from a N(3, 2.5^2) distribution and the null hypothesis
## states that \beta_0 = 2 (and 2 is not equal to 3...). Therefore the error the decision to reject H_0 is correct for all samples. The wrong decision here ## is to _fail_ to reject the null nypothesis. We see that we correcly reject it in 94.1 percent of the samples. There are 100 - 94.1 percent of the samples, however, where we do not reject H0, because the t-statistic is _not_ in the rejection region.
## You should not confuse this error rate with the significance level of the test, though. Experiment with different values for \beta_0 in the
## null hypothesis and see how this error rate changes.
10.3 Homework 3 (English)
Github classroom code: https://classroom.github.com/g/ejCpmcYX
Deadline: 21.05.2020 23:00
You are hired by a consultancy firm operating out of Lahore, Pakistan. Your goal is to develop a statistical model for the sales price of cars. The dataset cars
contains observations on 24,973 car sales in Pakistan.
Condition (character): Used or new. KMs.Driven: Kilometers driven (mileage). Year: Manufacturing year of the car. Fuel (character): Fuel type (5 categories). Transaction.Type (character): Cash payment or leasing. Price (numeric): Sales price of the car in Pakistani Rupees.
- Download and read the dataset.
cars <- read.csv("https://raw.githubusercontent.com/feb-uni-sofia/econometrics2020-solutions/master/data/cars.csv", stringsAsFactors = FALSE)
- Create a new variable (column) in the data set
cars
calledPriceUSD
that contains the sales price in USD. Use an exchange rate of 157 Pakistani rupees for one USD.
- What is the average sales price in USD?
## [1] 5885.022
- Create a new (logical) variable in the data.frame
cars
calledhasCNG
that equalsTRUE
if the car’s fuel type is CNG (compressed natural gas) and isFALSE
otherwise.
- How many cars with CNG fuel are there in the sample?
##
## FALSE TRUE
## 15731 6797
## [1] 6797
- Estimate the linear regression model: \[\begin{align*} \text{PriceUSD}_i = \beta_0 + \beta_1 \text{hasCNG}_i + u_i \end{align*}\] where \(u_i\), \(i = 1,\ldots,n\) are assumed to be independent normal random variables with zero mean and constant variance \(\sigma^2\).
##
## Call:
## lm(formula = PriceUSD ~ hasCNG, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6930 -2790 -885 873 551059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7248.32 79.51 91.16 <2e-16 ***
## hasCNGTRUE -4518.52 144.75 -31.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9973 on 22526 degrees of freedom
## Multiple R-squared: 0.04146, Adjusted R-squared: 0.04142
## F-statistic: 974.4 on 1 and 22526 DF, p-value: < 2.2e-16
- Write down the estimated regression equation.
\[ \widehat{PriceUSD} = 7248.32 - 4518.52 \times \text{hasCNG} \]
- A consultant in that firm claims that the sales price is (on average) not different for CNG-fueled cars compared to all other cars. Formulate a hypothesis (and an alternative) in terms of the model coefficients, compute and write down the value of the test-statistic and the p-value of the test. Explain your decision to reject or to not reject the hypothesis at the a 5% significance level. (Your calculations need to be reproducible!)
\[ H_0: \beta_1 = 0\\ H_1: \beta_1 \neq 0 \] Based on the p-value of the test \(<2 \times 10^{-16} < 0.05\) we reject the null hypothesis of no difference between the average sales prices for cars with and without CNG.
- Another consultant claims that the average sales price for cars not using CNG is 100,000 rupees. Formulate her hypothesis in terms of the model coefficients, compute and write down the value of the test-statistic, the critical values at a 5 percent significance level. Explain your decision to reject or to not reject the hypothesis at the 5% significance level.
\[ H_0: \beta_0 = 100000 / 157 \\ H_0: \beta_0 \neq 100000 / 157 \] The critical values at the 5 percent level are -1.96 and 1.96.
## [1] -1.960069
The t-statistic is
## [1] 83.15152
\[ t = \frac{7248.32 -100000 / 157}{79.51} = 83.15 \] We reject the null hypothesis at the 5 percent level, because 83.15 is greater than the upper critical value.
- The consultants are curious about the meaning of the 5% error probability referenced in the tests above. Give a explanation.
We expect the test to wrongly reject a true null hypothesis in 5 percent of the possible samples of cars.
- Create a new variable (column) in the data set
cars
calledMileageKm
that contains the car mileage in 10,000 km.
- Fit the linear regression model: \[\begin{align} \text{PriceUSD}_i = \beta_0 + \beta_1 \text{MileageKm}_i + \beta_2 \text{Year}_i + u_i \tag{10.2} \end{align}\] with \(i = 1,\ldots,n\) and where \(u_i\) are assumed to be independent normally distributed random variables with zero mean and constant variance \(\sigma^2\).
##
## Call:
## lm(formula = PriceUSD ~ MileageKm + Year, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8805 -2640 -979 736 549185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.295e+05 1.389e+04 -38.116 <2e-16 ***
## MileageKm -1.047e+00 1.102e+00 -0.951 0.342
## Year 2.669e+02 6.924e+00 38.546 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9860 on 22525 degrees of freedom
## Multiple R-squared: 0.06309, Adjusted R-squared: 0.06301
## F-statistic: 758.4 on 2 and 22525 DF, p-value: < 2.2e-16
- Write down the estimated regression equation.
\[ \widehat{PriceUSD} = -529477.90 -1.05 \text{ MileageKm} + 266.9 \text{ Year} \]
- What is the meaning of the intercept (\(\beta_0\)) in this model?
The intercept is the expected sales price in USD for a new car that is manufactured in year = 0 (does not make much sense).
- Interpret the signs of the estimated two regression coefficients (pay attention to the scales of the variables). Do the estimated coefficient conform to your expectations? Explain why.
The negative sign of MileageKm
indicates that cars were driven more (high mileage) tended to fetch a lower sales price on average given compared with
cars that were manufactured in the same year. The positive coefficient for Year
indicates that newer cars (with a high value for Year
) tended to cost more on average. The expected sales price for a car that was manufactured one year earlier is 266.90 USD less than another car with the same mileage.
The coefficient of MileageKm
is not significant (p-value = 0.34 > 0.05). This is due to the fact that mileage and manufacturing year are (negatively) correlated. Cars that were manufactured recently tended to have a lower mileage.
- Estimate the expected number sales price for a car with a 30,000 km mileage that was manufactured in 2010. Write down your estimate. Calculate the standard error of this estimate. Hint: use the predict function and write down your result: estimate and standard error.
## $fit
## 1
## 6956.6
##
## $se.fit
## [1] 74.8368
##
## $df
## [1] 22525
##
## $residual.scale
## [1] 9859.614
- What share of the variation of the sales price can be attributed to variation of manufacturing year and mileage? Write down your answer.
The model explains 0.0631 (\(R^2\)) of the total variation of the sales price in USD.
- The consultancy firm claims that the sales price is not affected by differences in mileage given the manufacturing year. Formulate a hypothesis (and an alternative) in terms of the model coefficients. Explain your decision to reject or not to reject the null hypothesis at a 5 percent significance level.
\[ H_0: \beta_1 = 0 \\ H_1: \beta_1 \neq 0 \]
The p-value of the test is 0.34 > 0.05 so we cannot reject this statement based on the data.
Estimate the linear regression model: \[\begin{align} \text{PriceUSD}_i = \beta_0 + \beta_1 \text{Year}_i + u_i \tag{10.3} \end{align}\]
Is the estimate for \(\beta_1\) in model @ref{eq:model-1} the same as the estimate of \(\beta_1\) in model @ref{eq:model-2}? Explain the difference (if any).
Plot the estimated price distributions for cars with 1000, 10,000 and 100,000 km mileage and year of manufacturing 2000.
The model assumes that the sales price of a car is normally distributed with mean \[ E(PriceUSD | \text{MileageKm}, \text{Year}) = \beta_0 + \beta_1\times\text{MileageKm} + \beta_2 \times\text{Year} \] and variance \(\sigma ^2\). Therefore the distributions of the samples price at \(MileageKm = 0.1; 1 and 10\) (pay attention to the measurement unit of MileageKm) for manufacturing year 2000 are:
\[ E(PriceUSD | \text{MileageKm} = 0.1, \text{Year} = 2000) = \beta_0 + \beta_1\times 0.1 + \beta_2 \times 2000 \\ E(PriceUSD | \text{MileageKm} = 1, \text{Year} = 2000) = \beta_0 + \beta_1\times 1 + \beta_2 \times 2000 \\ E(PriceUSD | \text{MileageKm} = 10, \text{Year} = 2000) = \beta_0 + \beta_1\times 10 + \beta_2 \times 2000 \] These distributions are have unknown parameters: \(\beta_0, \beta_1, \beta_2\) and \(\sigma\) . Replacing these with their OLS estimates gives us estimated distributions:
## 1 2 3
## 4318.932 4317.989 4214.300
\[ N(4318.9, 9860 ^ 2) \\ N(4318.0, 9860 ^ 2) \\ N(4308.6, 9860 ^ 2) \]
To plot the distributions notice that 95 percent of the probability of a normal distribution falls in an interval 2 standard deviations around its mean. We will draw the x-axis of the plot starting at two standard deviations below the lowest mean (4308.6) and ending at two standard deviations above the highest mean (4318.9). We create a sequence of 100 values between the start of the x-axis and the end of the x-axis.
For each of these 100 values we compute the densities of the three normal distributions. Finally we plot the lines.
density1 <- dnorm(xGrid, 4308.6, 9860)
density2 <- dnorm(xGrid, 4318.0, 9860)
density3 <- dnorm(xGrid, 4318.9, 9860)
plot(x=xGrid, y = density1, type = "n")
lines(xGrid, density1, lty = 1)
lines(xGrid, density2, lty = 2)
lines(xGrid, density3, lty = 3)
As the difference between the means is quite small you will only see one of the lines. Repeat the same exersice for Year
(e.g. Year = 1980, Year = 2000) and plot the densitites.
- Create a new variable called
MileageKmCentered
by subtracting the average ofMileageKm
fromMileageKm
. Fit model (10.3) usingMileageKmCentered
instead ofMileageKm
and interpret the coefficients of the model. Which coefficient change and why?
cars$MileageKmCentered <- cars$MileageKm - mean(cars$MileageKm)
summary(lm(PriceUSD ~ MileageKmCentered + Year, data = cars))
##
## Call:
## lm(formula = PriceUSD ~ MileageKmCentered + Year, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8805 -2640 -979 736 549185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.295e+05 1.389e+04 -38.122 <2e-16 ***
## MileageKmCentered -1.047e+00 1.102e+00 -0.951 0.342
## Year 2.669e+02 6.924e+00 38.546 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9860 on 22525 degrees of freedom
## Multiple R-squared: 0.06309, Adjusted R-squared: 0.06301
## F-statistic: 758.4 on 2 and 22525 DF, p-value: < 2.2e-16
Notice that the intercept has changed. This is due to the fact that it now estimates the expected sales price for a car manufactured in year = 0 (still does not make much sense) that hasthe average mileage of all cars. Experiment by centering Year
and fitting the model with the centered year instead of the original variable. See how it changes.
Ayres, Ian. 2008. Super Crunchers. John Murray Press. https://www.ebook.de/de/product/23088749/ian_ayres_super_crunchers.html.
Bertsekas, Dimitri P., and John N. Tsitsiklis. 2008. Introduction to Probability. Belmont, Mass: Athena Scientific.
Casella, George, and Roger Berger. 2001. Statistical Inference. Cengage Learning, Inc. https://www.ebook.de/de/product/3248529/george_university_of_florida_casella_roger_arizona_state_university_berger_statistical_inference.html.
Freedman, David, Robert Pisani, and Roger Purves. 2007. Statistics. W W NORTON & CO. https://www.ebook.de/de/product/6481761/david_freedman_robert_pisani_roger_purves_statistics.html.
Heumann, Christian, and Michael Schomaker Shalabh. 2016. Introduction to Statistics and Data Analysis. Springer International Publishing. https://doi.org/10.1007/978-3-319-46162-5.
Linton, Natalie M., Tetsuro Kobayashi, Yichi Yang, Katsuma Hayashi, Andrei R. Akhmetzhanov, Sung-mok Jung, Baoyin Yuan, Ryo Kinoshita, and Hiroshi Nishiura. 2020. “Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Case Data.” Journal of Clinical Medicine 9 (2). https://doi.org/10.3390/jcm9020538.
Xu, Bo, Moritz U G Kraemer, Bo Xu, Bernardo Gutierrez, Sumiko Mekaru, Kara Sewalk, Alyssa Loskill, et al. 2020. “Open Access Epidemiological Data from the COVID-19 Outbreak.” The Lancet Infectious Diseases, February. https://doi.org/10.1016/S1473-3099(20)30119-5.
References
Ayres, Ian. 2008. Super Crunchers. John Murray Press. https://www.ebook.de/de/product/23088749/ian_ayres_super_crunchers.html.
Linton, Natalie M., Tetsuro Kobayashi, Yichi Yang, Katsuma Hayashi, Andrei R. Akhmetzhanov, Sung-mok Jung, Baoyin Yuan, Ryo Kinoshita, and Hiroshi Nishiura. 2020. “Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Case Data.” Journal of Clinical Medicine 9 (2). https://doi.org/10.3390/jcm9020538.
Xu, Bo, Moritz U G Kraemer, Bo Xu, Bernardo Gutierrez, Sumiko Mekaru, Kara Sewalk, Alyssa Loskill, et al. 2020. “Open Access Epidemiological Data from the COVID-19 Outbreak.” The Lancet Infectious Diseases, February. https://doi.org/10.1016/S1473-3099(20)30119-5.