stat302 Assignment

Due: 2023-02-07

The purpose of the assignment is to deepen your understanding of linear regression properties and develop your data analysis skills. Developing these skills will be useful for the final project of this course and for future courses. The emphasis of this assignment will be on practicing coding with R. This assignment involves the use of a simulation based study and builds upon concepts discussed during the lectures.

Assignment Description: In this course we have come across many properties of least squares estimators and variance estimators. For example, a property of least squares estimator is that they are unbiased and follow a normal distribution. Next, the mean residual sum of squares (MRS ) is an unbiased estimator for the error variance σ2. We have seen (or will see) the mathematical proofs of these properties. Furthermore, we will make some assumptions about the linear regression model we use.

Task 1

Assume the following simple regression model,

Y = β0 + β1X + ε ε ∼ N(0,σ2)

Now run the following code to generate values of σ2 = sig2, β1 = beta1 and β0 = beta0. Simulate the parameters using the following codes:

#Simulation##

set.seed(“INSERT YOUR STUDENT ID”)

beta0 <- rnorm(1, mean = 0, sd = 1) ## The true beta0

beta1 <- runif(n = 1, min = 1, max = 3) ## The true beta1

sig2 <- rchisq(n = 1, df = 25) ## The true value of the error variance sigmaˆ2

## Multiple simulation will require loops ## nsample <- 10 ## Sample size

n.sim <- 100 ## The number of simulations sigX <- 0.2 ## The variances of X

## Simulate the predictor variable ##

X <- rnorm(nsample, mean = 0, sd = sqrt(sigX))

Please change the seed to your student ID. The seed is used to generate random numbers, with different seeds generating different numbers. Since every student will set a different seed to their own simulations, every student will have unique datasets. If you don’t set your seed to your student ID then you will receive a 0 for the whole assignment. In the first task below, please show and explain the following steps:

1. Fix the sample size nsample = 10 . Here, the values of X are fixed. You just need to generate ε and Y . Execute 100 simulations (ie, n.sim = 100). For each simulation estimate the regression coefficients (β0 ,β1) and the error variance (σ2). Calculate the mean of the estimates from the different simulations.

Comment on your observations. What did you expect the mean to be? 1

2. Plot the histogram of each of the regression parameter estimates from (b). Explain the pattern of the distributions.

3. Obtain the variance of the regression parameter estimator (ie, βˆ0 and βˆ1) from the simulations. That is calculate the sample variances of the regression parameter estimates from the 100 simulations. Is this variance approximately equal to the true variances of the regression parameter estimates? Explain in 2-3 sentences.

4. Construct the 95% t and z confidence intervals for β0 and β1 during every simulation. What is the proportion of the intervals for each method containing the true value of the parameters? Is this consistent with the definition of confidence interval? Next, what differences do you observe in the t and z confidence intervals? What effect does increasing the number of simulations from 100 have on the confidence intervals?

5. For steps (a)-(d) the sample size was fixed at 10. Start increasing the sample size (eg, 20, 50, 100) and run steps (a)-(d). Explain what happens to the mean , variance and distribution of the estimators as the sample size increases.

6. Choose the largest sample size you have used in step (f). Fix the sample size to that and start changing the error variance (sig2). You can increase and decrease the value of the error variance. For each value of error variance execute steps (a) – (d). Explain what happens to the mean, variance and distribution of the estimates as the error variance changes.

Note: For steps (e), (f) and (g) you can present the results according to your convenience. For example you can add further plots and tables which you think are going to be useful.

Task 2

Assume the following multiple linear regression model:

Y =β0 +β1X1 +β2X2 +β3X3 +ε ε ∼ N(0,σ2)

First, simulate a dataset for multiple linear regression. The dataset will consist of one outcome variable (Y ) and three predictor variables (X = (X1, X2, X3)). The X has to be simulated from a multivariate normal distribution. You can use the following simulation codes (Note: these are just initial codes)

library(MASS)

## Simulation for correlated predictors ##

set.seed(“INSERT YOUR STUDENT ID”)

nsample <- 10; nsim <- 100

sig2 <- rchisq(1, df = 1) ## The true error variance

bet <- c(rnorm(3, 0, 1), 0) ## 4 values of beta that is beta0, beta1, beta2, beta3 = 0 muvec <- rnorm(3, 0, 1)

sigmat <- diag(rchisq(3, df = 4))

X <- mvrnorm(nsample, mu = muvec, Sigma = sigmat)

Xmat <- cbind(1, X)

## Simulate the response ##

bets <- matrix(NA, ncol = length(bet), nrow = nsim) for(i in 1:nsim){

Y <- Xmat%*%bet + rnorm(nsample, 0, sqrt(sig2))

model1 <- lm(Y ~ X)

2

bets[i,] <- coef(model1)

}

Please change the seed to your student ID

There are a few things to be noted from these simulations. The β has four values, β0,β1,β2 and β3. You can see that β3 = 0, ie, the third predictor is not linearly related with the response. Here sigmat is The variance-covariance matrix for X the independent predictors, where the diagonal elements are variances (not standard deviations) and the off diagonals are covariances.

1. First assume that the correlation between the three predictors are zero, ie, the off diagonals of sigmat are zero, like the codes provided above. Set the number of simulations nsim = 100 and sample size for each simulation to 10. Generate Y for each simulation. Then run simple linear regression for each of the three variables separately. Obtain the regression parameter estimates and their variances from the coefficients tables obtained from the lm function. Comment on whether the estimators are unbiased. Parameter estimates and check it the values are approximately equal to the true values.

2. Now fit a multiple linear regression and obtain the regression parameter estimates along with their variances from each simulation. Again check the unbiasedness and the variances. Compare the results with step (a). Remember – in step (a) we are fitting incorrect models and in step (b) we are fitting the correct model.

3

3. Now assume X1 and X2 are correlated. You can select a value for correlation (eg, r12 = 0.8). Then add the following covariance terms in the sigmat matrix,

## The correlation ##

r12 <- 0.2

sigmat[1,2] <- sigmat[2,1] <- r12*sqrt(sigmat[1,1])*sqrt(sigmat[2,2])

## Simulation for Categorical Variables with Interaction ##

set.seed(1002656486)

X <- mvrnorm(nsample, mu = muvec, Sigma = sigmat); cor(X[,1], X[,2])

Xmat <- cbind(1, X)

Again run simple linear regressions on each of the predictors and also a multiple linear regression. Compare the results with step (a) and (b) and comment on the differences/similarities between the results. Start increasing the value of the correlation coefficient r12, (eg, 0.5, 0.7, 0.8 etc.) and again perform step (a) and (b). How do the estimated values and standard error of βˆ1 and βˆ2 change for simple and multiple linear regressions as the correlation changes?

4. Now assume X1 and X2 are uncorrelated, ie, r12 = 0 and sigmat[1,2] = sigmat[2,1] = 0. Instead X1 and X3 are correlated. Select a value for r13 arbitrarily (eg, r13 = 0.8). Now change the values of sigma[1,3] and sigmat[3,1] using similar codes as the previous step. You can select a high value for correlation (eg, r13 > 0.5). Recall, that the true β3 = 0. Again perform step (a) and (b). Compare the results with the results obtained from step (c) and comment on the differences/similarities. Start increasing the value of the correlation coefficient r13, (eg, 0.6, 0.7, 0.8, 0.9, 0.95 etc.). How do the estimated values and the standard errors of βˆ1 and βˆ2 and βˆ3 change for simple and multiple linear regression as the correlation changes?

Note: The answers to the tasks are open ended. You don’t necessarily need to show every result. You just need to show the summary statistics or plots from the 100 simulations. How you present your results is up to you and is a subjective choice. You will be marked based on your presentation of the results through your plots, tables and interpretations.