FIT3154 Assignment 1

Due Date: 11:55PM, Sunday, 29/9/2019

Introduction

There are a total of four questions worth 9 + 14 + 15 + 12 = 50 marks in this assignment.

This assignment is worth a total of 20% of your ﬁnal mark, subject to hurdles and any other matters (e.g., late penalties, special consideration, etc.) as speciﬁed in the FIT3154 Unit Guide or elsewhere in the FIT3154 Moodle site (including Faculty of I.T. and Monash University policies).

Students are reminded of the Academic Integrity Awareness Training Tutorial Activity and, in particular, of Monash University’s policies on academic integrity. In submitting this assignment, you acknowledge your awareness of Monash University’s policies on academic integrity and that work is done and submitted in accordance with these policies.

Submission: No ﬁles are to be submitted via e-mail. Correct ﬁles are to be submitted to Moodle. Scans of handwritten answers are acceptable but they must be clean and legible. You must ensure your submission contains answers to the questions in the order they appear in the assignment. Submission must occur before 11:55 PM Thursday, 26th of September, and late submissions will incur penalties as per Faculty of I.T. policies.

1

Question 1 (9 marks)

In this question you will use R to examine the behaviour of two basic and important shrinkage procedures: (i) the hard thresholding method, and (ii) the lasso shrinkage method in the context of the very simple (single) mean model. In particular, you will examine the behaviour of these two estimators under the very simple model: y ∼ N(µ,1) (1) For reference, the least-squares estimate of µ using y is just ˆ µLS(y) = y which has squared-error risk E(µ− ˆ µLS(y))2

= 1.Under this very simple model the estimates of µ produced by the thresholding estimator and lasso estimator are ˆ µTH(y;λ) =y if |y| > λ 0 if |y| < λ=”” and=”” ˆ=”” µl1(y;λ)=”sgn(y)max(|y|−λ,0)” respectively,=”” where=”” λ=””> 0 is the hyperparameter controlling the degree of shrinkage. We are interested in the squared-error estimation risks (average squared errors) of these two estimators under the model (1). Complete the following questions; provide R code in your report for questions 1.1 and 1.2.

1. Write an R function that takes a µ, λ and calculates the squared error risk of the hard thresholding function, i.e., Rµ, ˆ µTH= Ehµ− ˆ µTH(y;λ)2i=Z∞ −∞ p(y|µ,1)µ− ˆ µTH(y;λ)2 dy where y follows the model (1). You should do this using simulation, as generating random samples from the data model (1) is very straightforward. Your function should take a number of simulations to run as an additional argument. [2 marks]

2. Write an R function that takes a µ, λ and a number of simulation iterations, and calculates the approximate squared error risk of the lasso shrinkage estimator, i.e., Rµ, ˆ µL1= Ehµ− ˆ µL1(y;λ)2i=Z∞ −∞ p(y|µ,1)µ− ˆ µL1(y;λ)2 dy As above, this should be done using simulation. [2 marks]

3. Once you have coded these functions, you should produce the squared-error risk curves as µ is varied from (−10, 10) by steps of 0.1 for λ = 1 and λ = 2. Produce plots for each of these settings, and in each plot overlay the squared error risk curves for the hard-thresholding estimator, the lasso estimator and the least-squares estimator. Make sure these plots are neatly formatted with appropriate axis labels. Make sure you run your simulations for suﬃcient iterations to get smooth looking curves. [2 marks]

4. Analyse the resulting plots. Discuss how the hard thresholding estimator compares to the lasso estimator, and how they both compare to the least-squares estimator? What possible advantages and drawbacks do these plots suggest that each of these estimators might have? [3 marks]

2

Question 2 (14 marks)

This question will require you to analyse a regression dataset using Bayesian lasso regression. In particular, you will be looking at predicting the fuel eﬃciency of a car (in kilometers per litre) based on characteristics of the car and its engine. This is clearly an important and useful problem. The dataset fuel2017-20.csv contains n = 2,000 observations on p = 9 predictors obtained from actual fuel eﬃciency tables for car models available for sale during the years 2017 through to 2020. The data dictionary for this dataset is given in Table 1. To analyse this data in a Bayesian fashion you will use the Bayesian lasso as implemented in the ﬁle bayes.lasso.R. This has the same inputs and produces the same output variables as the bayes.cauchy.linreg.R code you used in Studio 6.

1. Fit a regression model to the fuel eﬃciency data using the least-squares method in R (i.e., using lm()). From this output, what variables do you believe are potentially associated with increased fuel eﬃciency, and why? [2 marks]

2. Analyse the fuel eﬃciency data using the Bayesian lasso. Compare posterior means, standard deviations and t-statistics to those obtained using least-squares. How do they compare, and if you believe they are diﬀerent, why do you think they are diﬀerent? [2 marks]

3. From your Bayesian analysis, which variables seem unimportant/unlikely to be associated with fuel eﬃciency? Justify your answer [2 marks]

4. Fit a model using the Bayesian lasso with the unimportant variables removed

(a) Write down your best estimate of the regression equation relating fuel eﬃciency to the explanatory variables. [2 marks] (b) If we wanted to advise a car manufacturer regarding the choice of an engine to have likely good fuel eﬃciency, what does this model suggest we should look for? [2 marks]

5. Imagine that a friend proposes purchasing a particular model of car with the properties given in Table 2 Use your Bayesian model to:

(a) Predict the mean fuel eﬃciency for this car. [1 mark] (b) Provide a histogram of the posterior distribution of the mean fuel eﬃciency for this car, and a 95% credible interval on the plausible range of mean fuel eﬃciency for this car. [1 mark]

6. An engineer suggests they believe there is likely an interaction eﬀect between engine displacement and the number of cylinders. Using the Bayesian lasso, and statistics produced by the Bayesian lasso code, assess whether you think this is the case, and what eﬀect it has on the model? [2 marks]

3

Variable name Description Values

Model.Year Year of sale 2017−2020 Eng.Displacement Engine Displacement (litres, l) 0.9−8.4 No.Cylinders Number of Cylinders 3−16 Aspiration Engine Aspiration (Oxygen intake) N: Naturally∗ OT: Other SC: Supercharged TC: Turbocharged TS: Turbo+supercharged No.Gears Number of Gears 1−10 Lockup.Torque.Converter Lockup torque converter present? N∗ and Y Drive.Sys Drive System 4∗: 4-wheel drive A:All-wheel F:Front-wheel P:Part-time 4-wheel R:Rear-wheel Max.Ethanol Maximum % of Ethanol allowed 10−85 Fuel.Type Type of Fuel G∗: Regular Unleaded GM: Mid-grade Unleaded Recommended GP: Premium Unleaded Recommended GPR: Premium Unleaded Required Comb.FE Fuel Eﬃciency (km/l) 4.974−26.224

Table 1: Fuel eﬃciency data dictionary. The ∗ denotes the reference category for each categorical variable.

Variable Model.Year Eng.Displacement No.Cylinders Aspiration No.Gears Lockup.Torque.Converter Drive.Sys Max.Ethanol Fuel.Type Value 2018 3.5 6 N 8 N R 15 GPR

Table 2: Engine Properties.

some text

4

Question 3 (15 marks)

This question involves some theory regarding estimation of the Poisson distribution. Imagine we have n observations generated from a Poisson distribution y1,…,yn ∼ Poi(λ) with unknown λ and choose to estimate λ using

ˆ λa,b =1 n + b(a + n¯ y) (2)

where ¯ y = (1/n)Pn i=1 yi is the sample mean and a ≥ 0 and b ≥ 0 are user chosen hyperparameters. The bias and variance of the estimator (2) are biasλ(ˆ λa,b) = a−bλ n + b , Vλhˆ λa,bi= nλ (n + b)2 (3) Please answer the following questions. Provide appropriate working/mathematical justiﬁcation.

1. Prove that the bias and variance are equal to (3). [3 marks]

2. Describe the dependency of the bias and variance on the hyperparameters a and b and the population value of λ. [2 marks] 3. Plot the squared-error risk (expected squared-error) for n = 5 and λ ∈ (0,10) for (a = 1,b = 1), (a = 5,b = 1), (a = 5,b = 5) and maximum likelihood. You may plot all curves on the one plot. Describe the key features of the risk curves. [3 marks]

4. Given an a and a b, for which population value of λ does the estimator (2) attain the smallest risk? [2 marks] 5. Is the estimator (2) asymptotically eﬃcient as n →∞? [2 marks] Sometimes the alternative log-rate parameterisation v = logλ is used for the Poisson distribution. This is in some (statistical) ways a more natural parameterisation than the rate λ.

6. Write down the probability mass function of the Poisson distribution parameterised in terms of the log-rate v. [1 mark]

7. Derive the Fisher information Jn(v) for the log-rate parameter v. (hint: remember the observations are independently and identically distributed) [2 marks]

5

Question 4 (12 marks)

It was believed for a long time by medical practitioners that the full moon inﬂuenced the expression of medical conditions including fevers, rheumatism, epilepsy and bipolar disorder – in fact, the antiquated term “lunatic” derives from the word lunar, i.e., of the moon. In the late 1990’s a (tongue in cheek) study was undertaken to test if the full moon induced dogs to become more aggressive, with a resulting increased likelihood of biting people. In addition to being a little bit of fun, examining a problem like this through the lense of data science is an instructive example on how quantitative methods can be used to answer “folk-lore” questions/hypotheses. A dataset was recorded that contained the daily number of admissions to hospital of people being bitten by dogs from 13th of June, 1997 through to 30th of June, 19981. The number of dogbite incidences on full moon days was

y = (0,3,4,3,6,3,9,6,3,5,8,4,1)

The average number of dogbites on the remaining non-full moon days was 4.51, which we can take as being a suﬃciently accurate estimate given the amount of data on which it is based. This question involves using the Poisson distribution to analyse this dogbite data. Let us ﬁrst analyse the data using maximum likelihood and the asymptotic distribution of maximum likelihood estimates. Please answer the following questions. Provide working/justiﬁcations/code as required to support your answers. 1. Fit a Poisson distribution to the dog-bite data above using maximum likelihood. What is the estimated rate of dog-bites on full moon days? Provide an approximate 95% conﬁdence interval for this estimate using the asymptotic distribution of the maximum likelihood estimate. [3 marks]

2. Using the asymptotic distribution of the ML estimate of λ, test the hypothesis that dogs bite more frequently on full moon nights than on non-full moon nights. Clearly specify the hypothesis you are testing and discuss the resulting p-value. [3 marks] Let us now consider a Bayesian analysis of the data using the hierarchy: yi|λ ∼ Poi(λ), i = 1,…,n λ|a,b ∼ Ga(a,b) where Ga(a,b) denotes a gamma distribution with shape a and scale b. After consulting a number of physicians at emergency wards, we ﬁnd a reasonable guess at the average number of dog bite incidences on any given day to be 5, with a 95% prior probability of the average number of daily dog bite incidences ranging anywhere from 1 to 14.

3. Select appropriate values of a and b for our gamma prior distribution to (approximately) encode the speciﬁed information (hint: use the pgamma() function). [1 mark]

4. Using this prior information and the provided data, write down the posterior distribution for λ (i.e., the estimate of the average number of dog bite incidences on a full moon day). [1 mark]

5. Plot the prior and posterior distributions of λ (hint: use the dgamma() function). [1 mark]

6. Report on the posterior mean and 95% credible intervals. How do these compare to our maximum likelihood estimates? (hint: use the qgamma() function) [2 marks]

7. Do you think the Bayesian analysis supports the hypothesis that dogs bite more frequently on full moon days than non-full moon days? [1 mark]

1Data source is taken from the Australian Institute of Health and Welfare Database of Aus