ECON 4041H

Research Methodology

Polynomial Predictors

Byron Lew

Department of Economics

Trent University

2023

©Byron Lew, 2023

Authorization is given to students enrolled in the course to reproduce this material exclusively for their own personal use.

Byron Lew ECON 4041H { Research Methodology in Economics 1 / 53

Learning Objectives

quadratic, cubic and potentially higher-order polynomial

functions of independent variables

data transformation to accommodate scale effect when using

high-order polynomials

use of margins to calculate partial derivatives

awareness of how functional form forces certain constrained fit

to data

use of fractional polynomials

Byron Lew ECON 4041H { Research Methodology in Economics 2 / 53

Implementing and interpreting higher-order terms

Relationship between many economic phenomenon are often

non-linear

Income in particular tends to be affected non-linearly

Example: income by age

path <- ’H:/Classes/4041/data/’

gss <- readRDS(paste0(path, ’gss.rds’))

gss <- subset(gss, age <= 80)

base <- ggplot(data=gss, aes(x=age, y=realrinc)) +

theme_bw() + labs(x = ’age’, y = ’real income ($1,000)’)

base + geom_jitter(size = 0.5) +

scale_y_continuous(labels = seq(0,500,100))

Byron Lew ECON 4041H { Research Methodology in Economics 3 / 53

Income by age

0

100

200

300

400

500

20 40 60 80

age

real income ($1,000)

Byron Lew ECON 4041H { Research Methodology in Economics 4 / 53

Test for functional form on income data

Fit a lowess curve to the data to see what shape it reveals

First graph is the lowess curve on top of the scatterplot

base + geom_jitter(size = 0.5) +

geom_smooth(method = ’loess’) +

scale_y_continuous(labels = seq(0,500,100))

Then to adjust for scale, second graph is only the lowess fitted

curve

base + geom_smooth(method = ’loess’) +

scale_y_continuous(breaks = seq(5000,30000,5000),

labels = seq(5, 30, 5))

Byron Lew ECON 4041H { Research Methodology in Economics 5 / 53

Income by age: lowess on Scatter

Byron Lew ECON 4041H { Research Methodology in Economics 6 / 53

Income by age: lowess function plot

Byron Lew ECON 4041H { Research Methodology in Economics 7 / 53

Lowess curves

Fit looks remarkably similar to a quadratic, and no particular

form was given to the loess() function

One problem with lowess fits is the computing power required:

can be memory intensive and slow

Solution: use random sample of dataset to reduce size and speed

up lowess

Example: generate random sample of 5000 observations

gssSamp <- gss[sample(nrow(gss), 5000, replace = FALSE),]

Use gssSamp” in plot commands

Byron Lew ECON 4041H { Research Methodology in Economics 8 / 53

Implementing and interpreting higher-order terms

A common estimator is of the form y = β0 + β1x + β2x2

If the sign on the second-order term coefficient β2 is negative,

then we have a quadratic with a maximum at some midpoint

The slope of this function is defined as @@yx = β1 + 2β2x

If we insert a value of x in which we are interested, we get the

value of the slope of the function

We can even solve for the point that maximizes the function by

setting the slope function = 0, yielding x = –1 2 β β1 2

Byron Lew ECON 4041H { Research Methodology in Economics 9 / 53

Estimate parameters for income by age

modInc2 <- lm(realrinc~age+I(age^2)+female, data=gss)

age 2,412.34∗∗∗

(64.36)

I(age^2) –24.20∗∗∗

(0.73)

femaleFemale –12,419.24∗∗∗

(283.70)

Constant –25,679.77∗∗∗

(1,326.66)

Observations | 32,100 |

R2 | 0.11 |

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

sign on quadratic coefficient is negative so the function rises to a

maximum

Byron Lew ECON 4041H { Research Methodology in Economics 10 / 53

Example: income by age

common to use quadratic for income, but note that the

functional form does put some structure on the results

for example, what if income rose, but then flattened out rather

than declined, to be addressed later

graph our predicted income by age using emmeans

library(emmeans)

modInc2.em <- summary(emmeans(modInc2, ~age,

at=list(age=seq(18,80,2))))

this will generate mean predicted income for each value of age

and save it in object modInc2.em”, to be passed to ggplot()

ggplot(modInc2.em, aes(x=age, y=emmean)) + geom_line() +

geom_ribbon(aes(ymax=upper.CL, ymin=lower.CL), alpha=0.2) +

labs(y=’income ($1,000)’, x=’age’) + theme_bw() +

scale_y_continuous(breaks = seq(5000, 30000, 5000),

labels = seq(5, 30, 5)) +

scale_x_continuous(breaks = seq(20,80,5))

plot on next slide . . .

Byron Lew ECON 4041H { Research Methodology in Economics 11 / 53

Income by age, quadratic estimate

5

10

15

20

25

20 25 30 35 40 45 50 55 60 65 70 75 80

age

income ($1,000)

Byron Lew ECON 4041H { Research Methodology in Economics 12 / 53

Example: income by age

calculate maximum point: age = –1 2 ββage

age2

variable.names(modInc2) #check coefficient names

[1] “(Intercept)” “age” “I(age^2)” “femaleFemale”

ageStar <- –0.5*coef(modInc2)[’age’]/coef(modInc2)[’I(age^2)’]

ageStar

age

49.83769

is the age at which income is maximized

Byron Lew ECON 4041H { Research Methodology in Economics 13 / 53

Marginal effects

we can use the R function margins() to find the value of age

where the derivative of the estimated function is 0

library(margins)

modInc2.me <- summary(margins(modInc2, variables = ’age’,

at=list(age=seq(20,80,10))))

age AME SE z p

20 1444.26 35.81 40.33 0.0000

30 960.22 22.25 43.15 0.0000

40 476.18 12.00 39.68 0.0000

50 -7.86 14.55 -0.54 0.5891

60 -491.90 27.04 -18.19 0.0000

70 -975.93 40.80 -23.92 0.0000

80 -1459.97 54.65 -26.71 0.0000

the function margins() takes the derivative of the model with

respect to each variable listed (and all variables if option

variables =” not included)

Byron Lew ECON 4041H { Research Methodology in Economics 14 / 53

Marginal effects plot

Plotting the marginal values, with confidence interval band,

makes it easier to see what is going on

ggplot(modInc2.me, aes(x = age, y = AME)) +

geom_line() +

geom_ribbon(aes(ymax = upper, ymin = lower), alpha = 0.2) +

labs(y = ’marginal effect ($)’, x = ’age’) +

geom_hline(yintercept = 0, linewidth = 0.2) + theme_bw()

Byron Lew ECON 4041H { Research Methodology in Economics 15 / 53

Marginal effect of age in quadratic income model

-1000

0

1000

20 40 60 80

age

marginal effect ($)

Byron Lew ECON 4041H { Research Methodology in Economics 16 / 53

Marginal effects function

Judging by the graph we can see that the line crosses y=0

around age 50

Note also that the value of AME for age=50 was not

statistically signficant, ie. indistinguishable from 0

We can find the value of the marginal effects function at ageStar

summary(margins(modInc2, variables = ’age’,

at=list(age=ageStar)))

factor age AME SE z p lower upper

age 49.8377 0.0000 14.4320 0.0000 1.0000 -28.2862 28.2862

Byron Lew ECON 4041H { Research Methodology in Economics 17 / 53

Marginal effects function

Notice the following attributes

1 Crosses age” axis at value just under 50: point of max income

2 Downward sloping: increases in income are decreasing with age

3 Slope is the decrease in the rate of increase in income per year

of age

4 Slope is constant: determined by specifying quadratic function

of age ! income

F Recall, quadratic relationship chosen from lowess

Byron Lew ECON 4041H { Research Methodology in Economics 18 / 53

Introduction to calculus: quadratic function and its

first derivative

Byron Lew ECON 4041H { Research Methodology in Economics 19 / 53

Use of cubic functions

Useful when relationship displays two turns”

demonstrate on fertility data

gss <- readRDS(paste0(path, ’gss.rds’))

gss <- gss[gss$age >= 45 & gss$age <= 55 & gss$yrborn >= 1920

& gss$yrborn <=1960 & gss$female == ’Female’, ]

illustrate pattern by graphing lowess fit

base <- ggplot(data = gss, aes(x = yrborn, y = children)) +

theme_bw() +

labs(x = ’year of birth’, y = ’number of children born’)

base + geom_jitter(size = 0.5) + geom_smooth(method = ’loess’)

Byron Lew ECON 4041H { Research Methodology in Economics 20 / 53

Number of children born by mother’s year of birth

8 6 4 2 0

1920 1930 1940 1950 1960

year of birth

number of children born

Byron Lew ECON 4041H { Research Methodology in Economics 21 / 53

Number of children born by mother’s year of birth

2.0

2.5

3.0

1920 1930 1940 1950 1960

year of birth

number of children born

Byron Lew ECON 4041H { Research Methodology in Economics 22 / 53

Plot of estimated marginal means from indicator

function

Another method of examining the relationship between two

variables

Run regression on independent variable yrborn as factor, then

plot emmeans

modYrbornf <- lm(children ~ factor(yrborn), data = gss)

plt <- summary(emmeans(modYrbornf, ~yrborn))

ggplot(plt, aes(x = yrborn, y = emmean)) + geom_line() +

geom_errorbar(aes(ymax = upper.CL, ymin = lower.CL),

size = 0.15) + theme_bw() +

labs(y = ’number of children born’,

x = ’mothern’s year of birth’) +

scale_x_continuous(breaks = seq(1920,1960,5))

Byron Lew ECON 4041H { Research Methodology in Economics 23 / 53

Emmeans from indep var as indicator

1.5

2.0

2.5

3.0

3.5

1920 1925 1930 1935 1940 1945 1950 1955 1960

mother’s year of birth

number of children born

Byron Lew ECON 4041H { Research Methodology in Economics 24 / 53

Cubic function?

Lowess graph looked cubic

Indicator variable plot has cubic-like appearance

But not clear that pattern for year-of-birth after 1945 is

I increasing (turn)

or

I flat

Important to note because cubic structure imposes turn

We will test statistically whether cubic fits data

Byron Lew ECON 4041H { Research Methodology in Economics 25 / 53

Cubic function: model estimate

Fit cubic function to estimate the relationship

modYrb3 <- lm(children ~ yrborn + I(yrborn^2) +

I(yrborn^3), data = gss)

yrborn 2,364.3660∗∗∗

(229.4207)

I(yrborn^2) –1.2180∗∗∗

(0.1182)

I(yrborn^3) 0.0002∗∗∗

(0.00002)

Constant –1,529,809.0000∗∗∗

(148,401.8000)

Observations | 5,049 |

R2 | 0.0903 |

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Byron Lew ECON 4041H { Research Methodology in Economics 26 / 53

Cubic function: scale

notice the scale of the variable

apply(model.matrix(modYrb3),2,min)

(Intercept) 1 |
yrborn I(yrborn^2) I(yrborn^3) | |

1920 | 3686400 | 7077888000 |

apply(model.matrix(modYrb3),2,max) | ||

(Intercept) | yrborn I(yrborn^2) I(yrborn^3) |

1 1960 3841600 7529536000

I yrborn: range is 1920 { 1960

I yrborn2: range is 3,686,400 { 3,841,600

I yrborn3: range is 7:078 × 109 { 7:530 × 109

solution: make number smaller by subtracting 1940

already provided in variable yrborn40

Byron Lew ECON 4041H { Research Methodology in Economics 27 / 53

Cubic function: rescaled estimate

modYrb3a <- lm(children ~ yrborn40 + I(yrborn40^2) +

I(yrborn40^3), data = gss)

yrborn40 –0.0909∗∗∗

(0.0052)

I(yrborn40^2) –0.0008∗∗∗

(0.0002)

I(yrborn40^3) 0.0002∗∗∗

(0.00002)

Constant 2.7414∗∗∗

(0.0372)

Observations 5,049

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

notice the size of the coefficients on the higher order term

Byron Lew ECON 4041H { Research Methodology in Economics 28 / 53

Cubic function: rescaled estimate

useful trick is to scale up parameters by scaling down the

covariates

gss$yrb40s <- gss$yrborn40/10

shrinking the covariates means increasing the size of the

estimated coefficients, and vice versa

Byron Lew ECON 4041H { Research Methodology in Economics 29 / 53

Cubic function: rescaled twice

modYrb3b <- lm(children ~ yrb40s + I(yrb40s^2) +

I(yrb40s^3), data = gss)

yrborn40 –0.0909∗∗∗

(0.0052)

I(yrborn40^2) –0.0008∗∗∗

(0.0002)

I(yrborn40^3) 0.0002∗∗∗

(0.00002)

yrb40s –0.9087∗∗∗

(0.0524)

I(yrb40s^2) –0.0775∗∗∗

(0.0212)

I(yrb40s^3) 0.2091∗∗∗

(0.0203)

Constant 2.7414∗∗∗ 2.7414∗∗∗

(0.0372) (0.0372)

Observations 5,049 5,049

Note: Byron Lew ∗p<0.1; ECON 4041H { Research Methodology in Economics ∗∗p<0.05; ∗∗∗p<0.01 30 / 53

Reading model with transformations

easier to read, and shows more digits

numbers are same, but scaled by transformation

I coef on linear term is now 10× larger

I coef on quadratic term is now 102× larger

I coef on cubic term is now 103× larger

just remember that yrborn has been transformed, particularly if

using margins command

Byron Lew ECON 4041H { Research Methodology in Economics 31 / 53

Detour: viewing output

When we want to compare several models, and to save output,

we need a tool to combine results and save to file

stargazer” is a useful and easy-to-use output summary tool

library(stargazer)

stargazer(modYrb3, modYrb3a, modYrb3b, type =’text’,

out=’yearBorn.txt’)

This will save results to files ‘yearBorn.txt’ which can be opened

in Word or other wordprocessor

Results also appear in R window

In our example, because the variable names have changed, the

results don’t line up, but we could edit file

Note: stargazer” supports three output formats: text, html and

latex. Text and html formats can be opened in a wordprocessor.

Use latex” format when including stargarzer() in RMarkdown

files

Byron Lew ECON 4041H { Research Methodology in Economics 32 / 53

Predicted means

use emmeans to generate plot of functional form

Note: if using final transformed function, inputs (at list) must be

transformed as well

plt <- summary(emmeans(modYrb3b, ~yrb40s,

at = list(yrb40s = seq(–2,2,0.2))))

ggplot(plt, aes(x = yrb40s, y = emmean)) +

geom_line() + theme_bw() +

geom_ribbon(aes(ymax=upper.CL, ymin=lower.CL), alpha = 0.2) +

labs(y = ’predicted number of children’,

x = ’mothern’s year of birth’) +

scale_x_continuous(breaks = seq(–2,2,0.5),

labels = seq(1920,1960,5))

why do we use range of -2 to 2?

plot on next slide . . .

Byron Lew ECON 4041H { Research Methodology in Economics 33 / 53

Cubic fit of fertility on year of birth

2.0

2.5

3.0

3.5

1920 1925 1930 1935 1940 1945 1950 1955 1960

mother’s year of birth

predicted number of children

Byron Lew ECON 4041H { Research Methodology in Economics 34 / 53

Slope of cubic function

Economists like to know the marginal response, the change in y

given a change in x, @@yx

Let’s ask what is the change in expected number of children as

year of birth changes

We want the first derivative of our function with respect to year

of birth

children = β0 + β1yearBirth + β2yearBirth2 + β3yearBirth3

first derivative: β1 + 2β2yearBirth + 3β3yearBirth2

again, we can estimate this the long way, but calculating the

functional form of the first derivative, or the easy (margins) way

code on next slide . . .

Byron Lew ECON 4041H { Research Methodology in Economics 35 / 53

Derivative of cubic function: code

modYrb3.me <- summary(margins(modYrb3b,

variables=’yrb40s’,

at = list(yrb40s = seq(–2,2,0.2))))

ggplot(modYrb3.me, aes(x=yrb40s, y=AME)) + geom_line() +

geom_ribbon(aes(ymax=upper, ymin=lower), alpha = 0.2) +

labs(y=’marginal effect on number of children’,

x =“mother’s year of birth”) +

geom_hline(yintercept=0, size=0.2) + theme_bw() +

scale_x_continuous(breaks=seq(–2,2,1),

labels = seq(1920,1960,10))

Byron Lew ECON 4041H { Research Methodology in Economics 36 / 53

Derivative of cubic function: plot

-1

2 1 0

1920 1930 1940 1950 1960

mother’s year of birth

marginal effect on number of children

The line y=0 illustrates the points where the first derivative is zero,

points of min or max of the original function.

Byron Lew ECON 4041H { Research Methodology in Economics 37 / 53

Words of caution

we fit a particular functional form: quadratic, cubic

by implication we are forcing a fit on our data

this fit may or may not be appropriate

let’s review by comparing our lowess plot with our plot of

predicted means

Byron Lew ECON 4041H { Research Methodology in Economics 38 / 53

Lowess plot (1st) vs. cubic plot of emmeans (2nd)

Byron Lew ECON 4041H { Research Methodology in Economics 39 / 53

Comparing cubic to lowess

both curves look similar (jaggedness on predicted means curve

due to range specification)

on curve of predicted means, the slope of each quadratic section

are defined to be the same by choice of functional form

Lowess curve has extended linear segment from 1938{1948

looks more like three curves spliced together: quadratic, line,

quadratic

But by fitting a cubic, we are stuck with the math and cannot

have a linear portion in middle

In practice, quadratics are quite common but cubics less common

Problems with functional form estimators

I impose symmetry

I may be wildly inaccurate out-of-sample

Several solutions: we look at fractional polynomials next

Byron Lew ECON 4041H { Research Methodology in Economics 40 / 53

Fractional polynomials

Test of educational attainment by age from gss data

gss <- readRDS(paste0(path, ’gss.rds’))

gss <- gss[gss$age <= 80,]

Test shape first using age as indicator

modlmf <- lm(educ ~ factor(age), data = gss)

Graph predicted means to visualize functional form

pltLmf <- summary(emmeans(modlmf, ~age))

ggplot(pltLmf, aes(x=age, y=emmean)) + geom_line() +

geom_ribbon(aes(ymax=upper.CL, ymin=lower.CL), alpha=0.2) +

labs(y = ’predicted years of education’, x = ’age’) +

theme_bw() +

scale_x_continuous(breaks = seq(20,80,5))

Byron Lew ECON 4041H { Research Methodology in Economics 41 / 53

Predicted education from indicator variable of age

11

12

13

20 25 30 35 40 45 50 55 60 65 70 75 80

age

predicted years of education

Byron Lew ECON 4041H { Research Methodology in Economics 42 / 53

Estimate education as quadratic function of age

Looks like a quadratic, but with substantial asymmetry

Estimate as quadratic

modlm2 <- lm(educ ~ age + I(age^2), data = gss)

pltModlm2 <- summary(emmeans(modlm2, ~age,

at = list(age = seq(18,80,1))))

ggplot(pltModlm2, aes(x=age, y=emmean)) + geom_line() +

geom_ribbon(aes(ymax=upper.CL, ymin=lower.CL), alpha=0.2) +

labs(y = ’predicted years of education’, x = ’age’) +

theme_bw() + scale_x_continuous(breaks=seq(20,80,5))

next slide . . .

Byron Lew ECON 4041H { Research Methodology in Economics 43 / 53

Predicted education from quadratic function of age

10

11

12

13

20 25 30 35 40 45 50 55 60 65 70 75 80

age

predicted years of education

Byron Lew ECON 4041H { Research Methodology in Economics 44 / 53

Flexible functional form: fractional polynomials

For curves with asymmetries, a better fit requires functions with

polynomials to fractional powers, powers < 1.

General form of fractional polynomials: f (x) = axp1 + bxp2

where p1 and p2 are chosen from f-2, -1, -0.5, 0, 0.5, 1, 2, 3g

and power 0 represents ln(x)

the coefficients a and b could be negative or positive

See .r file for demonstration of different functional forms

R implements this estimator with function mfp()

Fits a general family of polynomials of fractional powers

Byron Lew ECON 4041H { Research Methodology in Economics 45 / 53

Fractional polynomials

Estimate as fractional polynomial

library(mfp)

modFp <- mfp(educ ~ fp(age), data = gss)

Estimate Std. Error t value Pr(>jtj)

(Intercept) 18.0895 0.0984 183.8 0.0000

I((age/100)^-2) -0.1655 0.0040 -40.9 0.0000

I((age/100)^1) -9.0247 0.1558 -57.9 0.0000

f (x) = axp1 + bxp2

so a = –0:1655; b = –9:0247; p1 = –2 and p2 = 1

Also notice that the covariates have been scaled, here by

dividing by 100

and there is a constant (intercept) term

Byron Lew ECON 4041H { Research Methodology in Economics 46 / 53

Fractional polynomials

With model estimate graph estimated marginal means

We send emmeans() the modFp$fit object

pltModFp <- summary(emmeans(modFp$fit, ~age,

at = list(age = seq(18,80,1)),

weight = ’proportional’))

ggplot(pltModFp, aes(x=age, y=emmean)) + geom_line() +

geom_ribbon(aes(ymax = upper.CL, ymin = lower.CL), alpha=0.2)

labs(y = ’predicted years of education’, x = ’age’) +

theme_bw() + scale_x_continuous(breaks = seq(20,80,5))

Note: ymin and ymax objects in geom ribbon names changed

for mfp()

Byron Lew ECON 4041H { Research Methodology in Economics 47 / 53

Fractional polynomials

11

12

13

20 25 30 35 40 45 50 55 60 65 70 75 80

age

predicted years of education

Byron Lew ECON 4041H { Research Methodology in Economics 48 / 53

Fractional polynomials{marginal effects

modFpFit.me <- summary(margins(modFp$fit, variables = ’age’,

at=list(age = seq(20,80,10))))

ggplot(modFpFit.me, aes(x=age, y=AME)) + geom_line() +

geom_ribbon(aes(ymax=upper, ymin=lower), alpha=0.2) +

labs(y=’predicted years of education’, x=’age’) +

theme_bw() + scale_x_continuous(breaks=seq(20,80,5))

Byron Lew ECON 4041H { Research Methodology in Economics 49 / 53

Fractional polynomials{marginal effects

-0.1

0.0

0.1

0.2

0.3

20 25 30 35 40 45 50 55 60 65 70 75 80

age

predicted years of education

Byron Lew ECON 4041H { Research Methodology in Economics 50 / 53

Compare all three functions: plot

ggplot(gss, aes(x=age, y=educ)) +

geom_line(data=pltLmf, aes(x=age, y=emmean, colour=’factor’),

size=0.5) +

geom_ribbon(data=pltLmf, aes(ymax=upper.CL, ymin=lower.CL),

alpha=0.1, show.legend=F, fill=’black’) +

geom_point(data=pltLmf, size =0.75) +

geom_line(data=pltModlm2, aes(colour=’quadratic’), linewidth=0.5) +

geom_ribbon(data=pltModlm2, aes(ymax=upper.CL, ymin=lower.CL),

alpha=0.2, fill=’blue’, show.legend=F) +

geom_line(data=pltModFp, aes(colour=’fracpoly’), size=0.5) +

geom_ribbon(data=pltModFp, aes(ymax=upper.CL, ymin=lower.CL),

alpha=0.2, fill=’red’, show.legend=F) +

theme_bw() + labs(y=’predicted years of education’, x=’age’) +

scale_x_continuous(breaks=seq(20,80,5)) +

scale_colour_manual(name=’function’, values=c(’factor’=’black’,

’quadratic’=’blue’, ’fracpoly’=’red’))

Byron Lew ECON 4041H { Research Methodology in Economics 51 / 53

Fit of three estimators

10

11

12

13

20 25 30 35 40 45 50 55 60 65 70 75 80

age

predicted years of education

function

factor

fracpoly

quadratic

Byron Lew ECON 4041H { Research Methodology in Economics 52 / 53

In this lecture we have studied:

quadratic, cubic and potentially higher-order polynomial

functions of independent variables

data transformation to accommodate scale effect when using

high-order polynomials

use of margins to calculate partial derivatives

awareness of how functional form forces certain constrained fit

to data

use of fractional polynomials

Byron Lew ECON 4041H { Research Methodology in Economics 53 / 53