ECON 4041H
Research Methodology
Polynomial Predictors
Byron Lew
Department of Economics
Trent University
2023
©Byron Lew, 2023
Authorization is given to students enrolled in the course to reproduce this material exclusively for their own personal use.
Byron Lew ECON 4041H { Research Methodology in Economics 1 / 53
Learning Objectives
quadratic, cubic and potentially higher-order polynomial
functions of independent variables
data transformation to accommodate scale effect when using
high-order polynomials
use of margins to calculate partial derivatives
awareness of how functional form forces certain constrained fit
to data
use of fractional polynomials
Byron Lew ECON 4041H { Research Methodology in Economics 2 / 53
Implementing and interpreting higher-order terms
Relationship between many economic phenomenon are often
non-linear
Income in particular tends to be affected non-linearly
Example: income by age
path <- ’H:/Classes/4041/data/’
gss <- readRDS(paste0(path, ’gss.rds’))
gss <- subset(gss, age <= 80)
base <- ggplot(data=gss, aes(x=age, y=realrinc)) +
theme_bw() + labs(x = ’age’, y = ’real income ($1,000)’)
base + geom_jitter(size = 0.5) +
scale_y_continuous(labels = seq(0,500,100))
Byron Lew ECON 4041H { Research Methodology in Economics 3 / 53
Income by age
0
100
200
300
400
500
20 40 60 80
age
real income ($1,000)
Byron Lew ECON 4041H { Research Methodology in Economics 4 / 53
Test for functional form on income data
Fit a lowess curve to the data to see what shape it reveals
First graph is the lowess curve on top of the scatterplot
base + geom_jitter(size = 0.5) +
geom_smooth(method = ’loess’) +
scale_y_continuous(labels = seq(0,500,100))
Then to adjust for scale, second graph is only the lowess fitted
curve
base + geom_smooth(method = ’loess’) +
scale_y_continuous(breaks = seq(5000,30000,5000),
labels = seq(5, 30, 5))
Byron Lew ECON 4041H { Research Methodology in Economics 5 / 53
Income by age: lowess on Scatter
Byron Lew ECON 4041H { Research Methodology in Economics 6 / 53
Income by age: lowess function plot
Byron Lew ECON 4041H { Research Methodology in Economics 7 / 53
Lowess curves
Fit looks remarkably similar to a quadratic, and no particular
form was given to the loess() function
One problem with lowess fits is the computing power required:
can be memory intensive and slow
Solution: use random sample of dataset to reduce size and speed
up lowess
Example: generate random sample of 5000 observations
gssSamp <- gss[sample(nrow(gss), 5000, replace = FALSE),]
Use gssSamp” in plot commands
Byron Lew ECON 4041H { Research Methodology in Economics 8 / 53
Implementing and interpreting higher-order terms
A common estimator is of the form y = β0 + β1x + β2x2
If the sign on the second-order term coefficient β2 is negative,
then we have a quadratic with a maximum at some midpoint
The slope of this function is defined as @@yx = β1 + 2β2x
If we insert a value of x in which we are interested, we get the
value of the slope of the function
We can even solve for the point that maximizes the function by
setting the slope function = 0, yielding x = –1 2 β β1 2
Byron Lew ECON 4041H { Research Methodology in Economics 9 / 53
Estimate parameters for income by age
modInc2 <- lm(realrinc~age+I(age^2)+female, data=gss)
age 2,412.34∗∗∗
(64.36)
I(age^2) –24.20∗∗∗
(0.73)
femaleFemale –12,419.24∗∗∗
(283.70)
Constant –25,679.77∗∗∗
(1,326.66)
Observations | 32,100 |
R2 | 0.11 |
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
sign on quadratic coefficient is negative so the function rises to a
maximum
Byron Lew ECON 4041H { Research Methodology in Economics 10 / 53
Example: income by age
common to use quadratic for income, but note that the
functional form does put some structure on the results
for example, what if income rose, but then flattened out rather
than declined, to be addressed later
graph our predicted income by age using emmeans
library(emmeans)
modInc2.em <- summary(emmeans(modInc2, ~age,
at=list(age=seq(18,80,2))))
this will generate mean predicted income for each value of age
and save it in object modInc2.em”, to be passed to ggplot()
ggplot(modInc2.em, aes(x=age, y=emmean)) + geom_line() +
geom_ribbon(aes(ymax=upper.CL, ymin=lower.CL), alpha=0.2) +
labs(y=’income ($1,000)’, x=’age’) + theme_bw() +
scale_y_continuous(breaks = seq(5000, 30000, 5000),
labels = seq(5, 30, 5)) +
scale_x_continuous(breaks = seq(20,80,5))
plot on next slide . . .
Byron Lew ECON 4041H { Research Methodology in Economics 11 / 53
Income by age, quadratic estimate
5
10
15
20
25
20 25 30 35 40 45 50 55 60 65 70 75 80
age
income ($1,000)
Byron Lew ECON 4041H { Research Methodology in Economics 12 / 53
Example: income by age
calculate maximum point: age = –1 2 ββage
age2
variable.names(modInc2) #check coefficient names
[1] “(Intercept)” “age” “I(age^2)” “femaleFemale”
ageStar <- –0.5*coef(modInc2)[’age’]/coef(modInc2)[’I(age^2)’]
ageStar
age
49.83769
is the age at which income is maximized
Byron Lew ECON 4041H { Research Methodology in Economics 13 / 53
Marginal effects
we can use the R function margins() to find the value of age
where the derivative of the estimated function is 0
library(margins)
modInc2.me <- summary(margins(modInc2, variables = ’age’,
at=list(age=seq(20,80,10))))
age AME SE z p
20 1444.26 35.81 40.33 0.0000
30 960.22 22.25 43.15 0.0000
40 476.18 12.00 39.68 0.0000
50 -7.86 14.55 -0.54 0.5891
60 -491.90 27.04 -18.19 0.0000
70 -975.93 40.80 -23.92 0.0000
80 -1459.97 54.65 -26.71 0.0000
the function margins() takes the derivative of the model with
respect to each variable listed (and all variables if option
variables =” not included)
Byron Lew ECON 4041H { Research Methodology in Economics 14 / 53
Marginal effects plot
Plotting the marginal values, with confidence interval band,
makes it easier to see what is going on
ggplot(modInc2.me, aes(x = age, y = AME)) +
geom_line() +
geom_ribbon(aes(ymax = upper, ymin = lower), alpha = 0.2) +
labs(y = ’marginal effect ($)’, x = ’age’) +
geom_hline(yintercept = 0, linewidth = 0.2) + theme_bw()
Byron Lew ECON 4041H { Research Methodology in Economics 15 / 53
Marginal effect of age in quadratic income model
-1000
0
1000
20 40 60 80
age
marginal effect ($)
Byron Lew ECON 4041H { Research Methodology in Economics 16 / 53
Marginal effects function
Judging by the graph we can see that the line crosses y=0
around age 50
Note also that the value of AME for age=50 was not
statistically signficant, ie. indistinguishable from 0
We can find the value of the marginal effects function at ageStar
summary(margins(modInc2, variables = ’age’,
at=list(age=ageStar)))
factor age AME SE z p lower upper
age 49.8377 0.0000 14.4320 0.0000 1.0000 -28.2862 28.2862
Byron Lew ECON 4041H { Research Methodology in Economics 17 / 53
Marginal effects function
Notice the following attributes
1 Crosses age” axis at value just under 50: point of max income
2 Downward sloping: increases in income are decreasing with age
3 Slope is the decrease in the rate of increase in income per year
of age
4 Slope is constant: determined by specifying quadratic function
of age ! income
F Recall, quadratic relationship chosen from lowess
Byron Lew ECON 4041H { Research Methodology in Economics 18 / 53
Introduction to calculus: quadratic function and its
first derivative
Byron Lew ECON 4041H { Research Methodology in Economics 19 / 53
Use of cubic functions
Useful when relationship displays two turns”
demonstrate on fertility data
gss <- readRDS(paste0(path, ’gss.rds’))
gss <- gss[gss$age >= 45 & gss$age <= 55 & gss$yrborn >= 1920
& gss$yrborn <=1960 & gss$female == ’Female’, ]
illustrate pattern by graphing lowess fit
base <- ggplot(data = gss, aes(x = yrborn, y = children)) +
theme_bw() +
labs(x = ’year of birth’, y = ’number of children born’)
base + geom_jitter(size = 0.5) + geom_smooth(method = ’loess’)
Byron Lew ECON 4041H { Research Methodology in Economics 20 / 53
Number of children born by mother’s year of birth
8 6 4 2 0
1920 1930 1940 1950 1960
year of birth
number of children born
Byron Lew ECON 4041H { Research Methodology in Economics 21 / 53
Number of children born by mother’s year of birth
2.0
2.5
3.0
1920 1930 1940 1950 1960
year of birth
number of children born
Byron Lew ECON 4041H { Research Methodology in Economics 22 / 53
Plot of estimated marginal means from indicator
function
Another method of examining the relationship between two
variables
Run regression on independent variable yrborn as factor, then
plot emmeans
modYrbornf <- lm(children ~ factor(yrborn), data = gss)
plt <- summary(emmeans(modYrbornf, ~yrborn))
ggplot(plt, aes(x = yrborn, y = emmean)) + geom_line() +
geom_errorbar(aes(ymax = upper.CL, ymin = lower.CL),
size = 0.15) + theme_bw() +
labs(y = ’number of children born’,
x = ’mothern’s year of birth’) +
scale_x_continuous(breaks = seq(1920,1960,5))
Byron Lew ECON 4041H { Research Methodology in Economics 23 / 53
Emmeans from indep var as indicator
1.5
2.0
2.5
3.0
3.5
1920 1925 1930 1935 1940 1945 1950 1955 1960
mother’s year of birth
number of children born
Byron Lew ECON 4041H { Research Methodology in Economics 24 / 53
Cubic function?
Lowess graph looked cubic
Indicator variable plot has cubic-like appearance
But not clear that pattern for year-of-birth after 1945 is
I increasing (turn)
or
I flat
Important to note because cubic structure imposes turn
We will test statistically whether cubic fits data
Byron Lew ECON 4041H { Research Methodology in Economics 25 / 53
Cubic function: model estimate
Fit cubic function to estimate the relationship
modYrb3 <- lm(children ~ yrborn + I(yrborn^2) +
I(yrborn^3), data = gss)
yrborn 2,364.3660∗∗∗
(229.4207)
I(yrborn^2) –1.2180∗∗∗
(0.1182)
I(yrborn^3) 0.0002∗∗∗
(0.00002)
Constant –1,529,809.0000∗∗∗
(148,401.8000)
Observations | 5,049 |
R2 | 0.0903 |
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
Byron Lew ECON 4041H { Research Methodology in Economics 26 / 53
Cubic function: scale
notice the scale of the variable
apply(model.matrix(modYrb3),2,min)
(Intercept) 1 |
yrborn I(yrborn^2) I(yrborn^3) | |
1920 | 3686400 | 7077888000 |
apply(model.matrix(modYrb3),2,max) | ||
(Intercept) | yrborn I(yrborn^2) I(yrborn^3) |
1 1960 3841600 7529536000
I yrborn: range is 1920 { 1960
I yrborn2: range is 3,686,400 { 3,841,600
I yrborn3: range is 7:078 × 109 { 7:530 × 109
solution: make number smaller by subtracting 1940
already provided in variable yrborn40
Byron Lew ECON 4041H { Research Methodology in Economics 27 / 53
Cubic function: rescaled estimate
modYrb3a <- lm(children ~ yrborn40 + I(yrborn40^2) +
I(yrborn40^3), data = gss)
yrborn40 –0.0909∗∗∗
(0.0052)
I(yrborn40^2) –0.0008∗∗∗
(0.0002)
I(yrborn40^3) 0.0002∗∗∗
(0.00002)
Constant 2.7414∗∗∗
(0.0372)
Observations 5,049
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
notice the size of the coefficients on the higher order term
Byron Lew ECON 4041H { Research Methodology in Economics 28 / 53
Cubic function: rescaled estimate
useful trick is to scale up parameters by scaling down the
covariates
gss$yrb40s <- gss$yrborn40/10
shrinking the covariates means increasing the size of the
estimated coefficients, and vice versa
Byron Lew ECON 4041H { Research Methodology in Economics 29 / 53
Cubic function: rescaled twice
modYrb3b <- lm(children ~ yrb40s + I(yrb40s^2) +
I(yrb40s^3), data = gss)
yrborn40 –0.0909∗∗∗
(0.0052)
I(yrborn40^2) –0.0008∗∗∗
(0.0002)
I(yrborn40^3) 0.0002∗∗∗
(0.00002)
yrb40s –0.9087∗∗∗
(0.0524)
I(yrb40s^2) –0.0775∗∗∗
(0.0212)
I(yrb40s^3) 0.2091∗∗∗
(0.0203)
Constant 2.7414∗∗∗ 2.7414∗∗∗
(0.0372) (0.0372)
Observations 5,049 5,049
Note: Byron Lew ∗p<0.1; ECON 4041H { Research Methodology in Economics ∗∗p<0.05; ∗∗∗p<0.01 30 / 53
Reading model with transformations
easier to read, and shows more digits
numbers are same, but scaled by transformation
I coef on linear term is now 10× larger
I coef on quadratic term is now 102× larger
I coef on cubic term is now 103× larger
just remember that yrborn has been transformed, particularly if
using margins command
Byron Lew ECON 4041H { Research Methodology in Economics 31 / 53
Detour: viewing output
When we want to compare several models, and to save output,
we need a tool to combine results and save to file
stargazer” is a useful and easy-to-use output summary tool
library(stargazer)
stargazer(modYrb3, modYrb3a, modYrb3b, type =’text’,
out=’yearBorn.txt’)
This will save results to files ‘yearBorn.txt’ which can be opened
in Word or other wordprocessor
Results also appear in R window
In our example, because the variable names have changed, the
results don’t line up, but we could edit file
Note: stargazer” supports three output formats: text, html and
latex. Text and html formats can be opened in a wordprocessor.
Use latex” format when including stargarzer() in RMarkdown
files
Byron Lew ECON 4041H { Research Methodology in Economics 32 / 53
Predicted means
use emmeans to generate plot of functional form
Note: if using final transformed function, inputs (at list) must be
transformed as well
plt <- summary(emmeans(modYrb3b, ~yrb40s,
at = list(yrb40s = seq(–2,2,0.2))))
ggplot(plt, aes(x = yrb40s, y = emmean)) +
geom_line() + theme_bw() +
geom_ribbon(aes(ymax=upper.CL, ymin=lower.CL), alpha = 0.2) +
labs(y = ’predicted number of children’,
x = ’mothern’s year of birth’) +
scale_x_continuous(breaks = seq(–2,2,0.5),
labels = seq(1920,1960,5))
why do we use range of -2 to 2?
plot on next slide . . .
Byron Lew ECON 4041H { Research Methodology in Economics 33 / 53
Cubic fit of fertility on year of birth
2.0
2.5
3.0
3.5
1920 1925 1930 1935 1940 1945 1950 1955 1960
mother’s year of birth
predicted number of children
Byron Lew ECON 4041H { Research Methodology in Economics 34 / 53
Slope of cubic function
Economists like to know the marginal response, the change in y
given a change in x, @@yx
Let’s ask what is the change in expected number of children as
year of birth changes
We want the first derivative of our function with respect to year
of birth
children = β0 + β1yearBirth + β2yearBirth2 + β3yearBirth3
first derivative: β1 + 2β2yearBirth + 3β3yearBirth2
again, we can estimate this the long way, but calculating the
functional form of the first derivative, or the easy (margins) way
code on next slide . . .
Byron Lew ECON 4041H { Research Methodology in Economics 35 / 53
Derivative of cubic function: code
modYrb3.me <- summary(margins(modYrb3b,
variables=’yrb40s’,
at = list(yrb40s = seq(–2,2,0.2))))
ggplot(modYrb3.me, aes(x=yrb40s, y=AME)) + geom_line() +
geom_ribbon(aes(ymax=upper, ymin=lower), alpha = 0.2) +
labs(y=’marginal effect on number of children’,
x =“mother’s year of birth”) +
geom_hline(yintercept=0, size=0.2) + theme_bw() +
scale_x_continuous(breaks=seq(–2,2,1),
labels = seq(1920,1960,10))
Byron Lew ECON 4041H { Research Methodology in Economics 36 / 53
Derivative of cubic function: plot
-1
2 1 0
1920 1930 1940 1950 1960
mother’s year of birth
marginal effect on number of children
The line y=0 illustrates the points where the first derivative is zero,
points of min or max of the original function.
Byron Lew ECON 4041H { Research Methodology in Economics 37 / 53
Words of caution
we fit a particular functional form: quadratic, cubic
by implication we are forcing a fit on our data
this fit may or may not be appropriate
let’s review by comparing our lowess plot with our plot of
predicted means
Byron Lew ECON 4041H { Research Methodology in Economics 38 / 53
Lowess plot (1st) vs. cubic plot of emmeans (2nd)
Byron Lew ECON 4041H { Research Methodology in Economics 39 / 53
Comparing cubic to lowess
both curves look similar (jaggedness on predicted means curve
due to range specification)
on curve of predicted means, the slope of each quadratic section
are defined to be the same by choice of functional form
Lowess curve has extended linear segment from 1938{1948
looks more like three curves spliced together: quadratic, line,
quadratic
But by fitting a cubic, we are stuck with the math and cannot
have a linear portion in middle
In practice, quadratics are quite common but cubics less common
Problems with functional form estimators
I impose symmetry
I may be wildly inaccurate out-of-sample
Several solutions: we look at fractional polynomials next
Byron Lew ECON 4041H { Research Methodology in Economics 40 / 53
Fractional polynomials
Test of educational attainment by age from gss data
gss <- readRDS(paste0(path, ’gss.rds’))
gss <- gss[gss$age <= 80,]
Test shape first using age as indicator
modlmf <- lm(educ ~ factor(age), data = gss)
Graph predicted means to visualize functional form
pltLmf <- summary(emmeans(modlmf, ~age))
ggplot(pltLmf, aes(x=age, y=emmean)) + geom_line() +
geom_ribbon(aes(ymax=upper.CL, ymin=lower.CL), alpha=0.2) +
labs(y = ’predicted years of education’, x = ’age’) +
theme_bw() +
scale_x_continuous(breaks = seq(20,80,5))
Byron Lew ECON 4041H { Research Methodology in Economics 41 / 53
Predicted education from indicator variable of age
11
12
13
20 25 30 35 40 45 50 55 60 65 70 75 80
age
predicted years of education
Byron Lew ECON 4041H { Research Methodology in Economics 42 / 53
Estimate education as quadratic function of age
Looks like a quadratic, but with substantial asymmetry
Estimate as quadratic
modlm2 <- lm(educ ~ age + I(age^2), data = gss)
pltModlm2 <- summary(emmeans(modlm2, ~age,
at = list(age = seq(18,80,1))))
ggplot(pltModlm2, aes(x=age, y=emmean)) + geom_line() +
geom_ribbon(aes(ymax=upper.CL, ymin=lower.CL), alpha=0.2) +
labs(y = ’predicted years of education’, x = ’age’) +
theme_bw() + scale_x_continuous(breaks=seq(20,80,5))
next slide . . .
Byron Lew ECON 4041H { Research Methodology in Economics 43 / 53
Predicted education from quadratic function of age
10
11
12
13
20 25 30 35 40 45 50 55 60 65 70 75 80
age
predicted years of education
Byron Lew ECON 4041H { Research Methodology in Economics 44 / 53
Flexible functional form: fractional polynomials
For curves with asymmetries, a better fit requires functions with
polynomials to fractional powers, powers < 1.
General form of fractional polynomials: f (x) = axp1 + bxp2
where p1 and p2 are chosen from f-2, -1, -0.5, 0, 0.5, 1, 2, 3g
and power 0 represents ln(x)
the coefficients a and b could be negative or positive
See .r file for demonstration of different functional forms
R implements this estimator with function mfp()
Fits a general family of polynomials of fractional powers
Byron Lew ECON 4041H { Research Methodology in Economics 45 / 53
Fractional polynomials
Estimate as fractional polynomial
library(mfp)
modFp <- mfp(educ ~ fp(age), data = gss)
Estimate Std. Error t value Pr(>jtj)
(Intercept) 18.0895 0.0984 183.8 0.0000
I((age/100)^-2) -0.1655 0.0040 -40.9 0.0000
I((age/100)^1) -9.0247 0.1558 -57.9 0.0000
f (x) = axp1 + bxp2
so a = –0:1655; b = –9:0247; p1 = –2 and p2 = 1
Also notice that the covariates have been scaled, here by
dividing by 100
and there is a constant (intercept) term
Byron Lew ECON 4041H { Research Methodology in Economics 46 / 53
Fractional polynomials
With model estimate graph estimated marginal means
We send emmeans() the modFp$fit object
pltModFp <- summary(emmeans(modFp$fit, ~age,
at = list(age = seq(18,80,1)),
weight = ’proportional’))
ggplot(pltModFp, aes(x=age, y=emmean)) + geom_line() +
geom_ribbon(aes(ymax = upper.CL, ymin = lower.CL), alpha=0.2)
labs(y = ’predicted years of education’, x = ’age’) +
theme_bw() + scale_x_continuous(breaks = seq(20,80,5))
Note: ymin and ymax objects in geom ribbon names changed
for mfp()
Byron Lew ECON 4041H { Research Methodology in Economics 47 / 53
Fractional polynomials
11
12
13
20 25 30 35 40 45 50 55 60 65 70 75 80
age
predicted years of education
Byron Lew ECON 4041H { Research Methodology in Economics 48 / 53
Fractional polynomials{marginal effects
modFpFit.me <- summary(margins(modFp$fit, variables = ’age’,
at=list(age = seq(20,80,10))))
ggplot(modFpFit.me, aes(x=age, y=AME)) + geom_line() +
geom_ribbon(aes(ymax=upper, ymin=lower), alpha=0.2) +
labs(y=’predicted years of education’, x=’age’) +
theme_bw() + scale_x_continuous(breaks=seq(20,80,5))
Byron Lew ECON 4041H { Research Methodology in Economics 49 / 53
Fractional polynomials{marginal effects
-0.1
0.0
0.1
0.2
0.3
20 25 30 35 40 45 50 55 60 65 70 75 80
age
predicted years of education
Byron Lew ECON 4041H { Research Methodology in Economics 50 / 53
Compare all three functions: plot
ggplot(gss, aes(x=age, y=educ)) +
geom_line(data=pltLmf, aes(x=age, y=emmean, colour=’factor’),
size=0.5) +
geom_ribbon(data=pltLmf, aes(ymax=upper.CL, ymin=lower.CL),
alpha=0.1, show.legend=F, fill=’black’) +
geom_point(data=pltLmf, size =0.75) +
geom_line(data=pltModlm2, aes(colour=’quadratic’), linewidth=0.5) +
geom_ribbon(data=pltModlm2, aes(ymax=upper.CL, ymin=lower.CL),
alpha=0.2, fill=’blue’, show.legend=F) +
geom_line(data=pltModFp, aes(colour=’fracpoly’), size=0.5) +
geom_ribbon(data=pltModFp, aes(ymax=upper.CL, ymin=lower.CL),
alpha=0.2, fill=’red’, show.legend=F) +
theme_bw() + labs(y=’predicted years of education’, x=’age’) +
scale_x_continuous(breaks=seq(20,80,5)) +
scale_colour_manual(name=’function’, values=c(’factor’=’black’,
’quadratic’=’blue’, ’fracpoly’=’red’))
Byron Lew ECON 4041H { Research Methodology in Economics 51 / 53
Fit of three estimators
10
11
12
13
20 25 30 35 40 45 50 55 60 65 70 75 80
age
predicted years of education
function
factor
fracpoly
quadratic
Byron Lew ECON 4041H { Research Methodology in Economics 52 / 53
In this lecture we have studied:
quadratic, cubic and potentially higher-order polynomial
functions of independent variables
data transformation to accommodate scale effect when using
high-order polynomials
use of margins to calculate partial derivatives
awareness of how functional form forces certain constrained fit
to data
use of fractional polynomials
Byron Lew ECON 4041H { Research Methodology in Economics 53 / 53