Introduction
Panel Regression
Panel data are also called longitudinal data or cross-sectional time-series data. A panel data set has multiple entities, each of which has repeated measurements at different time periods. Panel data may have individual (group) effect, time effect, or both, which are analyzed by fixed effect and/or random effect models.1
Panel data examples:
-
Annual unemployment rates of each state over several years
-
Quarterly sales of individual stores over several quarters
-
Salary for the same worker, working at several different jobs.
Why should we use panel data?
-
Controlling for individual heterogeneity
-
Panel data give more informative data, more variability, less collinearity among the variables, more degress of freedom and more efficiency
-
Omitted variable bias.
Learning Objectives
The goal of this article is to help you:
-
Understand the concept of panel regression
-
Understand structure of panel data sets
-
Implement panel regression in business case.
Library and setup
library(dplyr)
library(lubridate)
library(plm)
library(AER)
library(gplots)
library(lmtest)
library(rsample)
library(MLmetrics)
Advantages of panel data
Panel data, by blending the inter-individual differences and intra-individual dynamics, have several advantages over cross-sectional or time-series data:
-
The obvious benefit is in terms of obtaining a large sample, giving more degrees of freedom, more variability, more information and less multicollinearity among the variables hence improving the efficiency of econometrics estimates 2.
-
Interpretable models, we can interpret the regression coefficients in the framework of a cross-section and time-series effect.
Structure of Panel data sets
Panel data have three types of data, there are cross section, pooled cross section, and panel. To help you visualize these types of data we’ll consider some sample data sets below.
We use the Grunfeld
datasets from AER
package. This datasets tell about Grunfeld’s Investment Data on 11 large US manufacturing firms over 20 years, for the years 1935-1954.
data("Grunfeld")
head(Grunfeld)
#> invest value capital firm year
#> 1 317.6 3078.5 2.8 General Motors 1935
#> 2 391.8 4661.7 52.6 General Motors 1936
#> 3 410.6 5387.1 156.9 General Motors 1937
#> 4 257.7 2792.2 209.2 General Motors 1938
#> 5 330.8 4313.2 203.4 General Motors 1939
#> 6 461.2 4643.9 207.2 General Motors 1940
This datasets have 20 annual observations on 3 variables for 11 firms, here some description of each feature:
-
invest
: Gross investment, defined as additions to plant and equipment plus maintenance and repairs in millions of dollars deflated by the implicit price deflator of producers’ durable equipment (base 1947). -
value
: Market value of the firm, defined as the price of common shares at December 31 (or, for WH, IBM and CH, the average price of December 31 and January 31 of the following year) times the number of common shares outstanding plus price of preferred shares at December 31 (or average price of December 31 and January 31 of the following year) times number of preferred shares plus total book value of debt at December 31 in millions of dollars deflated by the implicit GNP price deflator (base 1947). -
capital
: Stock of plant and equipment, defined as the accumulated sum of net additions to plant and equipment deflated by the implicit price deflator for producers’ durable equipment (base 1947) minus depreciation allowance deflated by depreciation expense deflator (10 years moving average of wholesale price index of metals and metal products, base 1947). -
firm
: factor with 11 levels:General Motors
,US Steel
,General Electric
,Chrysler
,Atlantic Refining
,IBM
,Union Oil
,Westinghouse
,Goodyear
,Diamond Match
,American Steel
. -
year
: Year.
Cross Sectional Data
Cross sectional data is a type of data collected by observing many subjects at the one point or period of time.
csd <- Grunfeld %>%
select(-firm) %>%
filter(year == 1954)
head(csd)
#> invest value capital year
#> 1 1486.70 5593.6 2226.3 1954
#> 2 459.30 2115.5 669.7 1954
#> 3 189.60 2759.9 888.9 1954
#> 4 172.49 703.2 414.9 1954
#> 5 81.43 365.7 804.9 1954
#> 6 135.72 927.3 238.7 1954
Pooled Cross Sectional Data
Pooled cross sectional data is multiple snapshots of multiple bunches of (randomly selected) individuals (or states or firms or whatever) at many points in time 3.
pcs <- Grunfeld%>%
select(-firm)
head(pcs)
#> invest value capital year
#> 1 317.6 3078.5 2.8 1935
#> 2 391.8 4661.7 52.6 1936
#> 3 410.6 5387.1 156.9 1937
#> 4 257.7 2792.2 209.2 1938
#> 5 330.8 4313.2 203.4 1939
#> 6 461.2 4643.9 207.2 1940
pcs
is an example of a pooled cross-sectional data set because we have observations with different year. We can use the same notation here as in cross section, indexing each person, firm, city, etc. by \(i\). Suppose we have 11 cross sectional datasets from 19 different years, pooling the data means to treat them as one larger sample and control for the fact that some observations are from a different year.
Panel Data
Panel data have a special structure, each row of the data corresponds to a specific individual and time period.
head(Grunfeld)
#> invest value capital firm year
#> 1 317.6 3078.5 2.8 General Motors 1935
#> 2 391.8 4661.7 52.6 General Motors 1936
#> 3 410.6 5387.1 156.9 General Motors 1937
#> 4 257.7 2792.2 209.2 General Motors 1938
#> 5 330.8 4313.2 203.4 General Motors 1939
#> 6 461.2 4643.9 207.2 General Motors 1940
The data set has 11 firms with 220 observation. This particular panel data set is sometimes referenced as a balanced panel data set
because we observe every single firm has 20 observations.
The equation for model becomes:
\(y_{it} = \beta_0+\beta _1x_i+u_{it}\)
where :
- \(y_{it}\) is the dependent variable
- \(i\) denoting households, individuals, firms, countries, etc.
- \(t\) denoting time.
The i
subscript, therefore, denotes the cross-section dimension whereas t
denotes the time series dimension. How do we account for the cross section and time heterogeneity in this model? this is done by using a two-way error component assumption for the disturbances, \(u_{it}\) with:
\(u_{it} = \mu _i+\lambda _t+\upsilon _{it}\)
where:
-
\(\mu _i\) represents the unobservable individual (cross section) heterogeneity
-
\(\lambda _t\) denotes the unobservable time heterogeneity
-
\(\upsilon_{it}\) is the remaining random error term
The first two components(\(\mu _i\) and \(\lambda _t\)) are also called within component and the last \(\upsilon _{it}\), panel or between component
Heterogeneity
We need to make sure that our data is a balanced panel or an unbalanced panel. If we have an unbalanced panel, try to classify individuals or time periods and get some manageable.
plotmeans(formula = invest~year, main = "Heterogeineity Across Year", data = Grunfeld)
From the plot above, we have a balanced panel dataset with 11 observations every year.
Estimation model of panel regression
Pooled Regression
The (pooled) OLS is a pooled linear regression without fixed and random effects. It assumes a constant intercept and slopes regardless of group and time period. We will use the plm
command with the option model = "pooling"
to obtain the pooled estimates:
pooling <- plm(invest~value + capital, data = Grunfeld, model = "pooling", index = c("firm","year"))
summary(pooling)
#> Pooling Model
#>
#> Call:
#> plm(formula = invest ~ value + capital, data = Grunfeld, model = "pooling",
#> index = c("firm", "year"))
#>
#> Balanced Panel: n = 11, T = 20, N = 220
#>
#> Residuals:
#> Min. 1st Qu. Median 3rd Qu. Max.
#> -290.331 -25.762 11.059 29.741 377.936
#>
#> Coefficients:
#> Estimate Std. Error t-value Pr(>|t|)
#> (Intercept) -38.4100540 8.4133709 -4.5654 8.35e-06 ***
#> value 0.1145344 0.0055188 20.7534 < 2.2e-16 ***
#> capital 0.2275141 0.0242283 9.3904 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Total Sum of Squares: 9712000
#> Residual Sum of Squares: 1768700
#> R-Squared: 0.81789
#> Adj. R-Squared: 0.81621
#> F-statistic: 487.284 on 2 and 217 DF, p-value: < 2.22e-16
Three additional arguments are common to these function:
-
index: this argument enables the estimation function to identify the structure of the data, the individual and the time period for each observation
-
effect: the kind of effects to include in the model, individual effects, time effects, and two-ways
-
model: the kind of model to be estimated, most of the time a model with fixed effects or a model with random effects
summary(pooling)
#> Pooling Model
#>
#> Call:
#> plm(formula = invest ~ value + capital, data = Grunfeld, model = "pooling",
#> index = c("firm", "year"))
#>
#> Balanced Panel: n = 11, T = 20, N = 220
#>
#> Residuals:
#> Min. 1st Qu. Median 3rd Qu. Max.
#> -290.331 -25.762 11.059 29.741 377.936
#>
#> Coefficients:
#> Estimate Std. Error t-value Pr(>|t|)
#> (Intercept) -38.4100540 8.4133709 -4.5654 8.35e-06 ***
#> value 0.1145344 0.0055188 20.7534 < 2.2e-16 ***
#> capital 0.2275141 0.0242283 9.3904 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Total Sum of Squares: 9712000
#> Residual Sum of Squares: 1768700
#> R-Squared: 0.81789
#> Adj. R-Squared: 0.81621
#> F-statistic: 487.284 on 2 and 217 DF, p-value: < 2.22e-16
The Fixed Effects Model
Fixed effect model explore the relationship between predictor and outcome variables within and entity (country, person, company, .etc). Each entity has its own individual characteristics that may or may not influence the predictor variables. When using Fixed Effects Model we assume that something within the individual may impact or bias the predictor or outcome variables and we need to control for this. This is the rationale behind the assumption of the correlation between entity’s error term and predictor variables. Fixed Effect remove the effect of those time-invariant characteristics so we can assess the net effect of the predictors on the outcome variable4.
The equation for the fixed effects model becomes:
$ y_{it} = iX{it} + i + u{it} $
Where:
-
\(y_{it}\) is the dependent variable, where i = entity and t = time
-
\(\beta_i\) is the coefficient for independent variable
-
\(X_{it}\) represents one independent variable
-
\(\alpha _i\) that represents a unique value for each individual in the unit individual. it would represent the effect of all characteristics of an individual that do not change over time
-
\(u_{it}\) is the error term
with \(i = 1,…,n\) and \(t=1,…,T\). The \(\alpha _i\) are entity-specific intercepts that capture heterogeneities across entities. An equivalent representation of this model is given by
$ y_{it} = _0 + iX{it} + …+kX{it}+ _2D2_i+ _3D3_i +…+_nDn_i$
where the \(\gamma _2D2_i+ \gamma _3D3_i + \gamma _nDn_i\) are dummy variables
There are several strategies for estimating a fixed effect model.
Least Square Dummy Variable
The least squares dummy variable model (LSDV) uses dummy variables. These strategies of course, produce the identical parameter estimates of regressor (non dummy independent variables). The between estimation
fits a model using individual or time means of dependent and independent without dummies.
lsdv <- lm(invest~value + capital + factor(firm), data = Grunfeld)
summary(lsdv)
#>
#> Call:
#> lm(formula = invest ~ value + capital + factor(firm), data = Grunfeld)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -184.008 -15.660 0.272 16.414 250.753
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -70.29907 47.37535 -1.484 0.139
#> value 0.11013 0.01130 9.746 < 2e-16 ***
#> capital 0.31003 0.01654 18.744 < 2e-16 ***
#> factor(firm)US Steel 172.20381 29.70010 5.798 2.48e-08 ***
#> factor(firm)General Electric -165.27033 30.28530 -5.457 1.38e-07 ***
#> factor(firm)Chrysler 42.48996 41.84991 1.015 0.311
#> factor(firm)Atlantic Refining -44.30345 48.12221 -0.921 0.358
#> factor(firm)IBM 47.13887 44.61445 1.057 0.292
#> factor(firm)Union Oil 3.75484 48.19182 0.078 0.938
#> factor(firm)Westinghouse 12.75258 41.98604 0.304 0.762
#> factor(firm)Goodyear -16.91548 46.17941 -0.366 0.715
#> factor(firm)Diamond Match 63.73104 47.96886 1.329 0.185
#> factor(firm)American Steel 49.72087 48.28006 1.030 0.304
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 50.3 on 207 degrees of freedom
#> Multiple R-squared: 0.9461, Adjusted R-squared: 0.9429
#> F-statistic: 302.6 on 12 and 207 DF, p-value: < 2.2e-16
Within Estimator
The within
estimation doesn’t need dummy variables, but it uses deviations from group (or time period) means. That is within
estimation uses variation within each individual or entity instead of a large number of dummies.
The within estimator substracts these unit-level means from the response, treatment, and all the controls:
\(Y_{it}-\bar{Y_i} = (X_{it}-\bar{X_i})’\beta+ (u_{it}-\bar{u_i})\)
Also note that since \(\bar{Y_i}\) are unit averages, and the unobserved effect is constant over time, substracting off the mean also subtracts that unobserved effect.
The default behavior of plm
is to introduce individual effects. Using the effect
argument, one may also introduce:
-
individual effects( effect = “individual”)
-
time effects( effect = “time” )
-
individual and time effects ( effect = “twoways” ). The two-ways effect model is for the moment only available for balanced panels
Here’s the code using plm
. Note that in this case, model = "within"
means fixed effects for the entity variable.
within <- plm(invest~value + capital, data = Grunfeld, model = "within", index = c("firm","year"), effect = "twoways")
summary(within)
#> Twoways effects Within Model
#>
#> Call:
#> plm(formula = invest ~ value + capital, data = Grunfeld, effect = "twoways",
#> model = "within", index = c("firm", "year"))
#>
#> Balanced Panel: n = 11, T = 20, N = 220
#>
#> Residuals:
#> Min. 1st Qu. Median 3rd Qu. Max.
#> -163.4113 -17.6747 -1.8345 17.9490 217.1070
#>
#> Coefficients:
#> Estimate Std. Error t-value Pr(>|t|)
#> value 0.116681 0.012933 9.0219 < 2.2e-16 ***
#> capital 0.351436 0.021049 16.6964 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Total Sum of Squares: 1672200
#> Residual Sum of Squares: 459400
#> R-Squared: 0.72527
#> Adj. R-Squared: 0.67997
#> F-statistic: 248.15 on 2 and 188 DF, p-value: < 2.22e-16
The Fixed Effect model assumes an individual-specific intercept that is non-random, while all other coefficients are fixed (homogeneous across units)5. We can get the fixed effects constant for each firm and year using fixef()
. The fixef()
function returns an object of class fixef.
fixef(within)
#> General Motors US Steel General Electric Chrysler
#> -125.5411 76.7777 -264.8565 -37.3708
#> Atlantic Refining IBM Union Oil Westinghouse
#> -136.2723 -30.2288 -80.5651 -65.4880
#> Goodyear Diamond Match American Steel
#> -101.7344 -7.2787 -23.7715
fixef(within, effect = "time")
#> 1935 1936 1937 1938 1939 1940 1941 1942
#> -30.534 -47.494 -66.910 -66.158 -93.634 -70.359 -47.022 -48.534
#> 1943 1944 1945 1946 1947 1948 1949 1950
#> -68.307 -68.855 -80.074 -58.289 -65.412 -68.865 -95.735 -97.922
#> 1951 1952 1953 1954
#> -85.369 -87.023 -89.047 -112.328
A summary method is provided, which prints the effects (in deviation from the overall intercept), their standard error and the test of equality to the overall intercepts.
summary(fixef(within))
#> Estimate Std. Error t-value Pr(>|t|)
#> General Motors -125.5411 54.6650 -2.2966 0.022747 *
#> US Steel 76.7777 26.7488 2.8703 0.004571 **
#> General Electric -264.8565 26.3789 -10.0405 < 2.2e-16 ***
#> Chrysler -37.3708 13.9735 -2.6744 0.008146 **
#> Atlantic Refining -136.2723 14.7514 -9.2379 < 2.2e-16 ***
#> IBM -30.2288 12.2184 -2.4740 0.014247 *
#> Union Oil -80.5651 12.7329 -6.3273 1.781e-09 ***
#> Westinghouse -65.4880 13.8303 -4.7351 4.300e-06 ***
#> Goodyear -101.7344 12.7979 -7.9493 1.690e-13 ***
#> Diamond Match -7.2787 11.0891 -0.6564 0.512381
#> American Steel -23.7715 11.1419 -2.1335 0.034178 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Relationship of type dmean
and level
and overall intercept
overall <- within_intercept(within)
fx_dmean <- fixef(within, type = "dmean")
fx_level <- fixef(within, type = "level")
all.equal(overall + fx_dmean, fx_level, check.attributes = FALSE)
#> [1] TRUE
Calculation fitted values of two ways within model
fixefs <- fixef(within)[index(within, which = "id")]
fitted_by_hand <- fixefs + (within$coefficients["value"] * within$model$value) + (within$coefficients["capital"] * within$model$capital)
The Random Effects Model
In the random effects model, the individual-specific effect is a random variable that is uncorrelated with the explanatory variables. The advantages of random effects specification are: 6
-
The number of paramaters stay constant when sample size increases
-
It allows the derivation of efficient estimators that make use of both within and between (group) variation.
-
It allows the estimation of the impact of time invariant variables
The equation for the random effects model becomes:
\[y_{it} = \beta _iX_{it} + \alpha +\varepsilon_{it} + u_{it}\]
Where:
- \(u_{it}\) is between entity error.
- \(\varepsilon_{it}\) is within entity error
random <- plm(invest~value + capital, data = Grunfeld, model = "random", index = c("firm","year"), effect = "twoways")
summary(random)
#> Twoways effects Random Effect Model
#> (Swamy-Arora's transformation)
#>
#> Call:
#> plm(formula = invest ~ value + capital, data = Grunfeld, effect = "twoways",
#> model = "random", index = c("firm", "year"))
#>
#> Balanced Panel: n = 11, T = 20, N = 220
#>
#> Effects:
#> var std.dev share
#> idiosyncratic 2443.62 49.43 0.283
#> individual 6206.26 78.78 0.717
#> time 0.00 0.00 0.000
#> theta: 0.8611 (id) 0 (time) 2.776e-17 (total)
#>
#> Residuals:
#> Min. 1st Qu. Median 3rd Qu. Max.
#> -177.6721 -18.9754 4.1721 14.9430 253.1971
#>
#> Coefficients:
#> Estimate Std. Error z-value Pr(>|z|)
#> (Intercept) -53.9812884 26.0814102 -2.0697 0.03848 *
#> value 0.1093251 0.0099449 10.9931 < 2e-16 ***
#> capital 0.3081066 0.0163788 18.8113 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Total Sum of Squares: 2388700
#> Residual Sum of Squares: 549690
#> R-Squared: 0.76988
#> Adj. R-Squared: 0.76776
#> Chisq: 725.987 on 2 DF, p-value: < 2.22e-16
For a random model, the summary
output gives information about the variance of the components of the error.
ranef(random)
#> General Motors US Steel General Electric Chrysler
#> -11.360155 154.988363 -175.795335 26.442593
#> Atlantic Refining IBM Union Oil Westinghouse
#> -58.348487 30.754167 -11.607130 -2.805524
#> Goodyear Diamond Match American Steel
#> -31.765611 46.565016 32.932104
Fixed Effect vs Random Effect Models
Panel data models estimate fixed and/or random effects models using dummy variables. The core difference between fixed and random effect models lies in the role of dummies. If dummies are consideres as a part of the intercept, it is a fixed effect model. In a random effect model, the dummies act as an error term.
The fixed effect model examines group difference in intercepts, assuming the same slopes and constant variance across groups. Fixed effect models use Leas Square Dummy Variable (LSDV), within effect, and between effect estimation methods. Thus OLS regression with dummies, in fact, are fixes effect models.
Selection Method of panel regression
We can check the several tests are the Chow Test, Hausman Test, and Lagrange Test to decide on the most appropriate model:
1. Chow Test for Poolability
Chow test is a test to determine the model of whether Pooled Effect or Fixed Effect (FE) is most appropriately used in estimating panel data. What is poolability? Poolability asks if slopes are the same across group or overtime.
H0: pooled effect model
H1: fixed effect model
pFtest(within,pooling)
#>
#> F test for twoways effects
#>
#> data: invest ~ value + capital
#> F = 18.476, df1 = 29, df2 = 188, p-value < 2.2e-16
#> alternative hypothesis: significant effects
From the result above, we can see the p-value is below 0.05 so a fixed effect model is better for this data
2. Hausman Test
Hausman test is a statistical test to select whether the most appropriate Fixed Effect or Random Effect model is used. The additional arguments is model, the kind of model to be estimated, most of the time a model with fixed effects or a model with random effects.
H0: Random Effect Model
H1: Fixed Effect Model
phtest(within, random)
#>
#> Hausman Test
#>
#> data: invest ~ value + capital
#> chisq = 13.253, df = 2, p-value = 0.001325
#> alternative hypothesis: one model is inconsistent
From the result above, the p-value is below 0.05 so a fixed effect model is better for this data
3. Lagrange Multiplier
Lagrange Multiplier to determine whether Random Effect (RE) is better than Comman Effect method.
H0: Pooled Effect Model
H1: Random Effect Model
plmtest(within,effect = "twoways",type = "bp")
#>
#> Lagrange Multiplier Test - two-ways effects (Breusch-Pagan) for
#> balanced panels
#>
#> data: invest ~ value + capital
#> chisq = 881.07, df = 2, p-value < 2.2e-16
#> alternative hypothesis: significant effects
From the result above, the p-value is below 0.05 so a random effect model is better for this data
The test will help you to decide, but your explanation/motivation according to the data structure and research question to use an estimator or another it’s a better choice.
Diagnostics Test
Test of Serial Correlation
Serial correlation is the relationship between a variable and a lagged version of itself over various time intervals.
Serial correlation hypothesis test:
H0: There is not serial correlation
H1: There is serial correlation
pbgtest(random)
#>
#> Breusch-Godfrey/Wooldridge test for serial correlation in panel models
#>
#> data: invest ~ value + capital
#> chisq = 77.049, df = 20, p-value = 1.237e-08
#> alternative hypothesis: serial correlation in idiosyncratic errors
From the result above, the p-value is below 0.05 so there is serial correlation.
Testing for Heteroskedasticity
Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value.
Heteroscedasticity hypothesis test:
H0: Homoskedasticity
H1: Heteroskedasticity
bptest(random)
#>
#> studentized Breusch-Pagan test
#>
#> data: random
#> BP = 65.228, df = 2, p-value = 6.854e-15
From the result above, the p-value is below 0.05 so there is heteroscedasticity
Use Case
In this section, we will use case of panel regression in public capital productivity.
productivity <- read.csv("data_input/productivity.csv")
str(productivity)
#> 'data.frame': 816 obs. of 10 variables:
#> $ STATE: chr "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" ...
#> $ YR : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
#> $ P_CAP: num 15033 15502 15972 16406 16763 ...
#> $ HWY : num 7326 7526 7765 7908 8026 ...
#> $ WATER: num 1656 1721 1765 1742 1735 ...
#> $ UTIL : num 6051 6255 6442 6756 7002 ...
#> $ PC : num 35794 37300 38670 40084 42057 ...
#> $ GSP : int 28418 29375 31303 33430 33749 33604 35764 37463 39964 40979 ...
#> $ EMP : num 1010 1022 1072 1136 1170 ...
#> $ UNEMP: num 4.7 5.2 4.7 3.9 5.5 7.7 6.8 7.4 6.3 7.1 ...
The data has 816 observations and 10 variables. The target variable is Gross State Product (GSP). Variable STATE
represents the entities or panels and YR
represent the time variable. Here some description of each feature:
- STATE : state name
- YR : year
- P_CAP : public capital
- HWY : highway capital
- WATER : water utility capital
- UTIL : utility capital
- PC : private capital
- GSP : gross state product
- EMP : employment rate
- UNEMP : unemployment rate
Cross Validation
We split the data into data_train
and data_test
. The first 13 years will be the training data and the last 4 years will be data test.
set.seed(100)
data_train <- productivity %>%
filter( !YR %in% c(1983,1984,1985,1986))
data_test <- productivity %>%
filter( YR %in% c(1983,1984,1985,1986))
Exploratory data
Before we go further, we need to make sure that our data is balanced panel from 1970 to 1982.
plotmeans(formula = GSP ~ YR, main = "Heterogeineity Across Year",data = data_train)
From the plot above, we have a balanced panel with 48 observations every year.
Modelling
In this case, we will try to build 3 model from pooling, fixed, and random model.
mod_pooling <- plm(GSP ~ P_CAP + HWY + WATER +
UTIL + PC + EMP + UNEMP,
data = data_train,
model = "pooling")
summary(mod_pooling)
#> Pooling Model
#>
#> Call:
#> plm(formula = GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP,
#> data = data_train, model = "pooling")
#>
#> Balanced Panel: n = 48, T = 13, N = 624
#>
#> Residuals:
#> Min. 1st Qu. Median 3rd Qu. Max.
#> -23508.4 -3502.1 1006.8 3028.9 33069.1
#>
#> Coefficients:
#> Estimate Std. Error t-value Pr(>|t|)
#> (Intercept) -7.1687e+02 8.0383e+02 -0.8918 0.3728
#> P_CAP -7.3614e+04 3.6801e+04 -2.0003 0.0459 *
#> HWY 7.3615e+04 3.6801e+04 2.0003 0.0459 *
#> WATER 7.3616e+04 3.6801e+04 2.0004 0.0459 *
#> UTIL 7.3615e+04 3.6801e+04 2.0004 0.0459 *
#> PC 3.0618e-01 1.1688e-02 26.1954 < 2.2e-16 ***
#> EMP 1.8854e+01 7.0890e-01 26.5968 < 2.2e-16 ***
#> UNEMP -6.1782e+02 1.1387e+02 -5.4255 8.314e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Total Sum of Squares: 2.7027e+12
#> Residual Sum of Squares: 1.9502e+10
#> R-Squared: 0.99278
#> Adj. R-Squared: 0.9927
#> F-statistic: 12121 on 7 and 616 DF, p-value: < 2.22e-16
From the pooling model, all of independent variable significants to dependent variable and generate a value of 0.99 for adjusted R squared. The result of pooling model has intercept and slopes regardless of group and time period. Next, we will build within model:
mod_within <- plm(GSP ~ P_CAP + HWY + WATER +
UTIL + PC + EMP + UNEMP,
data = data_train,
model = "within",
index = c("STATE","YR"),
effect = "twoways")
summary(mod_within)
#> Twoways effects Within Model
#>
#> Call:
#> plm(formula = GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP,
#> data = data_train, effect = "twoways", model = "within",
#> index = c("STATE", "YR"))
#>
#> Balanced Panel: n = 48, T = 13, N = 624
#>
#> Residuals:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -7842.0 -718.9 -54.4 0.0 648.7 6593.5
#>
#> Coefficients:
#> Estimate Std. Error t-value Pr(>|t|)
#> P_CAP 9.4220e+03 1.1642e+04 0.8093 0.4187
#> HWY -9.4222e+03 1.1642e+04 -0.8093 0.4187
#> WATER -9.4224e+03 1.1642e+04 -0.8093 0.4187
#> UTIL -9.4216e+03 1.1642e+04 -0.8093 0.4187
#> PC -2.7614e-02 2.0574e-02 -1.3422 0.1801
#> EMP 3.7677e+01 7.6494e-01 49.2555 <2e-16 ***
#> UNEMP 8.6094e+01 7.1320e+01 1.2072 0.2279
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Total Sum of Squares: 3.512e+10
#> Residual Sum of Squares: 1563600000
#> R-Squared: 0.95548
#> Adj. R-Squared: 0.9502
#> F-statistic: 1707.7 on 7 and 557 DF, p-value: < 2.22e-16
To check the intercept for each cross-section and time period, we can used function fixef()
.
fixef(mod_within)
#> ALABAMA ARIZONA ARKANSAS CALIFORNIA COLORADO
#> -10424.53 -3566.05 -5122.71 -2201.48 -3368.31
#> CONNECTICUT DELAWARE FLORIDA GEORGIA IDAHO
#> -6830.42 -1991.55 -17674.65 -16897.08 -1587.21
#> ILLINOIS INDIANA IOWA KANSAS KENTUCKY
#> -11828.54 -12051.68 -2441.26 -516.46 -2156.66
#> LOUISIANA MAINE MARYLAND MASSACHUSETTS MICHIGAN
#> 20243.90 -3992.07 -9682.28 -23539.79 -5966.18
#> MINNESOTA MISSISSIPPI MISSOURI MONTANA NEBRASKA
#> -8757.58 -5991.93 -11232.59 425.73 -3624.40
#> NEVADA NEW_HAMPSHIRE NEW_JERSEY NEW_MEXICO NEW_YORK
#> -1231.83 -3607.94 -10573.66 1585.53 -32166.92
#> NORTH_CAROLINA NORTH_DAKOTA OHIO OKLAHOMA OREGON
#> -19458.55 474.27 -20503.08 3315.88 -4532.44
#> PENNSYLVANIA RHODE_ISLAND SOUTH_CAROLINA SOUTH_DAKOTA TENNESSE
#> -32702.66 -4520.25 -12793.72 -894.22 -15095.64
#> TEXAS UTAH VERMONT VIRGINIA WASHINGTON
#> 30577.25 -3716.22 -1987.88 -9702.83 -3925.40
#> WEST_VIRGINIA WISCONSIN WYOMING
#> -1573.68 -12151.91 4245.09
fixef(mod_within,effect = "time")
#> 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
#> -6561.2 -5764.7 -5247.7 -4967.8 -6592.4 -6677.1 -5971.0 -5654.3 -5848.6 -6611.8
#> 1980 1981 1982
#> -7594.0 -7030.4 -7193.8
Next, we will build the random model:
mod_random <- plm(GSP ~ P_CAP + HWY + WATER +
UTIL + PC + EMP + UNEMP,
data = data_train,
model = "random",
index = c("STATE","YR"),
effect = "twoways")
summary(mod_random)
#> Twoways effects Random Effect Model
#> (Swamy-Arora's transformation)
#>
#> Call:
#> plm(formula = GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP,
#> data = data_train, effect = "twoways", model = "random",
#> index = c("STATE", "YR"))
#>
#> Balanced Panel: n = 48, T = 13, N = 624
#>
#> Effects:
#> var std.dev share
#> idiosyncratic 2.807e+06 1.676e+03 0.098
#> individual 2.571e+07 5.070e+03 0.900
#> time 5.286e+04 2.299e+02 0.002
#> theta: 0.9087 (id) 0.2753 (time) 0.2749 (total)
#>
#> Residuals:
#> Min. 1st Qu. Median 3rd Qu. Max.
#> -8850.7380 -588.8907 4.4832 549.8404 9901.7155
#>
#> Coefficients:
#> Estimate Std. Error z-value Pr(>|z|)
#> (Intercept) -3.2450e+03 1.1435e+03 -2.8377 0.0045441 **
#> P_CAP 1.2397e+04 1.2799e+04 0.9686 0.3327469
#> HWY -1.2397e+04 1.2799e+04 -0.9686 0.3327483
#> WATER -1.2398e+04 1.2799e+04 -0.9686 0.3327336
#> UTIL -1.2397e+04 1.2799e+04 -0.9686 0.3327561
#> PC 6.0680e-02 1.7461e-02 3.4752 0.0005104 ***
#> EMP 3.3964e+01 6.8185e-01 49.8107 < 2.2e-16 ***
#> UNEMP -1.5642e+02 5.9315e+01 -2.6372 0.0083600 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Total Sum of Squares: 6.6693e+10
#> Residual Sum of Squares: 2126900000
#> R-Squared: 0.96811
#> Adj. R-Squared: 0.96775
#> Chisq: 18701.2 on 7 DF, p-value: < 2.22e-16
Variable private capital and employee rate have positively related to gross state product and other independent variable have negatively related to our target variable.
ranef(mod_random)
#> ALABAMA ARIZONA ARKANSAS CALIFORNIA COLORADO
#> -5230.10493 1819.95913 -90.08252 18339.30247 2153.96993
#> CONNECTICUT DELAWARE FLORIDA GEORGIA IDAHO
#> 382.82692 3198.92806 -8683.31375 -9655.38005 3013.77662
#> ILLINOIS INDIANA IOWA KANSAS KENTUCKY
#> -2514.21002 -6107.49603 1115.73924 2934.53554 2356.77859
#> LOUISIANA MAINE MARYLAND MASSACHUSETTS MICHIGAN
#> 20080.55495 1305.33545 -2291.20789 -12570.66767 2684.74534
#> MINNESOTA MISSISSIPPI MISSOURI MONTANA NEBRASKA
#> -3331.05067 -1078.35286 -4937.46721 4143.34031 795.33684
#> NEVADA NEW_HAMPSHIRE NEW_JERSEY NEW_MEXICO NEW_YORK
#> 3482.59290 1225.91024 -1061.28694 5718.19874 -8522.63749
#> NORTH_CAROLINA NORTH_DAKOTA OHIO OKLAHOMA OREGON
#> -11998.91470 3967.56349 -11794.13171 6770.49466 1127.12389
#> PENNSYLVANIA RHODE_ISLAND SOUTH_CAROLINA SOUTH_DAKOTA TENNESSE
#> -21524.30136 1283.93650 -6993.38460 2757.16259 -8700.68861
#> TEXAS UTAH VERMONT VIRGINIA WASHINGTON
#> 28325.56849 1269.45013 2843.35365 -3205.10527 3144.51776
#> WEST_VIRGINIA WISCONSIN WYOMING
#> 2297.18196 -5229.31019 6980.91007
After we build 3 models above, we need to decide which model will to choose from the test:
- Chow test for poolability
pFtest(mod_within, mod_pooling)
#>
#> F test for twoways effects
#>
#> data: GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP
#> F = 108.31, df1 = 59, df2 = 557, p-value < 2.2e-16
#> alternative hypothesis: significant effects
The test has a p-value less than the significance level of 0.05, therefore we reject H0, we can conclude to select fixed effect model.
- Hausman Test
phtest(mod_within, mod_random)
#>
#> Hausman Test
#>
#> data: GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP
#> chisq = 7.9098, df = 7, p-value = 0.3406
#> alternative hypothesis: one model is inconsistent
The test has a p-value above the significance level of 0.05, therefore we fail to reject H0, we can conclude to select random effect model.
- Lagrange Multiplier
plmtest(mod_within, effect = "twoways", type = "bp")
#>
#> Lagrange Multiplier Test - two-ways effects (Breusch-Pagan) for
#> balanced panels
#>
#> data: GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP
#> chisq = 1948.8, df = 2, p-value < 2.2e-16
#> alternative hypothesis: significant effects
The test has a p-value less than the significance level of 0.05, therefore we reject H0, we can conclude to select random effect model.
From the three test above, we will used random effect model. Next, we only check the assumption of random effect model:
- Test of serial correlation
pbgtest(mod_random)
#>
#> Breusch-Godfrey/Wooldridge test for serial correlation in panel models
#>
#> data: GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP
#> chisq = 291.28, df = 13, p-value < 2.2e-16
#> alternative hypothesis: serial correlation in idiosyncratic errors
The test has a p-value less than the significance level of 0.05, therefore we reject H0, we can conclude there is serial correlation.
- Heteroscedasticity
We need to check heteroscedasticity using the Breusch Pagan test.
bptest(mod_random)
#>
#> studentized Breusch-Pagan test
#>
#> data: mod_random
#> BP = 276.65, df = 7, p-value < 2.2e-16
The test has a p-value less than the significance level of 0.05, therefore we reject the null hypothesis.
We can conclude that the assumption is not already passed. This may be a restrictive assumption in many panel data applications.
- Predict data test
pred <- predict(mod_random,data_test)
- Let’s check the performance of our model
RMSE(pred, data_test$GSP)
#> [1] 9376.271
MAPE(pred, data_test$GSP)
#> [1] 0.1312989
Conlusion
Panel regression is a method that can build model both the common and individual behaviors of groups. The advantage of a panel data set over a cross-section is that it will allow the researcher great flexibility in modeling of differences in behavior across individuals. The challenge from implementation panel regression is the assumption of no unobserved heterogeneity.