Econometrics for dummies
Consider the dataset Brexit.dta. It contains data on the outcome of the Brexit vote from June 23 this year by local area along with a range of area characteristics. The variable pct_leave records the percentage of voters in an area that voted for leave.
library(haven)
df=read_dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AAC_4UZXJG9kmImypJXTZ9IOa/brexit.dta?dl=1")
names(df)
[1] "oslaua" "region_code"
[3] "region" "area"
[5] "pct_turnout" "pct_leave"
[7] "pct_rejected" "electorate"
[9] "expectedballots" "verifiedballotpapers"
[11] "votes_cast" "valid_votes"
[13] "remain" "leave"
[15] "rejected_ballots" "no_official_mark"
[17] "writing_or_mark" "unmarked_or_void"
[19] "pop91" "pop11"
[21] "sh_young" "m_migr"
[23] "b_migr" "b_migr11"
[25] "etn11_W" "etn11_AI"
[27] "etn11_AP" "etn11_AB"
[29] "etn11_AC" "etn11_AO"
[31] "etn11_BCA" "etn11_BAF"
[33] "etn11_BO" "etn11_O"
[35] "ni11_sco" "ni11_bri"
[37] "ni11_eng" "ni11_oth"
[39] "ni11_oe" "shni11_sco"
[41] "shni11_bri" "shni11_eng"
[43] "shni11_oth" "shni11_oe"
[45] "citshare" "urate2004"
[47] "urate2005" "urate2006"
[49] "urate2007" "urate2008"
[51] "urate2009" "urate2010"
[53] "urate2011" "urate2012"
[55] "urate2013" "urate2014"
[57] "urate2015" "epop2004"
[59] "epop2005" "epop2006"
[61] "epop2007" "epop2008"
[63] "epop2009" "epop2010"
[65] "epop2011" "epop2012"
[67] "epop2013" "epop2014"
[69] "epop2015" "zage18_24"
[71] "zage45_59" "zage25_29"
[73] "zage60" "dlmig"
[75] "zshi61_ACD" "zshi61_F"
[77] "zshi61_GHI" "zshi61_JKO"
[79] "zshi61_LMN" "zshi71_ACD"
[81] "zshi71_F" "zshi71_GHI"
[83] "zshi71_JKO" "zshi71_LMN"
[85] "zshi81_ACD" "zshi81_F"
[87] "zshi81_GHI" "zshi81_JKO"
[89] "zshi81_LMN" "zshi91_ACD"
[91] "zshi91_F" "zshi91_GHI"
[93] "zshi91_JKO" "zshi91_LMN"
[95] "zshi01_ACD" "zshi01_F"
[97] "zshi01_GHI" "zshi01_JKO"
[99] "zshi01_LMN" "zshi11_ACD"
[101] "zshi11_F" "zshi11_GHI"
[103] "zshi11_JKO" "zshi11_LMN"
[105] "dzshi_ACD" "dzshi_F"
[107] "dzshi_GHI" "dzshi_JKO"
[109] "dzshi_LMN" "zshedu11_noqual"
[111] "zshedu11_l1" "zshedu11_l2"
[113] "zshedu11_l3" "zshedu11_l4"
[115] "wrkage" "age16over"
[117] "zsh11_wrk"
Consider the variable b_migr11. It records the share (in %) of foreign born residents in an area (according to the last census, which was in 2011). There is no shortage of politicians claiming that the vote for Brexit was due to immigration particularly after 2005 when Eastern European countries joined the EU and their residents could freely move to countries like Britain. Hence, we would expect there to be a strong effect from the presence of foreigners in an area to the vote outcome. Explore this using the pct_leave and b_migr11 variables using graphical and regression analysis.
Which way is the line of best fit sloping on your scatter plot?
What is the constant (rounded to 3 decimal places)?
What is the slope coefficient (rounded to 3 decimal places)?
library(ggplot2)
ggplot(df, aes(x=b_migr11, y=pct_leave)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
Looking at the scatter plot it seems there is rather a negative relationship between the proportion of foreigners in an area and the support for leave; i.e. exactly the opposite of what one would expect. Regression analysis confirms this:
Call:
lm(formula = pct_leave ~ b_migr11, data = df)
Residuals:
Min 1Q Median 3Q Max
-30.4487 -5.3237 0.6445 5.6518 24.9745
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58.55342 0.68603 85.35 <2e-16 ***
b_migr11 -0.50359 0.04668 -10.79 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.12 on 378 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.2354, Adjusted R-squared: 0.2334
F-statistic: 116.4 on 1 and 378 DF, p-value: < 2.2e-16
We see that there is a significant negative relationship. 1 percentage point more foreigners in an area leads to 0.5 percentage point loss in support for the vote to leave.
Various commentators have suggested that it might not be so much the level of immigrants as such, but the experience of a change due to more foreigners in an area that was driving the vote. The variable b_migr contains the share of immigrants in 1991.
Construct a new variable recording the change in the share of immigrants between 2011 and 1992. Explore its impact by extending the regression model from part a).
What is the coefficient on this new variable (rounded to 3 decimal places)?
Is it statistically significant?
We can construct the change in migration shares as:
df["Db_migr"] <- df$b_migr11-df$b_migr
We can then regress
Call:
lm(formula = pct_leave ~ Db_migr + b_migr11, data = df)
Residuals:
Min 1Q Median 3Q Max
-30.2777 -4.4700 0.6883 5.8477 28.5925
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58.6970 0.6703 87.566 < 2e-16 ***
Db_migr 0.9259 0.2078 4.454 1.11e-05 ***
b_migr11 -1.0987 0.1412 -7.783 6.86e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.901 on 377 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.2736, Adjusted R-squared: 0.2698
F-statistic: 71 on 2 and 377 DF, p-value: < 2.2e-16
It seems that indeed the change in the migration share has a strong positive and significant impact on the leave share. Notice also that the share of migrants in 2011 now becomes larger as well. This is because the areas with higher increase in foreigners are also the areas with a higher share of foreigners (e.g. like London). Consequently, in the previous univariate regression the b_migr11 coefficient suffered from an upward bias.
\[LeaveShare = \beta_{0}+\beta_{1}MShare_{2011}+\beta_{2}(MShare_{2011}-MShare_{1991})+\epsilon\]
Work out the change in an area’s leave percentage if the 2011 migrant share would move to back to its level in 1991 in every area.
According to your model from part (b), what would have happened to the vote if there would not have been any change in the share of migrants between 1991 and 2011? Support for Brexit would
In how many areas would the vote flip from a majority support for Brexit to a majority support for Remain?
Note that the impact of changing the 2011 migrant share is a combination of the factors found in the previous section i.e. reducing the migrant share by one percentage point leads to a change in the leave share of \(-(\beta_{1}+\beta_{2})=-(-1.099+0.926)=0.173\) percentage points. In other words, it would seem that a reversal in migrant presence would tend to lead to an increase in the support for Brexit, rather than an increase in the support for remain. Consequently, in no area would be find a flip in the vote from majority support for Brexit to remain.
Can you think of any reason why the estimates in (b) might not adequately reflect the causal impact of immigration on the vote? What are plausible confounding forces?
One reason could be as follows: an important factor that drives immigration is economic opportunity. Hence, it is quite likely that immigration is higher in areas of the country where economic growth was higher. This could mean that there is a positive correlation between the errors and immigration which could imply a downward bias in the estimate of the coefficient on immigration. This in turn could in principle be an explanation for why we find a negative coefficient for the immigration variable (i.e. immigration in 2011 has actually a positive effect on support for leave but we fail to detect it because it is conflated by the more substantial negative effect of economic conditions on the leave vote).
The dataset contains a large number of additional characteristics about a local area. Which variable would you add to your model from part b) to test the alternative explanation mentioned in d)?
We can explore the point made in answer d) by using unemployment as an additional control variable. Below we include both the level of the unemployment rate in 2004 and the change in in the rate between 2011 and 2004.
df["Durate"] <- df$urate2011-df$urate2004
summary(lm(pct_leave~Db_migr + b_migr11 + Durate + urate2004, data=df))
Call:
lm(formula = pct_leave ~ Db_migr + b_migr11 + Durate + urate2004,
data = df)
Residuals:
Min 1Q Median 3Q Max
-27.9548 -4.3850 0.6518 5.1247 19.8195
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 55.7178 1.3368 41.681 < 2e-16 ***
Db_migr 0.9224 0.2165 4.260 2.60e-05 ***
b_migr11 -1.0015 0.1453 -6.894 2.42e-11 ***
Durate 1.0426 0.1781 5.854 1.08e-08 ***
urate2004 -0.3586 0.2610 -1.374 0.17
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.393 on 362 degrees of freedom
(14 observations deleted due to missingness)
Multiple R-squared: 0.3543, Adjusted R-squared: 0.3471
F-statistic: 49.65 on 4 and 362 DF, p-value: < 2.2e-16
Note that including the unemployment variables does not change the migration variables by much, which suggests that the economic conditions are not conflating the results on immigration. That said, note that the net negative effect of b_migr11 (coefficient for b_migr11+ coefficient for Db_migr) is slightly less negative than in in (b) which would be consistent with a slight conflation of the immigration effect by economic factors.
Also, note that the change in the unemployment rate has a high and significant coefficient. If the unemployment rate goes up by 1 percentage point the support for leave goes up by about 1 percentage point as well. Hence, it might be more useful to consider economic conditions as a factor that has been driving the vote rather than recent immigration.
(updated on 2021/11/22 to make more interesting as example, also avoiding plm and vcov which we haven’t discussed)
The dataset data/prod.dta contains production data for various companies from 1979 to 1986.
library(haven)
prod=read_dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AACD9OHn_yCnKFAX7hbEASVha/prod.dta?dl=1")
names(prod)
[1] "year" "id" "go" "m" "l"
[6] "k" "sic3dig" "countyear" "va"
Examine the data using a Cobb-Douglas production function in terms of value added; i.e. regress log value added on log capital and log labour (va contains the value added, k the capital stock and l labour all not in logs). Run the regression with and without time dummies and comment on any differences.
On the basis of the regression with time dummies examine the hypothesis that the production function has constant returns to scale (i.e. the labour and capital coefficients would add ot 1).
The hypothesis is
library(dplyr)
prod=prod %>% mutate(year=factor(year))
mod0=lm(log(va)~log(k)+log(l), prod)
summary(mod0)
Call:
lm(formula = log(va) ~ log(k) + log(l), data = prod)
Residuals:
Min 1Q Median 3Q Max
-2.7851 -0.4169 0.0331 0.4492 2.0898
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.02744 0.11588 26.13 <2e-16 ***
log(k) 0.34639 0.02114 16.39 <2e-16 ***
log(l) 0.94000 0.03803 24.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7096 on 1165 degrees of freedom
Multiple R-squared: 0.7829, Adjusted R-squared: 0.7825
F-statistic: 2101 on 2 and 1165 DF, p-value: < 2.2e-16
Call:
lm(formula = log(va) ~ log(k) + log(l) + year, data = prod)
Residuals:
Min 1Q Median 3Q Max
-2.7397 -0.4336 0.0305 0.4477 2.1339
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.985474 0.128579 23.219 <2e-16 ***
log(k) 0.346498 0.021273 16.288 <2e-16 ***
log(l) 0.938894 0.038330 24.495 <2e-16 ***
year80 0.036981 0.082998 0.446 0.6560
year81 0.141971 0.083009 1.710 0.0875 .
year82 0.092141 0.082992 1.110 0.2671
year83 0.005795 0.082994 0.070 0.9443
year84 0.035601 0.083003 0.429 0.6681
year85 -0.061518 0.083138 -0.740 0.4595
year86 0.108689 0.083340 1.304 0.1924
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7091 on 1158 degrees of freedom
Multiple R-squared: 0.7846, Adjusted R-squared: 0.7829
F-statistic: 468.6 on 9 and 1158 DF, p-value: < 2.2e-16
We turn year into a categorical (factor) variable. Treating year as a categorical variable will calculate effect of each individual year - i.e. what impact on the target variable was on average in a given year.
Including time dummies can help with idiosyncratic time shocks; e.g. a recession will reduce investment by companies as well as sales. This could create a spurious positive correlation between capital and sales which would give an upwardly biased estimate of the causal effect of including capital. Having said that: in the case of the example above, including time dummies doesn’t have a big impact on the estimated coefficients.
library(car)
#library(plm)
#linearHypothesis(mod1,"log(k)+log(l)=1",vcov=vcovHC)
linearHypothesis(mod1,"log(k)+log(l)=1")
Linear hypothesis test
Hypothesis:
log(k) + log(l) = 1
Model 1: restricted model
Model 2: log(va) ~ log(k) + log(l) + year
Res.Df RSS Df Sum of Sq F Pr(>F)
1 1159 648.73
2 1158 582.19 1 66.541 132.35 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We then go on to test the hypothesis that the production function has constant returns to scale, formulated as log(k)+log(l)=1.
We need to install and load the extra package “car”, which includes the function linearHypothesis.
Constant returns are clearly rejected.
The variable sic3dig contains an industry classifier which groups the firms into 17 industries.
Why might it be useful to include industry classifiers in order to estimate the production function better?
Re-estimate the production function controlling for industry. Does your assessment about constant returns to scale change based on this new estimate?
Typically the residual from a production function estimation is interpreted as productivity. However, it is plausible that more productive firms will want to employ more production factors. This might lead to a correlation between residuals and the explanatory variables which could lead to biases. A big part of that might come from variations between sectors; i.e. some sectors are just more productive and profitable and those will also be the sectors that attract more capital and other production factors.
prod=prod %>% mutate(sic3dig=factor(sic3dig))
mod2=lm(log(va)~log(k)+log(l)+year+sic3dig, prod)
summary(mod2)
Call:
lm(formula = log(va) ~ log(k) + log(l) + year + sic3dig, data = prod)
Residuals:
Min 1Q Median 3Q Max
-2.88207 -0.40007 0.05372 0.44165 1.92283
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.216195 0.134458 23.920 < 2e-16 ***
log(k) 0.285436 0.022608 12.625 < 2e-16 ***
log(l) 0.995616 0.038889 25.602 < 2e-16 ***
year80 0.036104 0.077877 0.464 0.643019
year81 0.142005 0.077888 1.823 0.068537 .
year82 0.094111 0.077871 1.209 0.227085
year83 0.007594 0.077874 0.098 0.922338
year84 0.032116 0.077884 0.412 0.680161
year85 -0.071111 0.078036 -0.911 0.362352
year86 0.093891 0.078264 1.200 0.230517
sic3dig321 -0.107406 0.079196 -1.356 0.175304
sic3dig322 0.155755 0.079807 1.952 0.051224 .
sic3dig323 0.847057 0.239067 3.543 0.000411 ***
sic3dig324 0.034193 0.124769 0.274 0.784093
sic3dig331 -0.175586 0.085783 -2.047 0.040900 *
sic3dig332 0.260137 0.085636 3.038 0.002438 **
sic3dig341 0.733683 0.176477 4.157 3.46e-05 ***
sic3dig342 0.035795 0.095630 0.374 0.708245
sic3dig351 1.297901 0.175805 7.383 2.98e-13 ***
sic3dig352 0.597134 0.102678 5.816 7.84e-09 ***
sic3dig355 0.509182 0.237587 2.143 0.032313 *
sic3dig356 0.195583 0.095147 2.056 0.040050 *
sic3dig369 -0.239435 0.101992 -2.348 0.019065 *
sic3dig371 0.771417 0.139946 5.512 4.38e-08 ***
sic3dig381 0.174591 0.071427 2.444 0.014663 *
sic3dig383 0.367663 0.140038 2.625 0.008769 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6653 on 1142 degrees of freedom
Multiple R-squared: 0.813, Adjusted R-squared: 0.8089
F-statistic: 198.5 on 25 and 1142 DF, p-value: < 2.2e-16
linearHypothesis(mod2, "log(k)+log(l)=1")
Linear hypothesis test
Hypothesis:
log(k) + log(l) = 1
Model 1: restricted model
Model 2: log(va) ~ log(k) + log(l) + year + sic3dig
Res.Df RSS Df Sum of Sq F Pr(>F)
1 1143 564.78
2 1142 505.47 1 59.314 134.01 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
i.e. the test continues to be rejected. Consequently, we don’t find constant returns to scale.
Which of the 17 industries has the largest number of observations?
Lets pick industries 311 and 321. For each of the two industries separately, estimate a Cobb-Douglas production function.
Would you say the functions are very different in the two industries?
# A tibble: 17 x 2
sic3dig `n()`
<fct> <int>
1 311 400
2 321 88
3 322 88
4 323 8
5 324 32
6 331 72
7 332 72
8 341 16
9 342 56
10 351 16
11 352 48
12 355 8
13 356 56
14 369 48
15 371 24
16 381 112
17 383 24
The table reveals that the industry with the largest number of observations is 311.
library(dplyr)
prod= prod %>% mutate(sic3digf=factor(sic3dig))
mod311=lm(log(va)~log(k)+log(l)+year, prod %>% filter(sic3dig=="311"))
summary(mod311)
Call:
lm(formula = log(va) ~ log(k) + log(l) + year, data = prod %>%
filter(sic3dig == "311"))
Residuals:
Min 1Q Median 3Q Max
-2.30248 -0.41505 0.07536 0.45801 1.73235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.56445 0.20804 12.327 < 2e-16 ***
log(k) 0.41456 0.04113 10.079 < 2e-16 ***
log(l) 0.80244 0.07461 10.756 < 2e-16 ***
year80 0.08684 0.13017 0.667 0.505074
year81 0.26948 0.13017 2.070 0.039090 *
year82 0.39208 0.13033 3.008 0.002798 **
year83 0.23738 0.13065 1.817 0.070003 .
year84 0.22519 0.13086 1.721 0.086078 .
year85 0.07980 0.13209 0.604 0.546113
year86 0.48876 0.13271 3.683 0.000263 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6508 on 390 degrees of freedom
Multiple R-squared: 0.8318, Adjusted R-squared: 0.8279
F-statistic: 214.2 on 9 and 390 DF, p-value: < 2.2e-16
Call:
lm(formula = log(va) ~ log(k) + log(l) + year, data = prod %>%
filter(sic3dig == "321"))
Residuals:
Min 1Q Median 3Q Max
-1.12790 -0.32740 0.02603 0.38229 0.95998
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.61053 0.41939 8.609 6.19e-13 ***
log(k) 0.20105 0.07984 2.518 0.0138 *
log(l) 1.04942 0.13980 7.507 8.45e-11 ***
year80 0.16420 0.22541 0.728 0.4685
year81 0.17994 0.22527 0.799 0.4268
year82 0.17906 0.22650 0.791 0.4316
year83 0.22520 0.22623 0.995 0.3226
year84 -0.03102 0.22546 -0.138 0.8909
year85 -0.16535 0.22671 -0.729 0.4680
year86 0.26596 0.22574 1.178 0.2423
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5282 on 78 degrees of freedom
Multiple R-squared: 0.8231, Adjusted R-squared: 0.8027
F-statistic: 40.32 on 9 and 78 DF, p-value: < 2.2e-16
In each case the labor coefficient is larger than the capital coefficient. However, the numbers are not necessarily very close. In the case of sector 321 the capital coefficient is only half of the coefficient for sector 311. Still, to make sure they are statistically different it’s good to do a formal test which is the next question.
Conduct a hypothesis test to compare the two functions formally. Note, that for that you need to estimate both functions using a single regression model.
Are the coefficients statistically different from zero?
Could they be jointly significant?
mod_inter=lm(log(va)~ log(k) + log(l) + sic3digf:log(k)+sic3digf:log(l) +sic3digf*year, prod %>% filter(sic3dig=="311"|sic3dig=="321"))
summary(mod_inter)
Call:
lm(formula = log(va) ~ log(k) + log(l) + sic3digf:log(k) + sic3digf:log(l) +
sic3digf * year, data = prod %>% filter(sic3dig == "311" |
sic3dig == "321"))
Residuals:
Min 1Q Median 3Q Max
-2.30248 -0.38861 0.06615 0.42833 1.73235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.56445 0.20203 12.693 < 2e-16 ***
log(k) 0.41456 0.03994 10.378 < 2e-16 ***
log(l) 0.80244 0.07245 11.075 < 2e-16 ***
sic3digf321 1.04608 0.54098 1.934 0.053754 .
year80 0.08684 0.12642 0.687 0.492442
year81 0.26948 0.12641 2.132 0.033549 *
year82 0.39208 0.12657 3.098 0.002068 **
year83 0.23738 0.12688 1.871 0.061986 .
year84 0.22519 0.12708 1.772 0.077053 .
year85 0.07980 0.12828 0.622 0.534194
year86 0.48876 0.12888 3.792 0.000169 ***
log(k):sic3digf321 -0.21351 0.10355 -2.062 0.039770 *
log(l):sic3digf321 0.24698 0.18230 1.355 0.176128
sic3digf321:year80 0.07736 0.29788 0.260 0.795206
sic3digf321:year81 -0.08954 0.29772 -0.301 0.763739
sic3digf321:year82 -0.21302 0.29912 -0.712 0.476719
sic3digf321:year83 -0.01218 0.29896 -0.041 0.967512
sic3digf321:year84 -0.25620 0.29822 -0.859 0.390715
sic3digf321:year85 -0.24515 0.30008 -0.817 0.414374
sic3digf321:year86 -0.22280 0.29929 -0.744 0.456983
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.632 on 468 degrees of freedom
Multiple R-squared: 0.8326, Adjusted R-squared: 0.8258
F-statistic: 122.5 on 19 and 468 DF, p-value: < 2.2e-16
We continue to work with a subset of our data. We introduce interaction terms into our model – multiplying our logged labor variable with the sector code and the logged capital variable with the sector code. The t-test on these interaction terms is what we are interested in. Remember interaction terms should be interpreted as “effect modifiers” - we are interested in whether the industry modifies the relationship between labour/capital and value added.
Note that we also need to introduce interaction terms between sector and year if we want to replicate what happened when we run two separate regressions: each sector had its own time effects. If we are not interactng them in this new regression we force both sectors to have the same time effects.
The interaction coefficient on capital is significant at 5 percent whereas the one on labor is not. However, to assess if there is really no difference in the production function between the two groups, both coefficients need to be non significant jointly.
We can examine that with the linearHypothesis command:
linearHypothesis(mod_inter, c("log(k):sic3digf321=0","log(l):sic3digf321=0"))
Linear hypothesis test
Hypothesis:
log(k):sic3digf321 = 0
log(l):sic3digf321 = 0
Model 1: restricted model
Model 2: log(va) ~ log(k) + log(l) + sic3digf:log(k) + sic3digf:log(l) +
sic3digf * year
Res.Df RSS Df Sum of Sq F Pr(>F)
1 470 188.90
2 468 186.92 2 1.9749 2.4724 0.08549 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This suggests that the production function for sector 321 is only weakly (at 10%) different from that of sector 311
Re-estimate your extended model from d) by allowing for firm fixed effects.
Does this change your assessment concerning the hypothesis that the production functions are identical in the two industries?
#mod_fe=lm(log(va)~log(k)+log(l)+sic3dig*log(k)+sic3dig*log(l)+sic3dig+year+factor(id), index=c("id","year"),data=prod %>% filter(sic3dig=="311"|sic3dig=="321"), model="within")
prod=prod %>% mutate(idf=factor(id))
mod_fe=lm(log(va)~log(k)+log(l)+sic3digf:log(k)+sic3digf:log(l)+year+year:sic3digf+idf, data=prod %>% filter(sic3dig=="311"| sic3dig=="321"))
summary(mod_fe)
Call:
lm(formula = log(va) ~ log(k) + log(l) + sic3digf:log(k) + sic3digf:log(l) +
year + year:sic3digf + idf, data = prod %>% filter(sic3dig ==
"311" | sic3dig == "321"))
Residuals:
Min 1Q Median 3Q Max
-1.57822 -0.20745 -0.00353 0.23509 1.38593
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.18111 1.95275 3.165 0.001665 **
log(k) 0.36901 0.14443 2.555 0.010983 *
log(l) 0.33302 0.12117 2.748 0.006253 **
year80 0.09849 0.08842 1.114 0.265953
year81 0.27402 0.08847 3.097 0.002087 **
year82 0.40832 0.08978 4.548 7.14e-06 ***
year83 0.26082 0.09258 2.817 0.005079 **
year84 0.23774 0.09619 2.472 0.013861 *
year85 0.13050 0.10316 1.265 0.206590
year86 0.53737 0.10929 4.917 1.28e-06 ***
idf2421 -1.81562 0.91718 -1.980 0.048421 *
idf5265 -2.14844 0.87344 -2.460 0.014316 *
idf15921 -0.68585 0.49274 -1.392 0.164708
idf16578 -0.88316 0.77937 -1.133 0.257803
idf16605 -1.33203 0.33018 -4.034 6.54e-05 ***
idf17775 -1.50578 0.86152 -1.748 0.081247 .
idf18117 -2.32688 0.85913 -2.708 0.007044 **
idf19188 -1.97386 0.89953 -2.194 0.028774 *
idf19287 -1.88645 1.04781 -1.800 0.072537 .
idf20331 -2.19312 0.88310 -2.483 0.013412 *
idf20925 -2.69578 0.83438 -3.231 0.001334 **
idf25686 -0.47739 0.22155 -2.155 0.031764 *
idf26118 -2.06988 0.73049 -2.834 0.004831 **
idf28341 -2.27233 0.89856 -2.529 0.011819 *
idf28629 -0.58776 0.48893 -1.202 0.230013
idf29259 -2.99966 3.11677 -0.962 0.336405
idf32004 -2.47214 0.78769 -3.138 0.001821 **
idf35856 -0.57077 0.50390 -1.133 0.257997
idf36468 -1.29362 0.85129 -1.520 0.129384
idf39906 -0.43909 0.43517 -1.009 0.313572
idf40068 -2.00950 0.73748 -2.725 0.006710 **
idf44694 -0.68058 0.65952 -1.032 0.302717
idf45081 -1.40526 0.82235 -1.709 0.088239 .
idf48042 -1.98575 0.78698 -2.523 0.012006 *
idf49347 -2.26081 0.80635 -2.804 0.005292 **
idf49815 -2.05689 0.92068 -2.234 0.026016 *
idf55674 -1.58149 0.86329 -1.832 0.067689 .
idf56439 -1.67649 0.88115 -1.903 0.057794 .
idf56637 -2.60704 0.88052 -2.961 0.003247 **
idf60588 -1.15167 0.63502 -1.814 0.070474 .
idf63252 -2.70866 2.97442 -0.911 0.363016
idf64116 -2.72762 2.79224 -0.977 0.329217
idf65097 -1.97071 2.90443 -0.679 0.497826
idf65151 -3.18579 2.57774 -1.236 0.217211
idf65322 -3.41494 2.66798 -1.280 0.201280
idf68274 -2.30133 0.86071 -2.674 0.007801 **
idf68589 -1.45674 0.77012 -1.892 0.059252 .
idf69336 -2.33470 2.63736 -0.885 0.376546
idf71442 -3.27453 2.81318 -1.164 0.245104
idf76365 -1.65666 0.97446 -1.700 0.089876 .
idf78588 -3.25944 3.40056 -0.959 0.338376
idf80442 -1.88779 0.86938 -2.171 0.030473 *
idf81414 -2.25434 3.08892 -0.730 0.465920
idf81666 -1.40122 0.77478 -1.809 0.071255 .
idf83394 -2.33158 0.90748 -2.569 0.010544 *
idf84258 -1.31066 0.84903 -1.544 0.123427
idf84807 -2.14712 0.94422 -2.274 0.023485 *
idf85032 -1.19273 0.93573 -1.275 0.203152
idf85941 -1.65192 0.73076 -2.261 0.024311 *
idf86292 -2.45983 0.73433 -3.350 0.000884 ***
idf86751 0.03953 0.42280 0.094 0.925549
idf87201 -0.29516 0.61324 -0.481 0.630553
idf90063 -2.87884 2.96654 -0.970 0.332402
idf92979 -1.80307 0.77708 -2.320 0.020815 *
idf93519 -1.73268 0.96749 -1.791 0.074048 .
idf93726 -1.07958 1.11043 -0.972 0.331515
idf94158 -2.55671 0.90185 -2.835 0.004811 **
idf96507 -2.29632 0.87808 -2.615 0.009248 **
idf98298 -2.16615 0.82550 -2.624 0.009014 **
idf99396 -1.46163 0.46264 -3.159 0.001699 **
log(k):sic3digf321 0.09899 0.28015 0.353 0.724017
log(l):sic3digf321 0.15616 0.21784 0.717 0.473861
sic3digf321:year80 0.02734 0.20852 0.131 0.895766
sic3digf321:year81 -0.11491 0.20821 -0.552 0.581303
sic3digf321:year82 -0.32608 0.21122 -1.544 0.123423
sic3digf321:year83 -0.12108 0.21294 -0.569 0.569938
sic3digf321:year84 -0.23963 0.21574 -1.111 0.267347
sic3digf321:year85 -0.20025 0.22277 -0.899 0.369226
sic3digf321:year86 -0.22805 0.22585 -1.010 0.313209
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4417 on 409 degrees of freedom
Multiple R-squared: 0.9285, Adjusted R-squared: 0.9149
F-statistic: 68.12 on 78 and 409 DF, p-value: < 2.2e-16
linearHypothesis(mod_fe, c("log(k):sic3digf321=0","log(l):sic3digf321=0"))
Linear hypothesis test
Hypothesis:
log(k):sic3digf321 = 0
log(l):sic3digf321 = 0
Model 1: restricted model
Model 2: log(va) ~ log(k) + log(l) + sic3digf:log(k) + sic3digf:log(l) +
year + year:sic3digf + idf
Res.Df RSS Df Sum of Sq F Pr(>F)
1 411 79.958
2 409 79.789 2 0.16941 0.4342 0.6481
It seems that once we allow for fixed effects the interaction coefficients are no longer significant in separate or joint significance tests.
Also note that even for sector 321 (which has bigger capital and labour elasticities) we no longer reject the hypothesis the returns are constant:
linearHypothesis(mod_fe,"log(k)+log(l)+log(k):sic3digf321+log(l):sic3digf321=1")
Linear hypothesis test
Hypothesis:
log(k) + log(l) + log(k):sic3digf321 + log(l):sic3digf321 = 1
Model 1: restricted model
Model 2: log(va) ~ log(k) + log(l) + sic3digf:log(k) + sic3digf:log(l) +
year + year:sic3digf + idf
Res.Df RSS Df Sum of Sq F Pr(>F)
1 410 79.794
2 409 79.789 1 0.00546 0.028 0.8672
For attribution, please cite this work as
Martin (2021, Nov. 22). Datastories Hub: Exercises 6. Retrieved from https://mondpanther.github.io/datastorieshub/posts/exercises/exercises6/
BibTeX citation
@misc{martin2021exercises, author = {Martin, Ralf}, title = {Datastories Hub: Exercises 6}, url = {https://mondpanther.github.io/datastorieshub/posts/exercises/exercises6/}, year = {2021} }