Datastories Hub: Exercises 6

Ralf Martin

Exercise 6.1

Consider the dataset Brexit.dta. It contains data on the outcome of the Brexit vote from June 23 this year by local area along with a range of area characteristics. The variable pct_leave records the percentage of voters in an area that voted for leave.

library(haven)
df=read_dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AAC_4UZXJG9kmImypJXTZ9IOa/brexit.dta?dl=1")
names(df)

  [1] "oslaua"               "region_code"         
  [3] "region"               "area"                
  [5] "pct_turnout"          "pct_leave"           
  [7] "pct_rejected"         "electorate"          
  [9] "expectedballots"      "verifiedballotpapers"
 [11] "votes_cast"           "valid_votes"         
 [13] "remain"               "leave"               
 [15] "rejected_ballots"     "no_official_mark"    
 [17] "writing_or_mark"      "unmarked_or_void"    
 [19] "pop91"                "pop11"               
 [21] "sh_young"             "m_migr"              
 [23] "b_migr"               "b_migr11"            
 [25] "etn11_W"              "etn11_AI"            
 [27] "etn11_AP"             "etn11_AB"            
 [29] "etn11_AC"             "etn11_AO"            
 [31] "etn11_BCA"            "etn11_BAF"           
 [33] "etn11_BO"             "etn11_O"             
 [35] "ni11_sco"             "ni11_bri"            
 [37] "ni11_eng"             "ni11_oth"            
 [39] "ni11_oe"              "shni11_sco"          
 [41] "shni11_bri"           "shni11_eng"          
 [43] "shni11_oth"           "shni11_oe"           
 [45] "citshare"             "urate2004"           
 [47] "urate2005"            "urate2006"           
 [49] "urate2007"            "urate2008"           
 [51] "urate2009"            "urate2010"           
 [53] "urate2011"            "urate2012"           
 [55] "urate2013"            "urate2014"           
 [57] "urate2015"            "epop2004"            
 [59] "epop2005"             "epop2006"            
 [61] "epop2007"             "epop2008"            
 [63] "epop2009"             "epop2010"            
 [65] "epop2011"             "epop2012"            
 [67] "epop2013"             "epop2014"            
 [69] "epop2015"             "zage18_24"           
 [71] "zage45_59"            "zage25_29"           
 [73] "zage60"               "dlmig"               
 [75] "zshi61_ACD"           "zshi61_F"            
 [77] "zshi61_GHI"           "zshi61_JKO"          
 [79] "zshi61_LMN"           "zshi71_ACD"          
 [81] "zshi71_F"             "zshi71_GHI"          
 [83] "zshi71_JKO"           "zshi71_LMN"          
 [85] "zshi81_ACD"           "zshi81_F"            
 [87] "zshi81_GHI"           "zshi81_JKO"          
 [89] "zshi81_LMN"           "zshi91_ACD"          
 [91] "zshi91_F"             "zshi91_GHI"          
 [93] "zshi91_JKO"           "zshi91_LMN"          
 [95] "zshi01_ACD"           "zshi01_F"            
 [97] "zshi01_GHI"           "zshi01_JKO"          
 [99] "zshi01_LMN"           "zshi11_ACD"          
[101] "zshi11_F"             "zshi11_GHI"          
[103] "zshi11_JKO"           "zshi11_LMN"          
[105] "dzshi_ACD"            "dzshi_F"             
[107] "dzshi_GHI"            "dzshi_JKO"           
[109] "dzshi_LMN"            "zshedu11_noqual"     
[111] "zshedu11_l1"          "zshedu11_l2"         
[113] "zshedu11_l3"          "zshedu11_l4"         
[115] "wrkage"               "age16over"           
[117] "zsh11_wrk"

Part (a)

Consider the variable b_migr11. It records the share (in %) of foreign born residents in an area (according to the last census, which was in 2011). There is no shortage of politicians claiming that the vote for Brexit was due to immigration particularly after 2005 when Eastern European countries joined the EU and their residents could freely move to countries like Britain. Hence, we would expect there to be a strong effect from the presence of foreigners in an area to the vote outcome. Explore this using the pct_leave and b_migr11 variables using graphical and regression analysis.

Which way is the line of best fit sloping on your scatter plot?

What is the constant (rounded to 3 decimal places)?

What is the slope coefficient (rounded to 3 decimal places)?

library(ggplot2)
ggplot(df, aes(x=b_migr11, y=pct_leave)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

Looking at the scatter plot it seems there is rather a negative relationship between the proportion of foreigners in an area and the support for leave; i.e. exactly the opposite of what one would expect. Regression analysis confirms this:

summary(lm(pct_leave~b_migr11, data=df))


Call:
lm(formula = pct_leave ~ b_migr11, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.4487  -5.3237   0.6445   5.6518  24.9745 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 58.55342    0.68603   85.35   <2e-16 ***
b_migr11    -0.50359    0.04668  -10.79   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.12 on 378 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.2354,    Adjusted R-squared:  0.2334 
F-statistic: 116.4 on 1 and 378 DF,  p-value: < 2.2e-16

We see that there is a significant negative relationship. 1 percentage point more foreigners in an area leads to 0.5 percentage point loss in support for the vote to leave.

Part (b)

Various commentators have suggested that it might not be so much the level of immigrants as such, but the experience of a change due to more foreigners in an area that was driving the vote. The variable b_migr contains the share of immigrants in 1991.

Construct a new variable recording the change in the share of immigrants between 2011 and 1992. Explore its impact by extending the regression model from part a).

What is the coefficient on this new variable (rounded to 3 decimal places)?

Is it statistically significant?

We can construct the change in migration shares as:

df["Db_migr"] <- df$b_migr11-df$b_migr

We can then regress

summary(lm(pct_leave~Db_migr+b_migr11, data=df))


Call:
lm(formula = pct_leave ~ Db_migr + b_migr11, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.2777  -4.4700   0.6883   5.8477  28.5925 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  58.6970     0.6703  87.566  < 2e-16 ***
Db_migr       0.9259     0.2078   4.454 1.11e-05 ***
b_migr11     -1.0987     0.1412  -7.783 6.86e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.901 on 377 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.2736,    Adjusted R-squared:  0.2698 
F-statistic:    71 on 2 and 377 DF,  p-value: < 2.2e-16

It seems that indeed the change in the migration share has a strong positive and significant impact on the leave share. Notice also that the share of migrants in 2011 now becomes larger as well. This is because the areas with higher increase in foreigners are also the areas with a higher share of foreigners (e.g. like London). Consequently, in the previous univariate regression the b_migr11 coefficient suffered from an upward bias.

\[LeaveShare = \beta_{0}+\beta_{1}MShare_{2011}+\beta_{2}(MShare_{2011}-MShare_{1991})+\epsilon\]

Part (c)

Work out the change in an area’s leave percentage if the 2011 migrant share would move to back to its level in 1991 in every area.

According to your model from part (b), what would have happened to the vote if there would not have been any change in the share of migrants between 1991 and 2011? Support for Brexit would

In how many areas would the vote flip from a majority support for Brexit to a majority support for Remain?

Note that the impact of changing the 2011 migrant share is a combination of the factors found in the previous section i.e. reducing the migrant share by one percentage point leads to a change in the leave share of \(-(\beta_{1}+\beta_{2})=-(-1.099+0.926)=0.173\) percentage points. In other words, it would seem that a reversal in migrant presence would tend to lead to an increase in the support for Brexit, rather than an increase in the support for remain. Consequently, in no area would be find a flip in the vote from majority support for Brexit to remain.

Part (d)

Can you think of any reason why the estimates in (b) might not adequately reflect the causal impact of immigration on the vote? What are plausible confounding forces?

One reason could be as follows: an important factor that drives immigration is economic opportunity. Hence, it is quite likely that immigration is higher in areas of the country where economic growth was higher. This could mean that there is a positive correlation between the errors and immigration which could imply a downward bias in the estimate of the coefficient on immigration. This in turn could in principle be an explanation for why we find a negative coefficient for the immigration variable (i.e. immigration in 2011 has actually a positive effect on support for leave but we fail to detect it because it is conflated by the more substantial negative effect of economic conditions on the leave vote).

Part (e)

The dataset contains a large number of additional characteristics about a local area. Which variable would you add to your model from part b) to test the alternative explanation mentioned in d)?

We can explore the point made in answer d) by using unemployment as an additional control variable. Below we include both the level of the unemployment rate in 2004 and the change in in the rate between 2011 and 2004.

df["Durate"] <- df$urate2011-df$urate2004
summary(lm(pct_leave~Db_migr + b_migr11 + Durate + urate2004, data=df))


Call:
lm(formula = pct_leave ~ Db_migr + b_migr11 + Durate + urate2004, 
    data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-27.9548  -4.3850   0.6518   5.1247  19.8195 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  55.7178     1.3368  41.681  < 2e-16 ***
Db_migr       0.9224     0.2165   4.260 2.60e-05 ***
b_migr11     -1.0015     0.1453  -6.894 2.42e-11 ***
Durate        1.0426     0.1781   5.854 1.08e-08 ***
urate2004    -0.3586     0.2610  -1.374     0.17    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.393 on 362 degrees of freedom
  (14 observations deleted due to missingness)
Multiple R-squared:  0.3543,    Adjusted R-squared:  0.3471 
F-statistic: 49.65 on 4 and 362 DF,  p-value: < 2.2e-16

Note that including the unemployment variables does not change the migration variables by much, which suggests that the economic conditions are not conflating the results on immigration. That said, note that the net negative effect of b_migr11 (coefficient for b_migr11+ coefficient for Db_migr) is slightly less negative than in in (b) which would be consistent with a slight conflation of the immigration effect by economic factors.

Also, note that the change in the unemployment rate has a high and significant coefficient. If the unemployment rate goes up by 1 percentage point the support for leave goes up by about 1 percentage point as well. Hence, it might be more useful to consider economic conditions as a factor that has been driving the vote rather than recent immigration.

Exercise 6.2

(updated on 2021/11/22 to make more interesting as example, also avoiding plm and vcov which we haven’t discussed)

The dataset data/prod.dta contains production data for various companies from 1979 to 1986.

library(haven)
prod=read_dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AACD9OHn_yCnKFAX7hbEASVha/prod.dta?dl=1")
names(prod)

[1] "year"      "id"        "go"        "m"         "l"        
[6] "k"         "sic3dig"   "countyear" "va"

Part (a)

Examine the data using a Cobb-Douglas production function in terms of value added; i.e. regress log value added on log capital and log labour (va contains the value added, k the capital stock and l labour all not in logs). Run the regression with and without time dummies and comment on any differences.

On the basis of the regression with time dummies examine the hypothesis that the production function has constant returns to scale (i.e. the labour and capital coefficients would add ot 1).

The hypothesis is

library(dplyr)
prod=prod %>% mutate(year=factor(year))
                     

mod0=lm(log(va)~log(k)+log(l), prod)
summary(mod0)


Call:
lm(formula = log(va) ~ log(k) + log(l), data = prod)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7851 -0.4169  0.0331  0.4492  2.0898 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.02744    0.11588   26.13   <2e-16 ***
log(k)       0.34639    0.02114   16.39   <2e-16 ***
log(l)       0.94000    0.03803   24.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7096 on 1165 degrees of freedom
Multiple R-squared:  0.7829,    Adjusted R-squared:  0.7825 
F-statistic:  2101 on 2 and 1165 DF,  p-value: < 2.2e-16

mod1=lm(log(va)~log(k)+log(l)+year, prod)
summary(mod1)


Call:
lm(formula = log(va) ~ log(k) + log(l) + year, data = prod)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7397 -0.4336  0.0305  0.4477  2.1339 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.985474   0.128579  23.219   <2e-16 ***
log(k)       0.346498   0.021273  16.288   <2e-16 ***
log(l)       0.938894   0.038330  24.495   <2e-16 ***
year80       0.036981   0.082998   0.446   0.6560    
year81       0.141971   0.083009   1.710   0.0875 .  
year82       0.092141   0.082992   1.110   0.2671    
year83       0.005795   0.082994   0.070   0.9443    
year84       0.035601   0.083003   0.429   0.6681    
year85      -0.061518   0.083138  -0.740   0.4595    
year86       0.108689   0.083340   1.304   0.1924    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7091 on 1158 degrees of freedom
Multiple R-squared:  0.7846,    Adjusted R-squared:  0.7829 
F-statistic: 468.6 on 9 and 1158 DF,  p-value: < 2.2e-16

We turn year into a categorical (factor) variable. Treating year as a categorical variable will calculate effect of each individual year - i.e. what impact on the target variable was on average in a given year.

Including time dummies can help with idiosyncratic time shocks; e.g. a recession will reduce investment by companies as well as sales. This could create a spurious positive correlation between capital and sales which would give an upwardly biased estimate of the causal effect of including capital. Having said that: in the case of the example above, including time dummies doesn’t have a big impact on the estimated coefficients.

library(car)
#library(plm)
#linearHypothesis(mod1,"log(k)+log(l)=1",vcov=vcovHC)

linearHypothesis(mod1,"log(k)+log(l)=1")

Linear hypothesis test

Hypothesis:
log(k)  + log(l) = 1

Model 1: restricted model
Model 2: log(va) ~ log(k) + log(l) + year

  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1   1159 648.73                                  
2   1158 582.19  1    66.541 132.35 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We then go on to test the hypothesis that the production function has constant returns to scale, formulated as log(k)+log(l)=1.

We need to install and load the extra package “car”, which includes the function linearHypothesis.

Constant returns are clearly rejected.

Part (b)

The variable sic3dig contains an industry classifier which groups the firms into 17 industries.

Why might it be useful to include industry classifiers in order to estimate the production function better?

Re-estimate the production function controlling for industry. Does your assessment about constant returns to scale change based on this new estimate?

Typically the residual from a production function estimation is interpreted as productivity. However, it is plausible that more productive firms will want to employ more production factors. This might lead to a correlation between residuals and the explanatory variables which could lead to biases. A big part of that might come from variations between sectors; i.e. some sectors are just more productive and profitable and those will also be the sectors that attract more capital and other production factors.

prod=prod %>% mutate(sic3dig=factor(sic3dig))

mod2=lm(log(va)~log(k)+log(l)+year+sic3dig, prod)
summary(mod2)


Call:
lm(formula = log(va) ~ log(k) + log(l) + year + sic3dig, data = prod)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.88207 -0.40007  0.05372  0.44165  1.92283 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.216195   0.134458  23.920  < 2e-16 ***
log(k)       0.285436   0.022608  12.625  < 2e-16 ***
log(l)       0.995616   0.038889  25.602  < 2e-16 ***
year80       0.036104   0.077877   0.464 0.643019    
year81       0.142005   0.077888   1.823 0.068537 .  
year82       0.094111   0.077871   1.209 0.227085    
year83       0.007594   0.077874   0.098 0.922338    
year84       0.032116   0.077884   0.412 0.680161    
year85      -0.071111   0.078036  -0.911 0.362352    
year86       0.093891   0.078264   1.200 0.230517    
sic3dig321  -0.107406   0.079196  -1.356 0.175304    
sic3dig322   0.155755   0.079807   1.952 0.051224 .  
sic3dig323   0.847057   0.239067   3.543 0.000411 ***
sic3dig324   0.034193   0.124769   0.274 0.784093    
sic3dig331  -0.175586   0.085783  -2.047 0.040900 *  
sic3dig332   0.260137   0.085636   3.038 0.002438 ** 
sic3dig341   0.733683   0.176477   4.157 3.46e-05 ***
sic3dig342   0.035795   0.095630   0.374 0.708245    
sic3dig351   1.297901   0.175805   7.383 2.98e-13 ***
sic3dig352   0.597134   0.102678   5.816 7.84e-09 ***
sic3dig355   0.509182   0.237587   2.143 0.032313 *  
sic3dig356   0.195583   0.095147   2.056 0.040050 *  
sic3dig369  -0.239435   0.101992  -2.348 0.019065 *  
sic3dig371   0.771417   0.139946   5.512 4.38e-08 ***
sic3dig381   0.174591   0.071427   2.444 0.014663 *  
sic3dig383   0.367663   0.140038   2.625 0.008769 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6653 on 1142 degrees of freedom
Multiple R-squared:  0.813, Adjusted R-squared:  0.8089 
F-statistic: 198.5 on 25 and 1142 DF,  p-value: < 2.2e-16

linearHypothesis(mod2, "log(k)+log(l)=1")

Linear hypothesis test

Hypothesis:
log(k)  + log(l) = 1

Model 1: restricted model
Model 2: log(va) ~ log(k) + log(l) + year + sic3dig

  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1   1143 564.78                                  
2   1142 505.47  1    59.314 134.01 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

i.e. the test continues to be rejected. Consequently, we don’t find constant returns to scale.

Part (c)

Which of the 17 industries has the largest number of observations?

Lets pick industries 311 and 321. For each of the two industries separately, estimate a Cobb-Douglas production function.

Would you say the functions are very different in the two industries?

prod %>% group_by(sic3dig) %>%  summarize(n())

# A tibble: 17 x 2
   sic3dig `n()`
   <fct>   <int>
 1 311       400
 2 321        88
 3 322        88
 4 323         8
 5 324        32
 6 331        72
 7 332        72
 8 341        16
 9 342        56
10 351        16
11 352        48
12 355         8
13 356        56
14 369        48
15 371        24
16 381       112
17 383        24

The table reveals that the industry with the largest number of observations is 311.

library(dplyr)
prod= prod %>% mutate(sic3digf=factor(sic3dig))
mod311=lm(log(va)~log(k)+log(l)+year, prod %>% filter(sic3dig=="311")) 
summary(mod311)


Call:
lm(formula = log(va) ~ log(k) + log(l) + year, data = prod %>% 
    filter(sic3dig == "311"))

Residuals:
     Min       1Q   Median       3Q      Max 
-2.30248 -0.41505  0.07536  0.45801  1.73235 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.56445    0.20804  12.327  < 2e-16 ***
log(k)       0.41456    0.04113  10.079  < 2e-16 ***
log(l)       0.80244    0.07461  10.756  < 2e-16 ***
year80       0.08684    0.13017   0.667 0.505074    
year81       0.26948    0.13017   2.070 0.039090 *  
year82       0.39208    0.13033   3.008 0.002798 ** 
year83       0.23738    0.13065   1.817 0.070003 .  
year84       0.22519    0.13086   1.721 0.086078 .  
year85       0.07980    0.13209   0.604 0.546113    
year86       0.48876    0.13271   3.683 0.000263 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6508 on 390 degrees of freedom
Multiple R-squared:  0.8318,    Adjusted R-squared:  0.8279 
F-statistic: 214.2 on 9 and 390 DF,  p-value: < 2.2e-16

mod321=lm(log(va)~log(k)+log(l)+year, prod%>% filter(sic3dig=="321"))
summary(mod321)


Call:
lm(formula = log(va) ~ log(k) + log(l) + year, data = prod %>% 
    filter(sic3dig == "321"))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.12790 -0.32740  0.02603  0.38229  0.95998 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.61053    0.41939   8.609 6.19e-13 ***
log(k)       0.20105    0.07984   2.518   0.0138 *  
log(l)       1.04942    0.13980   7.507 8.45e-11 ***
year80       0.16420    0.22541   0.728   0.4685    
year81       0.17994    0.22527   0.799   0.4268    
year82       0.17906    0.22650   0.791   0.4316    
year83       0.22520    0.22623   0.995   0.3226    
year84      -0.03102    0.22546  -0.138   0.8909    
year85      -0.16535    0.22671  -0.729   0.4680    
year86       0.26596    0.22574   1.178   0.2423    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5282 on 78 degrees of freedom
Multiple R-squared:  0.8231,    Adjusted R-squared:  0.8027 
F-statistic: 40.32 on 9 and 78 DF,  p-value: < 2.2e-16

In each case the labor coefficient is larger than the capital coefficient. However, the numbers are not necessarily very close. In the case of sector 321 the capital coefficient is only half of the coefficient for sector 311. Still, to make sure they are statistically different it’s good to do a formal test which is the next question.

Part (d)

Conduct a hypothesis test to compare the two functions formally. Note, that for that you need to estimate both functions using a single regression model.

Are the coefficients statistically different from zero?

Could they be jointly significant?

mod_inter=lm(log(va)~ log(k) + log(l) + sic3digf:log(k)+sic3digf:log(l) +sic3digf*year, prod %>% filter(sic3dig=="311"|sic3dig=="321"))
summary(mod_inter)


Call:
lm(formula = log(va) ~ log(k) + log(l) + sic3digf:log(k) + sic3digf:log(l) + 
    sic3digf * year, data = prod %>% filter(sic3dig == "311" | 
    sic3dig == "321"))

Residuals:
     Min       1Q   Median       3Q      Max 
-2.30248 -0.38861  0.06615  0.42833  1.73235 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.56445    0.20203  12.693  < 2e-16 ***
log(k)              0.41456    0.03994  10.378  < 2e-16 ***
log(l)              0.80244    0.07245  11.075  < 2e-16 ***
sic3digf321         1.04608    0.54098   1.934 0.053754 .  
year80              0.08684    0.12642   0.687 0.492442    
year81              0.26948    0.12641   2.132 0.033549 *  
year82              0.39208    0.12657   3.098 0.002068 ** 
year83              0.23738    0.12688   1.871 0.061986 .  
year84              0.22519    0.12708   1.772 0.077053 .  
year85              0.07980    0.12828   0.622 0.534194    
year86              0.48876    0.12888   3.792 0.000169 ***
log(k):sic3digf321 -0.21351    0.10355  -2.062 0.039770 *  
log(l):sic3digf321  0.24698    0.18230   1.355 0.176128    
sic3digf321:year80  0.07736    0.29788   0.260 0.795206    
sic3digf321:year81 -0.08954    0.29772  -0.301 0.763739    
sic3digf321:year82 -0.21302    0.29912  -0.712 0.476719    
sic3digf321:year83 -0.01218    0.29896  -0.041 0.967512    
sic3digf321:year84 -0.25620    0.29822  -0.859 0.390715    
sic3digf321:year85 -0.24515    0.30008  -0.817 0.414374    
sic3digf321:year86 -0.22280    0.29929  -0.744 0.456983    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.632 on 468 degrees of freedom
Multiple R-squared:  0.8326,    Adjusted R-squared:  0.8258 
F-statistic: 122.5 on 19 and 468 DF,  p-value: < 2.2e-16

We continue to work with a subset of our data. We introduce interaction terms into our model – multiplying our logged labor variable with the sector code and the logged capital variable with the sector code. The t-test on these interaction terms is what we are interested in. Remember interaction terms should be interpreted as “effect modifiers” - we are interested in whether the industry modifies the relationship between labour/capital and value added.

Note that we also need to introduce interaction terms between sector and year if we want to replicate what happened when we run two separate regressions: each sector had its own time effects. If we are not interactng them in this new regression we force both sectors to have the same time effects.

The interaction coefficient on capital is significant at 5 percent whereas the one on labor is not. However, to assess if there is really no difference in the production function between the two groups, both coefficients need to be non significant jointly.

We can examine that with the linearHypothesis command:

linearHypothesis(mod_inter, c("log(k):sic3digf321=0","log(l):sic3digf321=0"))

Linear hypothesis test

Hypothesis:
log(k):sic3digf321 = 0
log(l):sic3digf321 = 0

Model 1: restricted model
Model 2: log(va) ~ log(k) + log(l) + sic3digf:log(k) + sic3digf:log(l) + 
    sic3digf * year

  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1    470 188.90                              
2    468 186.92  2    1.9749 2.4724 0.08549 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This suggests that the production function for sector 321 is only weakly (at 10%) different from that of sector 311

Part (e)

Re-estimate your extended model from d) by allowing for firm fixed effects.

Does this change your assessment concerning the hypothesis that the production functions are identical in the two industries?

#mod_fe=lm(log(va)~log(k)+log(l)+sic3dig*log(k)+sic3dig*log(l)+sic3dig+year+factor(id), index=c("id","year"),data=prod %>% filter(sic3dig=="311"|sic3dig=="321"), model="within")


prod=prod %>% mutate(idf=factor(id))
mod_fe=lm(log(va)~log(k)+log(l)+sic3digf:log(k)+sic3digf:log(l)+year+year:sic3digf+idf, data=prod %>% filter(sic3dig=="311"| sic3dig=="321"))
summary(mod_fe)


Call:
lm(formula = log(va) ~ log(k) + log(l) + sic3digf:log(k) + sic3digf:log(l) + 
    year + year:sic3digf + idf, data = prod %>% filter(sic3dig == 
    "311" | sic3dig == "321"))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.57822 -0.20745 -0.00353  0.23509  1.38593 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         6.18111    1.95275   3.165 0.001665 ** 
log(k)              0.36901    0.14443   2.555 0.010983 *  
log(l)              0.33302    0.12117   2.748 0.006253 ** 
year80              0.09849    0.08842   1.114 0.265953    
year81              0.27402    0.08847   3.097 0.002087 ** 
year82              0.40832    0.08978   4.548 7.14e-06 ***
year83              0.26082    0.09258   2.817 0.005079 ** 
year84              0.23774    0.09619   2.472 0.013861 *  
year85              0.13050    0.10316   1.265 0.206590    
year86              0.53737    0.10929   4.917 1.28e-06 ***
idf2421            -1.81562    0.91718  -1.980 0.048421 *  
idf5265            -2.14844    0.87344  -2.460 0.014316 *  
idf15921           -0.68585    0.49274  -1.392 0.164708    
idf16578           -0.88316    0.77937  -1.133 0.257803    
idf16605           -1.33203    0.33018  -4.034 6.54e-05 ***
idf17775           -1.50578    0.86152  -1.748 0.081247 .  
idf18117           -2.32688    0.85913  -2.708 0.007044 ** 
idf19188           -1.97386    0.89953  -2.194 0.028774 *  
idf19287           -1.88645    1.04781  -1.800 0.072537 .  
idf20331           -2.19312    0.88310  -2.483 0.013412 *  
idf20925           -2.69578    0.83438  -3.231 0.001334 ** 
idf25686           -0.47739    0.22155  -2.155 0.031764 *  
idf26118           -2.06988    0.73049  -2.834 0.004831 ** 
idf28341           -2.27233    0.89856  -2.529 0.011819 *  
idf28629           -0.58776    0.48893  -1.202 0.230013    
idf29259           -2.99966    3.11677  -0.962 0.336405    
idf32004           -2.47214    0.78769  -3.138 0.001821 ** 
idf35856           -0.57077    0.50390  -1.133 0.257997    
idf36468           -1.29362    0.85129  -1.520 0.129384    
idf39906           -0.43909    0.43517  -1.009 0.313572    
idf40068           -2.00950    0.73748  -2.725 0.006710 ** 
idf44694           -0.68058    0.65952  -1.032 0.302717    
idf45081           -1.40526    0.82235  -1.709 0.088239 .  
idf48042           -1.98575    0.78698  -2.523 0.012006 *  
idf49347           -2.26081    0.80635  -2.804 0.005292 ** 
idf49815           -2.05689    0.92068  -2.234 0.026016 *  
idf55674           -1.58149    0.86329  -1.832 0.067689 .  
idf56439           -1.67649    0.88115  -1.903 0.057794 .  
idf56637           -2.60704    0.88052  -2.961 0.003247 ** 
idf60588           -1.15167    0.63502  -1.814 0.070474 .  
idf63252           -2.70866    2.97442  -0.911 0.363016    
idf64116           -2.72762    2.79224  -0.977 0.329217    
idf65097           -1.97071    2.90443  -0.679 0.497826    
idf65151           -3.18579    2.57774  -1.236 0.217211    
idf65322           -3.41494    2.66798  -1.280 0.201280    
idf68274           -2.30133    0.86071  -2.674 0.007801 ** 
idf68589           -1.45674    0.77012  -1.892 0.059252 .  
idf69336           -2.33470    2.63736  -0.885 0.376546    
idf71442           -3.27453    2.81318  -1.164 0.245104    
idf76365           -1.65666    0.97446  -1.700 0.089876 .  
idf78588           -3.25944    3.40056  -0.959 0.338376    
idf80442           -1.88779    0.86938  -2.171 0.030473 *  
idf81414           -2.25434    3.08892  -0.730 0.465920    
idf81666           -1.40122    0.77478  -1.809 0.071255 .  
idf83394           -2.33158    0.90748  -2.569 0.010544 *  
idf84258           -1.31066    0.84903  -1.544 0.123427    
idf84807           -2.14712    0.94422  -2.274 0.023485 *  
idf85032           -1.19273    0.93573  -1.275 0.203152    
idf85941           -1.65192    0.73076  -2.261 0.024311 *  
idf86292           -2.45983    0.73433  -3.350 0.000884 ***
idf86751            0.03953    0.42280   0.094 0.925549    
idf87201           -0.29516    0.61324  -0.481 0.630553    
idf90063           -2.87884    2.96654  -0.970 0.332402    
idf92979           -1.80307    0.77708  -2.320 0.020815 *  
idf93519           -1.73268    0.96749  -1.791 0.074048 .  
idf93726           -1.07958    1.11043  -0.972 0.331515    
idf94158           -2.55671    0.90185  -2.835 0.004811 ** 
idf96507           -2.29632    0.87808  -2.615 0.009248 ** 
idf98298           -2.16615    0.82550  -2.624 0.009014 ** 
idf99396           -1.46163    0.46264  -3.159 0.001699 ** 
log(k):sic3digf321  0.09899    0.28015   0.353 0.724017    
log(l):sic3digf321  0.15616    0.21784   0.717 0.473861    
sic3digf321:year80  0.02734    0.20852   0.131 0.895766    
sic3digf321:year81 -0.11491    0.20821  -0.552 0.581303    
sic3digf321:year82 -0.32608    0.21122  -1.544 0.123423    
sic3digf321:year83 -0.12108    0.21294  -0.569 0.569938    
sic3digf321:year84 -0.23963    0.21574  -1.111 0.267347    
sic3digf321:year85 -0.20025    0.22277  -0.899 0.369226    
sic3digf321:year86 -0.22805    0.22585  -1.010 0.313209    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4417 on 409 degrees of freedom
Multiple R-squared:  0.9285,    Adjusted R-squared:  0.9149 
F-statistic: 68.12 on 78 and 409 DF,  p-value: < 2.2e-16

linearHypothesis(mod_fe, c("log(k):sic3digf321=0","log(l):sic3digf321=0"))

Linear hypothesis test

Hypothesis:
log(k):sic3digf321 = 0
log(l):sic3digf321 = 0

Model 1: restricted model
Model 2: log(va) ~ log(k) + log(l) + sic3digf:log(k) + sic3digf:log(l) + 
    year + year:sic3digf + idf

  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    411 79.958                           
2    409 79.789  2   0.16941 0.4342 0.6481

It seems that once we allow for fixed effects the interaction coefficients are no longer significant in separate or joint significance tests.

Also note that even for sector 321 (which has bigger capital and labour elasticities) we no longer reject the hypothesis the returns are constant:

linearHypothesis(mod_fe,"log(k)+log(l)+log(k):sic3digf321+log(l):sic3digf321=1")

Linear hypothesis test

Hypothesis:
log(k)  + log(l)  + log(k):sic3digf321  + log(l):sic3digf321 = 1

Model 1: restricted model
Model 2: log(va) ~ log(k) + log(l) + sic3digf:log(k) + sic3digf:log(l) + 
    year + year:sic3digf + idf

  Res.Df    RSS Df Sum of Sq     F Pr(>F)
1    410 79.794                          
2    409 79.789  1   0.00546 0.028 0.8672

Exercises 6

Exercise 6.1

Part (a)

Part (b)

Part (c)

Part (d)

Part (e)

Exercise 6.2

Part (a)

Part (b)

Part (c)

Part (d)

Part (e)

Citation