More exercises to help you become an econometrics superstar
\[log(wage)=(\beta_0+\delta_0 female)+(\beta_1+\delta_1 female) \times educ + u\] where wage is the hourly wage, female is a gender dummy, which is equal 1 if the individual is female, and educ is the number of years of education. Provide an interpretation if \(δ_0<0\) and \(δ_1<0\).
This is a linear model where the intercept men is \(\beta_0\) and for women is \(\beta_0+\delta_0 female\). The change in log wages for men is \(\beta_1\) whereas for women it is \(\beta_1+\delta_1 female\). Also note that because we have log wage as dependent variable, these coefficients can be interpreted as percentage changes in wage. If \(δ_0<0\) and \(δ_1<0\), this means that women earn less for a given level of education and also that the change in wage for a given change in education (i.e. the returns to education) are lower for women.
The figure below illustrates this model: Note that \(\beta_1\) represents the slope of the line for men whereas \(\delta_1\) reperesents how much less the line for women is sloped compared to men.
This would require the joint hypothesis: \(δ_0=0\) & \(δ_1=0\). This can be implemented via a joint (F-) test; e.g. in R this can be done with the linearHypothesis Command.
Man with 10 years: \(E\{log(wage)|Man\}=\beta_0+\beta_1 \times educ\)
Women with 10 years: \(E\{log(wage)|Women\}=\beta_0+\delta_0+(\beta_1+\delta_1) \times educ\)
The wage differential between a man an women with the same 10 years of education becomes \[E\{log(wage)|Man,10years\} - E\{log(wage)|Women,10 years\}=-(\delta_0+\delta_1\times 10)\] \[=0.1+10\times0.01=0.2\] Thus we would expect the women to have a a 20% lower wage
Consequently the effect of education on women’s wages would become \(β_1+\delta_1=0.01-0.01=0\). This would mean that education has no effect on women’s wages.
The equation describes a hump shaped relationship between wages and age (since the squared term is negative). It therefore makes sense to find the top of the hump which will have an age gradient of 0. The gradient can be found by differentiating with respect to age: \[\frac{\partial W}{\partial AGE}=10-0.1\times 2 \times AGE\] Thus setting \(\frac{\partial W}{\partial AGE}\) equal to zero leads to \[AGE^{max.wage}=\frac{10}{0.1\times 2}=50\]
Consider the dataset ets_thres_final.csv. It contains emission figures (lnco2=log of CO2 emissions) for a sample of firms regulated by the European Emissions Trading System (EUETS) for the years from 2005 to 2017 although the firm identifiers have gone missing from the dataset. Note that an Emissions Trading System requires firms to buy permits for every unit of CO2 they emit. By restricting the total number of permits that are issued governments can control the total amount of emissions while allowing firms to trade permits freely so that they can be used with those businesses that find it hardest to reduce emissions. In the early days of the EU ETS (which started in 2005) permits where freely given to firms. This changed from 2013 onwards when permits where only given to certain firms and sectors that were deemed at risk from foreign competition. The variable nonfree
indicates those firms in the dataset. According to economic theory the method of permits allocation should have no effect on the eventual emissions by firms (Independence hypothesis). Firms that have been given free permits will have an incentive to reduce emissions as that frees up permits to sell within the permit market.
nonfree
variable. Report what you find.library(dplyr)
df=read.csv("https://www.dropbox.com/s/urro3ty46kr4f7z/ets_thres_final.csv?dl=1")
df=df %>% mutate(nonfree=factor(nonfree),period=factor(period))
head(df)
X.1 X year COUNTRY_CODE xCOU period lnco2 free nonfree
1 1 1 2009 AT 1 1 9.179675 0 1
2 2 2 2010 AT 1 1 9.200492 0 1
3 3 3 2011 AT 1 1 9.326789 0 1
4 4 4 2012 AT 1 1 9.324919 0 1
5 5 5 2013 AT 1 2 9.332470 0 1
6 6 6 2014 AT 1 2 9.424483 0 1
Call:
lm(formula = lnco2 ~ nonfree, data = df)
Residuals:
Min 1Q Median 3Q Max
-9.3522 -0.7237 0.0633 0.8396 6.4050
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.85616 0.02427 406.09 <2e-16 ***
nonfree1 -0.50395 0.03361 -14.99 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.387 on 6827 degrees of freedom
(443 observations deleted due to missingness)
Multiple R-squared: 0.03188, Adjusted R-squared: 0.03174
F-statistic: 224.8 on 1 and 6827 DF, p-value: < 2.2e-16
# A tibble: 9 x 2
year `n()`
<int> <int>
1 2009 808
2 2010 808
3 2011 808
4 2012 808
5 2013 808
6 2014 808
7 2015 808
8 2016 808
9 2017 808
The firms that stop receiving free permits in 2013 pollute 50% less on average over the 2009 to 2017 period.
Call:
lm(formula = lnco2 ~ period, data = df)
Residuals:
Min 1Q Median 3Q Max
-9.5943 -0.7241 0.0457 0.8979 6.6685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.594291 0.025888 370.602 <2e-16 ***
period2 -0.001621 0.034425 -0.047 0.962
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.41 on 6827 degrees of freedom
(443 observations deleted due to missingness)
Multiple R-squared: 3.247e-07, Adjusted R-squared: -0.0001462
F-statistic: 0.002216 on 1 and 6827 DF, p-value: 0.9625
This shows that on average emission after 2012 are .16 percent lower than before 2013, a value that is not significantly different from zero.
The results in part a) confound the treatment effect with any - pre-existing - firm characteristics that might have influenced the allocation of permits. For instance it might well be that the most energy (and therefore pollution) intensive firms were given given an exemption from having to buy all their permits. Hence, firms who have to buy permits (nonfree
firms) are those with lower CO2 consumption to begin with.
In the dataset as it is there is actually no variable that properly captures the treatment we are interested in. nonfree
identifies firms that are eventually treated (the treatment being the bitter pill of having to pay for all their permits) but it is equal to 1 also in periods when they are not treated. But it is easy to create a dummy variable, which is only equal to one for those firms that treated in periods when they are treated: we simply have to create a dummy variable that is only true for nonfree
firms during period 2. Let’s try that:
df=df %>% mutate( period2Xnonfree= (nonfree==1) & ( as.character(period)=="2" ) )
lm(lnco2~period2Xnonfree,df) %>% summary()
Call:
lm(formula = lnco2 ~ period2Xnonfree, data = df)
Residuals:
Min 1Q Median 3Q Max
-9.7086 -0.7132 0.0611 0.8741 6.5525
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.70864 0.02017 481.34 <2e-16 ***
period2XnonfreeTRUE -0.38987 0.03710 -10.51 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.399 on 6827 degrees of freedom
(443 observations deleted due to missingness)
Multiple R-squared: 0.01592, Adjusted R-squared: 0.01578
F-statistic: 110.5 on 1 and 6827 DF, p-value: < 2.2e-16
At face value this would suggest that firms not receiving free permits leads to 41% less CO2 emissions. However, there are at least 2 potential confounding factors: 1. The fact that firms still getting free permits have not been selected at random 2. There might be time effects present. For instance, after 2013 growth might have picked up following the recession of 2008.
We can control for the first issue by including nonfree
as a control variable as (it measures how different the nonfree
firms were before they were made to buy all permits). The second issue we can address with a period dummy variable. Hence:
Call:
lm(formula = lnco2 ~ period + nonfree + period2Xnonfree, data = df)
Residuals:
Min 1Q Median 3Q Max
-9.3960 -0.7187 0.0645 0.8428 6.4183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.80887 0.03675 266.940 < 2e-16 ***
period2 0.08385 0.04893 1.714 0.0866 .
nonfree1 -0.41288 0.05097 -8.100 6.45e-16 ***
period2XnonfreeTRUE -0.16108 0.06779 -2.376 0.0175 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.387 on 6825 degrees of freedom
(443 observations deleted due to missingness)
Multiple R-squared: 0.03268, Adjusted R-squared: 0.03225
F-statistic: 76.86 on 3 and 6825 DF, p-value: < 2.2e-16
Hence, this changes the the coefficient for period2Xfree quite a bit; i.e. it would suggest that nonfree permit allocation reduces emissions by 16% only.
An alternative way of implementing that is via the :
operator which interacts (multiplies) variables “on the fly”:
Call:
lm(formula = lnco2 ~ period + nonfree + period:nonfree, data = df)
Residuals:
Min 1Q Median 3Q Max
-9.3960 -0.7187 0.0645 0.8428 6.4183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.80887 0.03675 266.940 < 2e-16 ***
period2 0.08385 0.04893 1.714 0.0866 .
nonfree1 -0.41288 0.05097 -8.100 6.45e-16 ***
period2:nonfree1 -0.16108 0.06779 -2.376 0.0175 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.387 on 6825 degrees of freedom
(443 observations deleted due to missingness)
Multiple R-squared: 0.03268, Adjusted R-squared: 0.03225
F-statistic: 76.86 on 3 and 6825 DF, p-value: < 2.2e-16
It’s also instructive to look at this graphically:
For this question use the dataset hals1prep.csv, containing data from the UK Health and Lifestyle Survey (1984-85). In this survey, several thousand people in the UK were being asked questions about their health and lifestyle.
1 2 3 4
1850 4563 2076 482
table(halsx$region)
east anglia east midlands greater london north
333 682 943 540
north west scotland south east south west
1092 925 1607 720
wales west midlands yorks/humber
498 823 808
Call:
lm(formula = bmi ~ 0 + region, data = halsx)
Residuals:
Min 1Q Median 3Q Max
-12.3808 -2.8505 -0.5398 2.2378 30.3695
Coefficients:
Estimate Std. Error t value Pr(>|t|)
regioneast anglia 24.5650 0.2417 101.6 <2e-16 ***
regioneast midlands 24.6908 0.1723 143.3 <2e-16 ***
regiongreater london 24.0111 0.1506 159.4 <2e-16 ***
regionnorth 24.6737 0.1943 127.0 <2e-16 ***
regionnorth west 24.7005 0.1371 180.2 <2e-16 ***
regionscotland 24.9136 0.1504 165.7 <2e-16 ***
regionsouth east 24.0898 0.1107 217.6 <2e-16 ***
regionsouth west 24.7633 0.1694 146.2 <2e-16 ***
regionwales 25.2405 0.2071 121.9 <2e-16 ***
regionwest midlands 24.5064 0.1614 151.8 <2e-16 ***
regionyorks/humber 24.6052 0.1585 155.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.08 on 7260 degrees of freedom
(1700 observations deleted due to missingness)
Multiple R-squared: 0.9731, Adjusted R-squared: 0.9731
F-statistic: 2.392e+04 on 11 and 7260 DF, p-value: < 2.2e-16
If we drop the intercept by writing 0+...
the dummies represent the average BMI values. We that Wales has the highest, with both Scotland and Wales above 24.9 and all other regions within the healthy range.
• Excellent (1)
• Good (2)
• Fair (3)
• Poor (4)
The numbers in brackets indicate how these options were recorded in the ownh_num
variable. Run a regression of ownh_num
on bmi
and provide a discussion of what you find. Is it in line with your expectations on this?
Call:
lm(formula = ownh_num ~ bmi, data = halsx)
Residuals:
Min 1Q Median 3Q Max
-1.4558 -0.2020 -0.1069 0.8016 2.0259
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.777801 0.056678 31.367 < 2e-16 ***
bmi 0.014155 0.002278 6.213 5.48e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7948 on 7269 degrees of freedom
(1700 observations deleted due to missingness)
Multiple R-squared: 0.005283, Adjusted R-squared: 0.005146
F-statistic: 38.6 on 1 and 7269 DF, p-value: 5.481e-10
An increase in the BMI by 1 unit increases the health score 0.014 units. Because a higher value of the score implies worse health this suggests a reduction in health which is line with expectations.
There might be a variety of confounding factors; e.g. richer people might be healthier and less overweight because they can afford higher quality food (making them slimmer) as well as better medical care (making them healthier for reasons unrelated to food intake and weight). Hence, because in this scenario money is negatively correlated with both BMI and the onwnh_num health score this leads to an upward bias; i.e. the coefficient would be lower in reality than what we found.
Education might play a similar role; i.e. better educated people will be healthier for a range of reasons (e.g. knowledge about health and how get the best care) and the same knowledge might also allow them to eat better and gain less weight.
There might also be a direct reverse causality: people who are sicker might find it hard to exercise and/or make the effort of doing high quality cooking which would again lead to an upward bias in our regression.
However, note that one could imagine that this also goes the other way round: many diseases lead to extreme weight loss which would imply a downward bias in our regression.
Age might be an other issue. Most people get a bit fatter as they age (well at least I do). Now the question asks to consider age when answering the question. However, there might be a systematic bias in how people respond to such questions. E.g. suppose older people tend to get more content than younger people so that they are more often just happy with their health. This would mean that age has a negative effect (more healthy) on the dependent variable. At the same time there is a positive effect on BMI. This would mean a negative correlation between errors and omitted variable implying a downward bias.
Again the bias could go the other way round if for instance older people are more likely to become hypochondriacs.
• incomeB a categorical variable representing income brackets where “1” represents the lowest and “12” the highest income group. • agyrs – a variable recording the age of the participant
Include those in the regression of reported health from b) Discuss what the output suggests about the relationships between health and age, and health and income. Are they in line with what you would have expected? In each case can you provide an explanation for the kind of relationship found?
Also discuss the usefulness of including both the age and income controls for estimating the causal effect of BMI. In each case discuss at least one reason for and one reason against including these controls. [5 points]
Call:
lm(formula = ownh_num ~ bmi + agyrs + factor(incomeB), data = halsx)
Residuals:
Min 1Q Median 3Q Max
-1.64142 -0.34612 -0.08213 0.65138 2.18574
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9586826 0.1242120 15.769 < 2e-16 ***
bmi 0.0123175 0.0025664 4.800 1.63e-06 ***
agyrs -0.0018862 0.0006752 -2.794 0.00523 **
factor(incomeB)2 0.1639262 0.1097212 1.494 0.13522
factor(incomeB)3 0.1738317 0.1087402 1.599 0.10997
factor(incomeB)4 0.0090082 0.1099935 0.082 0.93473
factor(incomeB)5 -0.0846940 0.1069890 -0.792 0.42862
factor(incomeB)7 -0.2736317 0.1124488 -2.433 0.01499 *
factor(incomeB)8 -0.2231460 0.1126696 -1.981 0.04769 *
factor(incomeB)9 -0.1422836 0.1116748 -1.274 0.20268
factor(incomeB)10 -0.2764668 0.1221840 -2.263 0.02369 *
factor(incomeB)12 -0.3545969 0.1246759 -2.844 0.00447 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7814 on 5699 degrees of freedom
(3260 observations deleted due to missingness)
Multiple R-squared: 0.03819, Adjusted R-squared: 0.03633
F-statistic: 20.57 on 11 and 5699 DF, p-value: < 2.2e-16
The health score for higher income bands is lower suggesting richer people tend to be healthier (or at least report to be healther). This is inline what we would expect: richer people can afford better health care, live in healthier houses, in better neighbourhoods with less pollution etc.
The relation between age and health seems a bit more surprising as it suggests that older people report being healthier. But we have to remind ourselves that the question asked “how is your health given your age”. Hence, it could mean that older people lower their standards and are more content.
Another more sinister explanation is the following: suppose each generation has some people that are inherently healthy (e.g. based on their genes) and other that more sickly. Clearly we would expect the healthier to be less likely to die and thereby get older. This would mean that even if people respond in exactly the same to the health question throughout their life the only old people remaining to respond to the survey are the ones that always responded as being in great health.
We would want to include those variables if there is concern regarding some of the biases discussed in part c). For that they do not only need to have an affect on the dependent variable (health score) but also cause some of the variation in the BMI explanatory variable. See part c) for a more elaborate discussion.
An important reason not to include those is if we think the causal chain goes the other way round; e.g. it could be that people who are overweight have harder time in the job market making them poorer. Equally, being overweight might affect your chances of survival and thereby your age.
labels=c("excellent", "good", "fair", "poor")
for(i in 1:4){
fr=halsx
fr['dum']=fr$ownh_num==i
fr['label']=labels[i]
if(i==1){
longframe=fr
}
else {
longframe=rbind(longframe,fr)
}
print(nrow(longframe))
}
[1] 8971
[1] 17942
[1] 26913
[1] 35884
Call:
lm(formula = dum ~ label, data = longframe)
Residuals:
Min 1Q Median 3Q Max
-0.50864 -0.23141 -0.20622 0.08254 0.94627
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.206220 0.004231 48.74 < 2e-16 ***
labelfair 0.025192 0.005984 4.21 2.56e-05 ***
labelgood 0.302419 0.005984 50.54 < 2e-16 ***
labelpoor -0.152491 0.005984 -25.48 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4007 on 35880 degrees of freedom
Multiple R-squared: 0.1436, Adjusted R-squared: 0.1435
F-statistic: 2005 on 3 and 35880 DF, p-value: < 2.2e-16
The regression reports the share of responses to the health question in the data; i.e. from the intercept we can see that 20.6% of respondents report excellent health. (20.6+2.5)% respond that their health is fair and so on.
Why is this the case?
Firstly, note that the for
loop creates a new dataframe (called longframe
) that is basically 5 copies (one for each possible health status answer category) of the original dataframe fr
glued together (that’s what the command rbind()
does; i.e. combining dataframes vertically). The only thing that is different between the 5 dataframe copies is the newly created variable dum
. It is equal to 1 if a respondent responds to the health survey in the way that corresponds to the answer category. With that in mind you need to remember what we learned about dummy variables as dependent and explanatory variables. For instance we said that constant represents the average dependent variable for the reference group. So here this is the observations that are in the dataframe that was created for an response of “excellent” to the health question. So we get the average for the dependent variable which is equal to one for those people who respondent “excellent”. So that is the number of people responding “excellent” dividid by the number of people responding; i.e. the the share that have respondent excellent. For the other groups the coefficient tells us how much higher (or lower) that share is.
Air pollution has been shown to have a variety of adverse health effects. Recently, researchers have also started to investigate other negative effects. Below we report regression tables from a study that investigates a link between air pollution and car accidents.
Air pollution affects the respiratory system. If people cannot breathe so well that might eventually affect their brain. Drivers might consequently be less able to focus and therefore are more likely to cause accidents. Pollution could also cause poorer visibility that leads to accidents.
There are a number of potentially confounding factors for instance traffic: more traffic will cause more pollution but also makes accidents more likely. Similarly: weather factors such as heat or clouds induce more pollution but could also lead to more (or less accidents). Weather could also work the other way round: rain could reduce pollution, while increasing accidents; i.e. because pollution is negatively correlated with rain and rain positively with accidents we have a downward bias.
Moreover, there could be a direct reverse causality: accidents cause traffic jams which increases pollution.
By including a range of weather controls we address some of the points raised in (b) which should be an improvement. However, a concern with weather variables as controls is that while on the one hand weather can cause pollution and accidents, pollution could also cause the weather (e.g. clouds forming because of particulate pollution) which in turn could affect accidents. The estimate in column 7 would not account for this causal effect.
Table 2 reports first stage regressions for this instrument. This allows us to check one of the three criteria for a valid instrument, namely if it is a strong driver of the relevant endogenous variable. This seems to be the case here: not only is the inversion variable significant, the F-statistic is also rather high (larger than 10).
Given that Temperature inversions are likely not driven by pollution this could get round issues such as the traffic-> pollution nexus. However, it is also likely that inversions drive other potentially pollution causing weather events (e.g. clouds, rain). However, we can deal with that by including weather variables (as done in column 3). Of course, the same disclaimer applies as in par c); i.e. we might miss out on parts of the causal effect by doing that. Note that the effect of pollution becomes actually stronger when using the instrument (e.g. compare column 1 and 6, but also 3 vs 7). This suggest that it addresses an endogeneity arising from a negative correlation between un-observed heterogeneity and the endogenous variable in 6; e.g. it could be the case that there are less accidents when traffic goes up (and therefore pollution goes up) because traffic is moving more slower.
The picture below summarises the various issues we discussed in this question:
Firstly, our goal is to identify the causal effect of pollution on accidents. This comprises of direct effect (e.g. via bad visibility) represented by arrow a. This could also include more indirect effects via weather shown as arrows b and c. Simple OLS estimates of accidents on pollution will be biased because of confounding factors such as effects from traffic or weather on both pollution and accidents. Indeed accidents themselves could affect traffic which in turn could affect pollution (arrow i). Controlling for counfounding factors such as traffic or weather can be helpful in finding a non-biased estimate. However, it also could mean that we ignore part of the causal effect we hope to find; e.g. in the figure we would miss out the path shown via arrow b and c if we control for weather. An instrument such as temperature inversion is helpful as it drives pollution and is likely not affected by any endogenous factors (i.e. that ensures criterion 1 and 2 of IV estimation). However, there could be an issue with criterion 3 in an IV regression without further controls because Temperature Inversion is not only having an effect on pollution but it might also cause a range of other weather phenomena. If we include weather as an additional control as in column 3 we will avoid this issue. However, we also might again shut down channel b-c. The good news in Table 3 is that with or without weather variables we find the same effect from pollution which would suggest that channel b-c might not be so relevant.
Consider the dataset back2country_set.dta. It contains data on 71 countries for various 5 year periods from 1992 to 2012 (i.e. the period 2012 refers to the period from 2008 to 2012) Among other variables the dataset contains the following
• en_cleanOdirtyPclean
is the share of clean as a fraction of clean and dirty innovations (as measured from patent data).
• social_ht
is the share of people in the country that report to favour higher taxes for environmental reasons
• ln_oil_PPP
is the log of the country level oil price (inclusive of taxes)
• period
is a categorical variable referring to the different 5 year periods.
• ccode
contains country codes
Note the dataset for this exercise is in dta
(i.e. STATA format). You can load that into R as follows:
social_ht
and oil price variables controlling for period effects. Report your regression in your answer. (A copy of your STATA output is sufficient) Based on the regression, what do you expect happens to the share of clean innovations in response to a 5 percentage point increase in in the share of the population supporting higher environmental taxes?
Call:
lm(formula = en_cleanOdirtyPclean ~ social_ht + ln_oil_PPP +
factor(period), data = b2c)
Residuals:
Min 1Q Median 3Q Max
-0.45419 -0.12094 0.01522 0.13474 0.45992
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.49010 0.22027 -2.225 0.0277 *
social_ht 0.02709 0.04747 0.571 0.5691
ln_oil_PPP 0.18331 0.03137 5.843 3.48e-08 ***
factor(period)1997 0.06825 0.05097 1.339 0.1828
factor(period)2002 0.08802 0.05209 1.690 0.0934 .
factor(period)2007 0.03250 0.05658 0.574 0.5666
factor(period)2012 0.11138 0.07342 1.517 0.1315
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1872 on 139 degrees of freedom
(591 observations deleted due to missingness)
Multiple R-squared: 0.3138, Adjusted R-squared: 0.2842
F-statistic: 10.59 on 6 and 139 DF, p-value: 1.135e-09
summary(b2c$social_ht)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.871 2.766 3.054 3.041 3.362 4.045 504
i.e. The regression suggest a positive (put not significant) coefficient for social attitudes (social_ht) of 0.027. Hence, a 5 percentage point increase in the share of people supporting higher environmental taxes would imply a 0.027 x 5 percentage points= 0.135 percentage point increase in the share of clean innovation.
Firms cater to their markets. If a country has more pro environmental voters it will have more pro environmental customers. Firms will respond to that by conducting innovation into products that can be marketed to those customers as pro – environmental.
There are many possible omitted variable or reverse causality stories to be told here; e.g. pro environmental attitudes and stronger focus on clean innovation could be jointly driven by the level of income and development of a country. Success in a particular technology – e.g. clean technologies – might also by itself cause pro environmental attitudes.
Also higher oil price might be one of the channels via which pro environmental attitudes might affect innovation; e.g. pro environmental attitudes lead to policies such as energy taxes. Hence, including this variable might underestimate the full causal effect from attitudes. Having said that, one question we might have in this research is if attitudes have an impact on the direction of innovation, irrespective of taxes for fuel prices. In that case it would be appropriate to include this control. This shows, nicely that which controls you might want to include depends in part on the what exactly your analysis is trying to do.
Call:
lm(formula = en_cleanOdirtyPclean ~ social_ht + ln_oil_PPP +
factor(period) + factor(ccode), data = b2c)
Residuals:
Min 1Q Median 3Q Max
-0.23354 -0.04737 0.00000 0.05087 0.20513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.565119 0.277232 -2.038 0.044200 *
social_ht 0.114364 0.035509 3.221 0.001736 **
ln_oil_PPP 0.121696 0.049067 2.480 0.014835 *
factor(period)1997 0.097519 0.027820 3.505 0.000689 ***
factor(period)2002 0.145598 0.030457 4.780 6.16e-06 ***
factor(period)2007 0.137128 0.040795 3.361 0.001107 **
factor(period)2012 0.241725 0.061112 3.955 0.000145 ***
factor(ccode)AU 0.178088 0.088830 2.005 0.047742 *
factor(ccode)BE 0.077369 0.075707 1.022 0.309323
factor(ccode)BG 0.280554 0.086384 3.248 0.001593 **
factor(ccode)BR 0.274031 0.102835 2.665 0.009010 **
factor(ccode)CA -0.013153 0.069869 -0.188 0.851072
factor(ccode)CH -0.140932 0.073520 -1.917 0.058160 .
factor(ccode)CN 0.225245 0.092335 2.439 0.016508 *
factor(ccode)CY -0.135501 0.114910 -1.179 0.241175
factor(ccode)CZ 0.331754 0.080399 4.126 7.74e-05 ***
factor(ccode)DE 0.068754 0.068500 1.004 0.317994
factor(ccode)DK 0.080852 0.079580 1.016 0.312136
factor(ccode)ES 0.270450 0.071852 3.764 0.000285 ***
factor(ccode)FI -0.142047 0.069755 -2.036 0.044410 *
factor(ccode)FR 0.013279 0.075705 0.175 0.861120
factor(ccode)GB 0.118911 0.071604 1.661 0.099977 .
factor(ccode)GR 0.287559 0.117060 2.457 0.015788 *
factor(ccode)HR 0.320115 0.080641 3.970 0.000137 ***
factor(ccode)HU 0.306553 0.084108 3.645 0.000430 ***
factor(ccode)ID 0.186398 0.133107 1.400 0.164562
factor(ccode)IE 0.174563 0.079000 2.210 0.029457 *
factor(ccode)IN 0.042168 0.108593 0.388 0.698629
factor(ccode)IT -0.217889 0.075439 -2.888 0.004768 **
factor(ccode)JP 0.301983 0.069631 4.337 3.51e-05 ***
factor(ccode)KR 0.034720 0.079368 0.437 0.662740
factor(ccode)LT 0.206439 0.115510 1.787 0.076998 .
factor(ccode)LU -0.161064 0.110599 -1.456 0.148508
factor(ccode)LV 0.478426 0.071231 6.717 1.23e-09 ***
factor(ccode)MX 0.198181 0.070821 2.798 0.006185 **
factor(ccode)NL -0.070671 0.080835 -0.874 0.384108
factor(ccode)NO 0.159818 0.082045 1.948 0.054284 .
factor(ccode)NZ 0.220711 0.075119 2.938 0.004116 **
factor(ccode)PL 0.317474 0.078613 4.038 0.000107 ***
factor(ccode)PT 0.340365 0.096022 3.545 0.000604 ***
factor(ccode)RO 0.267879 0.100362 2.669 0.008902 **
factor(ccode)RU 0.252490 0.109050 2.315 0.022678 *
factor(ccode)SE -0.203249 0.085854 -2.367 0.019878 *
factor(ccode)SK 0.307865 0.088092 3.495 0.000714 ***
factor(ccode)TH -0.009082 0.137805 -0.066 0.947586
factor(ccode)TR 0.164923 0.096589 1.707 0.090900 .
factor(ccode)US -0.013032 0.069369 -0.188 0.851369
factor(ccode)ZA 0.014158 0.083461 0.170 0.865650
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0926 on 98 degrees of freedom
(591 observations deleted due to missingness)
Multiple R-squared: 0.8817, Adjusted R-squared: 0.8249
F-statistic: 15.53 on 47 and 98 DF, p-value: < 2.2e-16
Note that the coefficient on social_ht is now 0.114. Hence a 10 percentage point increase would lead to a 0.114 x 10 pcp=1.14 pcp increase in the clean innovation share. Note that this is not only larger but also significant.
Including country fixed effects will deal with confounding factors that are operating at the country level and are fixed over time. e.g. if the relationship found previously is in part driven by income and relative country incomes haven’t changed that much (while attitudes have) then this last regression might lead to a better – i.e. less biased estimate.
Over recent years the UK has increasingly become more xenophobic. An important question explored by many commentators is the economic damage that this xenophobia will cause. One way to examine this is by looking at the wages of foreign-born workers compared to UK born ones. If wages of foreigners tend to be higher, it is likely that reducing the number of foreigners by terrorising them with hostile immigration procedures – one of Theresa May’s flagship policies - will have negative economic consequences for the native population as well. The dataset [lfsclean].dta(https://www.dropbox.com/s/0mvyckpzsssi5k2/lfsclean.dta?dl=1) contains data from the quarterly labour force survey for the years from 2010 to 2018.
Among other variables it includes the following • lngrsswk: log of the average weekly gross wage • edu: years in education • foreign: a dummy variable indicating that a person was born abraod • quarter • year
library(haven)
lfs=read_dta("https://github.com/mondpanther/datastorieshub/blob/master/data/lfsclean.dta?raw=true")
lm(lngrsswk~foreign,lfs) %>% summary()
Call:
lm(formula = lngrsswk ~ foreign, data = lfs)
Residuals:
Min 1Q Median 3Q Max
-5.9138 -0.4349 0.0939 0.5443 5.4612
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.877326 0.001468 4003.762 <2e-16 ***
foreign 0.036460 0.003996 9.124 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8338 on 372925 degrees of freedom
(2937143 observations deleted due to missingness)
Multiple R-squared: 0.0002232, Adjusted R-squared: 0.0002205
F-statistic: 83.24 on 1 and 372925 DF, p-value: < 2.2e-16
The regression suggests that foreign workers earn (on average) 3.6% more than natives.
It could be the case that there in the regression above the foreign variable is partially endogenous. For instance, it could be the case that foreigners come in time periods when the economy is doing better (both over the years but also within a given year - e.g. for seasonal work) and thus wages in general tend to be higher. Note that this would introduce a positive correlation between “foreign” and shocks to wages which might bias our coefficient upward. By including year and quarter dummies we can account for that. The regression below does that finding a slightly lower (but still significant) of 0.3.
Call:
lm(formula = lngrsswk ~ foreign + factor(year) + factor(quarter),
data = lfs)
Residuals:
Min 1Q Median 3Q Max
-5.9979 -0.4271 0.0920 0.5535 5.4262
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.789088 0.004442 1303.316 < 2e-16 ***
foreign 0.029855 0.003990 7.482 7.31e-14 ***
factor(year)2011 0.014416 0.005428 2.656 0.007916 **
factor(year)2012 0.042441 0.005441 7.800 6.20e-15 ***
factor(year)2013 0.064254 0.005465 11.756 < 2e-16 ***
factor(year)2014 0.083115 0.005456 15.233 < 2e-16 ***
factor(year)2015 0.104964 0.005480 19.153 < 2e-16 ***
factor(year)2016 0.123267 0.005583 22.080 < 2e-16 ***
factor(year)2017 0.165312 0.005551 29.781 < 2e-16 ***
factor(year)2018 0.193901 0.007096 27.326 < 2e-16 ***
factor(quarter)2 0.009685 0.003751 2.582 0.009819 **
factor(quarter)3 0.013639 0.003909 3.489 0.000485 ***
factor(quarter)4 0.019056 0.003905 4.880 1.06e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8318 on 372914 degrees of freedom
(2937143 observations deleted due to missingness)
Multiple R-squared: 0.005003, Adjusted R-squared: 0.004971
F-statistic: 156.3 on 12 and 372914 DF, p-value: < 2.2e-16
When including education the foreign coefficient becomes significantly negative (see below). Also note that the education coefficient is positive and significant (1 year more of eduction implies 8% higher wages). Hence it would seem that an important reason why foreigners earn more is because they tend to be more highly educated than the native population. Or put differently: foreigners with similar education levels seem to earn less than natives.
Call:
lm(formula = lngrsswk ~ foreign + edu + factor(year) + factor(quarter),
data = lfs)
Residuals:
Min 1Q Median 3Q Max
-6.3227 -0.3982 0.0795 0.4893 5.2480
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.7309376 0.0074174 637.815 < 2e-16 ***
foreign -0.1588243 0.0037825 -41.989 < 2e-16 ***
edu 0.0808206 0.0004485 180.206 < 2e-16 ***
factor(year)2011 0.0036654 0.0049737 0.737 0.461153
factor(year)2012 0.0229880 0.0049866 4.610 4.03e-06 ***
factor(year)2013 0.0383417 0.0050063 7.659 1.88e-14 ***
factor(year)2014 0.0561895 0.0049999 11.238 < 2e-16 ***
factor(year)2015 0.0752652 0.0050251 14.978 < 2e-16 ***
factor(year)2016 0.0850140 0.0051215 16.600 < 2e-16 ***
factor(year)2017 0.1172969 0.0050895 23.047 < 2e-16 ***
factor(year)2018 0.1433484 0.0064992 22.056 < 2e-16 ***
factor(quarter)2 0.0080568 0.0034338 2.346 0.018961 *
factor(quarter)3 0.0049519 0.0035772 1.384 0.166270
factor(quarter)4 0.0125272 0.0035737 3.505 0.000456 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7489 on 360915 degrees of freedom
(2949141 observations deleted due to missingness)
Multiple R-squared: 0.0871, Adjusted R-squared: 0.08707
F-statistic: 2649 on 13 and 360915 DF, p-value: < 2.2e-16
The regression in a would be more appropriate. As in most cases the foreign workers bring their education with them (and therefore the UK public doesn’t have to pay for that) and increase in “foreign” also tends to “cause” an increase in education. The combined effect of that is the contribution of a foreign relative to a native worker.
We can examine this by allowing for a different education effect for foreigners. The regression below suggests a negative (and significant) interaction effect (foreign X edu). The coefficient suggest that the increase in wages for an additional year in education is 1.5 percentage points less for foreigners.
Call:
lm(formula = lngrsswk ~ foreign * edu + factor(year) + factor(quarter),
data = lfs)
Residuals:
Min 1Q Median 3Q Max
-6.3371 -0.3978 0.0791 0.4891 5.2570
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6901890 0.0080247 584.470 < 2e-16 ***
foreign 0.0782408 0.0182375 4.290 1.79e-05 ***
edu 0.0837327 0.0004991 167.776 < 2e-16 ***
factor(year)2011 0.0038495 0.0049725 0.774 0.43884
factor(year)2012 0.0230811 0.0049854 4.630 3.66e-06 ***
factor(year)2013 0.0382001 0.0050051 7.632 2.31e-14 ***
factor(year)2014 0.0559443 0.0049988 11.192 < 2e-16 ***
factor(year)2015 0.0750334 0.0050239 14.935 < 2e-16 ***
factor(year)2016 0.0848432 0.0051202 16.570 < 2e-16 ***
factor(year)2017 0.1170249 0.0050883 22.999 < 2e-16 ***
factor(year)2018 0.1428669 0.0064978 21.987 < 2e-16 ***
factor(quarter)2 0.0080281 0.0034330 2.338 0.01936 *
factor(quarter)3 0.0047887 0.0035763 1.339 0.18057
factor(quarter)4 0.0124753 0.0035729 3.492 0.00048 ***
foreign:edu -0.0150768 0.0011347 -13.288 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7487 on 360914 degrees of freedom
(2949141 observations deleted due to missingness)
Multiple R-squared: 0.08755, Adjusted R-squared: 0.08751
F-statistic: 2473 on 14 and 360914 DF, p-value: < 2.2e-16
Consider the dataset unempprep.csv. It contains data for various regions (wards) of the UK. There are over 10,000 wards. For a long time the UK government has been supporting businesses that invest in disadvantaged areas by covering up to 35% of an investment the business undertakes if this promises to create or safeguard jobs in areas that are deemed disadvantaged by the government. In 2000 there was a review that changed which areas were considered disadvantaged and which not. In some, cases there was also a change in the intensity of support; i.e. some areas businesses received up to 35% support whereas in others the support would only amount to a maximum of 10%. In the relation to that the unempprep.dta dataset contains (among others) the following variables
• DDDln1Punemp : the change in the (log) number of unemployed people in a ward between 2002 and 1997; (log) of the number of unemployed people in a ward in 2002 minus the (log) number of unemployed people in 1997.
• DDDNGE : The change in support level between 2002 and 1997; i.e. support level in 2002 minus support level in 1997; e.g. if the support level was 10 in 2002 and 35% in 1997 DDDNGE would be equal to -0.25
See regression below. We find a statistically significant coefficient for DDDNGE
. Note that NGE - the support rate for investment project by the government - is recorded in decimals (i.e. a 20% support rate would be recorded as 0.2). Hence, the value implies that a 10 percentage point change (i.e. DDDNGE=0.1) would lead to a $10=2.21% reduction in unemployment.
up=read.csv("https://github.com/mondpanther/datastorieshub/blob/master/data/unempprep.csv?raw=true")
names(up)
[1] "X" "wardcode"
[3] "year" "NGE"
[5] "ttwacode_1984" "unemp"
[7] "actrate_1998" "resid_emp_rate92"
[9] "resid_unemp_rate92" "rate_2000"
[11] "rate_1993" "actrate_1991"
[13] "LRunemp_1991" "manufshare_1991"
[15] "occupation_1991" "popdens_1981"
[17] "gdp91" "gdp94to96"
[19] "popdens_1991" "current_unemprate1991"
[21] "current_unemprate1998" "strunemp8690"
[23] "strunemp9397" "vatgrowth8791"
[25] "vatgrowth9598" "resid_emp_rate9698"
[27] "resid_unemp_rate9698" "manufshare_9698"
[29] "districtcode" "ELI00"
[31] "ELI93" "grate_00"
[33] "X_est_r00" "p00_p0"
[35] "p00_p1" "p00_p15"
[37] "p00_p2" "p00_p3"
[39] "p00_p35" "iv00"
[41] "grate_93" "X_est_r93"
[43] "p93_p0" "p93_p2"
[45] "p93_p3" "iv93"
[47] "X_merge" "xnivav"
[49] "ln1Punemp" "lnunemp"
[51] "wardx" "lnunemp_1997"
[53] "DDDlnunemp" "ln1Punemp_1997"
[55] "DDDln1Punemp" "xnivav_1997"
[57] "DDDxnivav" "NGE_1997"
[59] "DDDNGE"
table(up$year)
2002
10764
summary(up$DDDNGE)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.30000 0.00000 0.00000 -0.01505 0.00000 0.35000
summary(up$year)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2002 2002 2002 2002 2002 2002
Call:
lm(formula = DDDln1Punemp ~ DDDNGE, data = up)
Residuals:
Min 1Q Median 3Q Max
-3.5451 -0.1907 0.0165 0.2109 2.8601
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.462220 0.003648 -126.703 < 2e-16 ***
DDDNGE -0.221062 0.036012 -6.138 8.62e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3743 on 10762 degrees of freedom
Multiple R-squared: 0.003489, Adjusted R-squared: 0.003396
F-statistic: 37.68 on 1 and 10762 DF, p-value: 8.624e-10
#library(AER)
#ivreg(DDDln1Punemp~DDDNGE | DDDxnivav,data=up) %>% summary()
Subsidies are given to disadvantaged areas. Hence it could be the case that a positive shock to unemployment growth (e.g. in an economically shrinking area) leads to higher subsidy levels. Consequently we would get an upward bias in our estimate implying that we would underestimate the effect of the subsidy.
DDDxnivav
. Run a regression of the variable DDDNGE
on this instrumental variable. What do you find and what does this tell you about the validity of the instrument? Can you implement further checks that would help you to understand if you are dealing with a valid instrument?
Call:
lm(formula = DDDNGE ~ DDDxnivav, data = up)
Residuals:
Min 1Q Median 3Q Max
-0.36533 -0.00347 0.00340 0.01953 0.43120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0016793 0.0009968 -1.685 0.0921 .
DDDxnivav 0.8822862 0.0257834 34.219 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09514 on 10762 degrees of freedom
Multiple R-squared: 0.09813, Adjusted R-squared: 0.09804
F-statistic: 1171 on 1 and 10762 DF, p-value: < 2.2e-16
The regressions shows the first stage regression for that instrument. The instrument is significantly driving the explanatory variable of interest. This is promising, however to make sure we are not suffering from a weak instrument problem we need to check the F-test of the hypothesis that the instrument has no effect on the explanatory variable.
For that we can use the linearHypothesis
command as shown below. We see that the instrument is strongly (F statistic = 1172 > 10) correlated with the explanatory variable. Hence, as far as criterion 2 of an IV estimator is concerned this is a good instrument. Of course this tells us nothing about the other criteria which we cannot test with the data.
library(AER)
linearHypothesis(reg2,c("DDDxnivav=0"))
Linear hypothesis test
Hypothesis:
DDDxnivav = 0
Model 1: restricted model
Model 2: DDDNGE ~ DDDxnivav
Res.Df RSS Df Sum of Sq F Pr(>F)
1 10763 108.021
2 10762 97.421 1 10.6 1171 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
DDDNGE
) on unemployment. Do you find an estimate that is larger or smaller than the value you found in part (a)? Can you motivate why?We can use the ivreg()
command:
Call:
ivreg(formula = DDDln1Punemp ~ DDDNGE | DDDxnivav, data = up)
Residuals:
Min 1Q Median 3Q Max
-3.5411 -0.1917 0.0158 0.2149 2.8641
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.466218 0.004011 -116.237 < 2e-16 ***
DDDNGE -0.486785 0.115253 -4.224 2.42e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3752 on 10762 degrees of freedom
Multiple R-Squared: -0.001552, Adjusted R-squared: -0.001645
Wald test: 17.84 on 1 and 10762 DF, p-value: 2.424e-05
We see that the effect of subsidies on unemployment growth is much stronger than in part a); i.e. now we have a coefficient of -.48. This suggest that indeed in the OLS case we suffer from an upward bias because of a positive correlation between the error term and the subsidy variables; i.e. areas that have a higher growth of unemployment are more likely to get an increase in their subsidy levels.
This question is based on an analysis of daily electricity consumption data for a sample of more than 1000 UK households from October 2018 until September 2020. Importantly, this includes the period from late March 2020 when the UK government first imposed strict lockdown measures.
We find a point estimate of 0.78895.
This implies that households (on average) used 0.78895 kWh more electricity over the period since lockdown has been imposed. This estimate is highly significant (i.e. we would reject the null hypothesis of “no lockdown effect” at very high low significance levels, lower than 1%). It is plausible that people used more electricity when they had to stay at home rather than go out. Many people would also have worked from home requiring using energy hungry computers.
We find a point estimate of 0.09 This suggests that on average households consumed 9.9% more electricity. This is a statistically significant result.
The only control included are month dummies. Hence, the estimate would assign anything that is different in 2020 from 2019 (and late 2018) as an effect of lockdown. A particular concern could be the weather (e.g. if 2020 was much colder or warmer so that people would use more (or less energy for heating)). Another issue could be ongoing changes in technology; e.g. people will have bought more electric vehicles in 2020 than in 2019 that they are charging at home (which would lead to an upward bias). Alternatively: every year technological improvements mean efficiency of electric devices improves which would imply a downward bias, etc.
Electricity consumption (on average) is highest in December. It is lowest in August. In December nights are longest and it is cold. So people use more electricity for heating and lighting. They might also cook extravagant meals during the Christmas season and use festive lighting. In August days are long and many people are on holidays abroad. Average consumption in January corresponds to the constant; i.e. 12.35 kWh. Make sure they use the result in part a for this and not b.
This regression does not include controls for the calendar month. We find now a significantly negative estimate for the lockdown effect suggesting that consumption went down by 1.2 kWh because of lockdown. This is because the dataset stops in September 2020. Hence, the lockdown period considered coincides with the period of the year where less electricity is being consumed. When we are not including month dummies we are confounding this effect with the lockdown effect leading to a severely downward biased estimate
Download the cigs.csv dataset. It contains data on cigarattes sales across US states for the years 1985 and 1995.
log(packs)
on the log(price)
. Discuss the result and provide an interpretation of the coefficient on log(price)
.library(dplyr)
cigs=read.csv("https://github.com/mondpanther/datastorieshub/raw/master/code/cigs.csv")
lm(log(packs)~log(price),cigs) %>% summary()
Call:
lm(formula = log(packs) ~ log(price), data = cigs)
Residuals:
Min 1Q Median 3Q Max
-0.61893 -0.08318 0.00114 0.09374 0.52046
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.28600 0.30459 23.921 < 2e-16 ***
log(price) -0.53280 0.06179 -8.623 1.55e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.183 on 94 degrees of freedom
Multiple R-squared: 0.4416, Adjusted R-squared: 0.4357
F-statistic: 74.35 on 1 and 94 DF, p-value: 1.552e-13
The log(price) coefficient can be interpreted as the price elasticity of cigarette demand; i.e. 1% increase in price would lead to a 0.533 percent decline in demand.
We would expect the state level price to be driven by both supply and demand. Suppose a state is economically less successful and therefore people have less money to spend on cigarettes. Such a state will have a lower cigarette demand for a given price. However, cigarette vendors might also decide to lower prices in such a state. This would introduce a positive correlation between income and prices leading to an upward bias in a simple price estimate as above.
To be a valid instrument taxes need to be un-correlated with any factors apart from price that are driving demand. This might be the case if taxes are set by government for largely other reasons (e.g. to find revenue or political ideology) or in response to factors only related to supply (e.g. a state where many producers are located might be more susceptible to lobbying by those producers).
To be a valid instrument taxes also need to be a strong enough driver of prices. This is something we check. However, it is also theoretically plausible as long as producers pass on some of the tax to consumers. Of course, if demand is highly elastic this might not bet the case. Also, it could be that government behaviour is driven by demand; e.g. the presence of cigarette producers in a state could not only affect government policy but because many people work in the cigarette industry they are also avid consumers. In other words: a positive shock to cigarette demand would expplain low taxes. Hence, this would create a dowward bias when finding the true tax causal effect.
Call:
lm(formula = log(price) ~ log(tax), data = cigs)
Residuals:
Min 1Q Median 3Q Max
-0.28636 -0.10804 0.00304 0.09524 0.50400
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.16155 0.15615 13.84 <2e-16 ***
log(tax) 0.74773 0.04213 17.75 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1465 on 94 degrees of freedom
Multiple R-squared: 0.7701, Adjusted R-squared: 0.7677
F-statistic: 315 on 1 and 94 DF, p-value: < 2.2e-16
This is the first stage regression; i.e. we regress prices on taxes. It shows that taxes are a highly significant driver of prices. Because we are dealing with a univariate model, there is no need to conduct a separate F test. We can look at the F-statistic that is reported as part of the standard regression summary. We see that with 315 it is sufficiently high ( larger than 10). Hence this shows that criterion 2 required for a successful IV is met.
If there was a need to implement a separate Ftest (e.g. if we had additional control variables included), we could do that as follows:
library(AER)
linearHypothesis(first,"log(tax)=0")
Linear hypothesis test
Hypothesis:
log(tax) = 0
Model 1: restricted model
Model 2: log(price) ~ log(tax)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 95 8.7738
2 94 2.0167 1 6.7571 314.96 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that this leads to the same F statistic as reported in the regression output.
Call:
ivreg(formula = log(packs) ~ log(price) | log(tax), data = cigs)
Residuals:
Min 1Q Median 3Q Max
-0.598254 -0.104838 -0.008713 0.100067 0.525173
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.65158 0.34958 21.888 < 2e-16 ***
log(price) -0.60711 0.07095 -8.557 2.14e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1844 on 94 degrees of freedom
Multiple R-Squared: 0.4331, Adjusted R-squared: 0.427
Wald test: 73.22 on 1 and 94 DF, p-value: 2.139e-13
This is the 2nd stage of the 2SLS IV regression. We find a price elasticity of -.6; i.e. 1% increase in price leads to a .6 percent decline in demand. If this is a valid IV it implies that the OLS (in part a) was upward biased (in line with our suggested reason for a bias in b)
Eating fast food could have an impact on the weight of the student. If weight also has an impact on grades this is part of the causal effect we are interested in. Hence we definitely, don’t want to include it as a control. The other options all relate to variables that are fixed once the experiment starts. Hence, those variables cannot be influenced by whatever happens during the experiment so they cannot lead to biases.
What does a slope coefficient β= -0.46 mean?
That the correlation is significant, and positive. That there is no predictive power in your independent variable. This means that for every unit increase in your independent variable, your dependent variable decreases by 0.46 units. If the dependent variable decreases by 0.46 units, the explanatory variable increases by 1.partyrank
ranks Universities according to the quality of the local party scene (i.e. the uni with the best partyscene has rank 1). What does the regression suggest abou the relationship between party rank and covid cases?
Call:
lm(formula = casesOstudent ~ partyrank, data = datafinal2)
Residuals:
Min 1Q Median 3Q Max
-0.05011 -0.02694 -0.01124 0.01172 0.20685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.937e-02 4.823e-03 12.310 < 2e-16 ***
partyrank -5.224e-05 1.205e-05 -4.336 2.08e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.04017 on 260 degrees of freedom
Multiple R-squared: 0.06743, Adjusted R-squared: 0.06384
F-statistic: 18.8 on 1 and 260 DF, p-value: 2.078e-05
Moving to one lower rank (e.g from rank 4 to rank 5) …
oilprice
is the spot price for oil computed at the monthly level from 2010-19. Based on what is reported below, which of the following can you conclude:
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression none
Call:
lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-13.9677 -3.2336 0.6622 3.3823 14.7928
Coefficients:
Estimate Std. Error t value Pr(>|t|)
z.lag.1 -0.004069 0.006139 -0.663 0.5088
z.diff.lag 0.258295 0.089599 2.883 0.0047 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.05 on 116 degrees of freedom
Multiple R-squared: 0.07005, Adjusted R-squared: 0.05402
F-statistic: 4.369 on 2 and 116 DF, p-value: 0.01481
Value of test-statistic is: -0.6628
Critical values for test statistics:
1pct 5pct 10pct
tau1 -2.58 -1.95 -1.62
Download the dataset cross.csv
cross=read.csv("https://raw.githubusercontent.com/mondpanther/datastorieshub/95d94862115819350247823f174a2633cde0236b/code/cross.csv")
It contains data on the COVID pandemic across most countries for two different months in 2021. Note that note that not all countries have data reported in the same month.
deaths
on vax
. deaths
reports the total number of covid related deaths in a particular country by a given months (in deaths per 1 million of the population). vax
records the total number of (at least once) vaxinated persons (per 100K) by a particular month. Discuss your and interpret your finding.library(dplyr)
cross_euram=cross %>% filter((continent=="Europe" | continent=="North America") & period==2)
reg1=lm(deaths~vax,cross_euram)
reg1 %>% summary()
Call:
lm(formula = deaths ~ vax, data = cross_euram)
Residuals:
Min 1Q Median 3Q Max
-1344.6 -847.5 -147.3 652.7 2768.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1344.6381 241.7908 5.561 3.93e-07 ***
vax -0.9906 2.0475 -0.484 0.63
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1043 on 75 degrees of freedom
Multiple R-squared: 0.003111, Adjusted R-squared: -0.01018
F-statistic: 0.2341 on 1 and 75 DF, p-value: 0.6299
It is also helpful to look at a scatterplot:
cross_euram %>% ggplot( aes(x=vax,y=deaths) ) +
geom_point() +
ylab("Total number of deaths per 1 million")+
xlab("Total number of vaccinations per 100K")+
theme_minimal()+
geom_smooth(method="lm",se=F)
We see that there is a negative relationship between vaccination rates and deaths. 1 additional person vaccinated per 100K would about 1 less deaths per 1 million of the population. However, relationship is not significant.
An upward bias could come from reverse causality. Consider the example of the UK vs Australia: UK engaged in a particularly strong vaccination campaign early on and a large part of the population embraced vaccinations. This was in part driven by the particularly dire covid case rate early in the pandemic and throughout the winter of 2020-21. Australia on the other engaged early in strong measures to prevent the pandemic from spreading - e.g. very strict international travel restrictions and good tracking and tracing. As a consequence there were very few covid cases for most of the pandemic. Once vaccines arrived, Australia did not pursue and agressive vaccine campaign and many Australians didn’t get vaccinated because they thought there was no need. In other words: what caused higher vaccination rates were high early pandemic case and death rates which are an important part of total pandemic death rates.
The dataset contains the death and case rates from early 2021 when vaccines started to be distributed widely. We can use this as a control variable.
Alternatively we can look at the relationship between vaccination rates and changes in deaths since early 2021; or: we could run a panel data regression including country fixed effects.
cross_euram2=cross %>%arrange(iso_code,period) %>%
group_by(iso_code) %>% mutate(L1deaths=dplyr::lag(deaths),
Ddeaths=deaths-dplyr::lag(deaths)) %>%
filter((continent=="Europe" | continent=="North America") )
reg2=lm(deaths~vax+L1deaths,cross_euram2 )
reg2 %>% summary()
Call:
lm(formula = deaths ~ vax + L1deaths, data = cross_euram2)
Residuals:
Min 1Q Median 3Q Max
-851.62 -346.61 -80.83 181.72 1556.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 531.98326 125.53951 4.238 6.43e-05 ***
vax -3.13125 0.98204 -3.189 0.0021 **
L1deaths 1.43916 0.08958 16.066 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 495.6 on 74 degrees of freedom
(84 observations deleted due to missingness)
Multiple R-squared: 0.7779, Adjusted R-squared: 0.7719
F-statistic: 129.6 on 2 and 74 DF, p-value: < 2.2e-16
Call:
lm(formula = Ddeaths ~ vax, data = cross_euram2)
Residuals:
Min 1Q Median 3Q Max
-780.0 -384.2 -159.6 268.2 1926.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 779.965 131.365 5.937 8.4e-08 ***
vax -2.478 1.112 -2.228 0.0289 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 566.7 on 75 degrees of freedom
(84 observations deleted due to missingness)
Multiple R-squared: 0.06206, Adjusted R-squared: 0.04955
F-statistic: 4.962 on 1 and 75 DF, p-value: 0.0289
Call:
lm(formula = deaths ~ vax + factor(iso_code) + factor(month),
data = cross_euram2)
Residuals:
Min 1Q Median 3Q Max
-885.7 -135.1 0.0 135.1 885.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -403.407 411.814 -0.980 0.330763
vax -3.898 1.287 -3.028 0.003481 **
factor(iso_code)AIA 153.563 487.972 0.315 0.753955
factor(iso_code)ALB 882.991 508.678 1.736 0.087119 .
factor(iso_code)AND 1718.605 498.445 3.448 0.000974 ***
factor(iso_code)ATG 527.315 537.818 0.980 0.330329
factor(iso_code)AUT 1373.897 486.255 2.825 0.006192 **
factor(iso_code)BEL 2268.137 497.394 4.560 2.19e-05 ***
factor(iso_code)BES 90.434 593.159 0.152 0.879274
...
We see that in all cases the coefficient becomes substantially more negative and signficant suggesting more powerful causal effect of vaccines. This would be in line with the suggestion that the (pre vacine) severety of the pandemic was driving the speed and strength of vaccine rollout and uptake.
For attribution, please cite this work as
Martin (2021, Dec. 4). Datastories Hub: Exercises 10. Retrieved from https://mondpanther.github.io/datastorieshub/posts/exercises/exercises10/
BibTeX citation
@misc{martin2021exercises, author = {Martin, Ralf}, title = {Datastories Hub: Exercises 10}, url = {https://mondpanther.github.io/datastorieshub/posts/exercises/exercises10/}, year = {2021} }