Becoming a Data Detective: Uncovering Racial Bias & more
For this question we will use a dataset from a randomized experiment conducted by Marianne Bertrand and Sendhil Mullainathan, who sent 4,870 fictitious resumes out to employers in response to job adverts in Boston and Chicago in 2001. The resumes differ in various attributes including the names of the applicants, and different resumes were randomly allocated to job openings. Some of the names are distinctly white sounding and some distinctly black sounding. The researchers collecting these data were interested to learn whether black sounding names obtain fewer callbacks for interviews than white names. Load the data set bm.dta.
library(haven)
data <- read_dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AABua74TH54FcmOsAs0ayMY5a/bm.dta?dl=1")
summary(data)
education ofjobs yearsexp computerskills
Min. :0.000 Min. :1.000 Min. : 1.000 Min. :0.0000
1st Qu.:3.000 1st Qu.:3.000 1st Qu.: 5.000 1st Qu.:1.0000
Median :4.000 Median :4.000 Median : 6.000 Median :1.0000
Mean :3.618 Mean :3.661 Mean : 7.843 Mean :0.8205
3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.: 9.000 3rd Qu.:1.0000
Max. :4.000 Max. :7.000 Max. :44.000 Max. :1.0000
call female black
Min. :0.00000 Min. :0.0000 Min. :0.0
1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.0
Median :0.00000 Median :1.0000 Median :0.5
Mean :0.08049 Mean :0.7692 Mean :0.5
3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0
Max. :1.00000 Max. :1.0000 Max. :1.0
The data set contains two dummy variables (0-1 variables) for gender (female) and whether the applicant has computer skills (computerskills). Tabulate these variables by black.
Do gender and computer skills look balanced – i.e. random - across race groups?
# This can be answered with the prop.table command
table(data$female,data$black)
0 1
0 575 549
1 1860 1886
prop.table(table(data$female,data$black))
0 1
0 0.1180698 0.1127310
1 0.3819302 0.3872690
# Alternatively you can compute (using the dplyr syntax only):
data %>% group_by(black, female) %>%
summarise(n=n()) %>%
group_by(female) %>%
mutate(black_share_within_gender=n/sum(n))
# A tibble: 4 x 4
# Groups: female [2]
black female n black_share_within_gender
<dbl> <dbl> <int> <dbl>
1 0 0 575 0.512
2 0 1 1860 0.497
3 1 0 549 0.488
4 1 1 1886 0.503
data %>% group_by(black, female) %>%
summarise(n=n()) %>%
group_by(black) %>%
mutate(female_share_within_ethnicity=n/sum(n)) %>%
arrange(female) # note that arrange is dplyr's sort command
# A tibble: 4 x 4
# Groups: black [2]
black female n female_share_within_ethnicity
<dbl> <dbl> <int> <dbl>
1 0 0 575 0.236
2 1 0 549 0.225
3 0 1 1860 0.764
4 1 1 1886 0.775
There are many more female than male CVs (i.e. about 38%+38%=76% of the sample are female). However, gender seems not all correlated with race; i.e. the split between black and non black is virtually half for both men and women.
A similar pattern emerges for computer skills and racial background.
prop.table(table(data$computerskills,data$black))
0 1
0 0.09568789 0.08377823
1 0.40431211 0.41622177
Do a similar tabulation for education and the number of jobs previous held (ofjobs). These variables take on 5 and 7 different values, respectively.
Does education and the number of previous jobs look balanced across race groups?
To be sure, run a regression. Are the differences significant?
You can tabulate to eye-ball proportions but if you want to be sure that there aren’t small discrepancies, it is always best to run a regression.
table(data$education,data$black)
0 1
0 18 28
1 18 22
2 142 132
3 513 493
4 1744 1760
prop.table(table(data$education,data$black))
0 1
0 0.003696099 0.005749487
1 0.003696099 0.004517454
2 0.029158111 0.027104723
3 0.105338809 0.101232033
4 0.358110883 0.361396304
table(data$ofjobs,data$black)
0 1
1 54 56
2 347 357
3 726 703
4 800 811
5 258 275
6 243 221
7 7 12
prop.table(table(data$ofjobs,data$black))
0 1
1 0.011088296 0.011498973
2 0.071252567 0.073305955
3 0.149075975 0.144353183
4 0.164271047 0.166529774
5 0.052977413 0.056468172
6 0.049897331 0.045379877
7 0.001437372 0.002464066
Call:
lm(formula = black ~ ofjobs, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.63158 -0.50341 -0.05394 0.49659 0.52371
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.509091 0.047690 10.675 <2e-16 ***
ofjobs2 -0.001989 0.051281 -0.039 0.969
ofjobs3 -0.017138 0.049492 -0.346 0.729
ofjobs4 -0.005677 0.049291 -0.115 0.908
ofjobs5 0.006857 0.052381 0.131 0.896
ofjobs6 -0.032798 0.053043 -0.618 0.536
ofjobs7 0.122488 0.124264 0.986 0.324
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5002 on 4863 degrees of freedom
Multiple R-squared: 0.0007238, Adjusted R-squared: -0.0005091
F-statistic: 0.587 on 6 and 4863 DF, p-value: 0.741
Hence the different job categories are individually (and jointly) insignificant. Note an OLS regression always reports a test that all coefficients are jointly not significant (the p-value of 0.741 reported at the end of the output).
Look at the mean and standard deviation for the variable for years of experience (yearsexp) separately for black and whites.
Does this variable look similar by race?
What do you make of the overall results on resume characteristics?
Is it important to figure out if these variables look similar across the race groups? Think why or why not.
If there was any evidence of a systematic relationship between race and any of those characteristics we could potentially be in trouble when simply comparing interview call backs for different race groups. Any differences found could simply due to those other factors rather than racial bias by employers.
The variable of interest on the data set is the variable call, which indicates a call back for an interview.
What percentage of people receive a call back (rounded up to 2 decimal places)?
Do you find differences in call back rates by race?
prop.table(table(data$call))
0 1
0.91950719 0.08049281
prop.table(table(data$call,data$black))
0 1
0 0.45174538 0.46776181
1 0.04825462 0.03223819
prop.table(table(data$call,data$black),2)
0 1
0 0.90349076 0.93552361
1 0.09650924 0.06447639
# Note that by specifying 2 we report proportions by column; i.e. it divides
# the number of cases falling in a cell by the total number in a column.
# (if you specify 1 or omit the parameter it reports proportions by rows)
We see that most CVs never received a call back (i.e. overall only 8.05% received a call back).
What do you conclude from the results of the Bertand and Mullainathan experiment?
Are black people as likely to receive a call back as white people?
There seems to be a clear difference between the different ethnic backgrounds. Whereas for white people call back rates where above average (9.65%) they were below average for black people (6.45%) suggesting a racial bias by employers.
Lets use the dataset from the experiment by Bertand and Mullainathan (bm.dta) again.
Develop a regression to examine if the difference in interview callbacks between black and white “sounding” CVs is significantly different.
What is the code for the simplest linear regression (we call this model “mod”)? mod <-
Whata is the code for the simplest logit regression (we call this model “mod2”)? mod2 <-
Execute both regression models in turn.
Are the results similar?
Linear regression:
Call:
lm(formula = call ~ black, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.09651 -0.09651 -0.06448 -0.06448 0.93552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.096509 0.005505 17.532 < 2e-16 ***
black -0.032033 0.007785 -4.115 3.94e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2716 on 4868 degrees of freedom
Multiple R-squared: 0.003466, Adjusted R-squared: 0.003261
F-statistic: 16.93 on 1 and 4868 DF, p-value: 3.941e-05
Alternatively we can use logit:
Call:
glm(formula = call ~ black, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.4505 -0.4505 -0.3651 -0.3651 2.3416
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.23663 0.06863 -32.590 < 2e-16 ***
black -0.43818 0.10732 -4.083 4.45e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2726.9 on 4869 degrees of freedom
Residual deviance: 2709.9 on 4868 degrees of freedom
AIC: 2713.9
Number of Fisher Scoring iterations: 5
black
-0.03232
Note that the marginal effects using logit are very similar to the linear model.
For this question, download the data set cps.dta, which comes from the responses to the monthly US Current Population Survey (CPS) in 2001, a large labour market survey. This data set contains data on 8,891 individuals living in Boston and Chicago. We want to use these data to compare the skills of real live blacks and whites (as opposed to made up CVs), and their employment outcomes and see how they differ from the findings in the exercises involving the bm.dta dataset.
library(foreign)
data <- read.dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AAAOb_v-Y2V0NN4-rxahZjl4a/cps.dta?dl=1")
summary(data)
employed black female education
No :1942 Min. :0.0000 Min. :0.0000 HSD : 980
Yes :6931 1st Qu.:0.0000 1st Qu.:0.0000 HSG :2358
NA's: 18 Median :0.0000 Median :1.0000 some col:2460
Mean :0.1508 Mean :0.5161 col+ :3093
3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
yearsexp
Min. : 0.00
1st Qu.:11.00
Median :21.00
Mean :20.94
3rd Qu.:30.00
Max. :59.00
The data set contains a variable education, which takes on four values (high school dropouts, high school graduates, some college, and college degree and more). Use the education variable to create a new dummy for resumes indicating some college or more (i.e. those in the “some college” category plus those in the college and more category).
What percentage of respondents has at least some college education (rounded up to the nearest percent)?
Note that nearly 63% of the sample have some college education.
First make employed into dichotomous variable and then conduct a regression analysis of the chances of being employed for people with different racial backgrounds.
Are black people significantly less likely to be employed?
If yes, what is the percentage difference (rounded to the nearest percentage point)?
#modify employed as a dichotomous variable
data=data%>% mutate(employed2=employed=="Yes")
mod1 <- lm(employed2~black,data)
summary(mod1)
Call:
lm(formula = employed2 ~ black, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.7949 0.2051 0.2051 0.2051 0.2964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.794879 0.004748 167.40 < 2e-16 ***
black -0.091286 0.012237 -7.46 9.48e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4122 on 8871 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.006234, Adjusted R-squared: 0.006122
F-statistic: 55.65 on 1 and 8871 DF, p-value: 9.481e-14
A regression of employment status on a dummy indicating a black racial background suggests that black people are significantly less likely to be employed. The difference amounts to 9.1 percentage points.
Conduct a regression analysis of the chances of having college education for people with different racial backgrounds.
What is the percentage difference (rounded to the nearest percentage point)?
Call:
lm(formula = HE ~ black, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.6420 -0.6420 0.3580 0.3580 0.4735
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.641987 0.005553 115.607 < 2e-16 ***
black -0.115514 0.014299 -8.078 7.41e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4825 on 8889 degrees of freedom
Multiple R-squared: 0.007288, Adjusted R-squared: 0.007177
F-statistic: 65.26 on 1 and 8889 DF, p-value: 7.414e-16
The difference in the probability of having higher education between people with non black and black background is 11 percentage points.
On the basis of your evidence what can you conclude about racial discrimination in the US labor market?
Do we see a open-and-shut case of racial bias here?
Think of the potential caveats and alternative explanations.
What analysis could you undertake to address some of these caveats?
The result in (b) is consistent with racial bias. However, from (c) we also see that people with black background tend to have less college education and college education is another major driver of being employed:
Call:
lm(formula = employed2 ~ HE, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.8206 0.1794 0.1794 0.2845 0.2845
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.715486 0.007109 100.64 <2e-16 ***
HETRUE 0.105124 0.008996 11.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4104 on 8871 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.01516, Adjusted R-squared: 0.01505
F-statistic: 136.5 on 1 and 8871 DF, p-value: < 2.2e-16
Hence, far from implying a racial issue, the result in (b) could simply reflect employers preference for more highly educated workers. We can examine this by doing the analysis in (b) separately for workers with different educational attainment; i.e.:
Call:
lm(formula = employed2 ~ black, data = data, subset = data$HE ==
1)
Residuals:
Min 1Q Median 3Q Max
-0.8260 0.1740 0.1740 0.1740 0.2162
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.825961 0.005513 149.809 < 2e-16 ***
black -0.042177 0.015479 -2.725 0.00645 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3835 on 5539 degrees of freedom
(12 observations deleted due to missingness)
Multiple R-squared: 0.001339, Adjusted R-squared: 0.001158
F-statistic: 7.425 on 1 and 5539 DF, p-value: 0.006453
Call:
lm(formula = employed2 ~ black, data = data, subset = data$HE ==
0)
Residuals:
Min 1Q Median 3Q Max
-0.7392 -0.6145 0.2608 0.2608 0.3855
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.739163 0.008636 85.59 < 2e-16 ***
black -0.124629 0.019814 -6.29 3.59e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4487 on 3330 degrees of freedom
(6 observations deleted due to missingness)
Multiple R-squared: 0.01174, Adjusted R-squared: 0.01144
F-statistic: 39.56 on 1 and 3330 DF, p-value: 3.588e-10
The results suggest that for either group there is a significant racial gap when it comes to being employed. Note that the effect is considerably stronger for less educated workers. Hence this re-enforces the hypothesis that there is discrimination against workers with black background which is un-related to their productivity in the workplace.
However, there might be further caveats: our simple regression cannot account for the quality of the college education which can vary considerably and might vary systematically along racial lines. Furthermore, an important driver for a good education and for various other skills might have to do with parental income and status. Again, this is likely to vary systematically along racial lines.
While it is interesting to ask if employers discriminate above and beyond what could be expected on the basis of education and skill of workers – which is what we were implicitly doing above - we might also be concerned about the overall impact of racial background on labor market outcomes which includes initially different educational outcomes. Hence, depending on our interest we might be primarily focused on the effect of race holding education fixed or we might be focused on the overall effect.
Consider once more the cps.dta dataset. In exercise 3 we argued that not accounting for education might bias our estimate of the impact of racial background. We then examined the issue by looking at college educated vs not college educated workers separately.
library(foreign)
data <- read.dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AAAOb_v-Y2V0NN4-rxahZjl4a/cps.dta?dl=1")
summary(data)
employed black female education
No :1942 Min. :0.0000 Min. :0.0000 HSD : 980
Yes :6931 1st Qu.:0.0000 1st Qu.:0.0000 HSG :2358
NA's: 18 Median :0.0000 Median :1.0000 some col:2460
Mean :0.1508 Mean :0.5161 col+ :3093
3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
yearsexp
Min. : 0.00
1st Qu.:11.00
Median :21.00
Mean :20.94
3rd Qu.:30.00
Max. :59.00
First add the HE (higher education) variable to the data frame and modify the employed variable as a dichotomous variable calling it “employed2”. Can you propose an alternative strategy using a multivariate regression approach?
What is the command for a linear regression with two variables (we call this model “mod1”)? mod1 <-
What is the command for a slightly more complex linear regression with two variables and their interaction (we call this model “mod2”)? mod2 <-
#adding HE variable to dataframe
data["HE"] <- as.integer(data$education=="some col" | data$education=="col+")
#modify employed as a dichotomous variable
data["employed2"] <- as.integer(data$employed)-1
mod1 <- lm(employed2~black+HE,data)
summary(mod1)
Call:
lm(formula = employed2 ~ black + HE, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.8307 0.1693 0.1693 0.2694 0.3491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.730628 0.007462 97.917 < 2e-16 ***
black -0.079705 0.012198 -6.534 6.75e-11 ***
HE 0.100094 0.009008 11.111 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4094 on 8870 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.01988, Adjusted R-squared: 0.01966
F-statistic: 89.94 on 2 and 8870 DF, p-value: < 2.2e-16
Call:
lm(formula = employed2 ~ black + HE + black * HE, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.8260 0.1740 0.1740 0.2608 0.3855
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.739163 0.007876 93.849 < 2e-16 ***
black -0.124629 0.018070 -6.897 5.68e-12 ***
HE 0.086798 0.009831 8.829 < 2e-16 ***
black:HE 0.082451 0.024481 3.368 0.00076 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4092 on 8869 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.02113, Adjusted R-squared: 0.0208
F-statistic: 63.81 on 3 and 8869 DF, p-value: < 2.2e-16
There are potentially two models you might have used in part (a) one that implies that the effect of race is the same for both educational groups or one that implies the effect of race is different for different educational groups.
Which model is more appropriate?
The test for the interaction coefficient is significant. That suggest the negative “black” effect is significatly less strong for higher educated workers; i.e. it’s important to have the more complicated model with interaction term.
Some alternative ways of testing include:
anova(mod2,mod1)
Analysis of Variance Table
Model 1: employed2 ~ black + HE + black * HE
Model 2: employed2 ~ black + HE
Res.Df RSS Df Sum of Sq F Pr(>F)
1 8869 1484.9
2 8870 1486.8 -1 -1.8992 11.344 0.0007603 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Wald test
Model 1: employed2 ~ black + HE + black * HE
Model 2: employed2 ~ black + HE
Res.Df Df F Pr(>F)
1 8869
2 8870 -1 11.344 0.0007603 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# or alternatively
coeftest(mod2)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7391627 0.0078761 93.8488 < 2.2e-16 ***
black -0.1246287 0.0180702 -6.8969 5.677e-12 ***
HE 0.0867985 0.0098305 8.8295 < 2.2e-16 ***
black:HE 0.0824513 0.0244806 3.3680 0.0007603 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
On the basis of your regressions in (b), what is the racial gap for college educated people in percentage points (to 1 decimal place)?
What is the racial gap for non college educated people in percentage points (to 1 decimal place)?
Think of how this compares to your findings in Exercise 5.3.
College educated black people have a 12.5-8.2=4.3 percentage points lower likelihood of being employed. For non college educated the propbability is 12.5 percentage points lower; i.e. short of rounding error this corresponds to exactly what we found in exercise 3 as well.
Use wage1.dta and examine once more the relationship between education and wages.
Would you say the relationship is different for men and women?
We can allow for a separate response for women by including a female dummy and also interact the education variable with gender status. Note that you can do this by creating the various variables first. A faster way is to use R’s ability to create new variables as part of the model description of the lm command:
Call:
lm(formula = wage ~ educ + educ * female, data = data)
Residuals:
Min 1Q Median 3Q Max
-6.1611 -1.8028 -0.6367 1.0054 15.5258
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.20050 0.84356 0.238 0.812
educ 0.53948 0.06422 8.400 4.24e-16 ***
female -1.19852 1.32504 -0.905 0.366
educ:female -0.08600 0.10364 -0.830 0.407
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.186 on 522 degrees of freedom
Multiple R-squared: 0.2598, Adjusted R-squared: 0.2555
F-statistic: 61.07 on 3 and 522 DF, p-value: < 2.2e-16
linearHypothesis(mod2,c("educ:female=0","female=0"))
Linear hypothesis test
Hypothesis:
educ:female = 0
female = 0
Model 1: restricted model
Model 2: wage ~ educ + educ * female
Res.Df RSS Df Sum of Sq F Pr(>F)
1 524 5980.7
2 522 5300.2 2 680.51 33.511 2.031e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that this leads to “female” parameters that individually are not significant. However, a joint significance test reveals that the variables combined have explanatory power; i.e. the low significance in the regression is likely a consequence of co-linearity of the female and female*education variables. We can also see this by just regressing on a female dummy which leads to a highly significant (and negative) effect (as we have seen before).
Call:
lm(formula = wage ~ educ + female:educ, data = data)
Residuals:
Min 1Q Median 3Q Max
-6.3234 -1.7394 -0.6856 1.0567 15.5795
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.28526 0.65041 -0.439 0.661
educ 0.57548 0.05039 11.421 < 2e-16 ***
educ:female -0.17764 0.02183 -8.138 2.95e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.186 on 523 degrees of freedom
Multiple R-squared: 0.2586, Adjusted R-squared: 0.2558
F-statistic: 91.23 on 2 and 523 DF, p-value: < 2.2e-16
You might consequently ask, which is the right model? There are several considerations:
Use the production2.dta dataset. This data set contains data on output (value added) and inputs at the industry level for 459 industries in 1958 and 1993.
library(foreign)
library(lmtest)
library(car)
data <- read.dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AAB8jv5xON_4OlZG50sj2l-xa/production2.dta?dl=1")
summary(data)
sic year emp vship
Min. :2011 Min. :58.0 Min. : 0.10 Min. : 24.9
1st Qu.:2437 1st Qu.:58.0 1st Qu.: 10.30 1st Qu.: 323.3
Median :3221 Median :75.5 Median : 19.30 Median : 1026.6
Mean :3039 Mean :75.5 Mean : 35.59 Mean : 3768.3
3rd Qu.:3560 3rd Qu.:93.0 3rd Qu.: 40.05 3rd Qu.: 3493.3
Max. :3999 Max. :93.0 Max. :511.40 Max. :167825.8
matcost vadd energy
Min. : 8.1 Min. : 9.1 Min. : 0.20
1st Qu.: 155.4 1st Qu.: 155.7 1st Qu.: 3.20
Median : 510.9 Median : 492.0 Median : 13.85
Mean : 1997.6 Mean : 1773.3 Mean : 71.80
3rd Qu.: 1726.9 3rd Qu.: 1699.1 3rd Qu.: 53.15
Max. :120458.8 Max. :47272.0 Max. :3780.60
cap
Min. : 3.7
1st Qu.: 292.2
Median : 682.6
Mean : 1974.8
3rd Qu.: 1711.6
Max. :61909.7
Suppose the relationship between output and inputs is described by a Cobb-Douglas production function: \[Y_{i} = AK_{\alpha}^{i}L_{\beta}^{i}\]
where \(Y_{i}\) is a measure of output, \(K_{i}\) is the capital stock, and \(L_{i}\) is employment. Answer all questions for the year 1958 only.
Transform the production function to a linear equation by taking logs. Estimate the parameters and by an OLS regression using total value added as your measure of output.
What coefficient on the log(emp) variable (rounded to 3 decimal places)?
What coefficient on the log(cap) variable (rounded to 3 decimal places)?
We can regress the following linear regression model:
\[\ln Y=\alpha \ln K +\beta \ln L +\varepsilon\]
Call:
lm(formula = log(vadd) ~ log(emp) + log(cap), data = data[data$year ==
58, ])
Residuals:
Min 1Q Median 3Q Max
-0.72254 -0.16410 -0.02135 0.16161 1.17701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.41898 0.06172 22.99 <2e-16 ***
log(emp) 0.69748 0.01868 37.34 <2e-16 ***
log(cap) 0.27609 0.01430 19.30 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2754 on 456 degrees of freedom
Multiple R-squared: 0.9269, Adjusted R-squared: 0.9266
F-statistic: 2893 on 2 and 456 DF, p-value: < 2.2e-16
Test whether your estimates are consistent with the production function exhibiting constant returns to scale, i.e.
\[H_{0}: \alpha+\beta=1\]
against the alternative
\[H_{1}: \alpha+\beta\neq1\]
Do you reject the null hypothesis at the 5% level?
What is the p-value of your test (rounded to 3 decimal places)?
library(car)
linearHypothesis(mod1,"log(emp)+log(cap)=1" )
Linear hypothesis test
Hypothesis:
log(emp) + log(cap) = 1
Model 1: restricted model
Model 2: log(vadd) ~ log(emp) + log(cap)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 457 34.892
2 456 34.580 1 0.31205 4.115 0.04309 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An alternative way to test the hypothesis of constant returns to scale is to impose this restriction on the parameters and transform your regression model. Derive the necessary transformation, and show how the constant returns hypothesis amounts to a t-test in this transformed model. Carry out this test.
Does your result match what you found in (b)?
If H0 is true we have that \(\beta=1-\alpha\). This means we can re-write the original equation as \[\ln\frac{Y}{L}=\alpha \ln\frac{K}{L}+\varepsilon\]
Hence we can regress \(\ln\frac{Y}{L}\) on \(\ln\frac{K}{L})\) and \(\ln L\) and do a t-test on the hypothesis that the coefficient on \(\ln L\) is equal to 0:
Call:
lm(formula = log(vadd/emp) ~ log(cap/emp) + log(emp), data = data[data$year ==
58, ])
Residuals:
Min 1Q Median 3Q Max
-0.72254 -0.16410 -0.02135 0.16161 1.17701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.41898 0.06172 22.991 <2e-16 ***
log(cap/emp) 0.27609 0.01430 19.304 <2e-16 ***
log(emp) -0.02643 0.01303 -2.029 0.0431 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2754 on 456 degrees of freedom
Multiple R-squared: 0.4571, Adjusted R-squared: 0.4547
F-statistic: 191.9 on 2 and 456 DF, p-value: < 2.2e-16
Note that we get the same p-value as before because it is essentially the same test.
What is the average size in numbers employed across all industries in 1958?
Suppose an industry of average size employs an additional 1000 workers. What does the model estimated in part (a) imply about the effect this will have on value added (in percent, to 1 decimal place)?
memp58 <- mean(data$emp[data$year==58])
memp58
[1] 34.26035
Average industry employs 34 thousand workers (note that employment is measured in thousands).
Adding 1000 workers to the average industry implies and increase of about 3%. Given that our estimate of β which we can interpret as an elasticity we would expect that value added increases by 3%×0.7=2.1%.
Do you think that our estimates in (a) could be biased? If so, why would that be?
There is concern of endogeneity with regression such as in (a). While clearly more input factors has a causal effect on outputs it is also plausible that industries which have for whatever reason a positive shock to their output might attract more workers and capital. What’s more: with more than one explanatory variable it’s no longer that easy to predict the sign of the bias because it not only depends on the relationship between the error and the dependent variable but also on the relationship between the various explanatory variables as well as the relative strength of endogeneity for different variables. For instance in the case of production function regression the suspicion is that the labour coefficient is upward biased whereas the capital coefficient is downward biased. Not because one is positively correlated with the shock and the other is negatively correlated but because labour – being a more flexible production factor – is likely to be more positively correlated with the shock than capital.
Use the dataset attend.dta to analyze whether attending lectures has a causal effect on final exam performance. The dataset contains 674 observations on college students who took a particular course. Most variables on the dataset should be self-explanatory. The ACT is a college entry test. GPA is grade point average, the average performance in all courses.
library(foreign)
library(lmtest)
library(car)
data <- read.dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AABGEGjs2k7bZZQxU0H9SDtga/attend.dta?dl=1")
summary(data)
attend termgpa priGPA ACT
Min. : 2.00 Min. :0.000 Min. :0.857 Min. :13.00
1st Qu.:24.00 1st Qu.:2.150 1st Qu.:2.200 1st Qu.:20.00
Median :28.00 Median :2.680 Median :2.560 Median :22.00
Mean :26.28 Mean :2.614 Mean :2.592 Mean :22.48
3rd Qu.:30.00 3rd Qu.:3.120 3rd Qu.:2.950 3rd Qu.:25.00
Max. :32.00 Max. :4.000 Max. :3.930 Max. :32.00
final atndrte hwrte frosh
Min. :10.00 Min. : 6.25 Min. : 12.50 Min. :0.00
1st Qu.:22.00 1st Qu.: 75.00 1st Qu.: 87.50 1st Qu.:0.00
Median :26.00 Median : 87.50 Median :100.00 Median :0.00
Mean :25.89 Mean : 82.12 Mean : 87.91 Mean :0.23
3rd Qu.:29.00 3rd Qu.: 93.75 3rd Qu.:100.00 3rd Qu.:0.00
Max. :39.00 Max. :100.00 Max. :100.00 Max. :1.00
soph skipped stndfnl
Min. :0.0000 Min. : 0.000 Min. :-3.30882
1st Qu.:0.0000 1st Qu.: 2.000 1st Qu.:-0.78782
Median :1.0000 Median : 4.000 Median : 0.05252
Mean :0.5801 Mean : 5.721 Mean : 0.02914
3rd Qu.:1.0000 3rd Qu.: 8.000 3rd Qu.: 0.68277
Max. :1.0000 Max. :30.000 Max. : 2.78361
Run a regression of stndfnl, the standardized final exam score, on attend, the number of lectures attended (note: the data is from the US where they call a lecture a class).
How much will attending one extra lecture add to the standardised final score (4 decimal places)?
Is the effect large or small?
Attending one extra lecture add 0.0278 to the standardized final score. In total there were 32 classes. Based on this, attending only half of those classes would imply a reduction in final outcome of 16×0.0278=0.44. This seems fairly substantial. While it is not enough to make a difference between a weak student (e.g. somebody in the bottom quartile; stndfnl=-0.78780) and a strong student (say somebody in the top quartile, stndfnl=0.68280) it has the potential to move a median student (stndfnl=0.0525) into top quartile territory (see summary stats for stndfnl below)
Call:
lm(formula = stndfnl ~ attend, data = data)
Residuals:
Min 1Q Median 3Q Max
-3.4415 -0.6700 -0.0245 0.6614 2.6787
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.702364 0.192243 -3.654 0.000279 ***
attend 0.027836 0.007172 3.881 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9824 on 672 degrees of freedom
Multiple R-squared: 0.02192, Adjusted R-squared: 0.02047
F-statistic: 15.06 on 1 and 672 DF, p-value: 0.0001143
summary(data$stndfnl)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.30882 -0.78782 0.05252 0.02914 0.68277 2.78361
[1] 0.1480643
Do you think that our estimates in (a) could be biased?
What would be the ideal way of addressing the problem?
The regression in (a) could be biased for a number of reasons. The main worry would be that students who are better irrespective of attendance are also more diligent in general and therefore more likely to attend class.
The best way of dealing with endogeneity in this context would be to use an instrumental variable. In absence of a good instrument (e.g. a nice instrument would be if some students where kept from attending class because of some weather or transport problem) we have to rely on the inclusion of further controls to deal which hopefully control for ability/diligence etc. of students.
Enter each of the following variables one at a time as a control in your regression: termgpa, priGPA, ACT. For each of these controls, answer the following questions:
Does entering the control variable termgpa help solve the problem you discussed in part (b) and gets you closer to a causal effect of lecture attendance?
Does entering the control variable priGPA create potential new problems in interpreting the coefficient on attend causally? Why?
What happens to the coefficient on attend when you add the ACT variable? It becomes . Interpret this result.
Call:
lm(formula = stndfnl ~ attend + termgpa, data = data)
Residuals:
Min 1Q Median 3Q Max
-3.3122 -0.5612 -0.0228 0.5745 2.3635
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.273935 0.166898 -7.633 7.90e-14 ***
attend -0.035146 0.007222 -4.866 1.42e-06 ***
termgpa 0.851879 0.052597 16.196 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8336 on 671 degrees of freedom
Multiple R-squared: 0.2968, Adjusted R-squared: 0.2947
F-statistic: 141.6 on 2 and 671 DF, p-value: < 2.2e-16
The term grade point average is strongly positive and significant and leads to a negative attendance effect. Note that the termgpa is most likely affected by higher attendance; However, by including it as a separate explanatory variable the resulting estimate ignore this effect. Instead what we are picking up is the final grade of people that attended class a lot but it did not make a difference to their term GPA. In other words these are probably very weak students who desperately tried to improve by attending classes a lot but it did not have a positive effect on either their term GPA or their final score.
Call:
lm(formula = stndfnl ~ attend + priGPA, data = data)
Residuals:
Min 1Q Median 3Q Max
-3.4949 -0.6228 -0.0443 0.6241 2.4043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.698644 0.208907 -8.131 2.06e-15 ***
attend -0.001748 0.007425 -0.235 0.814
priGPA 0.684353 0.072085 9.494 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9231 on 671 degrees of freedom
Multiple R-squared: 0.1377, Adjusted R-squared: 0.1352
F-statistic: 53.6 on 2 and 671 DF, p-value: < 2.2e-16
priGPA is the GPA before the term. Again, we find a strong positive relation with stndfnl which is not surprising as ability to get good (or bad) results is very persistent. Again, the attendance coefficient becomes negative although this is not significant. Unlike for termgpa we cannot argue that attendance causes priGPA as well: attendance cannot affect past grades. So perhaps this means that there is really no causal effect of attendance on final outcomes.
Call:
lm(formula = stndfnl ~ attend + ACT, data = data)
Residuals:
Min 1Q Median 3Q Max
-3.3142 -0.5702 -0.0045 0.6175 2.4836
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.448301 0.307065 -11.230 < 2e-16 ***
attend 0.037599 0.006671 5.636 2.56e-08 ***
ACT 0.110749 0.010114 10.950 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9056 on 671 degrees of freedom
Multiple R-squared: 0.1702, Adjusted R-squared: 0.1677
F-statistic: 68.81 on 2 and 671 DF, p-value: < 2.2e-16
The college entrance results captured by ACT are another way of controlling for ability. However, including it, far from destroying the effect of attendance makes is slightly stronger. A mechanism that could explain this is as follows: far from being more diligent, the most able students might have actually somewhat of a cavalier attitude. They know they are good and they know they don’t attend class to be good. Thus if this was the only effect we expect a downward bias in the simple regression of final scores on attendance. It is plausible that ACT captures this as it is the college entrance score.
Drawing on your discussion in (c), which of the control variables termgpa, priGPA, ACT would you like to have in your regression in order to uncover the causal effect of lecture attendance?
Why? Run your preferred specification and think of why it’s best.
In reality we will have both effects potentially causing endogeneity (diligent students attending more class and cavalier geniuses not attending class) Our best bet to control for it is by including both ACT and priGPA. However, we should not include termGPA because of the potential reverse causality. This leads to a lower effect of attendance (0.017) but it is still significant.
Call:
lm(formula = stndfnl ~ attend + priGPA + ACT, data = data)
Residuals:
Min 1Q Median 3Q Max
-3.2386 -0.5518 -0.0396 0.5927 2.3329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.359279 0.301720 -11.134 < 2e-16 ***
attend 0.017415 0.007603 2.291 0.0223 *
priGPA 0.410436 0.078675 5.217 2.43e-07 ***
ACT 0.083060 0.011253 7.381 4.66e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8884 on 670 degrees of freedom
Multiple R-squared: 0.2026, Adjusted R-squared: 0.199
F-statistic: 56.74 on 3 and 670 DF, p-value: < 2.2e-16
Students who are diligent in attending lectures may also be more diligent about other aspects of their coursework, like completing homework. There is a variable hwrte in the dataset indicating the percentage of homework turned in. Add this variable as a regressor to your preferred specification from (d).
Is hwrte significant at the 5% level?
Is it a regressor you want in order to uncover the causal effect of lecture attendance on student performance?
What happens to the coefficient on attend?
Interpret your results.
Call:
lm(formula = stndfnl ~ attend + priGPA + ACT + hwrte, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.98362 -0.56827 -0.03468 0.60324 2.32557
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.471448 0.308536 -11.251 < 2e-16 ***
attend 0.009107 0.009046 1.007 0.3144
priGPA 0.400493 0.078786 5.083 4.82e-07 ***
ACT 0.083834 0.011246 7.454 2.81e-13 ***
hwrte 0.003855 0.002282 1.689 0.0917 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8872 on 669 degrees of freedom
Multiple R-squared: 0.206, Adjusted R-squared: 0.2012
F-statistic: 43.39 on 4 and 669 DF, p-value: < 2.2e-16
Including hwrte reduces the coefficient on attend and makes it not significant. However, the effect of hwrte is equally not very significant. This is likely because the two variables are very co-linear: students who attend class a lot are also more likely to do their homework diligently and so it is difficult to separately identify the effect of the two.
It is easy to imagine a situation where knowing them separately would be useful: Suppose a teacher is trying to figure out if it would be more useful to give another lecture or instead set another homework. Equally, it might not important or even outright wrong to try to distinguish the two; e.g. suppose a teacher wants to work out the effect of introducing sanctions (i.e. a negative participation grade) for not attending lectures. Part of the mechanism from more lecture to better grades might a positive effect on homework as well (e.g. students might be less forgetful about homework or they might be worried about looking bad in class). This causal effect of attendance would be ignored in a joint regression.
There is a variable skipped in the dataset indicating the number of skipped lectures. Add this variable as a regressor to your preferred specification from (d).
What happens and why? It becomes . Interpret this result.
“skipped” is perfectly collinear with “attend”
Call:
lm(formula = stndfnl ~ attend + priGPA + ACT + hwrte + skipped,
data = data)
Residuals:
Min 1Q Median 3Q Max
-2.98362 -0.56827 -0.03468 0.60324 2.32557
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.471448 0.308536 -11.251 < 2e-16 ***
attend 0.009107 0.009046 1.007 0.3144
priGPA 0.400493 0.078786 5.083 4.82e-07 ***
ACT 0.083834 0.011246 7.454 2.81e-13 ***
hwrte 0.003855 0.002282 1.689 0.0917 .
skipped NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8872 on 669 degrees of freedom
Multiple R-squared: 0.206, Adjusted R-squared: 0.2012
F-statistic: 43.39 on 4 and 669 DF, p-value: < 2.2e-16
Download the data TeachingRatings.dta. This data set contains data on the teaching evaluations of 463 professors at the University of Texas, and various attributes of the professor and course.
library(foreign)
library(lmtest)
data <- read.dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AACPj9dkKUrGIGji8Ar8nEPla/TeachingRatings.dta?dl=1")
summary(data)
minority age female onecredit
Min. :0.0000 Min. :29.00 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:42.00 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.0000 Median :48.00 Median :0.0000 Median :0.00000
Mean :0.1382 Mean :48.37 Mean :0.4212 Mean :0.05832
3rd Qu.:0.0000 3rd Qu.:57.00 3rd Qu.:1.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :73.00 Max. :1.0000 Max. :1.00000
beauty course_eval intro
Min. :-1.45049 Min. :2.100 Min. :0.0000
1st Qu.:-0.65627 1st Qu.:3.600 1st Qu.:0.0000
Median :-0.06801 Median :4.000 Median :0.0000
Mean : 0.00000 Mean :3.998 Mean :0.3391
3rd Qu.: 0.54560 3rd Qu.:4.400 3rd Qu.:1.0000
Max. : 1.97002 Max. :5.000 Max. :1.0000
nnenglish
Min. :0.00000
1st Qu.:0.00000
Median :0.00000
Mean :0.06048
3rd Qu.:0.00000
Max. :1.00000
Run a regression of course_eval on beauty. Beauty is an index that was based on a subjective scoring.
What is the slope coefficient of the regression?
We get a slope coefficient of 0.133.
Call:
lm(formula = course_eval ~ beauty, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.80015 -0.36304 0.07254 0.40207 1.10373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.99827 0.02535 157.727 < 2e-16 ***
beauty 0.13300 0.03218 4.133 4.25e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5455 on 461 degrees of freedom
Multiple R-squared: 0.03574, Adjusted R-squared: 0.03364
F-statistic: 17.08 on 1 and 461 DF, p-value: 4.247e-05
The number in (a) will not have much meaning to anyone who doesn’t know anything about the data.
Is the effect of beauty on course_eval big or small? How would you go about assessing this?
To understand if something is big or small we need to have some sort of benchmark to compare it to. If you lack other obvious benchmarks, we can always what happends in response to a reasonable change in the explanatory variable to a reasonable change in the dependent variable. For instance we could see what happens if we change the beauty
variable by 1 standard deviation (sd(data$beauty= 0.789)) (of the beauty variable). The coefficient above suggest that this would lead to a change of 0.133 \(\times\) 0.789 = 0.105.
Is this big? Well depends in part on how much the dependent variable usually changes; e.g. we can look at the change of standard deviation of the dependent variable as well, which is 0.555. Hence, the beauty effect would amount of 18.904% of the evaluation standard deviation. Seems quite a lot for something that seems to not exactly relevant to learning. But of course that depends a bit on the eye of the beholder. Of course we could understand very well if some of you only came to university to enjoy the physical beauty of your lecturers.
As an alternative we can look at the change in the inter-quartile (i.e. the difference between the 75th and 25th percentile) range in beauty and see how that compares to the inter-quartile range in course scores:
a=mod1$coefficients[[2]]*(quantile(data$beauty,0.75)[[1]]-quantile(data$beauty,0.25)[[1]])
b=(quantile(data$course_eval,0.75)[[1]]-quantile(data$course_eval,0.25)[[1]])
a/b
[1] 0.1998132
i.e. gives us a very similar answer.
Is the slope coefficient in (a) statistically significantly different from zero?
We get a significant slope coefficient.
Call:
lm(formula = course_eval ~ beauty, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.80015 -0.36304 0.07254 0.40207 1.10373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.99827 0.02535 157.727 < 2e-16 ***
beauty 0.13300 0.03218 4.133 4.25e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5455 on 461 degrees of freedom
Multiple R-squared: 0.03574, Adjusted R-squared: 0.03364
F-statistic: 17.08 on 1 and 461 DF, p-value: 4.247e-05
Run a regression of course_eval on beauty and female. Did the coefficient on beauty become bigger or smaller?
How would you explain this?
Call:
lm(formula = course_eval ~ beauty + female, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.87197 -0.36913 0.03493 0.39919 1.03237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.08158 0.03293 123.94 < 2e-16 ***
beauty 0.14859 0.03195 4.65 4.34e-06 ***
female -0.19781 0.05098 -3.88 0.00012 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5373 on 460 degrees of freedom
Multiple R-squared: 0.0663, Adjusted R-squared: 0.06224
F-statistic: 16.33 on 2 and 460 DF, p-value: 1.407e-07
The beauty coefficient becomes a bit bigger (0.149). This is because it seems that women tend to get lower teaching but higher beauty scores.
What is the \(R^{2}\) of the regression in (d)?
Does it increase or decrease compared to the regression which does not include female?
Think why that could be.
\(R^{2}\) goes from 0.0357 to 0.0663. Adding more variables will always increase the \(R^{2}\) .
Is it possible that even in this case, the interpretation of the effect of beauty might not necessarily be causal?
Think of why.
As usual we have to consider confounding factors. It’s not immediately clear why other factors driving good teaching could have a reverse causality on teaching. But here is one story: beauty is of course a very subjective measure. We perceive people who are well dressed as more beautiful. Perhaps the same factors that make a person dress well are also useful when teaching (e.g. lecture slides are more tidy). Hence, when we just consider beauty we may in part pick up the effect of tidier lecture slides.
`r unhide()
For attribution, please cite this work as
Martin (2021, Nov. 20). Datastories Hub: Exercises 5. Retrieved from https://mondpanther.github.io/datastorieshub/posts/exercises/exercises5/
BibTeX citation
@misc{martin2021exercises, author = {Martin, Ralf}, title = {Datastories Hub: Exercises 5}, url = {https://mondpanther.github.io/datastorieshub/posts/exercises/exercises5/}, year = {2021} }