Exercises 5

Exercises

Becoming a Data Detective: Uncovering Racial Bias & more

Ralf Martin https://mondpanther.github.io/wwwmondpanther/
2021-11-20

Exercise 5.1

For this question we will use a dataset from a randomized experiment conducted by Marianne Bertrand and Sendhil Mullainathan, who sent 4,870 fictitious resumes out to employers in response to job adverts in Boston and Chicago in 2001. The resumes differ in various attributes including the names of the applicants, and different resumes were randomly allocated to job openings. Some of the names are distinctly white sounding and some distinctly black sounding. The researchers collecting these data were interested to learn whether black sounding names obtain fewer callbacks for interviews than white names. Load the data set bm.dta.

  library(haven)
  data <- read_dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AABua74TH54FcmOsAs0ayMY5a/bm.dta?dl=1")
  summary(data)
   education         ofjobs         yearsexp      computerskills  
 Min.   :0.000   Min.   :1.000   Min.   : 1.000   Min.   :0.0000  
 1st Qu.:3.000   1st Qu.:3.000   1st Qu.: 5.000   1st Qu.:1.0000  
 Median :4.000   Median :4.000   Median : 6.000   Median :1.0000  
 Mean   :3.618   Mean   :3.661   Mean   : 7.843   Mean   :0.8205  
 3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.: 9.000   3rd Qu.:1.0000  
 Max.   :4.000   Max.   :7.000   Max.   :44.000   Max.   :1.0000  
      call             female           black    
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0  
 1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0.0  
 Median :0.00000   Median :1.0000   Median :0.5  
 Mean   :0.08049   Mean   :0.7692   Mean   :0.5  
 3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.0  

Part (a)

The data set contains two dummy variables (0-1 variables) for gender (female) and whether the applicant has computer skills (computerskills). Tabulate these variables by black.

Do gender and computer skills look balanced – i.e. random - across race groups?

  # This can be answered with the prop.table command

  
  table(data$female,data$black)
   
       0    1
  0  575  549
  1 1860 1886
  prop.table(table(data$female,data$black))
   
            0         1
  0 0.1180698 0.1127310
  1 0.3819302 0.3872690
  # Alternatively you can compute (using the dplyr syntax only):
  data %>% group_by(black, female) %>% 
    summarise(n=n()) %>% 
    group_by(female) %>%
    mutate(black_share_within_gender=n/sum(n)) 
# A tibble: 4 x 4
# Groups:   female [2]
  black female     n black_share_within_gender
  <dbl>  <dbl> <int>                     <dbl>
1     0      0   575                     0.512
2     0      1  1860                     0.497
3     1      0   549                     0.488
4     1      1  1886                     0.503
  data %>% group_by(black, female) %>% 
    summarise(n=n()) %>% 
    group_by(black) %>% 
    mutate(female_share_within_ethnicity=n/sum(n)) %>% 
    arrange(female)    # note that arrange is dplyr's sort command
# A tibble: 4 x 4
# Groups:   black [2]
  black female     n female_share_within_ethnicity
  <dbl>  <dbl> <int>                         <dbl>
1     0      0   575                         0.236
2     1      0   549                         0.225
3     0      1  1860                         0.764
4     1      1  1886                         0.775

There are many more female than male CVs (i.e. about 38%+38%=76% of the sample are female). However, gender seems not all correlated with race; i.e. the split between black and non black is virtually half for both men and women.

A similar pattern emerges for computer skills and racial background.

  prop.table(table(data$computerskills,data$black))
   
             0          1
  0 0.09568789 0.08377823
  1 0.40431211 0.41622177

Part (b)

Do a similar tabulation for education and the number of jobs previous held (ofjobs). These variables take on 5 and 7 different values, respectively.

Does education and the number of previous jobs look balanced across race groups?

To be sure, run a regression. Are the differences significant?

You can tabulate to eye-ball proportions but if you want to be sure that there aren’t small discrepancies, it is always best to run a regression.

  table(data$education,data$black)
   
       0    1
  0   18   28
  1   18   22
  2  142  132
  3  513  493
  4 1744 1760
  prop.table(table(data$education,data$black))
   
              0           1
  0 0.003696099 0.005749487
  1 0.003696099 0.004517454
  2 0.029158111 0.027104723
  3 0.105338809 0.101232033
  4 0.358110883 0.361396304
  table(data$ofjobs,data$black)
   
      0   1
  1  54  56
  2 347 357
  3 726 703
  4 800 811
  5 258 275
  6 243 221
  7   7  12
  prop.table(table(data$ofjobs,data$black))
   
              0           1
  1 0.011088296 0.011498973
  2 0.071252567 0.073305955
  3 0.149075975 0.144353183
  4 0.164271047 0.166529774
  5 0.052977413 0.056468172
  6 0.049897331 0.045379877
  7 0.001437372 0.002464066
There is no clear relation between education/number of jobs and race either. If we are worried about small discrepancies we could also run a regression to test if differences are significant:
 data$ofjobs=factor(data$ofjobs)
 summary(lm(black~ofjobs,data))

Call:
lm(formula = black ~ ofjobs, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.63158 -0.50341 -0.05394  0.49659  0.52371 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.509091   0.047690  10.675   <2e-16 ***
ofjobs2     -0.001989   0.051281  -0.039    0.969    
ofjobs3     -0.017138   0.049492  -0.346    0.729    
ofjobs4     -0.005677   0.049291  -0.115    0.908    
ofjobs5      0.006857   0.052381   0.131    0.896    
ofjobs6     -0.032798   0.053043  -0.618    0.536    
ofjobs7      0.122488   0.124264   0.986    0.324    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5002 on 4863 degrees of freedom
Multiple R-squared:  0.0007238, Adjusted R-squared:  -0.0005091 
F-statistic: 0.587 on 6 and 4863 DF,  p-value: 0.741

Hence the different job categories are individually (and jointly) insignificant. Note an OLS regression always reports a test that all coefficients are jointly not significant (the p-value of 0.741 reported at the end of the output).

Part (c)

Look at the mean and standard deviation for the variable for years of experience (yearsexp) separately for black and whites.

Does this variable look similar by race?

  mean(data$yearsexp[data$black==1])
[1] 7.829569
  sd(data$yearsexp[data$black==1])
[1] 5.010764
  mean(data$yearsexp[data$black==0])
[1] 7.856263
  sd(data$yearsexp[data$black==0])
[1] 5.079228

Part (d)

What do you make of the overall results on resume characteristics?

Is it important to figure out if these variables look similar across the race groups? Think why or why not.

If there was any evidence of a systematic relationship between race and any of those characteristics we could potentially be in trouble when simply comparing interview call backs for different race groups. Any differences found could simply due to those other factors rather than racial bias by employers.

Part (e)

The variable of interest on the data set is the variable call, which indicates a call back for an interview.

What percentage of people receive a call back (rounded up to 2 decimal places)?

Do you find differences in call back rates by race?

  prop.table(table(data$call))

         0          1 
0.91950719 0.08049281 
  prop.table(table(data$call,data$black))
   
             0          1
  0 0.45174538 0.46776181
  1 0.04825462 0.03223819
  prop.table(table(data$call,data$black),2) 
   
             0          1
  0 0.90349076 0.93552361
  1 0.09650924 0.06447639
  # Note that by specifying 2 we report proportions by column; i.e. it divides
  # the number of cases falling in a cell by the total number in a column.
  # (if you specify 1 or omit the parameter it reports proportions by rows)

We see that most CVs never received a call back (i.e. overall only 8.05% received a call back).

Part (f)

What do you conclude from the results of the Bertand and Mullainathan experiment?

Are black people as likely to receive a call back as white people?

There seems to be a clear difference between the different ethnic backgrounds. Whereas for white people call back rates where above average (9.65%) they were below average for black people (6.45%) suggesting a racial bias by employers.

Exercise 5.2 (more advanced)

Lets use the dataset from the experiment by Bertand and Mullainathan (bm.dta) again.

  library(foreign)
  data <- read_dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AABua74TH54FcmOsAs0ayMY5a/bm.dta?dl=1")

Part (a)

Develop a regression to examine if the difference in interview callbacks between black and white “sounding” CVs is significantly different.

What is the code for the simplest linear regression (we call this model “mod”)? mod <-

Whata is the code for the simplest logit regression (we call this model “mod2”)? mod2 <-

  mod <- lm(call~black,data)
  mod2 <- glm(call~black,data,family=binomial)

Part (b)

Execute both regression models in turn.

Are the results similar?

Linear regression:

  mod <- lm(call~black,data)
  summary(mod)

Call:
lm(formula = call ~ black, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.09651 -0.09651 -0.06448 -0.06448  0.93552 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.096509   0.005505  17.532  < 2e-16 ***
black       -0.032033   0.007785  -4.115 3.94e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2716 on 4868 degrees of freedom
Multiple R-squared:  0.003466,  Adjusted R-squared:  0.003261 
F-statistic: 16.93 on 1 and 4868 DF,  p-value: 3.941e-05

Alternatively we can use logit:

  mod2 <- glm(call~black,data,family=binomial)
  summary(mod2)

Call:
glm(formula = call ~ black, family = binomial, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4505  -0.4505  -0.3651  -0.3651   2.3416  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.23663    0.06863 -32.590  < 2e-16 ***
black       -0.43818    0.10732  -4.083 4.45e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2726.9  on 4869  degrees of freedom
Residual deviance: 2709.9  on 4868  degrees of freedom
AIC: 2713.9

Number of Fisher Scoring iterations: 5
    black
 -0.03232

Note that the marginal effects using logit are very similar to the linear model.

Exercise 5.3

For this question, download the data set cps.dta, which comes from the responses to the monthly US Current Population Survey (CPS) in 2001, a large labour market survey. This data set contains data on 8,891 individuals living in Boston and Chicago. We want to use these data to compare the skills of real live blacks and whites (as opposed to made up CVs), and their employment outcomes and see how they differ from the findings in the exercises involving the bm.dta dataset.

  library(foreign)
  data <- read.dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AAAOb_v-Y2V0NN4-rxahZjl4a/cps.dta?dl=1")
  summary(data)
 employed        black            female          education   
 No  :1942   Min.   :0.0000   Min.   :0.0000   HSD     : 980  
 Yes :6931   1st Qu.:0.0000   1st Qu.:0.0000   HSG     :2358  
 NA's:  18   Median :0.0000   Median :1.0000   some col:2460  
             Mean   :0.1508   Mean   :0.5161   col+    :3093  
             3rd Qu.:0.0000   3rd Qu.:1.0000                  
             Max.   :1.0000   Max.   :1.0000                  
    yearsexp    
 Min.   : 0.00  
 1st Qu.:11.00  
 Median :21.00  
 Mean   :20.94  
 3rd Qu.:30.00  
 Max.   :59.00  

Part (a)

The data set contains a variable education, which takes on four values (high school dropouts, high school graduates, some college, and college degree and more). Use the education variable to create a new dummy for resumes indicating some college or more (i.e. those in the “some college” category plus those in the college and more category).

What percentage of respondents has at least some college education (rounded up to the nearest percent)?

  data=data %>% mutate(HE =  education=="some col" | education=="col+")
  mean(data$HE)  
[1] 0.6245642

Note that nearly 63% of the sample have some college education.

Part (b)

First make employed into dichotomous variable and then conduct a regression analysis of the chances of being employed for people with different racial backgrounds.

Are black people significantly less likely to be employed?

If yes, what is the percentage difference (rounded to the nearest percentage point)?

  #modify employed as a dichotomous variable
  data=data%>% mutate(employed2=employed=="Yes")
  mod1 <- lm(employed2~black,data)
  summary(mod1)

Call:
lm(formula = employed2 ~ black, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7949  0.2051  0.2051  0.2051  0.2964 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.794879   0.004748  167.40  < 2e-16 ***
black       -0.091286   0.012237   -7.46 9.48e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4122 on 8871 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.006234,  Adjusted R-squared:  0.006122 
F-statistic: 55.65 on 1 and 8871 DF,  p-value: 9.481e-14

A regression of employment status on a dummy indicating a black racial background suggests that black people are significantly less likely to be employed. The difference amounts to 9.1 percentage points.

Part (c)

Conduct a regression analysis of the chances of having college education for people with different racial backgrounds.

What is the percentage difference (rounded to the nearest percentage point)?

  mod2 <- lm(HE~black,data)
  summary(mod2)

Call:
lm(formula = HE ~ black, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6420 -0.6420  0.3580  0.3580  0.4735 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.641987   0.005553 115.607  < 2e-16 ***
black       -0.115514   0.014299  -8.078 7.41e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4825 on 8889 degrees of freedom
Multiple R-squared:  0.007288,  Adjusted R-squared:  0.007177 
F-statistic: 65.26 on 1 and 8889 DF,  p-value: 7.414e-16

The difference in the probability of having higher education between people with non black and black background is 11 percentage points.

Part (d)

On the basis of your evidence what can you conclude about racial discrimination in the US labor market?

Do we see a open-and-shut case of racial bias here?

Think of the potential caveats and alternative explanations.

What analysis could you undertake to address some of these caveats?

The result in (b) is consistent with racial bias. However, from (c) we also see that people with black background tend to have less college education and college education is another major driver of being employed:

  summary(lm(employed2~HE,data))

Call:
lm(formula = employed2 ~ HE, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8206  0.1794  0.1794  0.2845  0.2845 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.715486   0.007109  100.64   <2e-16 ***
HETRUE      0.105124   0.008996   11.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4104 on 8871 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.01516,   Adjusted R-squared:  0.01505 
F-statistic: 136.5 on 1 and 8871 DF,  p-value: < 2.2e-16

Hence, far from implying a racial issue, the result in (b) could simply reflect employers preference for more highly educated workers. We can examine this by doing the analysis in (b) separately for workers with different educational attainment; i.e.:

  mod3 <- lm(employed2~black,data,subset=data$HE==1)
  summary(mod3)

Call:
lm(formula = employed2 ~ black, data = data, subset = data$HE == 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8260  0.1740  0.1740  0.1740  0.2162 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.825961   0.005513 149.809  < 2e-16 ***
black       -0.042177   0.015479  -2.725  0.00645 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3835 on 5539 degrees of freedom
  (12 observations deleted due to missingness)
Multiple R-squared:  0.001339,  Adjusted R-squared:  0.001158 
F-statistic: 7.425 on 1 and 5539 DF,  p-value: 0.006453
  mod4 <- lm(employed2~black,data,subset=data$HE==0)
  summary(mod4)

Call:
lm(formula = employed2 ~ black, data = data, subset = data$HE == 
    0)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7392 -0.6145  0.2608  0.2608  0.3855 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.739163   0.008636   85.59  < 2e-16 ***
black       -0.124629   0.019814   -6.29 3.59e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4487 on 3330 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.01174,   Adjusted R-squared:  0.01144 
F-statistic: 39.56 on 1 and 3330 DF,  p-value: 3.588e-10

The results suggest that for either group there is a significant racial gap when it comes to being employed. Note that the effect is considerably stronger for less educated workers. Hence this re-enforces the hypothesis that there is discrimination against workers with black background which is un-related to their productivity in the workplace.

However, there might be further caveats: our simple regression cannot account for the quality of the college education which can vary considerably and might vary systematically along racial lines. Furthermore, an important driver for a good education and for various other skills might have to do with parental income and status. Again, this is likely to vary systematically along racial lines.

While it is interesting to ask if employers discriminate above and beyond what could be expected on the basis of education and skill of workers – which is what we were implicitly doing above - we might also be concerned about the overall impact of racial background on labor market outcomes which includes initially different educational outcomes. Hence, depending on our interest we might be primarily focused on the effect of race holding education fixed or we might be focused on the overall effect.

Exercise 5.4

Consider once more the cps.dta dataset. In exercise 3 we argued that not accounting for education might bias our estimate of the impact of racial background. We then examined the issue by looking at college educated vs not college educated workers separately.

  library(foreign)
  data <- read.dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AAAOb_v-Y2V0NN4-rxahZjl4a/cps.dta?dl=1")
  summary(data)
 employed        black            female          education   
 No  :1942   Min.   :0.0000   Min.   :0.0000   HSD     : 980  
 Yes :6931   1st Qu.:0.0000   1st Qu.:0.0000   HSG     :2358  
 NA's:  18   Median :0.0000   Median :1.0000   some col:2460  
             Mean   :0.1508   Mean   :0.5161   col+    :3093  
             3rd Qu.:0.0000   3rd Qu.:1.0000                  
             Max.   :1.0000   Max.   :1.0000                  
    yearsexp    
 Min.   : 0.00  
 1st Qu.:11.00  
 Median :21.00  
 Mean   :20.94  
 3rd Qu.:30.00  
 Max.   :59.00  

Part (a)

First add the HE (higher education) variable to the data frame and modify the employed variable as a dichotomous variable calling it “employed2”. Can you propose an alternative strategy using a multivariate regression approach?

What is the command for a linear regression with two variables (we call this model “mod1”)? mod1 <-

What is the command for a slightly more complex linear regression with two variables and their interaction (we call this model “mod2”)? mod2 <-

  #adding HE variable to dataframe
  data["HE"] <- as.integer(data$education=="some col" | data$education=="col+")
  #modify employed as a dichotomous variable
  data["employed2"] <- as.integer(data$employed)-1
  mod1 <- lm(employed2~black+HE,data)
  summary(mod1)

Call:
lm(formula = employed2 ~ black + HE, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8307  0.1693  0.1693  0.2694  0.3491 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.730628   0.007462  97.917  < 2e-16 ***
black       -0.079705   0.012198  -6.534 6.75e-11 ***
HE           0.100094   0.009008  11.111  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4094 on 8870 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.01988,   Adjusted R-squared:  0.01966 
F-statistic: 89.94 on 2 and 8870 DF,  p-value: < 2.2e-16
  mod2 <- lm(employed2~black+HE+black*HE,data)
  summary(mod2)

Call:
lm(formula = employed2 ~ black + HE + black * HE, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8260  0.1740  0.1740  0.2608  0.3855 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.739163   0.007876  93.849  < 2e-16 ***
black       -0.124629   0.018070  -6.897 5.68e-12 ***
HE           0.086798   0.009831   8.829  < 2e-16 ***
black:HE     0.082451   0.024481   3.368  0.00076 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4092 on 8869 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.02113,   Adjusted R-squared:  0.0208 
F-statistic: 63.81 on 3 and 8869 DF,  p-value: < 2.2e-16

Part (b)

There are potentially two models you might have used in part (a) one that implies that the effect of race is the same for both educational groups or one that implies the effect of race is different for different educational groups.

Which model is more appropriate?

The test for the interaction coefficient is significant. That suggest the negative “black” effect is significatly less strong for higher educated workers; i.e. it’s important to have the more complicated model with interaction term.

Some alternative ways of testing include:

  anova(mod2,mod1)
Analysis of Variance Table

Model 1: employed2 ~ black + HE + black * HE
Model 2: employed2 ~ black + HE
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1   8869 1484.9                                  
2   8870 1486.8 -1   -1.8992 11.344 0.0007603 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  # or alternatively
  library(lmtest)
  waldtest(mod2,mod1)
Wald test

Model 1: employed2 ~ black + HE + black * HE
Model 2: employed2 ~ black + HE
  Res.Df Df      F    Pr(>F)    
1   8869                        
2   8870 -1 11.344 0.0007603 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  # or alternatively
  coeftest(mod2)

t test of coefficients:

              Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  0.7391627  0.0078761 93.8488 < 2.2e-16 ***
black       -0.1246287  0.0180702 -6.8969 5.677e-12 ***
HE           0.0867985  0.0098305  8.8295 < 2.2e-16 ***
black:HE     0.0824513  0.0244806  3.3680 0.0007603 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part (c)

On the basis of your regressions in (b), what is the racial gap for college educated people in percentage points (to 1 decimal place)?

What is the racial gap for non college educated people in percentage points (to 1 decimal place)?

Think of how this compares to your findings in Exercise 5.3.

College educated black people have a 12.5-8.2=4.3 percentage points lower likelihood of being employed. For non college educated the propbability is 12.5 percentage points lower; i.e. short of rounding error this corresponds to exactly what we found in exercise 3 as well.

Exercise 5.5

Use wage1.dta and examine once more the relationship between education and wages.

  library(foreign)
  library(car)
  data <- read.dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AADZ4MZhDDk9R8sFSjBvmcRma/WAGE1.DTA?dl=1")

Would you say the relationship is different for men and women?

We can allow for a separate response for women by including a female dummy and also interact the education variable with gender status. Note that you can do this by creating the various variables first. A faster way is to use R’s ability to create new variables as part of the model description of the lm command:

  mod2 <- lm(wage~educ+educ*female,data)
  summary(mod2)

Call:
lm(formula = wage ~ educ + educ * female, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.1611 -1.8028 -0.6367  1.0054 15.5258 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.20050    0.84356   0.238    0.812    
educ         0.53948    0.06422   8.400 4.24e-16 ***
female      -1.19852    1.32504  -0.905    0.366    
educ:female -0.08600    0.10364  -0.830    0.407    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.186 on 522 degrees of freedom
Multiple R-squared:  0.2598,    Adjusted R-squared:  0.2555 
F-statistic: 61.07 on 3 and 522 DF,  p-value: < 2.2e-16
  linearHypothesis(mod2,c("educ:female=0","female=0"))
Linear hypothesis test

Hypothesis:
educ:female = 0
female = 0

Model 1: restricted model
Model 2: wage ~ educ + educ * female

  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    524 5980.7                                  
2    522 5300.2  2    680.51 33.511 2.031e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that this leads to “female” parameters that individually are not significant. However, a joint significance test reveals that the variables combined have explanatory power; i.e. the low significance in the regression is likely a consequence of co-linearity of the female and female*education variables. We can also see this by just regressing on a female dummy which leads to a highly significant (and negative) effect (as we have seen before).

    summary( lm(wage~educ+female:educ,data))

Call:
lm(formula = wage ~ educ + female:educ, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3234 -1.7394 -0.6856  1.0567 15.5795 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.28526    0.65041  -0.439    0.661    
educ         0.57548    0.05039  11.421  < 2e-16 ***
educ:female -0.17764    0.02183  -8.138 2.95e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.186 on 523 degrees of freedom
Multiple R-squared:  0.2586,    Adjusted R-squared:  0.2558 
F-statistic: 91.23 on 2 and 523 DF,  p-value: < 2.2e-16

You might consequently ask, which is the right model? There are several considerations:

  1. Work with the more general model even though the coefficients are individually not significant what matters is that they jointly matter.
  2. Consider theory: is it more plausible that there are other factors that imply that women have a lower wage whatever they do or is it more plausible that the impact of education is radically different? The former seems more plausible to me but you are free to differ.

Exercise 5.6

Use the production2.dta dataset. This data set contains data on output (value added) and inputs at the industry level for 459 industries in 1958 and 1993.

  library(foreign)
  library(lmtest)
  library(car)
  data <- read.dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AAB8jv5xON_4OlZG50sj2l-xa/production2.dta?dl=1")
  summary(data)
      sic            year           emp             vship         
 Min.   :2011   Min.   :58.0   Min.   :  0.10   Min.   :    24.9  
 1st Qu.:2437   1st Qu.:58.0   1st Qu.: 10.30   1st Qu.:   323.3  
 Median :3221   Median :75.5   Median : 19.30   Median :  1026.6  
 Mean   :3039   Mean   :75.5   Mean   : 35.59   Mean   :  3768.3  
 3rd Qu.:3560   3rd Qu.:93.0   3rd Qu.: 40.05   3rd Qu.:  3493.3  
 Max.   :3999   Max.   :93.0   Max.   :511.40   Max.   :167825.8  
    matcost              vadd             energy       
 Min.   :     8.1   Min.   :    9.1   Min.   :   0.20  
 1st Qu.:   155.4   1st Qu.:  155.7   1st Qu.:   3.20  
 Median :   510.9   Median :  492.0   Median :  13.85  
 Mean   :  1997.6   Mean   : 1773.3   Mean   :  71.80  
 3rd Qu.:  1726.9   3rd Qu.: 1699.1   3rd Qu.:  53.15  
 Max.   :120458.8   Max.   :47272.0   Max.   :3780.60  
      cap         
 Min.   :    3.7  
 1st Qu.:  292.2  
 Median :  682.6  
 Mean   : 1974.8  
 3rd Qu.: 1711.6  
 Max.   :61909.7  

Suppose the relationship between output and inputs is described by a Cobb-Douglas production function: \[Y_{i} = AK_{\alpha}^{i}L_{\beta}^{i}\]

where \(Y_{i}\) is a measure of output, \(K_{i}\) is the capital stock, and \(L_{i}\) is employment. Answer all questions for the year 1958 only.

Part (a)

Transform the production function to a linear equation by taking logs. Estimate the parameters and by an OLS regression using total value added as your measure of output.

What coefficient on the log(emp) variable (rounded to 3 decimal places)?

What coefficient on the log(cap) variable (rounded to 3 decimal places)?

We can regress the following linear regression model:

\[\ln Y=\alpha \ln K +\beta \ln L +\varepsilon\]

  mod1 <- lm(log(vadd)~log(emp)+log(cap),data[data$year==58,])
  summary(mod1)

Call:
lm(formula = log(vadd) ~ log(emp) + log(cap), data = data[data$year == 
    58, ])

Residuals:
     Min       1Q   Median       3Q      Max 
-0.72254 -0.16410 -0.02135  0.16161  1.17701 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.41898    0.06172   22.99   <2e-16 ***
log(emp)     0.69748    0.01868   37.34   <2e-16 ***
log(cap)     0.27609    0.01430   19.30   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2754 on 456 degrees of freedom
Multiple R-squared:  0.9269,    Adjusted R-squared:  0.9266 
F-statistic:  2893 on 2 and 456 DF,  p-value: < 2.2e-16

Part (b)

Test whether your estimates are consistent with the production function exhibiting constant returns to scale, i.e. 

\[H_{0}: \alpha+\beta=1\]

against the alternative

\[H_{1}: \alpha+\beta\neq1\]

Do you reject the null hypothesis at the 5% level?

What is the p-value of your test (rounded to 3 decimal places)?

library(car)
linearHypothesis(mod1,"log(emp)+log(cap)=1" )
Linear hypothesis test

Hypothesis:
log(emp)  + log(cap) = 1

Model 1: restricted model
Model 2: log(vadd) ~ log(emp) + log(cap)

  Res.Df    RSS Df Sum of Sq     F  Pr(>F)  
1    457 34.892                             
2    456 34.580  1   0.31205 4.115 0.04309 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part (c) (Advanced)

An alternative way to test the hypothesis of constant returns to scale is to impose this restriction on the parameters and transform your regression model. Derive the necessary transformation, and show how the constant returns hypothesis amounts to a t-test in this transformed model. Carry out this test.

Does your result match what you found in (b)?

If H0 is true we have that \(\beta=1-\alpha\). This means we can re-write the original equation as \[\ln\frac{Y}{L}=\alpha \ln\frac{K}{L}+\varepsilon\]

Hence we can regress \(\ln\frac{Y}{L}\) on \(\ln\frac{K}{L})\) and \(\ln L\) and do a t-test on the hypothesis that the coefficient on \(\ln L\) is equal to 0:

mod2 <- lm(log(vadd/emp)~log(cap/emp)+log(emp),data[data$year==58,])
summary(mod2)

Call:
lm(formula = log(vadd/emp) ~ log(cap/emp) + log(emp), data = data[data$year == 
    58, ])

Residuals:
     Min       1Q   Median       3Q      Max 
-0.72254 -0.16410 -0.02135  0.16161  1.17701 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.41898    0.06172  22.991   <2e-16 ***
log(cap/emp)  0.27609    0.01430  19.304   <2e-16 ***
log(emp)     -0.02643    0.01303  -2.029   0.0431 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2754 on 456 degrees of freedom
Multiple R-squared:  0.4571,    Adjusted R-squared:  0.4547 
F-statistic: 191.9 on 2 and 456 DF,  p-value: < 2.2e-16

Note that we get the same p-value as before because it is essentially the same test.

Part (d)

What is the average size in numbers employed across all industries in 1958?

Suppose an industry of average size employs an additional 1000 workers. What does the model estimated in part (a) imply about the effect this will have on value added (in percent, to 1 decimal place)?

  memp58 <- mean(data$emp[data$year==58])
  memp58
[1] 34.26035

Average industry employs 34 thousand workers (note that employment is measured in thousands).

Adding 1000 workers to the average industry implies and increase of about 3%. Given that our estimate of β which we can interpret as an elasticity we would expect that value added increases by 3%×0.7=2.1%.

Part (e)

Do you think that our estimates in (a) could be biased? If so, why would that be?

There is concern of endogeneity with regression such as in (a). While clearly more input factors has a causal effect on outputs it is also plausible that industries which have for whatever reason a positive shock to their output might attract more workers and capital. What’s more: with more than one explanatory variable it’s no longer that easy to predict the sign of the bias because it not only depends on the relationship between the error and the dependent variable but also on the relationship between the various explanatory variables as well as the relative strength of endogeneity for different variables. For instance in the case of production function regression the suspicion is that the labour coefficient is upward biased whereas the capital coefficient is downward biased. Not because one is positively correlated with the shock and the other is negatively correlated but because labour – being a more flexible production factor – is likely to be more positively correlated with the shock than capital.

Exercise 5.7

Use the dataset attend.dta to analyze whether attending lectures has a causal effect on final exam performance. The dataset contains 674 observations on college students who took a particular course. Most variables on the dataset should be self-explanatory. The ACT is a college entry test. GPA is grade point average, the average performance in all courses.

  library(foreign)
  library(lmtest)
  library(car)
  data <- read.dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AABGEGjs2k7bZZQxU0H9SDtga/attend.dta?dl=1")
  summary(data)
     attend         termgpa          priGPA           ACT       
 Min.   : 2.00   Min.   :0.000   Min.   :0.857   Min.   :13.00  
 1st Qu.:24.00   1st Qu.:2.150   1st Qu.:2.200   1st Qu.:20.00  
 Median :28.00   Median :2.680   Median :2.560   Median :22.00  
 Mean   :26.28   Mean   :2.614   Mean   :2.592   Mean   :22.48  
 3rd Qu.:30.00   3rd Qu.:3.120   3rd Qu.:2.950   3rd Qu.:25.00  
 Max.   :32.00   Max.   :4.000   Max.   :3.930   Max.   :32.00  
     final          atndrte           hwrte            frosh     
 Min.   :10.00   Min.   :  6.25   Min.   : 12.50   Min.   :0.00  
 1st Qu.:22.00   1st Qu.: 75.00   1st Qu.: 87.50   1st Qu.:0.00  
 Median :26.00   Median : 87.50   Median :100.00   Median :0.00  
 Mean   :25.89   Mean   : 82.12   Mean   : 87.91   Mean   :0.23  
 3rd Qu.:29.00   3rd Qu.: 93.75   3rd Qu.:100.00   3rd Qu.:0.00  
 Max.   :39.00   Max.   :100.00   Max.   :100.00   Max.   :1.00  
      soph           skipped          stndfnl        
 Min.   :0.0000   Min.   : 0.000   Min.   :-3.30882  
 1st Qu.:0.0000   1st Qu.: 2.000   1st Qu.:-0.78782  
 Median :1.0000   Median : 4.000   Median : 0.05252  
 Mean   :0.5801   Mean   : 5.721   Mean   : 0.02914  
 3rd Qu.:1.0000   3rd Qu.: 8.000   3rd Qu.: 0.68277  
 Max.   :1.0000   Max.   :30.000   Max.   : 2.78361  

Part (a)

Run a regression of stndfnl, the standardized final exam score, on attend, the number of lectures attended (note: the data is from the US where they call a lecture a class).

How much will attending one extra lecture add to the standardised final score (4 decimal places)?

Is the effect large or small?

Attending one extra lecture add 0.0278 to the standardized final score. In total there were 32 classes. Based on this, attending only half of those classes would imply a reduction in final outcome of 16×0.0278=0.44. This seems fairly substantial. While it is not enough to make a difference between a weak student (e.g. somebody in the bottom quartile; stndfnl=-0.78780) and a strong student (say somebody in the top quartile, stndfnl=0.68280) it has the potential to move a median student (stndfnl=0.0525) into top quartile territory (see summary stats for stndfnl below)

  mod1 <- lm(stndfnl~attend,data)
  summary(mod1)

Call:
lm(formula = stndfnl ~ attend, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4415 -0.6700 -0.0245  0.6614  2.6787 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.702364   0.192243  -3.654 0.000279 ***
attend       0.027836   0.007172   3.881 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9824 on 672 degrees of freedom
Multiple R-squared:  0.02192,   Adjusted R-squared:  0.02047 
F-statistic: 15.06 on 1 and 672 DF,  p-value: 0.0001143
  summary(data$stndfnl)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.30882 -0.78782  0.05252  0.02914  0.68277  2.78361 
  mod1$coefficients[[2]]*sd(data$attend)/sd(data$stndfnl)
[1] 0.1480643

Part (b)

Do you think that our estimates in (a) could be biased?

What would be the ideal way of addressing the problem?

The regression in (a) could be biased for a number of reasons. The main worry would be that students who are better irrespective of attendance are also more diligent in general and therefore more likely to attend class.

The best way of dealing with endogeneity in this context would be to use an instrumental variable. In absence of a good instrument (e.g. a nice instrument would be if some students where kept from attending class because of some weather or transport problem) we have to rely on the inclusion of further controls to deal which hopefully control for ability/diligence etc. of students.

Part (c)

Enter each of the following variables one at a time as a control in your regression: termgpa, priGPA, ACT. For each of these controls, answer the following questions:

  1. Does entering the control variable termgpa help solve the problem you discussed in part (b) and gets you closer to a causal effect of lecture attendance?

  2. Does entering the control variable priGPA create potential new problems in interpreting the coefficient on attend causally? Why?

  3. What happens to the coefficient on attend when you add the ACT variable? It becomes . Interpret this result.

mod2 <- lm(stndfnl~attend+termgpa,data)
summary(mod2)

Call:
lm(formula = stndfnl ~ attend + termgpa, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3122 -0.5612 -0.0228  0.5745  2.3635 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.273935   0.166898  -7.633 7.90e-14 ***
attend      -0.035146   0.007222  -4.866 1.42e-06 ***
termgpa      0.851879   0.052597  16.196  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8336 on 671 degrees of freedom
Multiple R-squared:  0.2968,    Adjusted R-squared:  0.2947 
F-statistic: 141.6 on 2 and 671 DF,  p-value: < 2.2e-16

The term grade point average is strongly positive and significant and leads to a negative attendance effect. Note that the termgpa is most likely affected by higher attendance; However, by including it as a separate explanatory variable the resulting estimate ignore this effect. Instead what we are picking up is the final grade of people that attended class a lot but it did not make a difference to their term GPA. In other words these are probably very weak students who desperately tried to improve by attending classes a lot but it did not have a positive effect on either their term GPA or their final score.

mod3 <- lm(stndfnl~attend+priGPA,data)
summary(mod3)

Call:
lm(formula = stndfnl ~ attend + priGPA, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4949 -0.6228 -0.0443  0.6241  2.4043 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.698644   0.208907  -8.131 2.06e-15 ***
attend      -0.001748   0.007425  -0.235    0.814    
priGPA       0.684353   0.072085   9.494  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9231 on 671 degrees of freedom
Multiple R-squared:  0.1377,    Adjusted R-squared:  0.1352 
F-statistic:  53.6 on 2 and 671 DF,  p-value: < 2.2e-16

priGPA is the GPA before the term. Again, we find a strong positive relation with stndfnl which is not surprising as ability to get good (or bad) results is very persistent. Again, the attendance coefficient becomes negative although this is not significant. Unlike for termgpa we cannot argue that attendance causes priGPA as well: attendance cannot affect past grades. So perhaps this means that there is really no causal effect of attendance on final outcomes.

mod4 <- lm(stndfnl~attend+ACT,data)
summary(mod4)

Call:
lm(formula = stndfnl ~ attend + ACT, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3142 -0.5702 -0.0045  0.6175  2.4836 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.448301   0.307065 -11.230  < 2e-16 ***
attend       0.037599   0.006671   5.636 2.56e-08 ***
ACT          0.110749   0.010114  10.950  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9056 on 671 degrees of freedom
Multiple R-squared:  0.1702,    Adjusted R-squared:  0.1677 
F-statistic: 68.81 on 2 and 671 DF,  p-value: < 2.2e-16

The college entrance results captured by ACT are another way of controlling for ability. However, including it, far from destroying the effect of attendance makes is slightly stronger. A mechanism that could explain this is as follows: far from being more diligent, the most able students might have actually somewhat of a cavalier attitude. They know they are good and they know they don’t attend class to be good. Thus if this was the only effect we expect a downward bias in the simple regression of final scores on attendance. It is plausible that ACT captures this as it is the college entrance score.

Part (d)

Drawing on your discussion in (c), which of the control variables termgpa, priGPA, ACT would you like to have in your regression in order to uncover the causal effect of lecture attendance?

Why? Run your preferred specification and think of why it’s best.

In reality we will have both effects potentially causing endogeneity (diligent students attending more class and cavalier geniuses not attending class) Our best bet to control for it is by including both ACT and priGPA. However, we should not include termGPA because of the potential reverse causality. This leads to a lower effect of attendance (0.017) but it is still significant.

  mod5 <- lm(stndfnl~attend+priGPA+ACT,data)
  summary(mod5)

Call:
lm(formula = stndfnl ~ attend + priGPA + ACT, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2386 -0.5518 -0.0396  0.5927  2.3329 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.359279   0.301720 -11.134  < 2e-16 ***
attend       0.017415   0.007603   2.291   0.0223 *  
priGPA       0.410436   0.078675   5.217 2.43e-07 ***
ACT          0.083060   0.011253   7.381 4.66e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8884 on 670 degrees of freedom
Multiple R-squared:  0.2026,    Adjusted R-squared:  0.199 
F-statistic: 56.74 on 3 and 670 DF,  p-value: < 2.2e-16

Part (e)

Students who are diligent in attending lectures may also be more diligent about other aspects of their coursework, like completing homework. There is a variable hwrte in the dataset indicating the percentage of homework turned in. Add this variable as a regressor to your preferred specification from (d).

Is hwrte significant at the 5% level?

Is it a regressor you want in order to uncover the causal effect of lecture attendance on student performance?

What happens to the coefficient on attend?

Interpret your results.

  mod6 <- lm(stndfnl~attend+priGPA+ACT+hwrte,data)
  summary(mod6)

Call:
lm(formula = stndfnl ~ attend + priGPA + ACT + hwrte, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.98362 -0.56827 -0.03468  0.60324  2.32557 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.471448   0.308536 -11.251  < 2e-16 ***
attend       0.009107   0.009046   1.007   0.3144    
priGPA       0.400493   0.078786   5.083 4.82e-07 ***
ACT          0.083834   0.011246   7.454 2.81e-13 ***
hwrte        0.003855   0.002282   1.689   0.0917 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8872 on 669 degrees of freedom
Multiple R-squared:  0.206, Adjusted R-squared:  0.2012 
F-statistic: 43.39 on 4 and 669 DF,  p-value: < 2.2e-16

Including hwrte reduces the coefficient on attend and makes it not significant. However, the effect of hwrte is equally not very significant. This is likely because the two variables are very co-linear: students who attend class a lot are also more likely to do their homework diligently and so it is difficult to separately identify the effect of the two.

It is easy to imagine a situation where knowing them separately would be useful: Suppose a teacher is trying to figure out if it would be more useful to give another lecture or instead set another homework. Equally, it might not important or even outright wrong to try to distinguish the two; e.g. suppose a teacher wants to work out the effect of introducing sanctions (i.e. a negative participation grade) for not attending lectures. Part of the mechanism from more lecture to better grades might a positive effect on homework as well (e.g. students might be less forgetful about homework or they might be worried about looking bad in class). This causal effect of attendance would be ignored in a joint regression.

Part (f)

There is a variable skipped in the dataset indicating the number of skipped lectures. Add this variable as a regressor to your preferred specification from (d).

What happens and why? It becomes . Interpret this result.

“skipped” is perfectly collinear with “attend”

  mod7 <- lm(stndfnl~attend+priGPA+ACT+hwrte+skipped,data)
  summary(mod7)

Call:
lm(formula = stndfnl ~ attend + priGPA + ACT + hwrte + skipped, 
    data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.98362 -0.56827 -0.03468  0.60324  2.32557 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.471448   0.308536 -11.251  < 2e-16 ***
attend       0.009107   0.009046   1.007   0.3144    
priGPA       0.400493   0.078786   5.083 4.82e-07 ***
ACT          0.083834   0.011246   7.454 2.81e-13 ***
hwrte        0.003855   0.002282   1.689   0.0917 .  
skipped            NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8872 on 669 degrees of freedom
Multiple R-squared:  0.206, Adjusted R-squared:  0.2012 
F-statistic: 43.39 on 4 and 669 DF,  p-value: < 2.2e-16

Exercise 5.8

Download the data TeachingRatings.dta. This data set contains data on the teaching evaluations of 463 professors at the University of Texas, and various attributes of the professor and course.

  library(foreign)
  library(lmtest)
  data <- read.dta("https://www.dropbox.com/sh/rqmo1hvij1veff0/AACPj9dkKUrGIGji8Ar8nEPla/TeachingRatings.dta?dl=1")
  summary(data)
    minority           age            female         onecredit      
 Min.   :0.0000   Min.   :29.00   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:42.00   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.0000   Median :48.00   Median :0.0000   Median :0.00000  
 Mean   :0.1382   Mean   :48.37   Mean   :0.4212   Mean   :0.05832  
 3rd Qu.:0.0000   3rd Qu.:57.00   3rd Qu.:1.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :73.00   Max.   :1.0000   Max.   :1.00000  
     beauty          course_eval        intro       
 Min.   :-1.45049   Min.   :2.100   Min.   :0.0000  
 1st Qu.:-0.65627   1st Qu.:3.600   1st Qu.:0.0000  
 Median :-0.06801   Median :4.000   Median :0.0000  
 Mean   : 0.00000   Mean   :3.998   Mean   :0.3391  
 3rd Qu.: 0.54560   3rd Qu.:4.400   3rd Qu.:1.0000  
 Max.   : 1.97002   Max.   :5.000   Max.   :1.0000  
   nnenglish      
 Min.   :0.00000  
 1st Qu.:0.00000  
 Median :0.00000  
 Mean   :0.06048  
 3rd Qu.:0.00000  
 Max.   :1.00000  

Part (a)

Run a regression of course_eval on beauty. Beauty is an index that was based on a subjective scoring.

What is the slope coefficient of the regression?

We get a slope coefficient of 0.133.

  mod1 <- lm(course_eval~beauty,data)
  summary(mod1)

Call:
lm(formula = course_eval ~ beauty, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.80015 -0.36304  0.07254  0.40207  1.10373 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.99827    0.02535 157.727  < 2e-16 ***
beauty       0.13300    0.03218   4.133 4.25e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5455 on 461 degrees of freedom
Multiple R-squared:  0.03574,   Adjusted R-squared:  0.03364 
F-statistic: 17.08 on 1 and 461 DF,  p-value: 4.247e-05

Part (b)

The number in (a) will not have much meaning to anyone who doesn’t know anything about the data.

Is the effect of beauty on course_eval big or small? How would you go about assessing this?

To understand if something is big or small we need to have some sort of benchmark to compare it to. If you lack other obvious benchmarks, we can always what happends in response to a reasonable change in the explanatory variable to a reasonable change in the dependent variable. For instance we could see what happens if we change the beauty variable by 1 standard deviation (sd(data$beauty= 0.789)) (of the beauty variable). The coefficient above suggest that this would lead to a change of 0.133 \(\times\) 0.789 = 0.105.

Is this big? Well depends in part on how much the dependent variable usually changes; e.g. we can look at the change of standard deviation of the dependent variable as well, which is 0.555. Hence, the beauty effect would amount of 18.904% of the evaluation standard deviation. Seems quite a lot for something that seems to not exactly relevant to learning. But of course that depends a bit on the eye of the beholder. Of course we could understand very well if some of you only came to university to enjoy the physical beauty of your lecturers.

As an alternative we can look at the change in the inter-quartile (i.e. the difference between the 75th and 25th percentile) range in beauty and see how that compares to the inter-quartile range in course scores:

  a=mod1$coefficients[[2]]*(quantile(data$beauty,0.75)[[1]]-quantile(data$beauty,0.25)[[1]])
  b=(quantile(data$course_eval,0.75)[[1]]-quantile(data$course_eval,0.25)[[1]])
  
  a/b
[1] 0.1998132

i.e. gives us a very similar answer.

Part (c)

Is the slope coefficient in (a) statistically significantly different from zero?

We get a significant slope coefficient.

  mod1 <- lm(course_eval~beauty,data)
  summary(mod1)

Call:
lm(formula = course_eval ~ beauty, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.80015 -0.36304  0.07254  0.40207  1.10373 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.99827    0.02535 157.727  < 2e-16 ***
beauty       0.13300    0.03218   4.133 4.25e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5455 on 461 degrees of freedom
Multiple R-squared:  0.03574,   Adjusted R-squared:  0.03364 
F-statistic: 17.08 on 1 and 461 DF,  p-value: 4.247e-05

Part (d)

Run a regression of course_eval on beauty and female. Did the coefficient on beauty become bigger or smaller?

How would you explain this?

  mod2 <- lm(course_eval~beauty+female,data)
  summary(mod2)

Call:
lm(formula = course_eval ~ beauty + female, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.87197 -0.36913  0.03493  0.39919  1.03237 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.08158    0.03293  123.94  < 2e-16 ***
beauty       0.14859    0.03195    4.65 4.34e-06 ***
female      -0.19781    0.05098   -3.88  0.00012 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5373 on 460 degrees of freedom
Multiple R-squared:  0.0663,    Adjusted R-squared:  0.06224 
F-statistic: 16.33 on 2 and 460 DF,  p-value: 1.407e-07

The beauty coefficient becomes a bit bigger (0.149). This is because it seems that women tend to get lower teaching but higher beauty scores.

Part (e)

What is the \(R^{2}\) of the regression in (d)?

Does it increase or decrease compared to the regression which does not include female?

Think why that could be.

\(R^{2}\) goes from 0.0357 to 0.0663. Adding more variables will always increase the \(R^{2}\) .

Part (f)

Is it possible that even in this case, the interpretation of the effect of beauty might not necessarily be causal?

Think of why.

As usual we have to consider confounding factors. It’s not immediately clear why other factors driving good teaching could have a reverse causality on teaching. But here is one story: beauty is of course a very subjective measure. We perceive people who are well dressed as more beautiful. Perhaps the same factors that make a person dress well are also useful when teaching (e.g. lecture slides are more tidy). Hence, when we just consider beauty we may in part pick up the effect of tidier lecture slides.

`r unhide()

Citation

For attribution, please cite this work as

Martin (2021, Nov. 20). Datastories Hub: Exercises 5. Retrieved from https://mondpanther.github.io/datastorieshub/posts/exercises/exercises5/

BibTeX citation

@misc{martin2021exercises,
  author = {Martin, Ralf},
  title = {Datastories Hub: Exercises 5},
  url = {https://mondpanther.github.io/datastorieshub/posts/exercises/exercises5/},
  year = {2021}
}