Exercises 2

Exercises

Exercises to get familiar with R.

Ralf Martin https://mondpanther.github.io/wwwmondpanther/
2022-10-10

Exercise 2.1

In the lecture we introduced R. The goal of this problem set is for you to work with R by yourself, so that you start to become comfortable with some basic commands, which will be needed throughout the course (and beyond?). Hence, even if you think you could answer a question with a different software package, could you please try using R?

In this exercise we are using a dataset called auto.csv. It contains various characteristics for a sample of cars. You can download this dataset from here from the course data repository. Note that a big part of working with data is to become familiar with different data formats. R can easily work with almost any dataformat imaginable but only if the user (i.e. you) gives it the right instructions. You are probably already familiar with the .csv format. But just in case not: csv stands for comma separated values. In other words: this is a table with arranged as a simple text file where rows represent table rows and the values that go into separate columns are separated by commas.

To do anything with data you need get a dataset into a dataframe (essentially another word for table) in memory. For that you can first store it on your computer’s harddrive and then load it. You can do that using RStudio’s import file menu very similar to how you would open an Excel worksheet, say:

Part (a)

Instead of using the import dialog, can you load the dataset with an R command either from the console command line or via an R script?

If you use the RStudio dialog just discussed it will show you the R command that is needed in the Console window

auto <- read.csv("https://www.dropbox.com/s/gm22o5efboc3q0w/auto.csv?dl=1", header=TRUE)

Of course the exact formulation depends on where you saved your file on disk; i.e. it is unlikely it will be in a directory called C:/Users/Ralf Martin/Dropbox/datastories/data/.

Note that header=TRUE means that R can figure out the column names from the first line of the auto.csv file.

Part (b)

Can you suggest an alternative version of the previous command with a relative path to locate the file?

getwd()
[1] "C:/Users/rmartin/github/datastorieshub/_posts/exercises/exercises2"

To write a relative path you need to know which the current active directory is. With the getwd() command you can find out. In my case this happens to be the C:/Users/Ralf Martin/Dropbox/datastories/datastorieshub/_posts/exercises/exercises2 directory. Hence to get to directory C:/Users/Ralf Martin/Dropbox/datastories/data we need to go 4 directories up (to directory C:/Users/Ralf Martin/Dropbox/datastories ) and then down into the data directory. Note that you can use .. to indicate to R that you want to move up one layer of the directory structure. Hence the corect command with relative path notation becomes:

auto <- read.csv("../../../../data/auto.csv", header=TRUE)

Part (c)

Probably the easiest way to load a web based CSV file into R is directly from the web. This can be achieved by using the file URL in the read.csv command. Can you work out how?

auto <- read.csv("https://www.dropbox.com/s/gm22o5efboc3q0w/auto.csv?dl=1", header=TRUE)

You can figure out the URL by right clicking on a hyperlink.

Part (d)

Describe the data set.

How many observations does the data set have?

and how many variables does it have?

What variables are contained in the dataset?

R commands you might want to use for that include summary, nrow, ncol. You can try to type ? in the RStudio console to learn more about the function. Alternatively, try googling and see how people use those functions.

# Summary stats for all variables
summary(auto)
       X             make               price            mpg       
 Min.   : 1.00   Length:74          Min.   : 3291   Min.   :12.00  
 1st Qu.:19.25   Class :character   1st Qu.: 4220   1st Qu.:18.00  
 Median :37.50   Mode  :character   Median : 5006   Median :20.00  
 Mean   :37.50                      Mean   : 6165   Mean   :21.30  
 3rd Qu.:55.75                      3rd Qu.: 6332   3rd Qu.:24.75  
 Max.   :74.00                      Max.   :15906   Max.   :41.00  
                                                                   
     rep78          headroom         trunk           weight    
 Min.   :1.000   Min.   :1.500   Min.   : 5.00   Min.   :1760  
 1st Qu.:3.000   1st Qu.:2.500   1st Qu.:10.25   1st Qu.:2250  
 Median :3.000   Median :3.000   Median :14.00   Median :3190  
 Mean   :3.406   Mean   :2.993   Mean   :13.76   Mean   :3019  
 3rd Qu.:4.000   3rd Qu.:3.500   3rd Qu.:16.75   3rd Qu.:3600  
 Max.   :5.000   Max.   :5.000   Max.   :23.00   Max.   :4840  
 NA's   :5                                                     
     length           turn        displacement     gear_ratio   
 Min.   :142.0   Min.   :31.00   Min.   : 79.0   Min.   :2.190  
 1st Qu.:170.0   1st Qu.:36.00   1st Qu.:119.0   1st Qu.:2.730  
 Median :192.5   Median :40.00   Median :196.0   Median :2.955  
 Mean   :187.9   Mean   :39.65   Mean   :197.3   Mean   :3.015  
 3rd Qu.:203.8   3rd Qu.:43.00   3rd Qu.:245.2   3rd Qu.:3.353  
 Max.   :233.0   Max.   :51.00   Max.   :425.0   Max.   :3.890  
                                                                
    foreign      
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.2973  
 3rd Qu.:1.0000  
 Max.   :1.0000  
                 
# Number of observations:
nrow(auto)
[1] 74
#Number of variables
ncol(auto)
[1] 13
#Just the names of all variables
names(auto)
 [1] "X"            "make"         "price"        "mpg"         
 [5] "rep78"        "headroom"     "trunk"        "weight"      
 [9] "length"       "turn"         "displacement" "gear_ratio"  
[13] "foreign"     

Part (e)

Much of the power of R derives from extensions (so called packages) that have been contributed to widen the capabilities of the basic software. A particularly nice extension are the dplyr and tidyr packages that allow vastly more efficient and transparent wrangling of data. Can you install those packages on your computer and load them into memory?

To install these packages use install.packages("dplyr","tidyr")

Note that you only need to do this once on a given computer (it’s like installing software). However, each time you re-start R you need to load the packages you want to use into memory again. For that you use

Part (f)

One of the great things included in those packages is the functionality of piping, which we shall be using extensively throughout this course. Piping allows you to send the result from one R command to the next. This typically allows for more transparent code as you will hopefully appreciate maybe not just yet but sometime halfway through the course.

Just to give you an idea what piping does look at the code below. It does exactly the same as the summary(auto) command but we have now written with the piping operator %>%. What it does is that it sends whatever comes on the left to the right (i.e. flushes it down the %>% pipe ). On the right hand side it is then being used as an argument to the code that comes there (hence we don’t need to put anything in brackets in the summary command)

auto %>% summary()
       X             make               price            mpg       
 Min.   : 1.00   Length:74          Min.   : 3291   Min.   :12.00  
 1st Qu.:19.25   Class :character   1st Qu.: 4220   1st Qu.:18.00  
 Median :37.50   Mode  :character   Median : 5006   Median :20.00  
 Mean   :37.50                      Mean   : 6165   Mean   :21.30  
 3rd Qu.:55.75                      3rd Qu.: 6332   3rd Qu.:24.75  
 Max.   :74.00                      Max.   :15906   Max.   :41.00  
                                                                   
     rep78          headroom         trunk           weight    
 Min.   :1.000   Min.   :1.500   Min.   : 5.00   Min.   :1760  
 1st Qu.:3.000   1st Qu.:2.500   1st Qu.:10.25   1st Qu.:2250  
 Median :3.000   Median :3.000   Median :14.00   Median :3190  
 Mean   :3.406   Mean   :2.993   Mean   :13.76   Mean   :3019  
 3rd Qu.:4.000   3rd Qu.:3.500   3rd Qu.:16.75   3rd Qu.:3600  
 Max.   :5.000   Max.   :5.000   Max.   :23.00   Max.   :4840  
 NA's   :5                                                     
     length           turn        displacement     gear_ratio   
 Min.   :142.0   Min.   :31.00   Min.   : 79.0   Min.   :2.190  
 1st Qu.:170.0   1st Qu.:36.00   1st Qu.:119.0   1st Qu.:2.730  
 Median :192.5   Median :40.00   Median :196.0   Median :2.955  
 Mean   :187.9   Mean   :39.65   Mean   :197.3   Mean   :3.015  
 3rd Qu.:203.8   3rd Qu.:43.00   3rd Qu.:245.2   3rd Qu.:3.353  
 Max.   :233.0   Max.   :51.00   Max.   :425.0   Max.   :3.890  
                                                                
    foreign      
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.2973  
 3rd Qu.:1.0000  
 Max.   :1.0000  
                 

You can even do several of those pipes in a row. For instance, look at the following piece of code. Can you work out what it does?

auto_summary <- auto %>%   select(mpg, weight)  %>% 
             summarise_all( list(mmm=mean,  p50=median, sd=sd  ) ) 

This piece of code takes first selects only 2 columns of the auto dataframe. It sends this smaller dataframe with only 2 variables to the command summarise_all. Summarise all will compute the statistics you mention within list(...) for all the variables in the dataframe it is being given. Note that in the list what comes after the = sign is the statistic you want whereas what comes before is a lable that will appear in the resulting dataframe, which we called auto_summary. You can look at it simply by typing its name in the console or a script:

auto_summary
  mpg_mmm weight_mmm mpg_p50 weight_p50   mpg_sd weight_sd
1 21.2973   3019.459      20       3190 5.785503  777.1936

Part (g)

The summary statistics in the previous part are not displayed very nice. The dplyr and tidyr offer a lot of nice functions to reshape any output in whichever way you want. Below is a somewhat elaborate example. Below I am explaining step by step what it does. However, before you look at my explanation, maybe you can run those commands yourself step by step and figure what they do?

auto_summary_reshape <- auto_summary %>% gather(stat, val) %>%
                    separate(stat, into = c("var", "stat"), sep = "_") %>%
                    spread(stat,val) 

The code above reshapes the auto_summary dataframe so that it looks like this:

auto_summary_reshape 
     var       mmm  p50         sd
1    mpg   21.2973   20   5.785503
2 weight 3019.4595 3190 777.193567

We can break this in all the steps to see how it does that:

 step = auto_summary %>% gather(stat, val)
 step
        stat         val
1    mpg_mmm   21.297297
2 weight_mmm 3019.459459
3    mpg_p50   20.000000
4 weight_p50 3190.000000
5     mpg_sd    5.785503
6  weight_sd  777.193567

The gather command arranges what you sent them down the pipe as key value pairs; i.e. 2 columns where the first column is the key (i.e. a name or label) for the value that follows in the value column. stat and val are simply how we want to call those two columns

 step = step %>% separate(stat, into = c("var", "stat"), sep = "_")
 step
     var stat         val
1    mpg  mmm   21.297297
2 weight  mmm 3019.459459
3    mpg  p50   20.000000
4 weight  p50 3190.000000
5    mpg   sd    5.785503
6 weight   sd  777.193567

The separate command splits the stat column in two new columns, var and stat where sep="_" tells R where to separate the original stat column.

 step = step %>% spread(stat, val) 
 step
     var       mmm  p50         sd
1    mpg   21.2973   20   5.785503
2 weight 3019.4595 3190 777.193567

The spread command spreads out the stat column again; i.e. creates a new column for every category represented in stat. Alternatively, we could have spread var:

auto_summary %>% gather(stat, val) %>%
                    separate(stat, into = c("var", "stat"), sep = "_") %>%
                    spread(var,val) 
  stat       mpg    weight
1  mmm 21.297297 3019.4595
2  p50 20.000000 3190.0000
3   sd  5.785503  777.1936

Exercise 2.2

This github repository provides COVID related data for the US at the state and county level: https://github.com/nytimes/covid-19-data

Part(a)

Download the state level data (in csv format) from

here and load it into an R dataframe. Have a look at the dataframe. How many rows are there? What do the rows represent? What do the columns represent?

To load the dataset into a dataframe you can use:
covid=read.csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
You can view the dataframe like an excel table by clicking on the name of the dataframe in the Global Environment tab, which also shows us the number of rows and columns of the dataframe (note that you might have a different number of rows as the dataframe is constantly updated online)

We see that the dataframe contains daily COVID case and death numbers for every us state. Note that the numbers never reduce which suggests that these are cumulative figues (i.e. total number of cases and deaths up until a give day)

Part(b)

Compute the death rate (deaths over cases) for every state by July 31st, 2020 using R. Also add columns that contain the share of deaths in total deaths and the share of cases in total cases

With filter() we can select specific rows of a dataframe. With mutate() we can create new variables in dataframe or modify exisitng ones. Note that the date variable is a factor variable. You can see this in the Global Environment tab:

We will talk a lot about factor variables later in the course. For now it’s enough to know that with the function as.character() you can convert the date variable from a factor to a character string which is necessary to compare it to character string such as “2020-08-01”. As ever you can only compare like with like.

Hence, the following comand provides the result (note I am loading again the library dplyr which is not necessariy if you have done that already in this session)

library(dplyr)


covid_july=covid %>% dplyr::filter(as.character(date)=="2020-07-31") %>%
          mutate(deathsOcases=deaths/cases,
                 deaths_sh=deaths/sum(deaths),
                 cases_sh=cases/sum(cases))

head(covid_july,10)
         date                state fips  cases deaths deathsOcases
1  2020-07-31              Alabama    1  87723   1580  0.018011240
2  2020-07-31               Alaska    2   3675     21  0.005714286
3  2020-07-31              Arizona    4 174108   3695  0.021222460
4  2020-07-31             Arkansas    5  42511    453  0.010656065
5  2020-07-31           California    6 502273   9222  0.018360533
6  2020-07-31             Colorado    8  46948   1841  0.039213598
7  2020-07-31          Connecticut    9  49810   4432  0.088978117
8  2020-07-31             Delaware   10  14788    585  0.039559102
9  2020-07-31 District of Columbia   11  12126    585  0.048243444
10 2020-07-31              Florida   12 470378   6842  0.014545748
      deaths_sh    cases_sh
1  0.0102689423 0.019188397
2  0.0001364859 0.000803864
3  0.0240150265 0.038084122
4  0.0029441967 0.009298792
5  0.0599368265 0.109866440
6  0.0119652676 0.010269335
7  0.0288050331 0.010895364
8  0.0038021084 0.003234705
9  0.0038021084 0.002652423
10 0.0444684198 0.102889776

The head() command shows you the first couple of lines of a dataframe. That is useful if you wann a have a peek to examine larger dataframes.

Part(c)

Which is the state with the higest death rate (by July 31)? Can you work it out with an R command (rather than just looking through the table)?

covid_july %>% filter(deathsOcases==max(deathsOcases))
        date       state fips cases deaths deathsOcases  deaths_sh
1 2020-07-31 Connecticut    9 49810   4432   0.08897812 0.02880503
    cases_sh
1 0.01089536

Part(d)

Can you compute daily figures of deaths and cases for the US as a whole from the state level data? We haven’t discussed all commands necessary for that yet. But see if you can work it out from checking online. If that takes too much time just check the answer below. As ever, ther are many different ways how could do that.

We can use the group_by() command which is par of dplyr. It allows to group rows of a dataframe by one (or several) categorical variables. We can then apply other commands that will be done seperately for each of those groups. In the case below we want to sum over all states for every day:

usdaily=covid %>% group_by(date) %>% summarise(deaths=sum(deaths),cases=sum(cases))
head(usdaily)
# A tibble: 6 x 3
  date       deaths cases
  <chr>       <int> <int>
1 2020-01-21      0     1
2 2020-01-22      0     1
3 2020-01-23      0     1
4 2020-01-24      0     2
5 2020-01-25      0     3
6 2020-01-26      0     5

Part(e)

Examine the US death rate over time. How would you expect this series to behave? How does it behave? How do you explain what you find?

The best way to do this is with a figure. We shall be talking in the next lecture a lot more about plotting. However, no harm done trying this out. There are a number of commands to plot something. However, the most versatile is the ggplot command. (You need to install and load the ggplot2 package before you can use it.). Also: things will look better if you tell R that the date column is date. We can do this with the as.Date() command.

Couple of words about the ggplot command:

library(ggplot2)

usdaily=usdaily %>% mutate(date=as.Date(date), deathsOcases=deaths/cases)
ggplot(usdaily, aes(x=date, y=deathsOcases)) +
      geom_point() + 
      scale_x_date(date_breaks="1 month",date_labels = "%b")

Now what do we expect to find? To die from COVID you have to get sick and it will take at least a couple of days. Hence, we would expect to see death rates to be low at first and then to increase. Over time we would hope that doctors get better at treating COVID patients. Also, people who know they are at risk will increasingly shielding so that the only people who get sick will be those who are less likely to die. Both factors should bring the death rate down after an initial peak. This is what we are seeing when only looking at the series from the end of March onwards with a peak in mid May. However, there is a first peak in early March. What could explain this? The most plausible explanation is probably measurement error: early on cases of infections where probably not counted properly as there was no systematic way of testing the population. On the other, if somebody was so ill that they die this would almost certainly be picked up by the authorities. Hence death figures were not under counted. Also, early on numbers were low. So a couple of counts too few for cases could easily swing the ratio. Here is some further discussion of this.

Part(f)

A large part of data analysis work is getting the data into a format that lends itself to analysis. And a very big part of that is the process of combining datasets from different sources. Here we start learning how this can be done in an efficient way. Here you find a dataset in csv format with information on the size (in terms of population) of different US states. Can you combine this with the COVID case count dataframe for US states at the end of July that we created earlier?

First you need to load the population dataset into a dataframe as well:

pop=read.csv("https://mondpanther.github.io/datastorieshub/data/populationdata.csv")

To combine the two dataframes we can use a command called merge():

covid_julyb=covid_july %>% merge(pop,by="state", all.x=TRUE)

This tells R to create a new dataframe (covid_julyb) by combinig the information in covid_july and pop. Rows are being matched using the variable state. Maybe you noted that pop has only 52 observations whereas covid_july has 55. This is because pop doesn’t include information on places such as Guam and a couple of other Islands that are not formal US states but controlled by the US, whereas covid_july does. The all.x option tells R to keep observations from the first dataset (i.e. covid_july) even if no equivalent observation in the second dataset can be found. This is also known as a left join. Similarly there is a right join with all.y=TRUE or a full join with all=TRUE. You might want to read a bit more about the merge command e.g. here

Part(g)

Using the newly created dataframe create a new variable COVID cases per capita (i.e. per head of population). Explore the relationship between per capita cases and population density. What kind of relationship would you expect? What do you find?

Add per capita cases with mutate (to make things easier to read we use cases per 1000 of population):

covid_julyb=covid_julyb %>% mutate(casesOpop000=cases/pop*1000)

We would we expect a positive relationship between per capita cases and density. A scatter plot is usually the best way to get going on a question like this. We can once more use the ggplot command (geom_smooth(method=lm) and add a regression line):

usdaily=usdaily %>% mutate(date=as.Date(date), deathsOcases=deaths/cases)

ggplot(covid_julyb, aes(x=density, y=casesOpop000 ))+ 
      geom_point() + geom_smooth(method=lm)

Dropping the outlier on the right (i.e. Washington DC which is not really a State).

ggplot(covid_julyb %>% filter(density<6000), aes(x=density, y=casesOpop000 ))+ 
      geom_point() + geom_smooth(method=lm)

Hence, there is clearly a positive relationship between density and covid cases. It becomes more pro-nounced when dropping Washington DC. This is also confirmed when looking at the parameters of the regression line directly, which we can do with the lm() command.

summary(lm(casesOpop000~density,covid_julyb  %>% filter(density<6000) ))

Call:
lm(formula = casesOpop000 ~ density, data = covid_julyb %>% filter(density < 
    6000))

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6365  -3.1810  -0.4811   3.3123  14.2090 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.103760   0.970901  10.407 5.27e-14 ***
density      0.006921   0.002751   2.516   0.0152 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.466 on 49 degrees of freedom
Multiple R-squared:  0.1144,    Adjusted R-squared:  0.0963 
F-statistic: 6.328 on 1 and 49 DF,  p-value: 0.01521

Part (h)

Computers - unlike humans - are good at doing repeated tasks reliably without frustration. A basic way to accomplish that in a programming language are are loops. Let’s look at a concrete example. Suppose you want to create time series figures of COVID death rates for several US states: California, Texas, New York and Alaska, say. Try if you can work out how to do that with a for loop using the help function and online googling. If your are stuck check out the answer below.

Let’s start by arranging the states in a collection. Also add a death rate column to the covid dataframe (and clear up the date variable):

states=c("California","Texas","New York","Alaska")

covid=covid%>% mutate(deathsOcases=deaths/cases, date=as.Date(date))

Now loop over those states:

for(s in states){
  print(s)
  p=ggplot(covid  %>% filter(state==s), aes(x=date, y=deathsOcases)) +
      geom_point() + ggtitle(s)+
      scale_x_date(date_breaks="1 month",date_labels = "%b")
  
  print(p)
}
[1] "California"

[1] "Texas"

[1] "New York"

[1] "Alaska"

It’s interesting to see quite a variation in the shape of this series between different states. New York hasn’t managed a meaningful decline in death rates. In Texas on the other hand death rates are increasing again after a substantial decline in July.

Exercise 2.3

Part (a)

Besides analysing data we can use R to make up data. Why would we do that? It’s a common approach called “Monte-Carlo Analysis”. If we make up the data we know the underlying model that is driving the data. We can use to this to test the methods we are using. Do they lead us to the true underlying model or do they have biases that give us the wrong answer? In this exercise we walk through how this can be done in a simple case. As ever: try to work through all of this while creating an R script or R markdown file. First we need to decide about the sample size. Let’s use a sample of size 100. It makes sense to assign this number to a variable as follows:

obs<-100

Assume that the data is driven by the following true model \(Y=2+0.5 X +\epsilon\) where \(\epsilon\) is a random variable that follows a standard normal distribution; i.e. its distribution follows the famous bell curve. Using R you can draw realisations of such a random variable using the function rnorm() To assign those values to a variable called eps we can write

df=data.frame(eps=rnorm(obs))

where the data.frame command arranges the newly created variable eps into a dataframe rather than being standalone.

Similarly we can generate the explanatory variable X as:

df=df%>% mutate(X=runif(obs))

This will draw the values of X by drawing random variables on the interval between 0 and 1 with uniform (i.e. all values have the same) probility. Look at the values of X and \(\epsilon\). Here are the first 10:

head(df,10)
          eps          X
1   0.7087166 0.01557453
2  -0.7568827 0.63896289
3  -0.7173113 0.50702648
4  -2.1784831 0.50394019
5   1.2146494 0.73980429
6   0.3188138 0.29162189
7  -0.7861682 0.42168957
8  -0.2599122 0.60679514
9  -0.4659159 0.94634844
10 -1.4880522 0.11489128

Part (b)

Now create the Y variable according to the formula above.

df=df%>% mutate(Y=2+0.5*X+eps)

Part (c)

Run a regression of Y on X. What value do you find for the coefficient on X? Would you think this is an unbiased estimate?

    reg=lm(Y~X,df)
    summary(reg)

Call:
lm(formula = Y ~ X, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6671 -0.7811  0.1055  0.7418  1.9315 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.8517     0.2006   9.233  5.6e-15 ***
X             0.9012     0.3618   2.491   0.0144 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9863 on 98 degrees of freedom
Multiple R-squared:  0.05954,   Adjusted R-squared:  0.04994 
F-statistic: 6.204 on 1 and 98 DF,  p-value: 0.01443

Part (d)

Now use a “for”-loop to repeat the steps from section (a)-(c) 1000 times.

We can do this in R with the following for-loop command

for(i in 1:1000) {

<several lines of code >

}

i.e. the lines of code between the first line and } will be repeated 1000 times. The variable i will increase by 1 on each round.

beta <- double(1000)
for(i in 1:1000){
    
    #print(i)
    
    df=data.frame(eps=rnorm(obs))
    df=df%>% mutate(X=runif(obs))
    df=df%>% mutate(Y=2+0.5*X+eps)
    #print(lm(Y~X,df)$coefficients[[2]])
    beta[i]<-lm(Y~X,df)$coefficients[[2]]
}

Note: with $coefficients[[2]] we extract the coefficient estimate of the second (i.e. the slope) coefficient from the regression results and store it in a vector called beta

Part (e)

Compute the average value of the estimate of the X coefficient across the 1000 repetitions.

mean(beta)
[1] 0.5042947
results=data.frame(beta)
ggplot(results, aes(x=beta)) + 
       geom_histogram(color="black",fill="white",bins=40) + 
       theme_minimal()

Note: we can use ggplot to draw a histogram of all the results for beta as well. This shows that while we not always get a value close to 0.5, over many replications the value we find are centered around the true value of 0.5

Citation

For attribution, please cite this work as

Martin (2022, Oct. 10). Datastories Hub: Exercises 2. Retrieved from https://mondpanther.github.io/datastorieshub/posts/exercises/exercises2/

BibTeX citation

@misc{martin2022exercises,
  author = {Martin, Ralf},
  title = {Datastories Hub: Exercises 2},
  url = {https://mondpanther.github.io/datastorieshub/posts/exercises/exercises2/},
  year = {2022}
}