Exercises to get familiar with R.
In the lecture we introduced R. The goal of this problem set is for you to work with R by yourself, so that you start to become comfortable with some basic commands, which will be needed throughout the course (and beyond?). Hence, even if you think you could answer a question with a different software package, could you please try using R?
In this exercise we are using a dataset called auto.csv
.
It contains various characteristics for a sample of cars. You can
download this dataset from here
from the course data repository. Note that a big part of working with
data is to become familiar with different data formats. R can easily
work with almost any dataformat imaginable but only if the user
(i.e. you) gives it the right instructions. You are probably already
familiar with the .csv
format. But just in case not:
csv
stands for comma separated values. In other words: this
is a table with arranged as a simple text file where rows represent
table rows and the values that go into separate columns are separated by
commas.
To do anything with data you need get a dataset into a dataframe (essentially another word for table) in memory. For that you can first store it on your computer’s harddrive and then load it. You can do that using RStudio’s import file menu very similar to how you would open an Excel worksheet, say:
Instead of using the import dialog, can you load the dataset with an R command either from the console command line or via an R script?
If you use the RStudio dialog just discussed it will show you the R command that is needed in the Console window
auto <- read.csv("https://www.dropbox.com/s/gm22o5efboc3q0w/auto.csv?dl=1", header=TRUE)
Of course the exact formulation depends on where you saved your file
on disk; i.e. it is unlikely it will be in a directory called
C:/Users/Ralf Martin/Dropbox/datastories/data/
.
Note that header=TRUE
means that R can figure out the
column names from the first line of the auto.csv
file.
Can you suggest an alternative version of the previous command with a relative path to locate the file?
getwd()
[1] "C:/Users/rmartin/github/datastorieshub/_posts/exercises/exercises2"
To write a relative path you need to know which the current active
directory is. With the getwd()
command you can find out. In
my case this happens to be the
C:/Users/Ralf Martin/Dropbox/datastories/datastorieshub/_posts/exercises/exercises2
directory. Hence to get to directory
C:/Users/Ralf Martin/Dropbox/datastories/data
we need to go
4 directories up (to directory
C:/Users/Ralf Martin/Dropbox/datastories
) and then down
into the data directory. Note that you can use ..
to
indicate to R that you want to move up one layer of the directory
structure. Hence the corect command with relative path notation
becomes:
auto <- read.csv("../../../../data/auto.csv", header=TRUE)
Probably the easiest way to load a web based CSV file into R is
directly from the web. This can be achieved by using the file URL in the
read.csv
command. Can you work out how?
auto <- read.csv("https://www.dropbox.com/s/gm22o5efboc3q0w/auto.csv?dl=1", header=TRUE)
You can figure out the URL by right clicking on a hyperlink.
Describe the data set.
How many observations does the data set have?
and how many variables does it have?
What variables are contained in the dataset?
R commands you might want to use for that include summary, nrow,
ncol. You can try to type ?
# Summary stats for all variables
summary(auto)
X make price mpg
Min. : 1.00 Length:74 Min. : 3291 Min. :12.00
1st Qu.:19.25 Class :character 1st Qu.: 4220 1st Qu.:18.00
Median :37.50 Mode :character Median : 5006 Median :20.00
Mean :37.50 Mean : 6165 Mean :21.30
3rd Qu.:55.75 3rd Qu.: 6332 3rd Qu.:24.75
Max. :74.00 Max. :15906 Max. :41.00
rep78 headroom trunk weight
Min. :1.000 Min. :1.500 Min. : 5.00 Min. :1760
1st Qu.:3.000 1st Qu.:2.500 1st Qu.:10.25 1st Qu.:2250
Median :3.000 Median :3.000 Median :14.00 Median :3190
Mean :3.406 Mean :2.993 Mean :13.76 Mean :3019
3rd Qu.:4.000 3rd Qu.:3.500 3rd Qu.:16.75 3rd Qu.:3600
Max. :5.000 Max. :5.000 Max. :23.00 Max. :4840
NA's :5
length turn displacement gear_ratio
Min. :142.0 Min. :31.00 Min. : 79.0 Min. :2.190
1st Qu.:170.0 1st Qu.:36.00 1st Qu.:119.0 1st Qu.:2.730
Median :192.5 Median :40.00 Median :196.0 Median :2.955
Mean :187.9 Mean :39.65 Mean :197.3 Mean :3.015
3rd Qu.:203.8 3rd Qu.:43.00 3rd Qu.:245.2 3rd Qu.:3.353
Max. :233.0 Max. :51.00 Max. :425.0 Max. :3.890
foreign
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.2973
3rd Qu.:1.0000
Max. :1.0000
# Number of observations:
nrow(auto)
[1] 74
#Number of variables
ncol(auto)
[1] 13
#Just the names of all variables
names(auto)
[1] "X" "make" "price" "mpg"
[5] "rep78" "headroom" "trunk" "weight"
[9] "length" "turn" "displacement" "gear_ratio"
[13] "foreign"
Much of the power of R derives from extensions (so called packages) that have been contributed to widen the capabilities of the basic software. A particularly nice extension are the dplyr and tidyr packages that allow vastly more efficient and transparent wrangling of data. Can you install those packages on your computer and load them into memory?
To install these packages use
install.packages("dplyr","tidyr")
Note that you only need to do this once on a given computer (it’s like installing software). However, each time you re-start R you need to load the packages you want to use into memory again. For that you use
One of the great things included in those packages is the functionality of piping, which we shall be using extensively throughout this course. Piping allows you to send the result from one R command to the next. This typically allows for more transparent code as you will hopefully appreciate maybe not just yet but sometime halfway through the course.
Just to give you an idea what piping does look at the code below. It
does exactly the same as the summary(auto)
command but we
have now written with the piping operator %>%
. What it
does is that it sends whatever comes on the left to the right
(i.e. flushes it down the %>%
pipe ). On the right hand
side it is then being used as an argument to the code that comes there
(hence we don’t need to put anything in brackets in the summary
command)
X make price mpg
Min. : 1.00 Length:74 Min. : 3291 Min. :12.00
1st Qu.:19.25 Class :character 1st Qu.: 4220 1st Qu.:18.00
Median :37.50 Mode :character Median : 5006 Median :20.00
Mean :37.50 Mean : 6165 Mean :21.30
3rd Qu.:55.75 3rd Qu.: 6332 3rd Qu.:24.75
Max. :74.00 Max. :15906 Max. :41.00
rep78 headroom trunk weight
Min. :1.000 Min. :1.500 Min. : 5.00 Min. :1760
1st Qu.:3.000 1st Qu.:2.500 1st Qu.:10.25 1st Qu.:2250
Median :3.000 Median :3.000 Median :14.00 Median :3190
Mean :3.406 Mean :2.993 Mean :13.76 Mean :3019
3rd Qu.:4.000 3rd Qu.:3.500 3rd Qu.:16.75 3rd Qu.:3600
Max. :5.000 Max. :5.000 Max. :23.00 Max. :4840
NA's :5
length turn displacement gear_ratio
Min. :142.0 Min. :31.00 Min. : 79.0 Min. :2.190
1st Qu.:170.0 1st Qu.:36.00 1st Qu.:119.0 1st Qu.:2.730
Median :192.5 Median :40.00 Median :196.0 Median :2.955
Mean :187.9 Mean :39.65 Mean :197.3 Mean :3.015
3rd Qu.:203.8 3rd Qu.:43.00 3rd Qu.:245.2 3rd Qu.:3.353
Max. :233.0 Max. :51.00 Max. :425.0 Max. :3.890
foreign
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.2973
3rd Qu.:1.0000
Max. :1.0000
You can even do several of those pipes in a row. For instance, look at the following piece of code. Can you work out what it does?
auto_summary <- auto %>% select(mpg, weight) %>%
summarise_all( list(mmm=mean, p50=median, sd=sd ) )
This piece of code takes first selects only 2 columns of the
auto
dataframe. It sends this smaller dataframe with only 2
variables to the command summarise_all
. Summarise all will
compute the statistics you mention within list(...)
for all
the variables in the dataframe it is being given. Note that in the list
what comes after the =
sign is the statistic you want
whereas what comes before is a lable that will appear in the resulting
dataframe, which we called auto_summary
. You can look at it
simply by typing its name in the console or a script:
auto_summary
mpg_mmm weight_mmm mpg_p50 weight_p50 mpg_sd weight_sd
1 21.2973 3019.459 20 3190 5.785503 777.1936
The summary statistics in the previous part are not displayed very nice. The dplyr and tidyr offer a lot of nice functions to reshape any output in whichever way you want. Below is a somewhat elaborate example. Below I am explaining step by step what it does. However, before you look at my explanation, maybe you can run those commands yourself step by step and figure what they do?
The code above reshapes the auto_summary
dataframe so
that it looks like this:
auto_summary_reshape
var mmm p50 sd
1 mpg 21.2973 20 5.785503
2 weight 3019.4595 3190 777.193567
We can break this in all the steps to see how it does that:
stat val
1 mpg_mmm 21.297297
2 weight_mmm 3019.459459
3 mpg_p50 20.000000
4 weight_p50 3190.000000
5 mpg_sd 5.785503
6 weight_sd 777.193567
The gather command arranges what you sent them down the pipe as key
value pairs; i.e. 2 columns where the first column is the key (i.e. a
name or label) for the value that follows in the value column.
stat
and val
are simply how we want to call
those two columns
var stat val
1 mpg mmm 21.297297
2 weight mmm 3019.459459
3 mpg p50 20.000000
4 weight p50 3190.000000
5 mpg sd 5.785503
6 weight sd 777.193567
The separate command splits the stat
column in two new
columns, var
and stat
where
sep="_"
tells R where to separate the original
stat
column.
var mmm p50 sd
1 mpg 21.2973 20 5.785503
2 weight 3019.4595 3190 777.193567
The spread command spreads out the stat column again; i.e. creates a new column for every category represented in stat. Alternatively, we could have spread var:
This github repository provides COVID related data for the US at the state and county level: https://github.com/nytimes/covid-19-data
Download the state level data (in csv format) from
here and load it into an R dataframe. Have a look at the dataframe. How many rows are there? What do the rows represent? What do the columns represent?
covid=read.csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
Global Environment
tab, which also
shows us the number of rows and columns of the dataframe (note that you
might have a different number of rows as the dataframe is constantly
updated online)
We see that the dataframe contains daily COVID case and death numbers for every us state. Note that the numbers never reduce which suggests that these are cumulative figues (i.e. total number of cases and deaths up until a give day)
Compute the death rate (deaths over cases) for every state by July 31st, 2020 using R. Also add columns that contain the share of deaths in total deaths and the share of cases in total cases
With filter()
we can select specific rows of a
dataframe. With mutate()
we can create new variables in
dataframe or modify exisitng ones. Note that the date variable is a
factor variable. You can see this in the Global Environment
tab:
We will talk a lot about factor variables later in the course. For
now it’s enough to know that with the function
as.character()
you can convert the date variable from a
factor to a character string which is necessary to compare it to
character string such as “2020-08-01”. As ever you can only compare like
with like.
Hence, the following comand provides the result (note I am loading again the library dplyr which is not necessariy if you have done that already in this session)
library(dplyr)
covid_july=covid %>% dplyr::filter(as.character(date)=="2020-07-31") %>%
mutate(deathsOcases=deaths/cases,
deaths_sh=deaths/sum(deaths),
cases_sh=cases/sum(cases))
head(covid_july,10)
date state fips cases deaths deathsOcases
1 2020-07-31 Alabama 1 87723 1580 0.018011240
2 2020-07-31 Alaska 2 3675 21 0.005714286
3 2020-07-31 Arizona 4 174108 3695 0.021222460
4 2020-07-31 Arkansas 5 42511 453 0.010656065
5 2020-07-31 California 6 502273 9222 0.018360533
6 2020-07-31 Colorado 8 46948 1841 0.039213598
7 2020-07-31 Connecticut 9 49810 4432 0.088978117
8 2020-07-31 Delaware 10 14788 585 0.039559102
9 2020-07-31 District of Columbia 11 12126 585 0.048243444
10 2020-07-31 Florida 12 470378 6842 0.014545748
deaths_sh cases_sh
1 0.0102689423 0.019188397
2 0.0001364859 0.000803864
3 0.0240150265 0.038084122
4 0.0029441967 0.009298792
5 0.0599368265 0.109866440
6 0.0119652676 0.010269335
7 0.0288050331 0.010895364
8 0.0038021084 0.003234705
9 0.0038021084 0.002652423
10 0.0444684198 0.102889776
The head()
command shows you the first couple of lines
of a dataframe. That is useful if you wann a have a peek to examine
larger dataframes.
Which is the state with the higest death rate (by July 31)? Can you work it out with an R command (rather than just looking through the table)?
Can you compute daily figures of deaths and cases for the US as a whole from the state level data? We haven’t discussed all commands necessary for that yet. But see if you can work it out from checking online. If that takes too much time just check the answer below. As ever, ther are many different ways how could do that.
We can use the group_by()
command which is par of dplyr.
It allows to group rows of a dataframe by one (or several) categorical
variables. We can then apply other commands that will be done seperately
for each of those groups. In the case below we want to sum over all
states for every day:
Examine the US death rate over time. How would you expect this series to behave? How does it behave? How do you explain what you find?
The best way to do this is with a figure. We shall be talking in the
next lecture a lot more about plotting. However, no harm done trying
this out. There are a number of commands to plot something. However, the
most versatile is the ggplot command. (You need to install and load the
ggplot2 package before you can use it.). Also: things will look better
if you tell R that the date column is date. We can do this with the
as.Date()
command.
Couple of words about the ggplot command:
usdaily
)aes()
for aesthetic you declare which variables to
use for x and y axis.+
sign you control
how things look; e.g. that you want to see points
(geom_point()
) and that you want monthly date labels.library(ggplot2)
usdaily=usdaily %>% mutate(date=as.Date(date), deathsOcases=deaths/cases)
ggplot(usdaily, aes(x=date, y=deathsOcases)) +
geom_point() +
scale_x_date(date_breaks="1 month",date_labels = "%b")
Now what do we expect to find? To die from COVID you have to get sick and it will take at least a couple of days. Hence, we would expect to see death rates to be low at first and then to increase. Over time we would hope that doctors get better at treating COVID patients. Also, people who know they are at risk will increasingly shielding so that the only people who get sick will be those who are less likely to die. Both factors should bring the death rate down after an initial peak. This is what we are seeing when only looking at the series from the end of March onwards with a peak in mid May. However, there is a first peak in early March. What could explain this? The most plausible explanation is probably measurement error: early on cases of infections where probably not counted properly as there was no systematic way of testing the population. On the other, if somebody was so ill that they die this would almost certainly be picked up by the authorities. Hence death figures were not under counted. Also, early on numbers were low. So a couple of counts too few for cases could easily swing the ratio. Here is some further discussion of this.
A large part of data analysis work is getting the data into a format that lends itself to analysis. And a very big part of that is the process of combining datasets from different sources. Here we start learning how this can be done in an efficient way. Here you find a dataset in csv format with information on the size (in terms of population) of different US states. Can you combine this with the COVID case count dataframe for US states at the end of July that we created earlier?
First you need to load the population dataset into a dataframe as well:
pop=read.csv("https://mondpanther.github.io/datastorieshub/data/populationdata.csv")
To combine the two dataframes we can use a command called
merge()
:
This tells R to create a new dataframe (covid_julyb
) by
combinig the information in covid_july
and
pop
. Rows are being matched using the variable
state
. Maybe you noted that pop has only 52 observations
whereas covid_july
has 55. This is because pop
doesn’t include information on places such as Guam and a couple of other
Islands that are not formal US states but controlled by the US, whereas
covid_july
does. The all.x
option tells R to
keep observations from the first dataset (i.e. covid_july
)
even if no equivalent observation in the second dataset can be found.
This is also known as a left join. Similarly there is a right join with
all.y=TRUE
or a full join with all=TRUE
. You
might want to read a bit more about the merge command e.g. here
Using the newly created dataframe create a new variable COVID cases per capita (i.e. per head of population). Explore the relationship between per capita cases and population density. What kind of relationship would you expect? What do you find?
Add per capita cases with mutate (to make things easier to read we use cases per 1000 of population):
We would we expect a positive relationship between per capita cases and density. A scatter plot is usually the best way to get going on a question like this. We can once more use the ggplot command (geom_smooth(method=lm) and add a regression line):
usdaily=usdaily %>% mutate(date=as.Date(date), deathsOcases=deaths/cases)
ggplot(covid_julyb, aes(x=density, y=casesOpop000 ))+
geom_point() + geom_smooth(method=lm)
Dropping the outlier on the right (i.e. Washington DC which is not really a State).
ggplot(covid_julyb %>% filter(density<6000), aes(x=density, y=casesOpop000 ))+
geom_point() + geom_smooth(method=lm)
Hence, there is clearly a positive relationship between density and covid cases. It becomes more pro-nounced when dropping Washington DC. This is also confirmed when looking at the parameters of the regression line directly, which we can do with the lm() command.
Call:
lm(formula = casesOpop000 ~ density, data = covid_julyb %>% filter(density <
6000))
Residuals:
Min 1Q Median 3Q Max
-10.6365 -3.1810 -0.4811 3.3123 14.2090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.103760 0.970901 10.407 5.27e-14 ***
density 0.006921 0.002751 2.516 0.0152 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.466 on 49 degrees of freedom
Multiple R-squared: 0.1144, Adjusted R-squared: 0.0963
F-statistic: 6.328 on 1 and 49 DF, p-value: 0.01521
Computers - unlike humans - are good at doing repeated tasks reliably without frustration. A basic way to accomplish that in a programming language are are loops. Let’s look at a concrete example. Suppose you want to create time series figures of COVID death rates for several US states: California, Texas, New York and Alaska, say. Try if you can work out how to do that with a for loop using the help function and online googling. If your are stuck check out the answer below.
Let’s start by arranging the states in a collection. Also add a death rate column to the covid dataframe (and clear up the date variable):
Now loop over those states:
for(s in states){
print(s)
p=ggplot(covid %>% filter(state==s), aes(x=date, y=deathsOcases)) +
geom_point() + ggtitle(s)+
scale_x_date(date_breaks="1 month",date_labels = "%b")
print(p)
}
[1] "California"
[1] "Texas"
[1] "New York"
[1] "Alaska"
It’s interesting to see quite a variation in the shape of this series between different states. New York hasn’t managed a meaningful decline in death rates. In Texas on the other hand death rates are increasing again after a substantial decline in July.
Besides analysing data we can use R to make up data. Why would we do that? It’s a common approach called “Monte-Carlo Analysis”. If we make up the data we know the underlying model that is driving the data. We can use to this to test the methods we are using. Do they lead us to the true underlying model or do they have biases that give us the wrong answer? In this exercise we walk through how this can be done in a simple case. As ever: try to work through all of this while creating an R script or R markdown file. First we need to decide about the sample size. Let’s use a sample of size 100. It makes sense to assign this number to a variable as follows:
obs<-100
Assume that the data is driven by the following true model \(Y=2+0.5 X +\epsilon\) where \(\epsilon\) is a random variable that
follows a standard normal distribution; i.e. its distribution follows
the famous bell curve. Using R you can draw realisations of such a
random variable using the function rnorm()
To assign those
values to a variable called eps
we can write
df=data.frame(eps=rnorm(obs))
where the data.frame
command arranges the newly created
variable eps into a dataframe rather than being standalone.
Similarly we can generate the explanatory variable X as:
This will draw the values of X by drawing random variables on the interval between 0 and 1 with uniform (i.e. all values have the same) probility. Look at the values of X and \(\epsilon\). Here are the first 10:
head(df,10)
eps X
1 0.7087166 0.01557453
2 -0.7568827 0.63896289
3 -0.7173113 0.50702648
4 -2.1784831 0.50394019
5 1.2146494 0.73980429
6 0.3188138 0.29162189
7 -0.7861682 0.42168957
8 -0.2599122 0.60679514
9 -0.4659159 0.94634844
10 -1.4880522 0.11489128
Now create the Y variable according to the formula above.
Run a regression of Y on X. What value do you find for the coefficient on X? Would you think this is an unbiased estimate?
Call:
lm(formula = Y ~ X, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.6671 -0.7811 0.1055 0.7418 1.9315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8517 0.2006 9.233 5.6e-15 ***
X 0.9012 0.3618 2.491 0.0144 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9863 on 98 degrees of freedom
Multiple R-squared: 0.05954, Adjusted R-squared: 0.04994
F-statistic: 6.204 on 1 and 98 DF, p-value: 0.01443
Now use a “for”-loop to repeat the steps from section (a)-(c) 1000 times.
We can do this in R with the following for-loop command
for(i in 1:1000) {
<several lines of code >
}
i.e. the lines of code between the first line and }
will
be repeated 1000 times. The variable i
will increase by 1
on each round.
Note: with $coefficients[[2]]
we extract the coefficient
estimate of the second (i.e. the slope) coefficient from the regression
results and store it in a vector called beta
Compute the average value of the estimate of the X coefficient across the 1000 repetitions.
mean(beta)
[1] 0.5042947
results=data.frame(beta)
ggplot(results, aes(x=beta)) +
geom_histogram(color="black",fill="white",bins=40) +
theme_minimal()
Note: we can use ggplot to draw a histogram of all the results for beta as well. This shows that while we not always get a value close to 0.5, over many replications the value we find are centered around the true value of 0.5
For attribution, please cite this work as
Martin (2022, Oct. 10). Datastories Hub: Exercises 2. Retrieved from https://mondpanther.github.io/datastorieshub/posts/exercises/exercises2/
BibTeX citation
@misc{martin2022exercises, author = {Martin, Ralf}, title = {Datastories Hub: Exercises 2}, url = {https://mondpanther.github.io/datastorieshub/posts/exercises/exercises2/}, year = {2022} }