Datastories Hub: Quick Guide - Unit roots

Ralf Martin

When dealing with time series we have to account for the possibility that the observations in our data are correlated from one observations to the next; e.g. if we see a particularly high value in one period we can also expect a higher value in the next period.

One of the simplest possibilities is a so called auto-regressive model with one lag; i.e.

\[y_t=\beta_0+\beta_1 y_{t-1}+\epsilon_t \tag{1}\] Hence, this is like the normal linear regression model except that one of our explanatory variables is the dependent variable in the previous period. In principle we can estimate that like any other regression model; i.e. with OLS using the lm() command in R.

However, this goes wrong if \(\beta_1\) gets too big; i.e. if it is equal or larger than 1 or smaller than -1 (i.e. \(|\beta_1|>1\)). This is also referred to as the times series having a “unit root”.

If this happens,

OLS will give us a (downward) biased estimate of \(\beta_1\)
We might find spurious relationships between \(y_t\) and other variables that have a unit root

i.e. we might find a strong correlation between two data series even though there is no causal mechanism between them. All that we are picking up is that both series have a unit root.

To check if we are dealing with a unit root in a given data series we can use the so called Dickey-Fuller (DF) test. This test is based on a transformed version of the above model where we subtract \(y_{t-1}\) from either side of the equation to get:

\[\Delta y_t= y_t - y_{t-1}= \beta_0+ \delta y_{t-1}+\epsilon_t \tag{2}\]

where \(\delta = \beta_1-1\). Hence, in Equation 2 a unit root would imply \(\delta=0\) and this is the (null) hypothesis the DF test examines.

We can implement this in R with the ur.df() command from the urca package. Here we do this with a simulated series with unit roots:

set.seed(2.22112)

obs=100
eps=rnorm(obs)
y100=eps
for(i in 2:obs){
  y100[i]= y100[i-1]*1+eps[i]
}
sdf=data.frame(y100,eps,period=1:obs)

If we plot y100 this will look as follows:

library(ggplot2)
ggplot(sdf,aes(x=period))  + geom_line(aes(y=y100))+
           theme_minimal()+ylab("y")

We can now implement the DF test:

library(urca)
library(dplyr)

ur.df(sdf$y100,lags=0) %>% summary()


############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression none 


Call:
lm(formula = z.diff ~ z.lag.1 - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.21711 -0.74493 -0.05352  1.00502  2.23895 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)
z.lag.1 -0.03789    0.02850  -1.329    0.187

Residual standard error: 1.153 on 98 degrees of freedom
Multiple R-squared:  0.01771,   Adjusted R-squared:  0.00769 
F-statistic: 1.767 on 1 and 98 DF,  p-value: 0.1868


Value of test-statistic is: -1.3294 

Critical values for test statistics: 
     1pct  5pct 10pct
tau1 -2.6 -1.95 -1.61

The summary of the output of the ur.df() provides us with a the regression output we would be getting if we simply implmented equation 2 as an OLS model. What is called z.lag.1 corresponds to \(\delta\) in equation 2. HOwever, as we said above: this estimate will be biased and so will be the P-value which we could normally use to decide if a regression coefficient is different from 0. Equally, we CANNOT use the typical critical values we might normally compare the t-value to. Luckily, ur.df() reports some new critical thresholds that we can use at the bottom of its output (i.e. the Critical values for test statistics:). We see critical values for a 1, 5 and 10% significance level. What we need to check is if our test statistic is smaller than the values reported there. If that is the case we can reject the hypothesis that \(\beta_1\) is equal to 1. Not surprisingly - given that we simulated our series with a \(\beta_1=1\) we cannot do that. So we conclude that we have unit root and we see if can get rid of it by differencing the series:

library(urca)
library(dplyr)

ur.df(diff(sdf$y100),lags=0) %>% summary()


############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression none 


Call:
lm(formula = z.diff ~ z.lag.1 - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4514 -0.9035 -0.1341  0.8249  2.1159 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
z.lag.1  -1.0580     0.1015  -10.43   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.167 on 97 degrees of freedom
Multiple R-squared:  0.5284,    Adjusted R-squared:  0.5235 
F-statistic: 108.7 on 1 and 97 DF,  p-value: < 2.2e-16


Value of test-statistic is: -10.4246 

Critical values for test statistics: 
     1pct  5pct 10pct
tau1 -2.6 -1.95 -1.61

This turns out to be the case. The test statistics is now much smaller than the critical value.

Unit roots and causal inference

A problem with unit roots is that they create spurious correlations. Let’s explore this by creating another time series with unit root:

epsx=rnorm(obs)
x100=epsx
for(i in 2:obs){
  x100[i]= x100[i-1]*.1+epsx[i]
}
sdf=bind_cols(sdf,data.frame(x100,epsx))

Let’s plot the time series

library(ggplot2)
ggplot(sdf,aes(x=period))  + geom_line(aes(y=y100,color="Y"))+
           theme_minimal() + geom_line(aes(y=x100,color="X"))+ylab("")+ labs(color = "")

Let’s also run a regression of the Y on X:

lm(y100~x100,sdf) %>% summary()


Call:
lm(formula = y100 ~ x100, data = sdf)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3732 -2.0077 -0.1562  2.0718  6.2196 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.82297    0.29442   9.588 9.49e-16 ***
x100        -0.05317    0.29917  -0.178    0.859    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.943 on 98 degrees of freedom
Multiple R-squared:  0.0003222, Adjusted R-squared:  -0.009879 
F-statistic: 0.03158 on 1 and 98 DF,  p-value: 0.8593

lm(y100~x100+period,sdf) %>% summary()


Call:
lm(formula = y100 ~ x100 + period, data = sdf)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5871 -1.9073 -0.0065  2.1269  5.7041 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.097333   0.551052   9.250 5.58e-15 ***
x100        -0.360405   0.278915  -1.292    0.199    
period      -0.044834   0.009504  -4.718 8.01e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.667 on 97 degrees of freedom
Multiple R-squared:  0.1869,    Adjusted R-squared:  0.1701 
F-statistic: 11.15 on 2 and 97 DF,  p-value: 4.39e-05

Hence, despite X and Y being absolutely un-related to each other we find a strong correlation. The matter cannot in general be addressed by including a linear time trend. Let’s see what happens if we difference both series:

sdf=sdf %>% mutate(Dy100=y100-dplyr::lag(y100))
lm(Dy100~x100,sdf %>% filter(period>1)) %>% summary()


Call:
lm(formula = Dy100 ~ x100, data = sdf %>% filter(period > 1))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4155 -0.8461 -0.1152  0.8612  2.1528 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02115    0.11744  -0.180    0.857
x100        -0.03505    0.11945  -0.293    0.770

Residual standard error: 1.168 on 97 degrees of freedom
Multiple R-squared:  0.0008868, Adjusted R-squared:  -0.009413 
F-statistic: 0.0861 on 1 and 97 DF,  p-value: 0.7698

i.e. we cannot find a significant relationship (as we should not).

Let’s also see what happens if we have a Y variable that is actually driven by X. We using the following true model:

\[ y_t=\beta_0 + y_{t-1}+ 0.5 x_{t} +\varepsilon_t\]
Let’s head for Monte Carlo and simulate data from this model:

epsyyy=rnorm(obs)
yyy=epsyyy[1]+ 0.5 * x100[1]


for(i in 2:obs){
  yyy[i]= yyy[i-1]*1+epsyyy[i] + 0.5 * x100[i]
}
sdf=bind_cols(sdf,data.frame(yyy,epsyyy))

Once more we can plot:

library(ggplot2)
ggplot(sdf,aes(x=period))  + geom_line(aes(y=y100,color="Y"))+ 
            geom_line(aes(y=yyy,color="Y(x)"))+
           theme_minimal() +geom_line(aes(y=x100,color="X"))+ylab("")+ labs(color = "")

And regress:

lm(yyy~x100,sdf) %>% summary()


Call:
lm(formula = yyy ~ x100, data = sdf)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.5245  -2.8366   0.7516   3.3588   9.1748 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.8317     0.5151  24.911   <2e-16 ***
x100         -0.8355     0.5234  -1.596    0.114    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.148 on 98 degrees of freedom
Multiple R-squared:  0.02534,   Adjusted R-squared:  0.0154 
F-statistic: 2.548 on 1 and 98 DF,  p-value: 0.1136

lm(yyy~x100+period,sdf) %>% summary()


Call:
lm(formula = yyy ~ x100 + period, data = sdf)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1768 -2.1512  0.1285  2.1781  4.6953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.764846   0.515937   9.235 6.01e-15 ***
x100        0.254195   0.261142   0.973    0.333    
period      0.159019   0.008898  17.871  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.497 on 97 degrees of freedom
Multiple R-squared:  0.7729,    Adjusted R-squared:  0.7683 
F-statistic: 165.1 on 2 and 97 DF,  p-value: < 2.2e-16

sdf=sdf %>%  mutate(Dyyy=yyy-dplyr::lag(yyy),Dx100=x100-dplyr::lag(x100))


lm(Dyyy~x100,sdf ) %>% summary()


Call:
lm(formula = Dyyy ~ x100, data = sdf)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8024 -0.6544 -0.1271  0.6305  2.7633 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.1425     0.1071   1.330    0.187    
x100          0.4486     0.1090   4.117 8.08e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.066 on 97 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.1487,    Adjusted R-squared:  0.1399 
F-statistic: 16.95 on 1 and 97 DF,  p-value: 8.082e-05

Note that if we run a regression of yyy on x100 we get an estimate that is wildly off the true on estimate. If we run a regression in first differences we get something fairly close to the true value of 0.5.

Also good practice to check that yyy is really stationary now:

ur.df(diff(sdf$yyy),lags=0) %>% summary()


############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression none 


Call:
lm(formula = z.diff ~ z.lag.1 - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4984 -0.5904  0.1305  0.9971  3.0370 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
z.lag.1  -1.0482     0.1016  -10.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.161 on 97 degrees of freedom
Multiple R-squared:  0.5231,    Adjusted R-squared:  0.5182 
F-statistic: 106.4 on 1 and 97 DF,  p-value: < 2.2e-16


Value of test-statistic is: -10.315 

Critical values for test statistics: 
     1pct  5pct 10pct
tau1 -2.6 -1.95 -1.61

Indeed, we can reject \(\delta=0\) on the differenced series (at least at 5%).

Extensions

Above we explore the simplest autoregressive model. More complex autoregressive models could include more lags, a (nonzero) constant or a time trend. We can accomodate these cases with the ur.df() command. For instance

ur.df((sdf$y100),lags=2, type="trend") %>% summary()


############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression trend 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4080 -0.8137 -0.1309  0.8624  2.2273 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.805594   0.344225   2.340   0.0214 *
z.lag.1     -0.111454   0.047331  -2.355   0.0207 *
tt          -0.010131   0.004671  -2.169   0.0327 *
z.diff.lag1 -0.027582   0.102665  -0.269   0.7888  
z.diff.lag2  0.074329   0.102043   0.728   0.4682  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.14 on 92 degrees of freedom
Multiple R-squared:  0.07854,   Adjusted R-squared:  0.03848 
F-statistic:  1.96 on 4 and 92 DF,  p-value: 0.1071


Value of test-statistic is: -2.3548 2.4378 3.593 

Critical values for test statistics: 
      1pct  5pct 10pct
tau3 -4.04 -3.45 -3.15
phi2  6.50  4.88  4.16
phi3  8.73  6.49  5.47

examines a model with 3 lags and a time trend. How do we know if that is necessary? Note that it says lags=2 because that refers to lags beyond lag one. How do we know if such a more general model is needed? We can go from more general to more specific; i.e. this would look as follows:

\[\Delta y_t=\beta_0 +\rho \times t+ \delta y_{t-1}+ \beta_2 \Delta y_{t-1} + \beta_3 \Delta y_{t-2}+\varepsilon_t\]

Firstly, on the lags. It turns out that the standard errors for the coefficients on the lags are actually not biased in the differences from of the equation. Hence, we can use the normal t-tests to throw out lags; e.g. let’s try lags=1 only as above the second lag term z.diff.lag2 is not significant.

ur.df((sdf$y100),lags=1, type="trend") %>% summary()


############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression trend 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5129 -0.7817 -0.1183  0.8220  2.2355 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.861754   0.326152   2.642  0.00965 **
z.lag.1     -0.112466   0.045150  -2.491  0.01450 * 
tt          -0.011050   0.004485  -2.464  0.01557 * 
z.diff.lag  -0.033659   0.101355  -0.332  0.74056   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.134 on 94 degrees of freedom
Multiple R-squared:  0.08781,   Adjusted R-squared:  0.0587 
F-statistic: 3.016 on 3 and 94 DF,  p-value: 0.03379


Value of test-statistic is: -2.4909 2.9158 4.3499 

Critical values for test statistics: 
      1pct  5pct 10pct
tau3 -4.04 -3.45 -3.15
phi2  6.50  4.88  4.16
phi3  8.73  6.49  5.47

Turns out the z.diff.lag1 is not significiant either. So maybe back to lags=0?

ur.df((sdf$y100),lags=0, type="trend") %>% summary()


############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression trend 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5413 -0.7635 -0.1778  0.8154  2.2900 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.804043   0.308978   2.602   0.0107 *
z.lag.1     -0.108795   0.042926  -2.534   0.0129 *
tt          -0.010252   0.004307  -2.380   0.0193 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.125 on 96 degrees of freedom
Multiple R-squared:  0.08291,   Adjusted R-squared:  0.06381 
F-statistic:  4.34 on 2 and 96 DF,  p-value: 0.01569


Value of test-statistic is: -2.5345 2.9056 4.3396 

Critical values for test statistics: 
      1pct  5pct 10pct
tau3 -4.04 -3.45 -3.15
phi2  6.50  4.88  4.16
phi3  8.73  6.49  5.47

But what about the time trend? Note that compared to our simple case above we now have more values being reported as test statistics and we also have more rows in our critical values table above. These are test statistics along with critical values. What they can refer to can be a bit confusing at first, but hopefully it will become clear. Again, you might want to think of this as going from a more general to a more specific model. So Let’s first look a the third value in the Value of test-statistic is: line above which is 4.3396. That refers to the test statistic of a joint hypothesis test with \[H0: \delta = 1, \rho=0, \beta_0=0\] Given that \(4.3396 < 5.47\) we cannot reject all of these restrictions jointly. Consequently, we have to proceed on the basis, that there is a unit root but there are no time trends or intercepts. Note that if we have a situation where the first test statistic is larger than tau3 and the third test statistic is larger than phi3 we can conclude that we have a unit root and a time trend. The second test statistic (i.e. 2.0237) examines the hypothesis that we have a unit root but no constant term (i.e. \(H0: \delta = 1, \beta_0=0\)).

Quick Guide - Unit roots

Unit roots and causal inference

Extensions

Citation