Apr 2017

Economists use a statistical procedure called regression analysis to determine whether there is a relationship between economic variables. For example, a labor economist might use regression analysis to determine whether there is a relationship between salaries and education after controlling for differences in job tenure and geographic region. An antitrust economist might use regression analysis to determine whether an attempted collusion in the airline industry effected prices after controlling for other possible explanations for changes in prices such as the cost of jet fuel. Financial economists use regression analysis to determine the relationship between the returns to shares in publicly traded company stock and the returns to baskets of shares of other publicly traded companies.

The simplest regression analysis is called

OLS requires the following five key assumptions to hold for the regression results to be valid.

Successful application of regression analysis must recognize potential problems with the data being analyzed to ensure that faulty inferences are not drawn from naive application of a valid technique to problematic data.

Data sets include

There are pervasive challenges applying regression analysis to time series data. An analyst who is not knowledgeable about basic statistics could conclude that two completely unrelated series are closely related by failing to test for serial correlation. This problem is especially acute in time series data where time trends and other omitted variables are likely to cause time series to move together even when there is no independent relationship between the variables. Fortunately, there are ways of drawing valid inferences from times series data with regression analysis which can be found in almost any introductory econometrics textbooks and are well-known to any competent analyst.

A classic example of faulty potential inferences drawn from time series data is GNP and sunspots. See Charles I. Plosser and G. William Schwert's "Money, Income, and Sunspots: Measuring Economic Relationships and the Effects of Differencing"

Simplifying Plosser and Schwert's example, imagine accumulating the sunspots observed each year into a running total. Such a series would increase each year by the number of sunspots observed that year. GNP also increases over time. Regressing GNP on cumulative sunspots generates a high R-squared and a significant t-statistic on cumulative sunspots even though GNP and sunspots are obviously unrelated. The analyst could avoid a possible embarrassing mistake by noting the residuals are positively correlated as evidenced by the DW statistic and make the correct transformation of the GNP and sunspots variables. This is a famous example because the regression of levels on levels is so obviously wrong and because regressing first differences (annual changes in GNP and sunspots) demonstrates that GNP and sunspots are unrelated other than by virtue of both having a positive time trend.

As

It is exceedingly hard to miss serial correlation in error terms since statistical packages automatically produce the Durbin-Watson (DW) statistic which tests for serial correlation of the error terms.

The DW statistic ranges from 0 to 4; DW = 0 means the error terms are perfectly positively correlated, DW = 4 means the error terms are perfectly negatively correlated and DW = 2 means the error terms are uncorrelated.

Our recent research into Puerto Rico municipal bond returns provides an example of the potential for finding an unreliable relationship in time series data which can be avoided by paying attention to what the DW statistic tells us about serial correlation in the error terms.

In Figure 1,We plot the S&P Puerto Rico Municipal Total Return Index (SAPIPR) and the S&P Investment Grade Municipal Total Return Index (SAPIINV) both set equal to 100 on January 29, 1999.We also plot the cumulative sunspots observed from January 1, 1982 normalized to equal 100 on January 29, 1999 when our total return data starts.

As we expect from a fixed income total return index, there is a strong upward trend in both the Puerto Rico and USA series. There is a drop in both indexes in 2008, 2010 and 2013. Each time the drop in the Puerto Rico is larger than the drop in the USA series. The drop in Puerto Rico municipal bond prices in 2013 was especially dramatic and persistent.

Figure 1 might tempt an unsophisticated analyst to run an OLS regression on these two total return index level data series from January 29, 1999 to December 31, 2007 to support a belief that Puerto Rico municipal bond returns are highly correlated with mainland municipal bond returns. The results of that regression run in Excel are reported in Tables 1.

The R-squared statistic = 0.9995 and t-statistic = 2,159 on the explanatory variable are red flags that there is something wrong with this regression.

To see how silly it would be to defend the regression reported in Table 1 consider a regression of the Puerto Rico Municipal Bond Total Return on the Cumulative Sunspots variable plotted in Figure 1. Obviously Puerto Rico municipal bond returns are unrelated to sunspot activity yet the regression yields an R-Squared of 0.93 and a t-statistic on the Cumulative Sunspots of 177.8. See Table 2.

An analyst who accepts or defends the regression results reported in Table 1 would likely also accept and defend the nonsensical results in Table 2. The correlation coefficient for adjacent residuals from the regression in Table 2 is 0.997 and the DW statistic is 0.0054 demonstrating that the residuals are nearly perfectly positively correlated and the regression results therefore unreliable. Figure 3 is a residual plot for the Puerto Rico municipal bond total returns levels on cumulative sunspots. It looks very similar to Figure 2 because both regressions suffer from severe serial autocorrelation.

With near perfect positive serial autocorrelation like exhibited in Figure 2 and Figure 3 the standard fix is to run the regression on the first differences of the variables. The results of a regression of first differences in the Puerto Rico Municipal Bond Total Return Index and the Cumulative Sunspots are reported in Table 3. The R-Squared drops to 0 and the t-statistic on the Sunspots variable is not statistically significant at standard confidence levels. This is what we expect since Puerto Rico municipal bond returns cannot be related to sunspot activity. The correlation coefficient for adjacent residuals from the regression in Table 3 is -0.020 and the DW statistic is 2.039 demonstrating that the residuals are uncorrelated and the regression statistics therefore likely reliable.

Returning to the more serious question of the relationship between the returns on Puerto Rican municipal bonds and the returns on mainland municipal bonds; as we saw in Table 1, the adjacent residuals are nearly perfectly correlated and so the careful analyst would run the regression on the first differences in the two total return index level variables. Table 4 reports the results of such a regression. The correlation coefficient of adjacent residuals is now 0.069 (instead of 0.99) and the DW statistic is 1.862 (instead of 0.02). The R-squared is still quite high and the statistic on the mainland variable is still implausibly high at 317 but the results in Table 4 make more sense than the results in Table 1.

There is a second major data problem that would cause the error terms to still be correlated. The index levels from 1999 to mid-2006 are only reported monthly.Our hypothetical hapless analyst has filled in all the missing days from 1999 to mid-2006 with the last value for both total return series and treated each day as having a new, independent observation instead of 20 identical observations.

Running OLS regressions on first differences of this "fake" data will still generate serially correlated errors since the error term every day but once a month during the period from January 1999 to August 2006 will equal the previous day's error term. The correlation coefficient and DW statistics reported above don't fully reflect the positive serial correlation because once a month there is a large reversal in error term. In fact, continuing to difference the variables a second time, a third time and so on as suggested by Plosser and Schwert won't fix this data problem. Even without looking at the DW statistic, the analyst would know by looking at the residuals from both the levels and differences regressions, the data from January 1999 to August 2006 is not daily data and can't be used as daily data.

There are other serious problems with analyzing this data and interpreting the results, but the lesson for today is simply: like any good undergraduate student, check the residuals for serial correlation.

_______________________________________