This page was generated from notebooks/9_specification_tests.ipynb. Interactive online version:
Specification Tests¶
Luc Anselin¶
(revised 09/11/2024)¶
Preliminaries¶
In this notebook, the basic regression diagnostics for spatial autocorrelation are introduced. These include the classic Moran’s I test, as well as the Lagrange Multiplier/Rao Score tests for lag and error dependence developed during the 1980s and 1990s. In addition, the recent tests for the Spatial Durbin specification developed by Koley and Bera are covered as well. The tests are detailed in Anselin and Rey (2014), Modern Spatial Econometrics in Practice and in Anselin, Serenini and Amaral (2024). Spatial Econometric Model Specification Search: Another Look (DOI: 10.13140/RG.2.2.10650.86721), as well as in the references therein.
In addition to the classic case with OLS estimates, a test for spatial correlation is covered for models with endogenous variables, estimated by means of 2SLS.
Prerequisites¶
Familiarity with OLS and 2SLS estimation in spreg is assumed, as covered in the respective notebooks, as well as basics of numpy, pandas, geopandas, and libpysal. In addition, it is assumed that the chicagoSDOH PySAL sample data set has been installed (for specific instructions, refer to the sample data notebook).
Modules Needed¶
The main module for spatial regression in PySAL is spreg. In addition libpysal is needed for data import and spatial weights manipulation, and geopandas for data input from a shape file. This notebook is based on version 1.7 of spreg.
As before, only the needed functions from libpysal are imported, i.e., libpysal.io.open
as open
, libpysal.examples.get_path
as get_path
, and libpysal.weights
as weights
. The OLS
and TSLS
estimation routines are imported from spreg
.
Some additional imports are included to avoid excessive warning messages. With later versions of PySAL, these may not be needed. As before, the set_printoptions
is used for numpy 2.0 and later.
[1]:
import warnings
warnings.filterwarnings("ignore")
import os
os.environ['USE_PYGEOS'] = '0'
import numpy as np
import pandas as pd
import geopandas as gpd
from libpysal.io import open
from libpysal.examples import get_path
import libpysal.weights as weights
from spreg import OLS, TSLS
np.set_printoptions(legacy="1.25")
Functionality Used¶
from geopandas:
read_file
from libpysal:
examples.get_path
io.open
weights.transform
from spreg:
OLS
TSLS
Data, Weights and Variables¶
As in the previous notebooks, all data sets, weights files and variables are specified at the top, so that they can be easily changed to other examples.
Date sets and weights are from the chicagoSDOH sample data set:
Chi-SDOH.shp,shx,dbf,prj: socio-economic indicators of health for 2014 in 791 Chicago tracts
Chi-SDOH_q.gal: queen contiguity spatial weights created with GeoDa
The weights are used in row-standardized form.
For the OLS case, the same model specification is used as before, with YPLL_rate (an index measuring premature mortality, i.e., higher values are worse health outcomes) as the dependent variable, and HIS_ct (economic hardship index), Blk14P (percent Black population), and Hisp14P (percent Hispanic population) as the explanatory variables. These are specified in the y_name etc. variables, which are then used to create the corresponding numpy vectors and matrices for use as y, x, etc. in the regression specification.
For the 2SLS case, the variable HIS_ct is considered to be endogenous, with COORD_X and COORD_Y (the tract centroids) as instruments, as in the 2SLS notebook.
The various initializations are carried out in two steps:
first, all file names and variable names are defined
second, the files are read and variable vectors/matrices constructed
The first step allows for customization to other examples, the second step is agnostic to the actual files and variables that were specified. To keep the code simple, there are no error checks for missing files or mismatches in the variable names.
[2]:
infileshp = get_path("Chi-SDOH.shp") # input shape file with data
infileq = get_path("Chi-SDOH_q.gal") # queen contiguity weights from GeoDa
y_name = 'YPLL_rate'
x_names1 = ['Blk14P','Hisp14P']
x_names2 = ['Blk14P','Hisp14P','HIS_ct']
yend_names = ['HIS_ct']
q_names = ['COORD_X', 'COORD_Y']
ds_name = 'Chi-SDOH'
w_name = 'Chi-SDOH_q'
[3]:
dfs = gpd.read_file(infileshp)
wq = open(infileq).read() # queen contiguity weights
wq.transform = 'r' # row-transform the weights
y = dfs[y_name]
x1 = dfs[x_names1]
x2 = dfs[x_names2]
yend = dfs[yend_names]
q = dfs[q_names]
Specification Tests¶
The specification tests for spatial autocorrelation in regression residuals are invoked by setting the argument spat_diag = True
in the OLS call and passing a spatial weights object. This will result in the LM tests being added to the result output. In order to also list Moran’s I, the argument moran = True
is required as well. Since this test statistic involves more computation, it is not included in the default spat_diag
listing.
Before considering the implementation of these tests in spreg
, their formal expressions are briefly summarized.
Moran’s I¶
Formally,the Moran’s I test statistic is:
\begin{equation*} I = \frac{e'We / S_0}{e'e / n}. \end{equation*}
In this expression, \(S_0 = \sum_i \sum_j w_{ij}\) is the sum of the weights in matrix \(W\), and \(e\) are the residuals.
Inference for Moran’s I is based on an asymptotic standard normal approximation. The statistic itself is first converted into a standardized or z-value by subtracting the mean and dividing by the standard deviation. Those two moments are obtained under the null hypothesis of no spatial autocorrelation.
The moments of Moran’s I for regression residuals, under the null of no spatial autocorrelation (for the regression error terms) were derived by Cliff and Ord (1972) (Testing for spatial autocorrelation among regression residuals, Geographical Analysis 4, 267-284).
The mean is:
\begin{equation*} E[ I ] = \frac{tr(MW)}{(n - k)}, \end{equation*}
where the matrix \(M\) in the trace expression follows the conventional notation as \(M = I - X(X'X)^{-1}X'\), a \(n \times n\) projection matrix.
The variance of Moran’s I is:
:nbsphinx-math:`begin{equation*} Var[I] = frac{ tr(MWMW’) + tr(MWMW) + (tr(MW))^2}
{(n - k)(n - k + 2)} - ( E[I])^2.
end{equation*}`
The so-called z-value is then obtained in the usual fashion, as:
\begin{equation*} I_z = \frac{I - E[I]}{\sqrt{Var[I]}} \sim \ N(0, 1). \end{equation*}
The \(I_z\) statistic has an asymptotic distribution that is approximated by the standard normal.
Moran’s I is a very powerful misspecification test. While designed to detect spatial error autocorrelation, it also has power against a range of other misspecifications, including heteroskedasticity and non-normality. Hence, when the null is not rejected, one can be fairly confident that none of these misspecifications are present. On the other hand, when the null is rejected, it is not always clear what the next step should be, other than maybe implement HAC standard errors. In contrast, the LM statistics are so-called focused tests against spatial autocorrelation that consider a specific alternative (lag or error).
LM Statistics for Spatial Lag and Spatial Error¶
Technical details pertaining to these tests are given in Chapter 5 of Anselin and Rey (2014). The original results are from Burridge (1980) (On the Cliff-Ord test for spatial autocorrelation, Journal of the Royal Statistical Society B, 42, 107-108) for the LM-Error test, Anselin (1988) (Lagrange Multiplier test diagnostics for spatial dependence and spatial heterogeneity, Geographical Analysis 20, 1-17), for the LM-Lag test, and Anselin, Bera, Florax, Yoon (1996) (Simple diagnostic tests for spatial dependence, Regional Science and Urban Economics 26, 77-104) for their robust forms.
The LM test for spatial lag is:
\begin{equation*} LM_{\rho} = \frac{d_{\rho}^2}{D} \sim \chi^2(1), \end{equation*}
with \(d_{\rho} = e'Wy / \hat{\sigma}^2\) as the score for \(\rho\), and \(D = (WX\hat{\beta})' M (WX \hat{\beta}) / \hat{\sigma}^2 + T\), where \(M = I - X(X'X)^{-1}X'\), \(T = tr(WW + W'W)\) (with tr as the trace of a matrix), \(e\) is the OLS residual vector and \(\hat{\beta}\) are the OLS regression coefficients (so, \(X \hat{\beta}\) is the vector of predicted values).
The LM test for spatial error is:
\begin{equation*} LM_{\lambda} = \frac{d_{\lambda}^2}{T} \sim \chi^2(1), \end{equation*}
where \(d_{\lambda} = (e'We) /\hat{\sigma}^2\) is the score for \(\lambda\) and \(T\) is as before.
The LM test for lag robust to the presence of error is:
\begin{equation*} LM_{\rho}^* = \frac{(d_{\rho} - d_{\lambda} )^2}{( D - T)} \sim \chi^2(1), \end{equation*}
The LM test for error robust to the presence of lag is:
\begin{equation*} LM_{\lambda}^* = \frac{(d_{\lambda} - TD^{-1}d_{\rho} )^2}{[T (1 - TD) ] } \sim \chi^2(1), \end{equation*}
in the same notation.
LM Statistics for Spatial Durbin Model¶
In Koley and Bera (2024) (To use or not to use the spatial Durbin model? - that is the question. Spatial Economic Analysis 19, 30-56), robust LM tests are derived for \(\rho\) and \(\gamma\) in the spatial Durbin model. Again, the point of departure is an OLS regression of the classic non-spatial specification.
LM test for \(\gamma\):
\begin{equation*} LM_{\gamma} = \frac{(e'(WX_0))[(WX_0)'M(WX_0)]^{-1}((WX_0)'e)}{\hat{\sigma}^2} \sim \chi^2(h), \end{equation*}
where \(h = k-1\) (i.e., not counting the constant term), and \(X_0\) is the \(X\) matrix without the constant column. The expression for the \(LM_{\rho}\) test is the same as before.
Also, the joint LM test on \(\rho\) and \(\gamma\) is:
\begin{equation*} LM_{\rho\gamma} = \begin{bmatrix} (Wy)'e / \hat{\sigma}^2 & e'(WX_0) / \hat{\sigma}^2 \end{bmatrix} \begin{bmatrix} (WX\beta)'M(WX\hat{\beta}) + T & (WX\hat{\beta})'M(WX_0) \\ (WX_0)'M(WX\hat{\beta}) & (WX_0)'M(WX_0) \end{bmatrix}^{-1} \begin{bmatrix} (Wy)'e / \hat{\sigma}^2 \\ (WX_0)'e / \hat{\sigma}^2 \end{bmatrix} \sim \chi^2(k) \end{equation*}
In these expressions, \(T\) and \(M\) are as before.
The robust forms of the \(LM_{\rho}\) and \(LM_{\gamma}\) tests can be obtained from the result of the joint test and the expressions for \(LM_{\rho}\) and \(LM_{\gamma}\), since the following equality holds:
\begin{equation*} LM_{\rho\gamma} = LM_{\rho} + LM_{\gamma}^* = LM_{\rho}^* + LM_{\gamma}. \end{equation*}
Diagnostics in a 2SLS Regression¶
As in the classic OLS estimation, the residuals from 2SLS estimation can be assessed for spatial autocorrelation. Specifically, the so-called \(AK\) test extends the principle behind the Moran’s I statistic to residuals from a 2SLS estimation (Anselin and Kelejian 1997. Testing for spatial error autocorrelation in the presence of endogenous regressors. International Regional Science Review 20, 153-182).
The statistic reduces to the same expression as the Lagrange Multiplier test for error spatial autocorrelation, but using the residuals from the 2SLS regression. Formally: \begin{equation*} AK = \frac{[ (e'We)/e'e/n) ]^2}{tr(WW + W'W)} \sim \chi^2(1) \end{equation*} where \(e\) is a vector of 2SLS residuals, \(W\) is the spatial weights matrix, and and \(tr\) stands for a matrix trace expression. The statistic is distributed asymptotically as Chi-squared with one degree of freedom.
Some caution is needed in the interpretation of the results of the \(AK\) test. Even though it takes the form of an \(LM\) test, the statistic is really a generalization of Moran’s I and therefore not actually a Lagrange Multiplier test. Since estimation is based on 2SLS, there is no assumption of normality and thus also no likelihood function (on which the \(LM\) statistic is based). Therefore, the \(AK\) test needs to be interpreted as a diffuse test rather than as a focused test (e.g., in the standard \(LM\) case). In other words, the rejection of the null of no spatial autocorrelation does not point to either a lag or an error specification as the proper alternative. As is the case for Moran’s I in a classic regression, rejection of the null points to the absence of independence, but not to a particular specification that may be the reason for the spatial correlation.
Spatial Diagnostics in OLS Regression¶
To obtain the spatial diagnostics, the argument spat_diag = True
must be set, with, in addition, moran = True
if Moran’s I is desired. Also, a spatial weights matrix must be specified as the w
argument (optionally with its name in name_w
).
Two explanatory variables¶
First, this is illustrated for the immigrant paradox regression with just two explanatory variables, i.e., with the arguments y and x1. The call to OLS
is the same as in the earlier notebook, but now with the arguments for the spatial diagnostics included.
[ ]:
ols1 = OLS(y,x1,w=wq,spat_diag=True,moran=True,
name_w=w_name,name_ds=ds_name)
print(ols1.summary)
The regression output is augmented with a section at the bottom, entitled DIAGNOSTICS FOR SPATIAL DEPENDENCE. It is organized into two parts, one dealing the SAR-Error model as the alternative specification, the other with the Spatial Durbin specification (SDM).
In the SAR-Error part, the first result (if moran=True
) is for Moran’s I. The value of the Moran’s I is listed under MI/DF, here 0.1271, with the associated z-value under VALUE, as 6.521. Finally, in the PROB column, the associated p-value is given. In this example, Moran’s I is highly significant, suggesting a misspecification problem. However, since the test is not focused on a specific alternative, it is not clear what the next step would be. Also, there is very strong evidence of
heteroskedasticity, against which Moran’s I has power as well.
The next set of tests are the LM tests and their associated robust forms, first for Lag and then for Error. The LM-Lag test (26.211) is highly significant, but its robust form (0.258) is not. In contrast, both LM-Error (39.569) and the associated robust form (13.615) are highly significant, suggesting an Error alternative. Finally, the joint test for Lag and Error is highly significant as well. This test is not always indicative of the need for a higher order alternative, since it has high power against the single alternatives as well.
The upshot of these statistics is strong evidence towards a spatial error alternative.
The final set of tests are the LM tests in a Spatial Durbin context. The LM test for the coefficients of WX is given first, with its robust counterpart. The degrees of freedom (DF, 2) match the number of explanatory variables (not counting the constant term). It this example, the LM test of 0.836 is not significant, but its robust counterpart of 14.193 is (robust to the presence of a spatial lag). The LM test for Lag has the same value as in the SAR-Error context (26.211), but its robust form (39.569) is now robust to the presence of an SLX term and is highly significant. The joint test for Spatial Durbin (40.404) has degrees of freedom equal to the number of explanatory variables + 1, or 3 in this case. It is highly significant.
Koley and Bera (2024) suggest the following interpretation of these results. First, consider whether the joint test is significant, which it clearly is. Then consider each of the robust forms of the test statistic, which are also both highly significant in this example. This would point to a Spatial Durbin alternative. However, there may be some other misspecifications going on here, since the values of the robust tests are larger than their original counterparts, which is not standard behavior. In a strict sense, the robust tests are robust to small departures from the null and they should be smaller than the original test, since they correct for the ignored misspecification. When this inequality does not hold, there is an indication that the misspecification is more major than the correction is able to accommodate.
Clearly, the recommendations given by the SAR-Error and the Spatial Durbin contexts are at odds. This is further explored in the notebook on specification searches.
Three explanatory variables¶
With the hardship indicator (HIS_ct) included in the regression, the call uses x2, but is otherwise identical to the previous one.
[ ]:
ols2 = OLS(y,x2,w=wq,spat_diag=True,moran=True,
name_w=w_name,name_ds=ds_name)
print(ols2.summary)
The inclusion of the hardship indication in the regression specification changes not only the interpretation of the regression coefficients, but also greatly affects the results for the spatial diagnostics. In the SAR-Error context, the LM-Lag statistic is still significant, but its robust form is not. There is now only the weakest of evidence for the Error case (5.705 with a p-value of 0.02). Neither of the robust tests are significant. The joint LM test (7.329) only achieves a p-value of 0.0256, which provides very weak evidence. On the other hand, Moran’s I remains significant, but much less so than before (z-value of 2.619 with p=0.0088).
The situation is completely different on the Spatial Durbin front. Now, the values for the robust forms of the tests are smaller than the original counterparts, as they should be. There is strong evidence for an SLX alternative. Following the Koley-Bera (2024) recommendations, interest focuses on the robust forms of the one-directional tests, since the joint test is highly significant. The robust Lag test (5.705) is no longer significant for a p-value of 0.01 (p=0169), whereas the robust WX test (17.742) is still highly significant (p=0.0005). This would suggest an SLX alternative, or possibly, with a lower standard for significance (e.g., p=0.05), a Spatial Durbin model.
Since it is straightforward to estimate an SLX model by means of OLS, the effect of this specification on the spatial diagnostics is examined next. A full consideration of the estimation of various SLX models is considered in a separate notebook.
Spatial Diagnostics in SLX Regression¶
The call for the estimation of an SLX specification is the same as for any other OLS estimation, except that an additional argument slx_lags=1
must be included. If higher order lags are desired, the argument to slx_lags
must be adjusted accordingly.
The results for the full regression (with x2) is:
[ ]:
slx = OLS(y,x2,w=wq,slx_lags=1,
spat_diag=True,moran=True,
name_w=w_name,name_ds=ds_name)
print(slx.summary)
The inclusion of the three SLX terms changes the estimates and significance of the other coefficients considerably. The Hisp14P variable is no longer significant, but its spatial lag is, with a large negative coefficient. All three SLX coefficients are significant. However, their magnitude and sign raise some concerns. It is very difficult to interpret a case where the coefficient of the neighbor’s influence has a different sign from the original coefficient, as is the case for Blk14P and W_Blk14P. This suggests negative spatial autocorrelation, which is rare, but not impossible, although it runs counter to Tobler’s first law of geography.
Similarly, obtaining a coefficient for WX that is larger than the corresponding coefficient for X suggests a stronger effect of the neighbors than for the location itself, which runs counter to the distance decay implied by Tobler’s law. While this does not make the SLX specification invalid as such, it does require a careful consideration of the interpretation of the coefficients.
Diagnostics are only reported for the SAR-Error alternatives, since the WX term is already included in the model. This inclusion seems to have eliminated most evidence for remaining spatial autocorrelation. While Moran’s I and the LM tests are weakly significant (but not for a p-value of p=0.01), the robust LM tests are not, and neither is the joint LM test.
This would suggest that the inclusion of the SLX terms has taken care of the spatial autocorrelation problem. However, as mentioned, the signs and magnitudes of the coefficients are difficult to interpret and would demand closer scrutiny.
Spatial Diagnostics in 2SLS Regression¶
The AK test is included in the output of a 2SLS regression when spat_diag=True
and spatial weights are specified. In all other respects, the call is the same as reviewed in a previous notebook, with x1 as the exogenous explanatory variables, yend as the endogenous variable and q as the instruments.
[ ]:
tsls1 = TSLS(y,x=x1,yend=yend,q=q,w=wq,spat_diag = True,
name_w=w_name,name_ds=ds_name)
print(tsls1.summary)
In contrast to what was found with Moran’s I for the OLS case, there is no evidence for residual spatial autocorrelation when the variable HIS_CT is treated as being endogenous.
Practice¶
At this point, it would most effective if you could continue with your baseline regression, assess whether there is any evidence of spatial autocorrelation and what type of alternative suggests itself. If warranted, check the effect of including SLX terms or correcting for endogeneity on the regression coefficients and spatial diagnostics. Consider using a number of different spatial weights to assess the robustness of your findings.