This page was generated from notebooks/16_GMM_estimation_spatial_error.ipynb. Interactive online version:

GMM Estimation - Spatial Error Model¶

Luc Anselin¶

(revised 09/26/2024)¶

Preliminaries¶

This module covers the estimation of spatial error models by means of the general method of moments (GMM). As for the IV estimation of the spatial lag model, this allows for both exogenous and endogenous explanatory variables. The estimates for the spatial error parameter are obtained as the solution of a set of moment conditions. As of version 1.4 of spreg, all spatial error estimation is implemented through the spreg.GMM_Error command. This is essentially a wrapper around the original implementations of the various GM_Error, GM_Endog_Error and GM_Combo functions (that still work exactly as before) with a more simplified interface. Beyond the classic spatial error specification, more complex models can be estimated by including slx_lags (for additional WX) or add_wy (for inclusion of a spatially lagged dependent variable). This yields a range of higher order models, such as SLX Error, SARSAR (spatial lag with spatial autoregressive errors), and the generalized nested specification (GNS), i.e., a Spatial Durbin model with spatial autoregressive errors. The latter two models are treated in a separate notebook.

Modules Needed¶

The main module continues to be spreg for spatial regression analysis. From this, GMM_Error is imported. In addition, the utilities in libpysal (to open spatial weights and access the sample data set), pandas and geopandas are needed. All of these rely on numpy as a dependency.

[1]:

import warnings
warnings.filterwarnings("ignore")
import os
os.environ['USE_PYGEOS'] = '0'

import numpy as np
import pandas as pd
import geopandas as gpd
from libpysal.io import open
from libpysal.examples import get_path
from libpysal.weights import lag_spatial

from spreg import GMM_Error

Functions Used¶

from pandas/geopandas:
- read_file
from libpysal:
- io.open
- examples.get_path
from spreg:
- spreg.GMM_Error

Variable definition and data input¶

The data set and spatial weights are again from the chicagoSDOH sample data set:

Chi-SDOH.shp,shx,dbf,prj: socio-economic indicators of health for 2014 in 791 Chicago tracts
Chi-SDOH_q.gal: queen contiguity weights

To illustrate the methods, the same descriptive model is used as in the ML notebook. It relates the rate of uninsured households in a tract(for health insurance, EP_UNINSUR) to the lack of high school education (EP_NOHSDP), the economic deprivation index (HIS_ct), limited command of English (EP_LIMENG) and the lack of access to a vehicle (EP_NOVEH). This is purely illustrative of a spatial error specification and does not have a particular theoretical or policy motivation.

In an alternative specification, HIS_ct is considered to be endogenous, with, as before, COORD_X and COORD_Y as instruments.

The file names and variable names are set in the usual manner. Any customization for different data sets/weights and different variables should be specified in this top cell.

[2]:

infileshp = get_path("Chi-SDOH.shp")     # input shape file with data
infileq = get_path("Chi-SDOH_q.gal")     # queen contiguity weights created with GeoDa

y_name = 'EP_UNINSUR'
x_names = ['EP_NOHSDP','HIS_ct','EP_LIMENG','EP_NOVEH']
xe_names = ['EP_NOHSDP','EP_LIMENG','EP_NOVEH']
yend_names = ['HIS_ct']
q_names = ['COORD_X','COORD_Y']
ds_name = 'Chi-SDOH'
w_name = 'Chi-SDOH_q'

The read_file and open functions are used to access the sample data set and contiguity weights. The weights are row-standardized and the data frames for the dependent and explanatory variables are constructed. As before, this functionality is agnostic to the actual data sets and variables used, since it relies on the specification given in the initial block above.

[3]:

dfs = gpd.read_file(infileshp)
wq =  open(infileq).read()
wq.transform = 'r'    # row-transform the weights
y = dfs[y_name]
x = dfs[x_names]
yend = dfs[yend_names]
xe = dfs[xe_names]
q = dfs[q_names]

GMM Estimation of the Error Model¶

Principle¶

As in the ML case, estimation of the spatial error model by means of GMM is based on spatially weighted least squares, where a consistent estimate for \(\lambda\) is used to create the spatially filtered dependent and explanatory variables, \((I - \hat{\lambda} W)y\) and \((I - \hat{\lambda} W)X\). If the only interest is in obtaining consistent estimates for \(\beta\), then any consistent estimate for \(\lambda\) will do. The only property that matters is consistency, since, unlike what holds in the spatial lag case, the precision of the \(\lambda\) estimate does not affect the precision of the \(\beta\) estimates.

In two classic papers by Kelejian and Prucha (1998,1999), it was shown how a consistent estimate can be obtained as the solution of a set of moment equations formulated in terms of the regression residuals and their spatial lags. However, this did not provide an asymptotic variance for the spatial parameter, and thus did not allow for inference. In a series of later papers (Kelejian and Prucha 2010, Arraiz et al. 2010, 2013, Drukker et al 2013), the approach was extended to also allow for heteroskedasticity of unknown form and to include an asymptotic variance matrix.

The technical details are quite complex and are outlined in Chapter 9 of Anselin and Rey (2014). The upshot is that there are three main estimation methods: the original Kelejian-Prucha generalized momens (GM) approach, the generalized method of moments (GMM) approach with heteroskedastic errors; and the GMM approach with homoskedastic errors. In practice, the GMM-heteroskedastic approach is greatly preferred. It is also the default used by GMM_Error.

The GM and GMM approaches can be extended to models that include endogenous variables by means of spatially weighted two stage least squares (SW2SLS). This estimator uses the same expression as standard 2SLS, but replaces the \(Z\) matrix of exogenous and endogenous regressors by its spatially filtered counterpart, \(Z - \hat{\lambda} WZ\). The instrument matrix \(Q\) is unaffected.

In both cases, inference in based on two asymptotic variance matrices: one for the regression coefficients and one for the spatial error coefficient.

Implementation methods¶

GMM estimation of the spatial error model is implemented in spreg.GMM_Error. This requires the standard regression arguments (i.e., at a minimum, y, x and w, as well as yend and q for the endogenous case). Three methods are implemented, specified by means of the estimator argument.

The default is estimator = "het" for the heteroskedastic-robust GMM method. Other options are estimator = "hom", for GMM with homoskedastic errors, and estimator = "kp98", for the legacy Kelejian-Prucha GM approach.

In addition, there are some more technical options for the GMM methods. In practice, these are rarely needed.

For GMM-heteroskedastic, the extra arguments include step1c, max_iter and epsilon. The default is step1c = False, max_iter = 1 and epsilon = 0.00001. When set to True, step1c carries out an additional estimation step for the autoregressive coefficient after the solution of the initial set of moment equations, as in Arraiz et. al (2010). The default follows the later paper by Arraiz et al. (2013) and skips this step. It is possible to iterate the procedure by using the new/updated residuals in additional rounds of estimation by setting max_iter to a larger value than 1. Typically, this is not needed. When there are additional iterations, epsilon is used as a convergence criterion to stop the procedure.

For GMM-homoskedastic, the option A1 determines the exact manner in which the first set of moment equations is constructed. This pertains to very technical details regarding the trace of a matrix. The default is A1 = "hom_sc", which applies a scaling factor suggested in Drukker et al. (2013). Other options are A1 = "hom" for no scaling, and A1 = "het" for a slightly different matrix expression (details are given on p. 217 of Anselin and Rey 2014). In practice, the options seldom make much of a difference and the default is fine. There are no step1c or max_iter options for this case.

The legacy kp98 method has no special options.

GMM-heteroskedastic is the default, so the estimator argument does not need to be specified. The first illustration is for all default settings with only exogenous regressors.

[ ]:

err1 = GMM_Error(y,x,w=wq,
                     name_w=w_name,name_ds=ds_name)
print(err1.summary)

The estimate of \(\lambda\), 0.4677, is almost identical to that obtained by means of ML (0.451). The other characteristics of the model are very similar as well, with a pseudo \(R^2\) of 0.634 (compared to 0.633). The interpretation of the other features of the regression object are the same as for the ML estimation, and will not be repeated here.

Again, there is only one type of predicted value, contained in predy. There are two types of residuals, the classic residual, \(u = y - X \hat{\beta}\), stored in u, and the spatially filtered residuals, \(u - \lambda W u\), stored in e_filtered.

To illustrate the (minor) effect of the additional arguments, step1c is next set to True.

[ ]:

err2 = GMM_Error(y,x,w=wq,
                     step1c=True,
                     name_w=w_name,name_ds=ds_name)
print(err2.summary)

The estimate for \(\lambda\) is only marginally different, at 0.4722, with a slightly smaller standard error, but the fit is unaffected (in fact, in terms of pseudo \(R^2\), it is slightly worse than before, 0.6337 vs. 0.6340).

With max_iter = 10, the differences are again marginal.

[ ]:

err3 = GMM_Error(y,x,w=wq,
                     max_iter=10,
                     name_w=w_name,name_ds=ds_name)
print(err3.summary)

In practice, ignoring heteroskedasticiy is typically not a good idea for cross-sectional regressions. The hom estimator is included here for the sake of completeness, but should usually not be considered. Only the default settings are illustrated.

[ ]:

err4 = GMM_Error(y,x,w=wq,
                     estimator="hom",
                     name_w=w_name,name_ds=ds_name)
print(err4.summary)

The spatial coefficient is again slightly different, but the effect on the \(\beta\) coefficient is marginal. The main differences are in the results for the standard errors. As mentioned, ignoring heteroskedasticity may yield an overly optimistic impression of precision.

The final estimator is the legacy kp98 method, which is only included for completeness. Since it does not provide inference for the spatial parameter, it is otherwise not recommended for use in practice.

[ ]:

err5 = GMM_Error(y,x,w=wq,
                     estimator="kp98",
                     name_w=w_name,name_ds=ds_name)
print(err5.summary)

Note how the output now does not list standard errors, z-statistics and p-value for the \(\lambda\) estimate. The regression estimates are essentially the same.

GMM Estimation of the SLX-Error Model¶

As for ML, GMM estimation of the SLX-Error model is a special case of spreg.GMM_Error, with the additional argument of slx_lags=1 (or a larger value). Everything else remains the same. More specifically, the three estimators of het, hom and kp98 are again available. Only the default het is considered here. The results show only minor differences for the other methods.

[ ]:

slxerr1 = GMM_Error(y,x,w=wq,slx_lags=1,
                    name_w=w_name,name_ds=ds_name)
print(slxerr1.summary)

In general, as before, there is little support for including all the WX terms after controlling for spatial error autocorrelation. Only the coefficient of W_HIS_ct is significant, but, as before, the value of the estimate is larger than that for the unlagged regressor.

The spatial autoregressive coefficient, 0.422, is highly significant.

As usual, further refinements of the model specification can be carried out by eliminating some lag terms by means of slx_vars, as in the standard SLX model. This is not further pursued here.

Exogenous and Endogenous Variables¶

Additional endogenous variables and associated instruments are added in the standard way by including yend and q as arguments. Note that in the example, the regression matrix is now set to xe, for only the exogenous variables. Everything else is the same as before. Only the default estimator = "het" is illustrated, with its default settings.

[ ]:

enderr = GMM_Error(y,xe,w=wq,yend=yend,q=q,
                     name_w=w_name,name_ds=ds_name)
print(enderr.summary)

The estimate for \(\lambda\) is again in the same general ballpark, but HIS_ct is no longer significant. As usual, the endogenous variable is listed as Instrumented as well as the Instruments.

Finally, the SLX model can be estimated with additional endogenous variables, by including yend and q with slx_lags=1.

[ ]:

slxenderr = GMM_Error(y,xe,w=wq,yend=yend,q=q,
                      slx_lags=1,
                     name_w=w_name,name_ds=ds_name)
print(slxenderr.summary)

In this example, none of the WX terms end up being significant. In addition, of the original regressors, only EP_LIMENG remains as significant.

Practice¶

As practice, different model specifications could be considered, including adding additional explanatory variables, selectively removing some lag terms, and using different spatial weights, in the same way as for the other models.