This page was generated from notebooks/17_GMM_higher_order.ipynb. Interactive online version: Binder badge

GMM Estimation - Higher Order Models

Luc Anselin

(revised 09/26/2024)

Preliminaries

This module covers the estimation of higher order spatial models, such as SAR-SAR (spatial lag with spatial autoregressive errors) and the generalized nested specification (GNS), i.e., a Spatial Durbin model with spatial autoregressive errors. These specifications are estimated as special cases of spreg.GMM_Error, by including the argument add_wy = True, without or with slx_lags = 1.

In general, these specifications should be avoided, but they are included here for the sake of completeness. As shown by Koley and Bera (2024), the full set of parameters in the GNS model is not identified, and ML cannot be applied. The SAR-SAR models suffers from similar problems, and its ML estimation typically has a very hard time to convert, switching back and forth between the estimates for \(\rho\) and \(\lambda\) (this is referred to by Bivand and Piras in the R-spatialreg package as the “banana” problem). As a result, ML estimation of these models is not included in spreg. However, it remains possible to estimate them by means of IV/GMM methods, although the results need to be interpreted with caution. Also, as it turns out, in practice, the results often do not make sense and are difficult to interpret.

Modules Needed

The same modules are needed as for the GMM estimation of the error model: GMM_Error imported from spreg, utilities in libpysal (to open spatial weights and access the sample data set), pandas and geopandas.

[1]:
import warnings
warnings.filterwarnings("ignore")
import os
os.environ['USE_PYGEOS'] = '0'

import numpy as np
import pandas as pd
import geopandas as gpd
from libpysal.io import open
from libpysal.examples import get_path
from libpysal.weights import lag_spatial

from spreg import GMM_Error

Functions Used

  • from pandas/geopandas:

    • read_file

  • from libpysal:

    • io.open

    • examples.get_path

  • from spreg:

    • spreg.GMM_Error

Variable definition and data input

The data set and spatial weights are again from the chicagoSDOH sample data set. They are the same as for the GMM Error estimation:

  • Chi-SDOH.shp,shx,dbf,prj: socio-economic indicators of health for 2014 in 791 Chicago tracts

  • Chi-SDOH_q.gal: queen contiguity weights

To illustrate the methods, the same descriptive model is used as in the ML notebook. It relates the rate of uninsured households in a tract(for health insurance, EP_UNINSUR) to the lack of high school education (EP_NOHSDP), the economic deprivation index (HIS_ct), limited command of English (EP_LIMENG) and the lack of access to a vehicle (EP_NOVEH). This is purely illustrative of a spatial error specification and does not have a particular theoretical or policy motivation.

In an alternative specification, HIS_ct is considered to be endogenous, with, as before, COORD_X and COORD_Y as instruments.

The file names and variable names are set in the usual manner. Any customization for different data sets/weights and different variables should be specified in this top cell.

[2]:
infileshp = get_path("Chi-SDOH.shp")     # input shape file with data
infileq = get_path("Chi-SDOH_q.gal")     # queen contiguity weights created with GeoDa

y_name = 'EP_UNINSUR'
x_names = ['EP_NOHSDP','HIS_ct','EP_LIMENG','EP_NOVEH']
xe_names = ['EP_NOHSDP','EP_LIMENG','EP_NOVEH']
yend_names = ['HIS_ct']
q_names = ['COORD_X','COORD_Y']
ds_name = 'Chi-SDOH'
w_name = 'Chi-SDOH_q'

The read_file and open functions are used to access the sample data set and contiguity weights. The weights are row-standardized and the data frames for the dependent and explanatory variables are constructed. As before, this functionality is agnostic to the actual data sets and variables used, since it relies on the specification given in the initial block above.

[3]:
dfs = gpd.read_file(infileshp)
wq =  open(infileq).read()
wq.transform = 'r'    # row-transform the weights
y = dfs[y_name]
x = dfs[x_names]
yend = dfs[yend_names]
xe = dfs[xe_names]
q = dfs[q_names]

The SAR-SAR Model

Exogenous variables only

GMM estimation of the SAR-SAR model is implemented in spreg.GMM_Error. This requires the standard regression arguments (i.e., at a minimum, y, x and w, as well as yend and q for the endogenous case), as well as add_wy = True. The same three methods are implemented as for generic GMM_Error, but here only the default estimator = "het" will be considered. Also, none of the special options are included (see the GMM Error notebook for details).

GMM-heteroskedastic is the default, so the estimator argument does not need to be specified. The first illustration is for all default settings with only exogenous regressors. As usual, there is the option to use higher order lags for the instruments, but this is not pursued here.

Also, note that in constrast to the standard lag (and spatial Durbin) specifications, there is no spat_impacts option.

[ ]:
sarsar1 = GMM_Error(y,x,w=wq,
                 add_wy=True,
                     name_w=w_name,name_ds=ds_name)
print(sarsar1.summary)

In this example, there was no evidence to include a spatial error term suggested by the AK test in the spatial lag model (see the notebook on IV estimation of the spatial lag model). As a result, it is not a surprise to find the coefficient \(\lambda\) not to be significant. Typical in the SAR-SAR model, the signs of \(\rho\) and \(\lambda\) tend to be opposite, which is difficult to interpret and may point to an identification problem.

Exogenous and endogenous variables

An extension to include additional endogenous variables is carried out in the standard way. The endogenous variables and associated instruments are listed below the table with estimates. As usual, there is an option to include the spatial lags of the instruments (True by default).

[ ]:
sarsar2 = GMM_Error(y,xe,w=wq,yend=yend,q=q,
                     add_wy = True,
                     name_w=w_name,name_ds=ds_name)
print(sarsar2.summary)

The consideration of endogeneity makes HIS_ct insignificant (as well as EP_NOHSDP). The \(\lambda\) coefficient remains non-significant, but its sign changes.

The GNS Model

The GNS model is treated as an SLX-Error model with an additional spatially lagged dependent variable as a regressor. This is accomplished by setting both slx_lags = 1 (or higher) and add_wy = True in the GMM_Error call. As in the SAR-SAR case, there is no spat_impacts option.

Exogenous variables only

[ ]:
gns1 = GMM_Error(y,x,w=wq,
                     slx_lags = 1, add_wy = True,
                     name_w=w_name,name_ds=ds_name)
print(gns1.summary)

The results are only provided as an illustration of the functionality. Again, the typical pattern emerges of opposite signs for \(\rho\) and \(\lambda\), but now only \(\lambda\) is significant. None of the SLX terms is significant at p=0.01, and of the original regressors, only EP_LIMENG remains significant.

Exogenous and endogenous variables

Endogenous variables are included by specifying yend and q (the x argument is set to xe, for exogenous variables only).

[ ]:
gns2 = GMM_Error(y,xe,w=wq,yend=yend,q=q,
                     add_wy = True,slx_lags = 1,
                     name_w=w_name,name_ds=ds_name)
print(gns2.summary)

In this example, the estimate of \(\rho\) is 0.9, which is highly suspicious. The estimate for \(\lambda\) is now marginally significant and positive. Of the other variables in the model, only EP_LIMENG remains as significant.

Again, this highlights the caution that is needed when implementing this model. In general, it should be avoided and the model under consideration should be respecified in a different way.

Practice

Since these models should be avoided, there is not much point in practicing them, other than to gain insight into the often conflicting (and confusing) indications provided by the parameter estimates. It is not because a model is not identified that no estimates can be obtained. However, those results are not necessarily (and usually not) meaningful.