This page was generated from notebooks/14_IV_estimation_spatial_lag.ipynb. Interactive online version:
Instrumental Variables Estimation - Spatial Lag Model¶
Luc Anselin¶
(revised 09/19/2024)¶
Preliminaries¶
An alternative to maximum likelihood is to tackle the endogeneity of the spatially lagged dependent variable by means of instrumental variables (IV) estimation. This is implemented in the spreg.GM_Lag
class. As before, the estimation of the Spatial Durbin model is achieved through the inclusion of the slx_lags
argument.
Distinct from what is possible for maximum likelihood, other endogenous variables can be included as well, using the familiar yend (for the endogenous variables) and q (for the instruments) arguments.
The treatment in this notebook will focus on the specific properties of the IV estimation. Generic properties of the spatial lag model, such as the different predicted values, residuals, and the impact measures will not be treated in detail again. Technical aspects pertaining to these issues are covered in the maximum likelihood notebook.
Modules Needed¶
As before, the main module is spreg for spatial regression analysis. From this, OLS
and GM_Lag
are imported. In addition, the utilities in libpysal (to open spatial weights and access the sample data set), pandas and geopandas are needed. All of these rely on numpy as a dependency.
The usual numpy set_printoptions
is included as well.
[1]:
import warnings
warnings.filterwarnings("ignore")
import os
os.environ['USE_PYGEOS'] = '0'
import numpy as np
import pandas as pd
import geopandas as gpd
from libpysal.io import open
from libpysal.examples import get_path
from libpysal import weights
from spreg import OLS, GM_Lag
np.set_printoptions(legacy="1.25")
Functions Used¶
from pandas/geopandas:
read_file
from libpysal:
io.open
examples.get_path
weights.distance.Kernel
from spreg:
spreg.OLS
spreg.GM_Lag
Variable definition and data input¶
As in the maximum likelihood notebook, the data set and spatial weights are from the chicagoSDOH sample data set:
Chi-SDOH.shp,shx,dbf,prj: socio-economic indicators of health for 2014 in 791 Chicago tracts
Chi-SDOH_q.gal: queen contiguity weights
Chi-SDOH_k10tri.kwt: triangular kernel weights based on a variable bandwidth with 10 nearest neighbors from
GeoDa
The same descriptive model is used that relates the rate of uninsured households in a tract(for health insurance, EP_UNINSUR) to the lack of high school education (EP_NOHSDP), the economic deprivation index (HIS_ct), limited command of English (EP_LIMENG) and the lack of access to a vehicle (EP_NOVEH). This is purely illustrative of a spatial lag specification and does not have a particular theoretical or policy motivation.
In an alternative specification, HIS_ct is considered to be endogenous, with, as before, COORD_X and COORD_Y as instruments.
The file names and variable names are set in the usual manner. Any customization for different data sets/weights and different variables should be specified in this top cell.
[2]:
infileshp = get_path("Chi-SDOH.shp") # input shape file with data
infileq = get_path("Chi-SDOH_q.gal") # queen contiguity weights created with GeoDa
infilek = get_path("Chi-SDOH_k10tri.kwt") # triangular kernel weights
y_name = 'EP_UNINSUR'
x_names = ['EP_NOHSDP','HIS_ct','EP_LIMENG','EP_NOVEH']
xe_names = ['EP_NOHSDP','EP_LIMENG','EP_NOVEH']
yend_names = ['HIS_ct']
q_names = ['COORD_X','COORD_Y']
ds_name = 'Chi-SDOH'
w_name = 'Chi-SDOH_q'
wk_name = 'Chi-SDOH_k10tri'
The read_file
and open
functions are used to access the sample data set and contiguity weights. The contiguity weights are row-standardized, the class
of the kernel weights adjusted, and the data frames for the dependent and explanatory variables are constructed. As before, this functionality is agnostic to the actual data sets and variables used, since it relies on the specification given in the initial block above.
[3]:
dfs = gpd.read_file(infileshp)
wq = open(infileq).read()
wq.transform = 'r' # row-transform the weights
wk = open(infilek).read()
wk.__class__ = weights.distance.Kernel
y = dfs[y_name]
x = dfs[x_names]
yend = dfs[yend_names]
xe = dfs[xe_names]
q = dfs[q_names]
OLS and SLX with Spatial Diagnostics¶
For easy reference, standard OLS and SLX regressions with spatial diagnostics are repeated here to provide a point of reference. The results are identical to those reported in the maximum likelihood notebook
Refer to the specific OLS notebook for further details.
OLS¶
[ ]:
ols1 = OLS(y,x,w=wq,spat_diag=True,moran=True,
name_w=w_name,name_ds=ds_name)
print(ols1.summary)
The specification achieves an acceptable \(R^2\) of about 0.63 and all coefficients are positive and highly significant.
The non-spatial diagnostics suggest non-normality as well as a hight degree of heteroskedasticity. There is no problem with multicollinearity.
The spatial diagnostics against the SARERR alternatives show very significant LM-Lag and LM-Error, but of the two robust tests, only RLM-Lag is highly significant (RLM-Error only at p < 0.03). Hence, there is a strong indication that a Lag rather than an Error alternative may be appropriate. While the joint LM test is also highly significant, this is likely due to a strong one-sided (Lag) alternative.
Interestingly, the diagnostics against a spatial Durbin alternative strongly support the latter as well. Both LM tests and their robust forms are highly significant, and so is the joint test. Moreover, the value for the robust forms of the test is smaller than the original, which is the expected behavior (although not always reflected in empirical practice).
In sum, in addition to a spatial Lag model as an alternative, the spatial Durbin specification deserves consideration as well.
SLX¶
[ ]:
slx1 = OLS(y,x,w=wq,slx_lags=1,spat_diag=True,moran=True,
name_w=w_name,name_ds=ds_name)
print(slx1.summary)
Relative to the classic regression model, the fit improves slightly, but the constant, EP_NOHSDP and HIS_CT become non-significant at p = 0.01 (they are marginally signifcant at p=0.05). All but one coefficient of the SLX terms are significant (W_EP_NOVEH is not). The signs and magnitudes of the SLX coefficients relative to their unlagged counterparts remain a bit confusing. Only for EP_LIMENG and W_EP_LIMENG are they the same, with the lag coefficient smaller than the unlagged one, in accordance with Tobler’s law. The coefficient for W_HIS_ct is significant and larger than that of HIS_ct, while the latter is not significant at p = 0.01. In other words, the interpretation of these results in terms of distance decay and Tobler’s law may be a bit problematic.
In terms of diagnostics, there is a slight problem with multicollinearity (often the case in SLX specifications), strong non-normality and evidence of heteroskedasticity. Moran’s I is significant, as are both LM-tests, but neither of the robust forms is significant. Based on the relative magnitudes of the test statistics, there is a slight indication of a possible Lag alternative, i.e., a spatial Durbin specification. However, this indication is not as strong as that provided by the LM-SDM test statistics in the classic specification.
IV Estimation of the Lag Model¶
Principle¶
The lag model \(y = \rho Wy + X\beta + u\) can also be written as \(y = Z\delta + u\), where \(Z = [ X, Wy ]\) and the coefficient vector is rearranged as \(\delta = [\beta, \rho ]\). This setup is the same as that considered in the treatment of endogeneity by means of 2SLS estimation.
The spatially lagged dependent variable is indeed endogenous. Based on the reduced form,
\(Wy\) follows as:
As a result, after some algebra, \(E[(Wy)'u] = tr[W(I - \rho W)'^{-1}].E[uu'] \neq 0\). This follows because of the presence of non-zero diagonal terms in the inverse matrix (see the discussion of impacts in the maximum likelihood notebook). Consequently, one of the fundamental assumptions of OLS estimation is violated, and Wy must be treated as an endogenous variable.
This is carried out by means of 2SLS estimation, whereby the main question becomes the choice of good (optimal) instruments for Wy. The conditional expectation of Wy, given X, provides the basis for this:
As a result, the spatially lagged explanatory variables \(WX, W^2X, \dots\) suggest themselves as instruments. Since the \(X\) are assumed to be uncorrelated with the error terms, so are the \(WX\). Also, from the conditional expectation, it follows that they are related to the endogenous Wy, which satisfies a second requirement for instruments. Note that WX is not applied to the constant term, since this would result in perfect multicollinearity.
Estimation then follows as a straightforward application of 2SLS, referred to as spatial 2SLS, or S2SLS. The estimator, variance-covariance matrix, and robust standard errors are the same as for the classis 2SLS (see the 2SLS notebook for details).
In a straightforward manner, additional endogenous variables with associated instruments can be incorporated as well. One question remains whether the additional instruments should be lagged or not. In spreg.GM_Lag
, this is handled by the lag_q
option (the default is lag_q = True
).
Implementation¶
IV estimation of the spatial lag model is carried out by means of spreg.GM_Lag
. This is a customized implementation of the two stage least squares estimation, with the instruments for the spatially lagged dependent variable computed internally. As mentioned, additional endogenous variables can be specified as well.
The default setup requires the dependent variable, y
, the explanatory variables (without a constant term), x
, and the spatial weights w
. The instruments are the spatially lagged explanatory variables, WX. They do not need to be specified separately. The order of spatial weights used as instruments is set by means of w_lags
(the default is w_lags = 1
).
As is customary, the main results are listed using the summary
method. For now, no impacts are listed by setting spat_impacts = None
(note that spat_impacts = "simple"
is the default). The AK-test for remaining residual spatial autocorrelation is included by default (i.e., spat_diag = True
).
[ ]:
lag1 = GM_Lag(y,x,w=wq,
name_w=w_name,name_ds=ds_name,
spat_impacts = None)
print(lag1.summary)
In this example, the estimates are very similar to the ML results (see the ML notebook). This is not always the case, but it is encouraging when it is (in the sense of not having other misspecification issues). The autoregressive coefficient of 0.377 compares to 0.392 in the ML case and is highly significant. As in the ML case, the coefficients of the constant term and of EP_NOHSDP become non-significant. The other regression coefficients are of similar magnitudes as in the ML case and thus also generally smaller than the corresponding OLS estimates.
The results show the Instrumented variable as W_EP_UNINSUR and the Instruments as the spatial lags of the explanatory variables.
As in the ML case, there are two measures of fit, the Pseudo R-squared (based on the naive residuals) and the Spatial Pseudo R-squared (based on the predicted values from the reduced form). The results are essentially the same as for the ML estimates, e.g., respectively 0.684 (vs. 0.685) and 0.647 (vs. 0.647). In constrast to the ML results, there is no Likelihood, AIC or SC.
Note that with S2SLS estimation, it is possible for the spatial autoregressive parameter to take on a value larger than one. Unlike maximum likelihood estimation, where the parameter space is constrained in the optimization routine, there is no such constraint in S2SLS. When the parameter estimate is outside the bounds, a warning is given and any properties based on the reduced form predicted values will not be computed. This includes predy_e and e_pred, as well as the Spatial Pseudo R-squared and the spatial impacts (for an example, see below).
The AK test shows no evidence of remaining residual spatial autocorrelation.
Instrument lag order¶
The order of the spatial lags used for the instruments is determined by the w_lags
argument. By default, this is set to 1. The use of higher order lags results in greater precision, but often at the cost of an increase in multicollinearity. In practice, using lags larger than 2 is not recommended.
The effect of setting w_lags = 2
is illustrated next.
[ ]:
lag2 = GM_Lag(y,x,w=wq,
name_w=w_name,name_ds=ds_name,
w_lags = 2,
spat_impacts = None)
print(lag2.summary)
The additional instruments are listed below the parameter estimates with the prefix W2. In this example, the effect of including the second order lags is minimal. The autoregressive coefficient becomes 0.381 (vs. 0.377) and the standard errors are marginally smaller. The fit is unchanged.
Predicted values and residuals¶
The treatment of predicted values and residuals is the same as for maximum likelihood estimation. The naive results are stored in the regression object attributes predy and u, and the values that are based on the reduced form prediction are in predy_e and e_pred.
Robust Standard Errors¶
As in the standard 2SLS estimation, it is possible to obtain robust standard errors by means of the robust = "white"
and robust = "hac"
options. Given the common presence of heteroskedastic errors in cross-sectional regression, this is highly recommended. The results are given below. As before, the hac
option requires that a kernel spatial weights object is specified as the argument gwk
, with, optionally, name_gwk
.
White standard errors¶
[ ]:
lag3 = GM_Lag(y,x,w=wq,
name_w=w_name,name_ds=ds_name,
robust = 'white',
spat_impacts = None)
print(lag3.summary)
The estimates are the same as before, but the standard errors are slightly larger. For example, for the spatial autoregressive coefficient, the standard error becomes 0.07377, compared to 0.06455. The overall impact is minimal.
HAC standard errors¶
[ ]:
lag4 = GM_Lag(y,x,w=wq,
name_w=w_name,name_ds=ds_name,
robust = 'hac',gwk=wk,name_gwk=wk_name,
spat_impacts = None)
print(lag4.summary)
In this case, the HAC standard errors are very similar to the heteroskedasticity-robust ones, suggesting the main source of misspecification comes from the latter. This is in accordance with the lack of significance of the AK-test. The standard error for the spatial autoregressive coefficient, 0.07238, is even slightly smaller than the White standard error (0.07337).
Spatial Multipliers - Impacts¶
Similar to what holds for ML estimation, impact measures are computed for spatial lag models estimated by means of instrumental variables. This is implemented through the spat_impacts
argument. The default setting is spat_impacts = "simple"
for the Kim et al approach (see the ML notebook for technical details). As before, the other options are full
, power
, all
or `None
.
In the example below, the argument is set as spat_impacts = ["simple","full"]
. Only classic standard errors are considered. Since the multipliers and impacts only depend on the coefficient estimates, the type of standard error is immaterial.
Note that the reported impacts are only average effects. See the spatial multipliers notebook for a more extensive analysis of the associated spatial pattern.
[ ]:
lag5 = GM_Lag(y,x,w=wq,
name_w=w_name,name_ds=ds_name,
spat_impacts=['simple','full'])
print(lag5.summary)
The interpretation of the direct, indirect and total effects is the same as before. Since the coefficient estimates were very similar to the results obtained for ML, the impacts are similar as well.
Additional Endogenous Variables¶
Additional endogenous variables are included by means of the yend and q arguments, in the same way as for classic 2SLS estimation. The only relevant option in this regard is whether the instruments should be lagged as well. The default is to include their spatial lags, through lag_q = True
. In practice, there is no good reason not to lag them, since they provide additional information. For clarity, the lag_q
argument is included, even though it is not needed, since it is the
default.
Note that here x = xe
, which contains only the exogenous variables.
[ ]:
lag6 = GM_Lag(y,x=xe,w=wq,yend=yend,q=q,
lag_q = True,
name_w=w_name,name_ds=ds_name,
spat_impacts=['simple','full'])
print(lag6.summary)
The output listing now includes HIS_ct as Instrumented, and both COORD_X and COORD_Y, as well as their spatial lags among the Instruments.
The impact on the estimates is relatively minor, although the spatial autoregressive coefficient decreases slightly to 0.32995 (compared to 0.37698). The treatment of HIS_ct as endogenous results in its coefficient becoming only marginally significant (p=0.02). The model impacts reflect the slightly different coefficient estimates.
Note that impacts are only computed for the exogenous variables. In a strict sense, the additional endogenous variables should not be part of the reduced form. Moreover, since they are not determined in a fully simultaneous equation system, there is no practical way to include other exogenous variables from such a system. Therefore, the multiplier effect of additional endogenous variables is not considered.
IV Estimation of Spatial Durbin Model¶
IV estimation of the Spatial Durbin model is a special case of spreg.GM_Lag
, with the additional argument of slx_lags=1
(or a larger value). Everything else remains the same.
In the example, spat_impacts
is set to ["simple","full"]
(note, the default setting remains spat_impacts = "simple"
).
Another default setting is spat_diag = True
, which yields the results for the both the AK test and the Common Factor Hypothesis test. To avoid these tests, spat_diag
must be set to False
. In the illustration, both arguments are listed explicitly for clarity.
[ ]:
spdur1 = GM_Lag(y,x,w=wq,slx_lags=1,
name_w=w_name,name_ds=ds_name,
spat_impacts = ['simple','full'],
spat_diag = True)
print(spdur1.summary)
Note how the list of Instruments now consists of the second order spatial lags, even though w_lags = 1
. This is because the spatial Durbin model already includes the first order lags as explanatory variables. Using first order lags as instruments would result in perfect multicollinearity. This is detected internally in GM_Lag
, so that the computation of the lagged instruments is adjusted accordingly.
As in the case of ML estimation, the inclusion of the lagged explanatory variables has quite an effect, even though only W_EP_NOHSDP turns out to be significant (for ML, W_HIS_ct was significant as well). Also, the spatial autoregressive coefficient (0.509) is no longer significant. Three of the four lag coefficients (including the non-significant ones) have a negative sign, the opposite of the original regression coefficients. This raises the suspicion that the proper specification may
be a spatial error model. The results of the Common Factory Hypothesis Test bear this out. At 4.083, it is not significant, which means that the common factor constraint can not be rejected. This contrasts greatly with the result for ML estimation, where the constraint was strongly rejected. However, given the lack of significance of the spatial lag terms, this needs to be interpreted with caution. A better strategy would be to consider further refinements of the model specification by
eliminating some lag terms by means of slx_vars
, as in the SLX model. This is not further pursued here.
The impacts are computed as before. Also, robust standard errors can be implemented in the usual fashion.
Spatial Durbin model with endogenous variables¶
Additional endogenous variables can be included in a spatial Durbin specification in the same way as in the standard spatial lag model. With spat_diag = True
, the default, only the AK test is produced. Since there is no spatial lag for the endogenous variable, a common factor test is not meaningful.
The example below uses the same specification as in the spatial lag model.
[ ]:
spdur2 = GM_Lag(y,x=xe,w=wq,yend=yend,q=q,
lag_q = True,
slx_lags=1,
name_w=w_name,name_ds=ds_name,
spat_impacts=['simple','full'],
spat_diag = True)
print(spdur2.summary)
In this example, the estimate for the spatial autoregressive coefficient turns out to be larger than one (1.01654), which is outside the accepted parameter space. This raises a warning and precludes the computation of the spatial pseudo R-squared and the model impacts.
When this happens, further refinement of the model and/or estimation is required. There are several options, such as using higher lags for the instruments, using different/more instruments for the additional endogenous variable, dropping some lag terms (using slx_vars
), or changing the original specification altogether. This is left as an exercise.
Practice¶
IV estimation of the spatial lag model is sensitive to several aspects of the model specification. However, in large(r) samples, it generally yields more robust results than maximum likelihood, especially when using robust standard errors.
As practice, different model specifications could be considered, including adding additional explanatory variables, selectively removing some lag terms, and using different spatial weights. Make sure to carefully consider the interpretation of the estimated coefficients and associated direct and indirect effects, as well as the robustness of the standard errors.