This page was generated from notebooks/7_spatial_models.ipynb. Interactive online version:
Spatial Model Specifications¶
Luc Anselin¶
(revised 09/07/2024)¶
Preliminaries¶
In this notebook, the basic model specifications that include spatial dependence into a linear spatial regression are introduced. This is implemented through the use of a spatially lagged variable. The spatial lag is applied to the dependent variable, as \(Wy\), to the explanatory variables (excluding the constant term), as \(WX\), and to the error terms, as \(We\). For technical details, see the relevant chapters in Anselin and Rey (2014). Modern Spatial Econometrics in Practice.
As an illustration, the effect of a spatial misspecification will be investigated. Specifically, the impact spatial effects have on the classical OLS estimates will be assessed when they are present in the data generating process (DGP), but ignored in the regression specification. To accomplish this, some simple simulation experiments are carried out.
Prerequisites¶
Familiarity with OLS estimation in spreg is assumed, as covered in the OLS notebook. For the graphs, it may be useful to have some familiarity with matplotlib, although to just replicate the graphs used here, that is not really needed.
Modules Needed¶
The main module for spatial regression in PySAL is spreg. In addition, libpysal is needed for spatial weights manipulation, and pandas for data frame manipulation. In the current illustrations, geopandas is not needed.
In addition, in order to plot the results of the simulation experiments, seaborn will be used, which in turns relies on matplotlib as a dependency. While under the hood, everything is actually matplotlib code, seaborn is a bit more intuitive and easier to achieve simple results. An in-depth coverage of the seaborn functionality is beyond the current scope, but the illustrations given here should be enough to get some decent graphs. For full details, see https://seaborn.pydata.org.
Finally, the module time is imported to provide some timing results (optional).
[7]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns
from libpysal import weights
from spreg import OLS, make_x, make_xb, make_wx, make_wxg, make_error, \
dgp_lag, dgp_sperror, dgp_slx, dgp_spdurbin
Functions Used¶
from numpy:
random.default_rng
zeros
from pandas:
DataFrame
describe
melt
from seaborn:
displot
from matplotlib.pyplot:
show
from libpysal:
weights.lat2W
w.transform
w.n
from spreg:
OLS
make_x
make_xb
make_wx
make_wxg
make_error
dgp_lag
dgp_sperror
dgp_slx
dgp_spdurbin
Files and Variables¶
In this notebook, no actual data are used, but data sets will be created by means of simulation.
Model Parameters¶
The various model parameters are set here, so that it is easy to replicate the experiments for different sample sizes and coefficient values.
gridx: the number of cells in the horizontal dimension of a regular lattice of dimension gridx x gridy – set to 20
gridy: the number of cells in the vertical dimension of a regular lattice of dimension gridx x gridy – set to 20
b1: a list with regression parameters (includes a coefficient for the constant term as the first element) – set to 1, 1
rndseed: the random seed to ensure reproducibility – set to 123456789
reps: the number of replications – set to 1000
rhovals: a list with spatial autoregressive coefficients \(\rho\) for the lag variables Wy – set to [0, 0.2, 0.5, 0.7, 0.9]
lamvals: a list with spatial coefficients \(\lambda\) for the error lag variables We – set to [0, 0.2, 0.5, 0.7, 0.9]
gamvals: a list with coefficients for the SLX variable (WX) – set to [0.1, 0.3, 0.5, 0.7, 0,9]
gamma: coefficient for WX in the Spatial Durbin Model - set to 0.5
[8]:
gridx = 20
gridy = 20
b1 = [1, 1]
rndseed = 123456789
reps = 1000
rhovals = [0, 0.2, 0.5, 0.7, 0.9]
lamvals = [0, 0.2, 0.5, 0.7, 0.9]
gamvals = [0.1, 0.3, 0.5, 0.7, 0.9]
gamma = 0.5
Spatial Weights¶
Row-standardized queen contiguity spatial weights are constructed for a regular lattice. The number of observations is the product of row and column dimensions of the grid. The libpysal command weights.lat2W
with rook=False
is used to obtain queen weights.
[ ]:
w = weights.lat2W(gridx,gridy,rook=False)
w.transform = 'r'
n = w.n
n
X Matrices¶
The X matrix is constructed using the make_X
command from the dgp
module in spreg
, with the default uniform distribution. Next, \(X\beta\) is computed with make_xb
and \(WX\) with make_wx
. This uses the X matrix (without a constant column), the spatial weights and the provided regression parameters as inputs. The resulting matrices will be used as input to create the dependent vector \(y\) for specific data generating processes (DGP). For these DGP, the regression
coefficients and random seed specified on top are used.
Note that in this example make_x
will make a one column vector, since there is only one slope coefficient. In make_xb
, a constant column is added automatically. So, whereas a constant column is not included in x1, the result of make_x
, its coefficient must be included in the coefficient list b1 to compute \(X \beta\) correctly. The result is a column vector.
[10]:
# Create X
rng=np.random.default_rng(seed=rndseed) # set seed for X
x1 = make_x(rng,n,mu=[0],varu=[6],method="uniform")
xb1 = make_xb(x1,b1) # no constant in x1, but a coefficient for the constant in b1
wx1 = make_wx(x1,w) # default first order no constant
Spatial Lag Model¶
The spatial lag model takes on the form: \begin{equation*} y = \rho Wy + X\beta + u, \end{equation*} where \(Wy\) is the spatial lag term, \(\rho\) is the spatial autoregressive coefficient, \(\beta\) are the regression coefficients, and \(u\) is an error vector.
The dependent variable for a spatial lag model is generated by means of spreg.dgp_lag
. This uses the reduced form for the spatial lag model:
\begin{equation*} y = (I - \rho W)^{-1} X\beta + (I - \rho W)^{-1}u. \end{equation*}
Interest centers on finding out what happens to the estimated coefficient \(\hat{\beta}\) when the true DGP is the spatial lag model, but the estimation ignores the spatial lag term and uses OLS. In other words, \(\beta\) is estimated in the regression \(y = X\beta + u\). The properties of the OLS estimates in this misspecified regression are investigated by means of a small simulation.
The simulation consists of the following steps:
initialize the result matrix for the estimates of \(\beta\)
in each step of the replications, generate \(y\) from the DGP and obtain \(\hat{\beta}\) from an OLS estimation
run the simulation loop over all replications and spatial parameters and collect the results
turn the results matrix into a DataFrame
compute descriptive statistics
plot the distribution of parameter estimates for different values of the spatial parameter
For greatest efficiency, these steps could all be combined in one big function, but to get a good sense of what is going on in each step, for now they are kept separate.
Initialize results matrix¶
For each value of \(\rho\) as a column, the results matrix will have the \(\beta\) estimate for each replication in the row. The matrix is thus of dimension reps times number of \(\rho\) values.
[11]:
best = np.zeros((reps,len(rhovals)))
Simulation loop¶
The first step is to initialize the random seed for reproducibility. The function spreg.make_error
is used to generate a standard normal random error vector. Then, together with the computed \(X\beta\), the error term is used to generate \(y\) by means of spreg.dgp_lag
.
The \(\hat{\beta}\) coefficient is extracted from the regression object as the second element in the betas
attribute of the regression object obtained from spreg.OLS
, and then entered in the results matrix. Running 1000 replications may take a minute or so, depending on hardware.
[ ]:
t0 = time.time()
rng=np.random.default_rng(seed=rndseed)
for r in range(len(rhovals)):
for i in range(reps):
u = make_error(rng,n)
y = dgp_lag(u,xb1,w,rhovals[r])
reg = OLS(y,x1,nonspat_diag=False)
best[i,r] = reg.betas[1]
t1 = time.time()
print("time in minutes: ",(t1-t0)/60.0)
Results data frame¶
By means of Python list comprehension, a list of meaningful column headers is created that is related to the spatial parameter values.
The result array is then converted to a pandas Data Frame using this list as the column names.
[ ]:
rr = ["rho"+str(r) for r in rhovals]
results = pd.DataFrame(best,columns=rr)
results.head()
Descriptive statistics¶
The default descriptive statistics are obtained by means of the describe
method of the pandas data frame. Of most interest are the mean and the standard deviation (std). Any difference between the mean and the true value of \(\beta\) indicates bias, whereas changes in the standard deviation suggest a change in precision with values of \(\rho\).
[ ]:
results.describe()
The descriptive statistics for this example illustrate how the estimate for \(\beta\) becomes more and more biased with increasing values of \(\rho\), with a mean of 1.002 under the null to a mean of 1.389 for \(\rho = 0.9\). In addition, the standard deviation of the estimate increases as well, from a value of 0.0198 under the null to 0.0409 for \(\rho = 0.9\), more than double (which also means that the variance is four times as large).
Graphing the results¶
Using seaborn, it is very straightforward to plot the distribution of a column in a data frame by means of the sns.displot
command, specifying kind="kde"
. The other arguments are the name of the data frame (here, results) and the x-axis, for example rho0.5. The command plt.show()
is included to show the plot.
[ ]:
sns.displot(results, x="rho0.5", kind="kde")
plt.show()
The plot illustrates how the mean is no longer centered on the value of 1.000, but instead on 1.06.
Plotting the distribution of the estimates for all the spatial parameters on a single plot is a little trickier, and requires the use of the pandas melt
functionality. As it stands, the data frame results is in so-called wide format, with a different column for each value of \(\rho\). seaborn likes this to be in so-called long format, or tidy (in R terminology), where all the \(\beta\) estimates form one long column with an additional variable that gives the value for
\(\rho\). In a sense, each individual estimation result becomes a separate observation, with a value for \(\rho\) and a value for the \(\beta\) estimate.
To accomplish this, two arguments need to be set, one for the new column that will contain the values of \(\rho\), var_name
, the other for the regression coefficient that matches the replication-rho combination, named value_name
in the pandas terminology. In the example, var_name
becomes “rho”, since each of the columns contains the prefix rho, and value_name
become “b”. The result is a long data frame.
[ ]:
reslong = results.melt(var_name='rho',value_name='b')
reslong
At this point, the sns.displot
command can be applied to the new data frame with x="b"
and differentiated by the value of \(\rho\) by specifying the hue="rho"
. As before, kind="kde"
for a kernel density curve. Other parameters can be set as well, but that’s beyond the current scope.
[ ]:
sns.displot(reslong,x="b",hue="rho",kind="kde")
plt.show()
The plots clearly illustrate both the increasing bias as well as the larger variance with growing values of the spatial autoregressive parameter.
Spatial Error Model¶
The same exercise is now repeated for a spatial error specification, using spreg.dgp_sperror
. The error model is:
\(y = X\beta + u\), with \(u = \lambda Wu + e\),
where \(\lambda\) is the spatial autoregressive coefficient for the error vector.
The values for \(\lambda\) are specified in lamvals at the top of the notebook.
The default for a spatial error vector is an autoregressive process, with model='sar'
as the option in spreg.dgp_sperror
(since it is the default, it doesn’t have to be specified, but it is included here for clarity).
Note how the parameter vectors have been changed to lamvals and the column names in the data frame have the lam predicate. In all other respects, the logic is the same as for the previous experiment.
[ ]:
t0 = time.time()
best = np.zeros((reps,len(lamvals)))
rng=np.random.default_rng(seed=rndseed)
for r in range(len(lamvals)):
for i in range(reps):
u = make_error(rng,n)
y = dgp_sperror(u,xb1,w,lamvals[r],model="sar")
reg = OLS(y,x1,nonspat_diag=False)
best[i,r] = reg.betas[1]
t1 = time.time()
print("time in minutes: ",(t1-t0)/60.0)
rr = ["lam"+str(r) for r in lamvals]
results = pd.DataFrame(best,columns=rr)
results.describe()
In contrast to the spatial lag model, the estimates for \(\beta\) remain unbiased, with the mean centered on the correct value of 1.0. However, as the spatial parameter increases, the standard error goes from 0.0198 under the null to 0.0409 for \(\lambda = 0.9\), four times as much.
A graph can be created for all values of \(\lambda\) in the same way as before.
[ ]:
reslong = results.melt(var_name='lam',value_name='b')
sns.displot(reslong,x="b",hue="lam",kind="kde")
plt.show()
The pattern is confirmed by the graphs, with a lower curve corresponding to a larger variance. Note how for \(\lambda = 0.2\), the effect is negligible, with even a slightly smaller standard error. The impact becomes really pronounced for larger values of \(\lambda\).
Moving Average Errors¶
For comparison, the simulations are also run for a spatial moving average error process. In this model, the error vector is:
\begin{equation*} u = \lambda We + e, \end{equation*} with \(e\) as a standard normal error vector.
The commands to carry out the simulation are all the same, except that in spreg.dgp_sperror
, the model
argument should be set to "ma"
.
[ ]:
t0 = time.time()
best = np.zeros((reps,len(lamvals)))
rng=np.random.default_rng(seed=rndseed)
for r in range(len(lamvals)):
for i in range(reps):
u = make_error(rng,n)
y = dgp_sperror(u,xb1,w,lamvals[r],model="ma")
reg = OLS(y,x1,nonspat_diag=False)
best[i,r] = reg.betas[1]
t1 = time.time()
print("time in minutes: ",(t1-t0)/60.0)
rr = ["lam"+str(r) for r in lamvals]
results = pd.DataFrame(best,columns=rr)
results.describe()
Here, the effect on the variance is much less pronounced, going from 0.020 under the null to 0.022 with \(\rho = 0.9\). The same can be observed in the frequency graphs.
[ ]:
reslong = results.melt(var_name='lam',value_name='b')
sns.displot(reslong,x="b",hue="lam",kind="kde")
plt.show()
SLX¶
The same approach is followed to assess the effect of ignoring a spatially lagged explanatory variable (SLX) on the regression slope coefficient. Formally, the SLX specification is:
\begin{equation*} y = X \beta + WX \gamma + u, \end{equation*}
where \(WX\) does not contain a constant term (the spatial lag of a constant term is the same constant, which would create perfect multicollinearity) and \(\gamma\) is a vector of parameters.
The DGP for the SLX model is obtained from spreg.dgp_slx
. Both xb1 and wxg1 need to be passed as arguments. The latter is the product \(WX \gamma\), which must be computed for each different value of \(\gamma\).
To investigate a range of values for \(\gamma\), some slight adjustments to the code are needed. In the simulation loop, wxg1 must be updated for each new value of \(\gamma\). Otherwise, the logic remains the same as before.
[ ]:
t0 = time.time()
best = np.zeros((reps,len(gamvals)))
rng=np.random.default_rng(seed=rndseed)
for r in range(len(gamvals)):
wxg1 = make_wxg(wx1,r)
for i in range(reps):
u = make_error(rng,n)
y = dgp_slx(u,xb1,wxg1)
reg = OLS(y,x1,nonspat_diag=False)
best[i,r] = reg.betas[1]
t1 = time.time()
print("time in minutes: ",(t1-t0)/60.0)
rr = ["gam"+str(r) for r in gamvals]
results = pd.DataFrame(best,columns=rr)
results.describe()
This result illustrates the classic omitted variable bias effect, since the ignored WX is nothing but an omitted regressor. The effect is a slight increase in bias of the \(\beta\) estimate, that becomes larger with larger \(\gamma\). The effect on the variance is minimal.
This behavior is nicely illustrated by the frequency plots, which show a gradual shift away from the value of 1.0, but in contrast to what happens for the spatial lag model, all the curves have basically the same shape, indicating a minimal effect on the variance.
[ ]:
reslong = results.melt(var_name='gam',value_name='b')
sns.displot(reslong,x="b",hue="gam",kind="kde")
plt.show()
Spatial Durbin¶
The final model illustrated here is a Spatial Durbin model, which includes both a spatially lagged dependent variable, \(Wy\), and spatially lagged explanatory variables (SLX), \(WX\):
\begin{equation*} y = Wy + X\beta + WX \gamma + u. \end{equation*}
The dependent variable is generated by means of the reduced form, similar to what is the case for a spatial lag model, but now including the SLX term as part of the regression:
\begin{equation*} y = (I - \rho W)^{-1} (X\beta + WX\gamma) + (I - \rho W)^{-1} u. \end{equation*}
The effect of a misspecified spatial Durbin model is illustrated for a range of spatial autoregressive coefficients. For the sake of simplicity, \(\lambda\) is kept fixed, but of course, there could be a double loop over values of both \(\rho\) and \(\lambda\). The relevant function is spreg.dgp_spdurbin
. It takes as arguments and error term, xb, wxg, the spatial weights w, and the spatial parameter \(\rho\) (\(\gamma\) is used in the calculation of wxg,
outside of the loop).
[ ]:
t0 = time.time()
rng=np.random.default_rng(seed=rndseed)
wxg1 = dgp.make_wxg(wx1,gamma)
for r in range(len(rhovals)):
for i in range(reps):
u = make_error(rng,n)
y = dgp_spdurbin(u,xb1,wxg1,w,rhovals[r])
reg = OLS(y,x1,nonspat_diag=False)
best[i,r] = reg.betas[1]
t1 = time.time()
print("time in minutes: ",(t1-t0)/60.0)
rr = ["rho"+str(r) for r in rhovals]
results = pd.DataFrame(best,columns=rr)
results.describe()
The effect on the bias of the regression coefficient is similar but greater than for the simple spatial lag model, but the standard error changes in very much the same way, essentially doubling over the range of \(\rho\).
This is nicely illustrated by the series of frequency curves, with the right-most curve centered on 1.6.
[ ]:
reslong = results.melt(var_name='rho',value_name='b')
sns.displot(reslong,x="b",hue="rho",kind="kde")
plt.show()
Other Options¶
The dgp module contains several other options, such as:
SLX error model with SAR and MA errors:
spreg.dgp_slxerror
SAR-Error model:
spreg.dgp_lagerr
SARMA model:
spreg.dgp_lagerr
withmodel="ma"
GNS SAR model:
spreg.dgp_gns
GNS MA model:
spreg.dgp_gns
withmodel="ma"
In addition, the random errors can be generated for different variance values, and besides the default normal distribution, for a laplace
, cauchy
or lognormal
distribution by specifying the argument method
.
Finally, the X matrix can be generated with a uniform distribution (the default), but also for independent normal vectors (method = 'normal'
) and bivariate correlated vectors (method = 'bivnormal'
). While the latter can only be implemented for two explanatory variables, the number of regressors for the other options is not constrained.
Practice¶
Assess the properties of OLS for different forms of spatial misspecifications. For example, you could examine the effect of negative spatial coefficients, or of multiple spatially lagged regressors, as well as the misspecification caused by the models that were not included in the examples, such as SAR-Error and SLX-Error. You could also experiment with different sample sizes and different spatial weights, especially weights with different connectedness characteristics.
For now, the treatment of estimation has been limited to classic OLS. Before venturing into the estimation of the spatial models, it is important to have a good sense of the implications of ignoring spatial effects and to what extent they matter.