This page was generated from notebooks/6_TWOSLS.ipynb. Interactive online version:
Two Stage Least Squares Regression (2SLS)¶
Luc Anselin¶
(revised 09/08/2024)¶
Preliminaries¶
In this notebook, endogenous and instrumental variables are introduced and the basics of two stage least squares estimation are presented.
Technical details are given in Chapter 6 in Anselin and Rey (2014). Modern Spatial Econometrics in Practice.
Prerequisites¶
Familiarity with pandas, geopandas and libpysal is assumed to read data files and manipulate spatial weights as well as knowledge of how to use PySAL sample data sets. Again, the chicagoSDOH sample data set is used. If not available, it must be installed first with libpysal.examples.load_example("chicaogoSDOH")
.
Modules Needed¶
The main module for spatial regression in PySAL is spreg. In addition, libpysal is needed to handle the example data sets and spatial weights, and pandas and geopandas for data input and output. This notebook is based on version 1.7 of spreg.
Some additional imports are included to avoid excessive warning messages. With later versions of PySAL, these may not be needed. Finally, to avoid issues with the print
function for numpy 2.0 and later, the legacy
option is set in set_printoptions
.
[1]:
import warnings
warnings.filterwarnings("ignore")
import os
os.environ['USE_PYGEOS'] = '0'
import numpy as np
import pandas as pd
import geopandas as gpd
from libpysal.io import open
from libpysal.examples import get_path
import libpysal.weights as weights
from spreg import OLS, TSLS, f_stat
np.set_printoptions(legacy="1.25")
Functionality Used¶
from pandas/geopandas:
read_file
DataFrame
concat
from libpysal:
examples.get_path
io.open
weights.distance.Kernel
from spreg:
OLS
TSLS
f_stat
Input Files¶
All notebooks are organized such that the relevant filenames and variables names are listed at the top, so that they can be easily adjusted for use with your own data sets and variables.
In the current notebook, the Chi-SDOH sample shape file is used, with associated kernel weights (for HAC standard errors). The specific file names are:
Chi-SDOH.shp,shx,dbf,prj: socio-economic determinants of health for 2014 in 791 Chicago tracts
Chi-SDOH_k10tri.kwt: triangular kernel weights based on a variable bandwidth with 10 nearest neighbors from GeoDa
As before, the input files are specified generically as infileshp (for the shape file) and infilek (for the kernel weights). The libpysal.examples.get_path
functionality is used to get the correct path.
[2]:
infileshp = get_path("Chi-SDOH.shp") # input shape file with data
infilek = get_path("Chi-SDOH_k10tri.kwt") # triangular kernel weights from GeoDa
Variable Names¶
The illustration in this notebook considers two different specifications. One has YPLL_rate (index of premature mortality) as the dependent variable and Blk14P (percent Black population), Hisp14P (percent Hispanic population), and HIS_ct (economic hardship index) as explanatory variables. The variable HIS_ct is considered to be endogenous, with the census tract centroids as instruments (COORD_X and COORD_Y). The full set of regressors is specified in z_names1, the exogenous variables in xe_names, endogenous variable in yend_names1, and instruments in q_names1.
A second specification uses the same dependent variable (y_name) and exogenous regressors (xe_names), but now includes two endogenous regressors, HIS_ct and EP_NOHDSP (percent without a high school diploma) in yend_names2 (with the full set of regressors in z_names2), and one additional instrument (in addition to the census tract centroids), EP_LIMENG (percent with limited English proficiency), specified in q_names2.
[3]:
y_name = ['YPLL_rate']
z_names1 = ['Blk14P','Hisp14P','HIS_ct']
z_names2 = ['Blk14P','Hisp14P','EP_NOHSDP','HIS_ct']
xe_names = ['Blk14P','Hisp14P']
yend_names1 = ['HIS_ct']
yend_names2 = ['EP_NOHSDP','HIS_ct']
q_names1 = ['COORD_X', 'COORD_Y']
q_names2 = ['COORD_X', 'COORD_Y','EP_LIMENG']
ds_name = 'Chi-SDOH'
gwk_name = 'Chi-SDOH_k10tri'
Variable definition and data input¶
The input geo data frame is created using read_file
and the weights file using libpysal.io
(imported as open
). Also, the class of the kernel weights is corrected using libpysal.weights.distance.Kernel
(libpysal.weights
is imported as weights
). Next, all the relevant variables are initialized as subsets from the data frame.
[ ]:
dfs = gpd.read_file(infileshp)
wk = open(infilek).read()
print(wk.n)
wk.__class__ = weights.distance.Kernel
print(type(wk))
y = dfs[y_name]
z1 = dfs[z_names1]
z2 = dfs[z_names2]
xe = dfs[xe_names]
yend1 = dfs[yend_names1]
yend2 = dfs[yend_names2]
q1 = dfs[q_names1]
q2 = dfs[q_names2]
The Principle of Two Stage Least Squares¶
A fundamental assumption behind OLS is, loosely put, that the explanatory variables (\(X\)) are uncorrelated with the error terms. If this is not the case for one or more variables, the OLS estimates will be biased (in the early days, this used to be called simultaneous equation bias, now it is referred to as endogeneity). To correct for this bias, the violating endogenous variables are replaced by instruments that are: (1) uncorrelated with the error term; and (2) closely related (but not too close) to the original endogenous variables.
For example, consider a set of explanatory variables, such that: \begin{equation*} y = Z\delta + u, \end{equation*} where \(Z\) is organized such that the exogenous variables \(X\) are first and the endogenous ones \(Y\) are second, as \(Z = [X \ Y]\).
In essence, 2SLS estimation can be thought of as proceeding in two stages (hence the acronym). In the first stage, each of the endogenous variables is regressed on a matrix of instruments, \(Q\), which includes all of the exogenous variables in \(X\) as well as specific instruments. The predicted values from this regression are then used as explanatory variables, replacing the endogenous variables in the second stage, which yields consistent estimates for the coefficients \(\delta\). Note that the predicted value in a regression of the \(X\) variables on \(Q\) (which includes the \(X\)) simply yields the original \(X\), so that there are in fact no new instrumental variables for the \(X\).
The first stage can be expressed succinctly as: \begin{equation*} \hat{Z} = Q [ (Q'Q)^{-1} Q'Z ], \end{equation*} where the term in square brackets is the vector of OLS regression coefficients in a regression of each of the \(Z\) on the instrument matrix \(Q\).
The second stage consists of an OLS regression of \(y\) on the predicted values \(\hat{Z}\): \begin{equation*} \hat{\delta} = (\hat{Z'} \hat{Z} )^{-1} \hat{Z'} y. \end{equation*}
Alternatively, substituting the results of the first regression yields the full expression as: \begin{equation*} \hat{\delta} = [ Z'Q (Q'Q)^{-1} Q'Z ]^{-1} Z'Q (Q'Q)^{-1} Q' y. \end{equation*}
Strictly speaking, the 2SLS estimator is an instrumental variables estimator with instruments: \begin{equation*} H = Q (Q'Q)^{-1} Q'Z, \end{equation*} where \(Q (Q'Q)^{-1} Q'\) is often referred to as a projection matrix. The estimator can be expressed in the usual way as: \begin{equation*} \hat{\delta} = (H'Z)^{-1} H'y, \end{equation*} which, after substituting the full expression for \(H\) and some matrix algebra yields the same result as above.
The Two Stages of 2SLS¶
Before illustrating the spreg.TSLS
functionality, the two stages of 2SLS are spelled out in detail, using the first regression example. First, as a reference point, the OLS estimation, using spreg.OLS
with nonspat_diag = False
to limit the output. The complete set of regressors is contained in z1.
[ ]:
ols1 = OLS(y,x=z1,nonspat_diag = False,name_ds = ds_name)
print(ols1.summary)
Creating the instruments¶
In the example, there is only one endogenous variable, HIS_ct. The associated instrumental variable is the predicted value from a regression of HIS_ct on the exogenous variables xe and the instruments q1. The terminology can get a bit confusing, since instrumental variables and instruments are often used interchangeably. In a strict sense, the predicted value is an instrumental variable for the endogenous variable and the exogenous variables are often not explicitly designated
as instruments. This is the meaning used here. So, the full set of instruments consists of both xe and q1. This is accomplished by means of pandas concat
function.
The instrumental variable for HIS_ct is then extracted as the predy
attribute of the regression object and turned into a dataframe.
[ ]:
qq = pd.concat([xe,q1],axis=1)
olsinst = OLS(yend1,qq,nonspat_diag=False,name_ds=ds_name)
print(olsinst.summary)
[ ]:
yend_p = olsinst.predy
yep = pd.DataFrame(yend_p,columns=['HIS_ct_p'])
yep
Second stage¶
The second stage consists of an OLS estimation of YPLL_rate on the three regressors, where HIS_ct is replaced by its predicted value. Again, first concat
is used to put the exogenous variables together with the newly created predicted value.
[ ]:
zz = pd.concat([xe,yep],axis=1)
ols2 = OLS(y,zz,nonspat_diag = False, name_ds=ds_name)
print(ols2.summary)
Compared to the results of the original OLS regression (ols1), there are some marked differences in both the magnitude and significance of the coefficients. The coefficient for Blk14P is less than half its previous value (16.4 vs. 42.1) and is no longer significant. The coefficient for Hisp14P is three times as large (-45.7 vs. -14.6) and HIS_ct_p double (153.6 vs. 72.7). Both remain highly significant. However, as noted below, the standard errors and thus also the significance must be interpreted with caution.
Predicted values and residuals¶
The proper residuals for the 2SLS regression should be \(y - Z\hat{\delta}\), and not \(y - \hat{Z}\hat{\delta}\), as given by the second stage OLS regression. As a consequence, the standard errors (which are based on those residuals) given in the second stage OLS regression are not the proper standard errors in a 2SLS estimation. In some software implementations, this is sometimes overlooked when the estimation is implemented as two actual separate stages. The correct standard
errors and associated (asymptotic) t-statistics and p-values are given by the spreg.TSLS
command.
TSLS¶
Immigrant paradox¶
The proper standard errors for 2SLS are obtained with spreg.TSLS
. The required arguments are y
for the dependent variable, x
for the exogenous variables, yend
for the endogenous variables and q
for the instruments. The default is to have nonspat_diag = True
, so this is turned to False
to focus on just the estimates and their significance. Finally, name_ds
is included for the data set name.
[ ]:
tsls1 = TSLS(y,x=xe,yend=yend1,q=q1,nonspat_diag = False,name_ds=ds_name)
print(tsls1.summary)
The coefficient estimates are identical to the ones obtained in ols2, but the standard errors are slightly different. In the end, this does not affect the significance in a meaningful way. Blk14P remains not significant and the other p-values are only marginally affected.
The complete set of attributes of the regression object is given with the usual dir
command. They are essentially the same as for an OLS object, e.g., with the estimates in betas
, predicted values in predy
and residuals in u
. In addition, several intermediate matrix results are included as well that are useful for customized calculations, but beyond the current scope.
Predicted values and residuals¶
As mentioned, the residuals are \(y - \hat{y}\), with \(\hat{y}\) as the predicted values. The latter are obtained as \(Z \hat{\delta}\). This is somewhat misleading as a measure of fit, since observations on the endogenous variables are included in the matrix \(Z\). A proper predicted value would be obtained from a solution of the reduced form, with only exogenous variables on the RHS of the equation. However, in a single equation cross-sectional setting, this is not possible. As computed, the predicted value thus may give an overly optimistic measure of the fit of the model. Consequently, the associated residual sum of squares may be too small to reflect the correct performance of the model.
Two endogenous variables¶
To further illustrate that 2SLS also works for multiple endogenous variables, the second specification is used, now with yend2 as the endogenous variables and q2 as the associated instruments. For reference, the OLS results are given as well.
[ ]:
ols3 = OLS(y,x=z2,nonspat_diag=False,name_ds=ds_name)
print(ols3.summary)
[ ]:
tsls2 = TSLS(y,x=xe,yend=yend2,q=q2,nonspat_diag=False,name_ds=ds_name)
print(tsls2.summary)
The effect of correcting for potential endogeneity is again significant. In the OLS results, all coefficients except EP_NOHSDP are highly significant. However, in the 2SLS results, neither Blk14P nor Hisp14P are significant, but EP_NOHSDP becomes significant. Its coefficient also changes sign, from positive (but not significant, so essentially zero) to negative. This illustrates the importance of checking for potential endogeneity.
Durbin-Wu-Hausman Test¶
The Durbin-Wu-Hausman statistic (DWH) is a test for the endogeneity of some regressors, provided instruments are available. It is a simple F-test on the joint significance of selected coefficients in an augmented regression. The augmented regression specification consists of the original exogenous variables and predicted values of the endogenous variables when regressed on the instruments (those include the exogenous variables). If the null hypothesis holds, the coefficients of these predicted values should not be significant. The DWH test then boils down to an F-test on their joint significance (see Davidson and MacKinnon 2004, p. 338-340).
The test is included as a default diagnostic in spreg.TSLS
(the default is nonspat_diag=True
so this must not be included explicitly). For the two example regressions, this yields the following results.
[ ]:
tsls1a = TSLS(y,x=xe,yend=yend1,q=q1,name_ds=ds_name)
print(tsls1a.summary)
[ ]:
tsls2a = TSLS(y,x=xe,yend=yend2,q=q2,name_ds=ds_name)
print(tsls2a.summary)
In both instances, there is clear evidence that controlling for endogeneity was warranted.
Since the predicted value for HIS_ct in the first regression is already available as yep above, the DWH test can be verified by carrying out the auxiliary regression. To this end z1 must be concatenated with yep to form the new matrix of regressors.
[ ]:
zza = pd.concat([z1,yep],axis=1)
ols4 = OLS(y,zza,nonspat_diag=False,name_ds=ds_name)
print(ols4.summary)
The DWH statistic then follows as the result of an F-test on the coefficient of HIS_ct_p in the auxiliary regression.
[ ]:
dwhf = f_stat(ols4,df=1)
print(dwhf)
The resulting F-statistic is exactly the value of DWH listed for tsls1a, with the associated p-value. In the same way, the result for the second 2SLS regression can be verified, but this is slightly more involved. In the second case, two additional regression are required to obtain predicted values for each of the endogenous variables, but otherwise everything is the same. This is all carried out under the hood for the DWH statistic.
Robust Standard Errors¶
As was the case for OLS regression, 2SLS allows for robust standard errors by setting the arguments robust="white"
or robust="hac"
. Everything operates in the same way as for OLS, i.e., the estimates are identical, but the standard errors, z-statistic and p-values may differ.
White standard errors¶
[ ]:
tsls1b = TSLS(y,x=xe,yend=yend1,q=q1,name_ds=ds_name,
robust='white')
print(tsls1b.summary)
[ ]:
tsls2b = TSLS(y,x=xe,yend=yend2,q=q2,name_ds=ds_name,
robust='white')
print(tsls2b.summary)
HAC standard errors¶
As was the case for OLS, in addition to robust="hac"
, the kernel weights (gwk
) and optionally their name (name_gwk
) must be specified as well.
[ ]:
tsls1c = TSLS(y,x=xe,yend=yend1,q=q1,name_ds=ds_name,
robust='hac',gwk=wk,name_gwk=gwk_name)
print(tsls1c.summary)
[ ]:
tsls2c = TSLS(y,x=xe,yend=yend2,q=q2,name_ds=ds_name,
robust='hac',gwk=wk,name_gwk=gwk_name)
print(tsls2c.summary)
Practice¶
At this point, you can assess endogeneity in your own regression specification, or experiment with different models constructed from the variables in the chicagoSDOH sample data set. Replicate the 2SLS results by explicitly carrying out the two stage OLS regressions and/or check on the value of the Durbin-Wu-Hausman test by means of the auxiliary regression.