This page was generated from notebooks/5_OLS.ipynb. Interactive online version:
Basic Ordinary Least Squares Regression (OLS)¶
Luc Anselin¶
(revised 09/08/2024)¶
Preliminaries¶
In this notebook, the basic OLS regression and elementary regression diagnostics are reviewed. In addition, a number of concepts related to robust standard errors are covered.
Technical details are treated in Chapter 5 in Anselin and Rey (2014). Modern Spatial Econometrics in Practice.
Video recordings of the basics and non-spatial diagnostics are available from the GeoDa Center YouTube channel playlist Applied Spatial Regression - Notebooks: - https://www.youtube.com/watch?v=x63dL2M5x_0&list=PLzREt6r1NenmhNy-FCUwiXL17Vyty5VL6&index=5 - https://www.youtube.com/watch?v=pimNfZyOyKk&list=PLzREt6r1NenmhNy-FCUwiXL17Vyty5VL6&index=6
Prerequisites¶
Familiarity with pandas, geopandas and libpysal is assumed to read data files and manipulate spatial weights. In addition, the sample data set chicagoSDOH should be installed.
Modules Needed¶
The main module for spatial regression in PySAL is spreg. In addition libpysal is needed to handle the example data sets and spatial weights, and pandas and geopandas for data input and output. This notebook is based on version 1.7 of spreg. In order to make the graphs a bit nicer looking, matplotlib.pyplot is imported as well, although this is not strictly needed since it is behind the plotting functionality of pandas and geopandas. Importing matplotlib.pyplot also allows for some further customization of the plots and maps, but that is not pursued in detail here.
Some additional imports are included to avoid excessive warning messages. With later versions of PySAL, these may not be needed. As before, in order to avoid some arguably obnoxious new features of numpy 2.0, it is necessary to include the set_printoptions
command if you are using a Python 3.12 environment with numpy 2.0 or greater.
[4]:
import warnings
warnings.filterwarnings("ignore")
import os
os.environ['USE_PYGEOS'] = '0'
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from libpysal.io import open
from libpysal.examples import get_path
import libpysal.weights as weights
from spreg import OLS, f_stat, wald_test
np.set_printoptions(legacy="1.25")
Functionality Used¶
from numpy:
array
hstack
identity
zeros
from pandas/geopandas:
read_file
DataFrame
concat
to_file
plot
from libpysal:
examples.get_path
io.open
weights.distance.Kernel
from spreg:
OLS
f_stat
wald_test
Input Files¶
All notebooks are organized such that the relevant filenames and variables names are listed at the top, so that they can be easily adjusted for use with your own data sets and variables.
In the examples, the Chi-SDOH sample shape file is used, with associated kernel weights (for HAC standard errors). The specific file names are:
Chi-SDOH.shp,shx,dbf,prj: socio-economic determinants of health for 2014 in 791 Chicago tracts
Chi-SDOH_k10tri.kwt: triangular kernel weights based on a variable bandwidth with 10 nearest neighbors from
GeoDa
In this and the other spreg related notebooks, it is assumed that you have installed the chicagoSDOH example data set using libpysal.examples.load_example("chicagoSDOH")
.
The input files are specified generically as infileshp (for the shape file) and infilek (for the kernel weights). In addition, optionally, an output file is specified to add predicted values and residuals to a shape file as outfileshp.
[9]:
infileshp = get_path("Chi-SDOH.shp") # input shape file with data
infilek = get_path("Chi-SDOH_k10tri.kwt") # triangular kernel weights from GeoDa
outfileshp = "./testols.shp" # output shape file with predicted values and residuals
Variable Names¶
In this notebook, the regression model is illustrated in the context of the immigrant paradox. This is based on four variables from the SDOH data set: YPLL_rate (an index measuring premature mortality, i.e., higher values are worse health outcomes), HIS_ct (economic hardship index), Blk14P (percent Black population), and Hisp14P (percent Hispanic population).
The easiest way to specify a generic regression model is to first create lists with the variable names for the dependent variable (y_name), the explanatory variables (x_names1 and x_names2), the data set name (optional, as ds_name), the name of the contiguity weights (optional, as w_name, but not needed in this notebook), and the file name for the kernel weights (optional, as gwk_name). In this way, all the regression commands pertain to these generic variable names and do not need to be adjusted for different specifications and/or data sets.
[10]:
y_name = 'YPLL_rate'
x_names1 = ['Blk14P','Hisp14P']
x_names2 = ['Blk14P','Hisp14P','HIS_ct']
ds_name = 'Chi-SDOH'
gwk_name = 'Chi-SDOH_k10tri'
Reading Input Files from the Example Data Set¶
The actual path to the files contained in the local copy of the remote data set was found above by means of the libpysal get_path
command. This is then passed to the geopandas read_file
function in the usual way.
As mentioned earlier, if the example data are not installed locally by means of libpysal.examples
, the get_path
command must be replaced by an explicit reference to the correct file path name. This is easiest if the files are in the current working directory, in which case just specifying the file names in infileshp etc. is sufficient.
[ ]:
dfs = gpd.read_file(infileshp)
print(dfs.shape)
print(dfs.columns)
Reading Weights¶
The weights are read by means of the libpysal.io.open
command which was imported with a open
alias. In the example, the kernel weights are loaded from a file created with GeoDa, and the full path was specified above using get_path
. After reading the weights, their class
is adjusted to avoid raising an exception in the computation of HAC standard errors as illustrated in the spatial weights ntoebook (note that libpysal.weights
was imported as weights
).
[ ]:
wk = open(infilek).read()
print("dimension ",wk.n)
wk.__class__ = weights.distance.Kernel
print("type of wk ",wk.__class__)
Setting up the Variables¶
In legacy spreg, variables were typically read from a dBase data file (associated with a shape file). This required a multi-step process to extract the variables and then turn them into numpy arrays. With the more recent support for pandas and geopandas, this process has become greatly simplified. The variables can be loaded directly from the pandas or geopandas data frame. As a side effect, this no longer requires the name_y
and name_x
arguments to be set explicitly in the
OLS call.
The setup uses the variable names specified in the respective y_name, x_names, etc. lists.
Note that there is no constant term in the x matrices. A constant vector is included by default in the spreg.OLS
routine.
[13]:
y = dfs[y_name]
x1 = dfs[x_names1]
x2 = dfs[x_names2]
OLS Regression¶
A first illustration uses the simplest of OLS regressions, where only y and X are specified as arguments in the spreg.OLS
command. Since OLS
was imported explicitly, the spreg
part is omitted in what follows.
The resulting OLS object has many attributes. An easy (the easiest) way to list the results is to print the summary
attribute.
First, the regression with the two ethnicity explanatory variables is considered. The dependent variable is y and the explanatory variables are contained in x1.
The option nonspat_diag=False
must be set, since the default will provide the diagnostics, which are considered in more detail below.
[14]:
ols1 = OLS(y,x1,nonspat_diag=False)
The basic result of this operation is to create an OLS object. The range of attributes and methods included can be readily viewed by means of the dir
command.
[ ]:
print(dir(ols1))
The regression coefficients are in betas
, the predicted values in predy
, residuals in u
, the coefficient variance-covariance matrix in vm
, and the \(R^2\) measure of fit in r2
. The summary
method provides a nice looking output listing.
[ ]:
print(ols1.summary)
As is, the output does not list any information on the data set. The variable names were extracted directly from the data frame. In order to have a more informative output, name_ds
should be specified in the arguments, as illustrated below. Since no spatial diagnostics are specified so far, the Weights matrix is listed as None.
[ ]:
ols1a = OLS(y,x1,nonspat_diag=False,
name_ds=ds_name)
print(ols1a.summary)
This bare bones regression output lists the variables and data set involved. In addition to the coefficient estimates, standard errors, t-statistics and p-values, it includes the \(R^2\) and adjusted \(R^2\) as measures of fit. The mean and standard deviation of the dependent variable are given as well.
The overall fit of about 0.60 is reasonable and both slope coefficients are positive and highly significant.
Alternative way to pass arguments¶
The approach taken in the example, whereby y and x1 as passed is mostly for convenience. In practice, one can specify the dependent and explanatory variables directly as subsets from a data frame. For example, one could use dfs["YPLL_rate"]
and dfs[["Blk14P","Hisp14P"]]
directly as arguments instead of taking the extra step to define y and x1 separately. The results are identical. Which approach one takes is a matter of preference.
[ ]:
ols1b = OLS(dfs["YPLL_rate"],dfs[["Blk14P","Hisp14P"]],nonspat_diag=False,
name_ds=ds_name)
print(ols1b.summary)
Immigrant paradox¶
As it turns out, the initial regression result is somewhat misleading, which is revealed when the economic hardship variable is included. This is implemented by specifying x2 as the x-variable argument.
[ ]:
ols2 = OLS(y,x2,nonspat_diag=False,
name_ds=ds_name)
print(ols2.summary)
The inclusion of economic hardship turned the coefficient of the Hispanic share from positive to negative. This is the so-called immigrant paradox. All coefficients are highly significant, with an adjusted \(R^2\) of 0.6316.
Predicted Values and Residuals¶
Two important attributes of the regression object are the predicted values and residuals. Unlike many of the other attributes, these are full-length column vectors of size n x 1. They can be extracted as the predy
and u
attributes. A quick check is included to make sure the dimensions are correct.
For the residuals, the mean()
is computed to confirm that it is zero.
[ ]:
yp = ols2.predy
yp.shape
[ ]:
resid = ols2.u
print(resid.mean())
resid.shape
The most useful application of the predicted values and residuals in a diagnostic sense is to plot and map them. Before briefly illustrating this, it is shown how the two vectors can be added to the data frame and then optionally written out to a new shape file. This uses the Dataframe
functionality from pandas and horizontal vector stacking (hstack
) from numpy.
[ ]:
preds = pd.DataFrame(np.hstack((yp,resid)),columns=['ypred','resid'])
preds.head()
The data frame with the predicted values and residuals is then concatenated (pandas.concat
) with the current shape file and can be written to the output file outfileshp using the to_file
function.
[25]:
dfs = pd.concat([dfs,preds],axis=1)
dfs.to_file(outfileshp,mode='w')
A common residual diagnostic plot is a scatter plot of the residuals against the predicted values. If the error variance is constant (homoskedasticity) the point cloud should be more or less parallel around the value of zero (the mean of the residuals). On the other hand, a fan-like shape or clearly different spreads for different ranges of the predicted values would suggest heteroskedasticity.
The pandas plot
functionality can be used to make a simple scatter plot of residuals against predicted values. In the example, ypred is used as the x
argument, resid as the y
argument, with kind = "scatter"
. With matplotlib.pyplot imported, plt.show()
will produce a clean graph. Alternatively, one can set the plot object to ax
, as illustrated in the mapping notebook. This can be useful for further customization, but is not pursued here.
[ ]:
dfs.plot(x="ypred",y="resid",kind='scatter',title='Diagnostic plot')
plt.show()
The slight fan-like shape of the residuals with values further away from zero for increasing predicted values is a visual diagnostic suggesting heteroskedasticity. However, since this is purely visual, it remains only suggestive. More formal tests againsts heteroskedasticity are considered below.
Some insight into possible spatial patterns among the residuals can be gained from a residual map, i.e., a choropleth map of the residuals. This is readily implemented by means of the plot
functionality of a geopandas object. The most useful classification in this respect is a standard deviational map, for which the bins in the classification are based on standard deviational units. This has two advantages: (1) it clearly shows the positive and negative residuals; and (2) it highlights
outliers as residuals that are more than two standard deviational units away from zero.
As covered in a previous notebook, a standard deviational map is obtained by setting the argument scheme = std_mean
in the geopandas plot
function. This relies on the PySAL mapclassify package under the hood, but the latter does not need to be imported separately. As pointed out in the mapping notebook, in mapclassify, the convention is to merge all observations that are within one standard deviation below and above the mean into a single category. This may not be optimal for all
types of maps (it can be customized by means of anchor
), but it is useful for a residual map. Specifically, all observations that have residuals within one standard deviation from the mean (zero) are shown in the same color (here, white). To highlight the difference between negative and positive residuals, the color map cmap="bwr"
is chosen. In this scheme, the negative residuals are shades of blue (overprediction), whereas the positive residuals are shades of red (underprediction). The
middle category is white.
In order to make sure that the polygon boundaries are shown as well, edgecolor = "black"
is set as an additional argument (otherwise all the census tracts in the middle category will not be able to be distinguished). The linewidth
is adjusted somewhat to improve legibility. A legend is set in the usual way.
Finally, to remove the default axes and to add a title, the set_axis_off
and set_title
methods are applied to the plot object. Further customization can always be carried out using more advanced features of matplotlib. For some examples, see the mapping notebook.
[ ]:
ax = dfs.plot(
column = 'resid',
figsize=(8,8),
cmap='bwr',
scheme='std_mean',
edgecolor='black',
linewidth = 0.3,
legend=True,
legend_kwds={'loc':'center left','bbox_to_anchor':(1,0.5), 'title': "Residuals"})
ax.set_axis_off()
ax.set_title('Residual Standard Deviational Plot')
plt.show()
The residual map seems to suggest some spatial patterning among the residuals, but such a visual impression can be misleading. More formal approaches are considered in the notebook that deals with Specification Tests.
Latex Table Output¶
In addition to the standard regression output shown so far, it is also possible to generate a latex-formatted table for the main coefficient results. This makes it easier to incorporate the regression results into reports or papers. It is accomplished by setting the latex
option to True
(the default is False
).
Note that only the table with the estimated coefficients, standard errors, etc. is in latex format. The other items (various headings and diagnostics) remain simple text.
[ ]:
ols2lat = OLS(y,x2,nonspat_diag=False,
name_ds=ds_name,
latex=True)
print(ols2lat.summary)
Non-Spatial Diagnostics¶
The default setting for OLS regression is to always include the non-spatial diagnostics, with nonspat_diag=True
. Since this is the default, this argument does not have to be set. The default output with diagnostics will be as follows.
[ ]:
ols2a = OLS(y,x2,
name_ds=ds_name)
print(ols2a.summary)
The previous listing is now agumented with several additional measures of fit (shown above the coefficients) as well as diagnostics for multicollinearity, non-normality and heteroskedasticity (listed below the coefficients).
The measures of fit on the left-hand column are all related to the sum of squared residuals. This is listed, as well as two measures of \(\sigma^2\) (one from division by the degrees of freedom, the other - ML - from division by the number of observations) and their square roots, the S.E. of regression.
On the right-hand side are the results of an F-statistic for the joint significance of the coefficients and its associated p-value, the log-likelihood \(L\) (under the assumption of normality) for use in comparisions with spatial models, and two adjustments of the log-likelihood for the number of variables in the model, the \(AIC\) and \(SC\), with \(AIC = -2L + 2k\) and \(SC = -2L + k.ln(n)\) (\(SC\) is sometimes referred to as \(BIC\)). Whereas a better fit is reflected in a higher log-likelihood, it is the reverse for AIC and SC (lower is better).
Below the listing of coefficients are the multicollinearity condition number, the Jarque-Bera test on normality of the errors and two tests for heteroskedasticity (random coefficients): the Breusch-Pagan LM test and the more robust (against non-normality) Koenker-Bassett test.
There is no evidence of a problem with multicollinearity (typically associated with values larger than 30), but a strong indication of both non-normality and heteroskedasticity.
White test¶
Because it requires additional computation, the White test against heteroskedasticity is not included by default. To include it, the argument white_test=True
must be set.
All the output is the same as before, except for the addition of the test results, which again strongly indicate the presence of heteroskedasticity.
[ ]:
ols2b = OLS(y,x2,white_test=True,
name_ds=ds_name)
print(ols2b.summary)
Variance Inflation Factor (VIF)¶
The Variance Inflation Factor or VIF is an alternative to the multicollinearity condition number to assess the sensitivity of the regression results to the presence of multicollinearity. The condition number is a measure computed for all variables jointly and is a measure of the degree of linear dependence among all the columns of \(X\). Instead, the VIF is computed for each variable separately.
The VIF depends on the fit of a regression of the variable in question on all other x-variables, e.g., \(R^2_k\) for \(x_k\). \(1 - R^2_k\) is called the tolerance (where \(R^2_k\) is the unadjusted \(R^2\)). The more collinear \(x_k\) is with the other variables, the larger \(R^2_k\) and thus the lower the tolerance. The VIF is then the inverse of the tolerance.
The VIF is not reported as part of the standard regression output, since it requires considerable additional computation, but is available by setting the argument vif = True
(the default setting is vif = False
). The output is augmented with a listing of the VIF and corresponding tolerance (its inverse) for each coefficient.
While there is no indication of a multicollinearity problem in the current set of results, it may still be informative to check which variables are the most correlated with the others.
[ ]:
ols2c = OLS(y,x2,vif=True,
name_ds=ds_name)
print(ols2c.summary)
In this result, Blk14P has the highest VIF at 4.323. However, this is not very high. When there are more serious multicollinearity problems, this value can be much higher. For example, with \(R^2_k = 0.95\), the VIF would be 20, and with \(R^2_k = 0.99\), it would be 100.
In the example, the corresponding tolerance is 0.231. In other words, a regression of Blk4P on the other two variables would have an (unadjusted) \(R^2\) of 0.769. This can be readily verified in a simple regression by specifying the respective subsets of the dfs data frame.
[ ]:
regv = OLS(dfs["Blk14P"],dfs[["Hisp14P","HIS_ct"]],nonspat_diag = False)
print(regv.summary)
The unadjusted \(R^2\) in this regression is indeed 0.769.
F-Test on joint significance of coefficients¶
The F-test reported as part of the regression output listing is a joint test on all slope coefficients. It is also possible to test the joint significance of a subset of coefficients. In order to carry this out, the variables of interest must be the last variables. In other words, the test is on the joint significance of the last df coefficients in betas, where df is passed as the second argument to spreg.f_stat
(because of the way the import
statement was phrased, here
available as just f_stat
). The first argument is a regression object. One can apply this to one or more variables.
For example, to test the significance of HIS_ct in the last regression, the F-test would have df=1
:
[ ]:
f_hisct = f_stat(ols2a,df=1)
print(f_hisct)
The result is an F-statistic with 1, 787 degress of freedom, 1 for the numerator (df) and 787 (\(n - k\)) for the denominator. The value of 65.796 is clearly significant, confirming the indication given by the t-test. Since the first degree of freedom of the F-statistic is 1, its value is exactly the square of the t-test in the regression, i.e., \(8.11149^2 = 65.796\).
To replicate the result of the F test that is listed in the regression output, two different approaches are valid. In one, the degrees of freedom is set to 3 (the last three coefficients), in the other it is not specified. The default of f_stat
is to carry out an F test on all slope coefficients. The results for the two approaches are identical and match the regression output.
[ ]:
f_all1 = f_stat(ols2a,df=3)
print(f_all1)
f_all2 = f_stat(ols2a)
print(f_all2)
Finally, to test the joint significance of Hisp14P and His_ct, df is set to 2.
[ ]:
f_two = f_stat(ols2a,df=2)
print(f_two)
Wald test on joint significance of coefficients¶
The F-test as currently implemented is a bit awkward in that the joint test only applies to the last df coefficients. A more flexible approach is based on the textbook Wald test for linear restrictions on the regression coefficients (for example, as explained in Greene, 2012, p. 117-118). A set of linear constraints on the coefficients can be expressed as:
\begin{equation*} R \beta = q, \end{equation*}
where \(R\) is a \(J \times k\) matrix of constants, for \(J\) constraints and \(k\) total parameters, \(\beta\) is the column vector with all the model coefficients (i.e., including both the intercept and slopes), and \(q\) is a \(J \times 1\) column vector of zeros. In the case of a test on the joint significance of coefficients, the matrix \(R\) is a truncated identity matrix, with only the rows selected that pertain to the coefficients of interest. This will be illustrated in more detail below.
The test statistic takes the form of a general Wald statistic, or: \begin{equation*} (R \hat{\beta})' [ R V R' ]^{-1} (R \hat{\beta}) \sim \chi^2(J), \end{equation*}
i.e., following a chi-squared distribution with \(J\) degrees of freedom, and where \(V\) is the estimated variance-covariance matrix for the coefficients, a \(k \times k\) matrix.
The Wald test is available from spreg.regimes.wald_test
(imported as just wald_test
in this notebook). The arguments are betas
, the vector with regression coefficients, r
, the matrix \(R\), q
, the vector \(q\), and vm
, the estimated variance-covariance matrix for the regression. The result is a tuple with the test statistic and associated p-value.
For example, taking the results from ols2, the regression coefficients would be in ols2.betas with associated variance-covariance matrix in ols2.vm. For a test on the joint significance of the coefficients of Blk14P and HIS_ct, r
would be a \(2 \times 4\) matrix, with a 1 in position [0,1] and [1,3] (keep in mind that the count starts at 0). q
would then be a \(2 \times 1\) column vector of zeros.
[ ]:
rr = np.identity(ols2.k)
r = rr[(1,3),:]
r
[ ]:
j = 2 # number of constraints
q = np.zeros((j,1))
q
The Wald statistic then follows as:
[ ]:
wald_test(ols2.betas,r,q,ols2.vm)
The two coefficients are clearly jointly significant (one doesn’t actually need a test to see that).
A final example illustrates the equivalence of the Wald test, the t-test and the F-test when testing a single coefficient. For the last coefficient, of HIS_ct, q
equals 0 and r
is a row vector with all zeros except for the last element, which is 1.
[ ]:
r = rr[3,:].reshape(1,-1)
q = 0
wald_test(ols2.betas,r,q,ols2.vm)
The value of the statistic is the square of the t-statistic and the same as the F-test given above.
Robust Standard Errors¶
The classical regression coefficient standard errors tend not to be very robust to deviations from the regularity conditions such as i.i.d. In cross-sectional data, it is rather unrealistic to assume the absence of heteroskedasticity. In fact, in the example used here, there is strong evidence of the presence of heteroskedasticity indicated by all three diagnostics.
The so-called White standard errors (also known as Huber sandwich standard errors) correct for the presence of unspecified heteroskedasticity. They are almost always (but not always) larger than the classical standard errors, leading to a bit more conservative inference.
A more recent development are the so-called HAC standard errors introduced by Kelejian and Prucha (2007)(“HAC estimation in a spatial framework,” Journal of Econometrics 140, 131-154), where HAC stands for heteroskedastic and autocorrelation consistent standard errors. The computations use a kernel logic to obtain standard errors that correct for both heteroskedasticity and the possible presence of spatial autocorrelation of unspecified form. Typically (but again not always) these are the largest standard errors.
The robust standard errors are invoked with the option robust
in spreg.OLS
. The default is None
, for classical standard errors.
White Standard Errors¶
White standard errors are obtained by setting robust='white'
. Except for the standard errors, the regression output remains exactly the same.
[ ]:
ols3 = OLS(y,x2,robust='white',
name_ds=ds_name)
print(ols3.summary)
In the example, the standard errors increase for all coefficients except for Blk14P (3.39 vs 3.49). However, the impact on inference is negligible, with all coefficients still significant at p < 0.003.
HAC Standard Errors¶
HAC standard errors are obtained with robust='hac'
. In addition, a kernel weights object must be passed (gwk
), and optionally, its name (name_gwk
). Again, except for the standard errors, the output is the same as before. One slight difference is that no diagnostics for spatial dependence are implemented when the HAC option is used. In the example, the triangular kernel weights contained in wk are used. Several other kernel functions (and different bandwidths) can be used to
create kernel weights through libpysal.weights. In practice, some experimentation is advised.
[ ]:
ols4 = OLS(y,x2,robust='hac',
gwk=wk,name_gwk=gwk_name,
name_ds=ds_name)
print(ols4.summary)
In the example, the HAC standard errors are slightly larger than in the White case, but again not sufficient to affect inference.
Practice¶
At this point, it would be useful to set up your own baseline regression model, with a continuous dependent variable and at least two or three explanatory variables. You can pick any set of variables from the Chicago data set, from one of the PySAL sample data sets or your own data, but of course, make sure that your research question makes sense. Create some kernel weights (use libpysal.weights
) to check out the HAC estimation.
Assess the extent to which your initial specification may suffer from some forms of misspecification, as well as the sensitivity of your results to different measures of standard errors. In addition, test some linear restrictions on the coefficients.
You may also want to experiment with the plotting and mapping functionality contained in geopandas to create plots and maps of the predicted values and/or residuals.